Collaborative Filtering

Using predictive analysis to make recommendations

Collaborative filtering on the Web has existed for a long time, dating all the way back to the original incarnations of sites like CDNow and Recommendation systems are a powerful tool for businesses to extract additional value from their e-commerce and customer databases. They benefit customers by enabling them to find products they like, and help businesses by generating more sales.

We're going to look at some of the basic principles of predictive systems and introduce some methods you can utilize to make recommendations in your own applications. Along the way, we'll attempt to point out the benefits and limitations of each type of system.

Basic Predictions
At the most basic level, predictive information can be provided manually for your items. This can be built into the back-end administration of the site. When adding products to an e-commerce site, you could include a multiple select box listing all of the additional items in that category. Selecting items in the list would create a list of product IDs to be stored in an additional "related items" field in your database. With one additional query on a product detail page, you can pull up details on all of the related items that have been associated with the item being viewed.

This scenario can provide quick, quality recommendations as the computer is not guessing at the association and also does not have to perform any on-the-fly calculations. The technique suffers, however, by requiring your product administrator to have a deep knowledge of the products in your store, which may be unrealistic for larger sites. It also requires you to continuously update the "related items" lists of older items as new products are added to the catalog.

User-Based Collaborative Filtering
A second approach to providing recommendations is to use collaborative filtering, which is a technique to make predictions without any explicit relationships defined within the database. There are two types of collaborative filtering that are common: user-based and item-based. User-based filtering works by building a database of ratings for products by consumers (see Listing 1).

We'll assume an Items table and a Users table in the database with respective primary keys of ItemID and UserID, and we'll rate using a scale of 1 (lowest) to 5 (highest). You can go as high as you'd like, though statistically there's not much value in going above 7. The system will determine, on the fly, a community of like users whose ratings of items most closely match those of the current user. We'll set up a sample table of five users providing ratings for each of the colors of the rainbow (Figure 1).


To determine our community of users, we'll use the "Mean Squared Differences (MSD)" algorithm. This measures the degree of dissimilarity between two user profiles. Squaring adds more weight to the larger differences, which is appropriate since points further from the mean may be more significant (we care more about things that a user has a positive or negative feeling about versus items they are ambivalent about). To perform the calculation in laymans' terms: take the difference between the two users' rankings on each item that they have both rated, square that number, add those all up, and take the average. The lower the result, the closer that user's preferences are to the current user. Listing 2 provides the query used to determine the community of users with the lowest mean squared difference to the user. Figure 2 provides the results of the query and the MSD values. We'll use a TOP value of 5 at the beginning of our query to display only the five most similar users to userID 1.


We're going to use the three most like-minded users to come up with predictions on what colors this user would like. We see from Figure 2 that our three closest neighbors are Mike, Laura, and Sam, since they have the lowest MSD values. Products that this community likes most will then be recommended to the user, as he will probably also like them. We loop over each member in the community and assign a weighted rating (based upon their MSD value) to each of the other items that they have rated (see Listing 3). These weighted ratings from the query in Listing 3 are then inserted into a database table (see Listing 4) to aid with our calculations.

Now that we have all of our weighted ratings in the database, we total up the weighted ratings and divide by the total MSDs to give us the items with the highest weighted averages that have not already been rated by the user (see Listing 5).

Our final results are shown in Listing 6. Although this is a simplified example, it allows us to see where our recommendations come from. Better predictions would be gained by increasing the neighborhood size (up to a point), so you should experiment to find a reasonably large neighborhood size that does not significantly affect processing time. Since we were using a scale of 1-5, the higher the weighted average for the prediction, the more likely this user is to desire this item (or color in our case).

Although we used the Mean Squared Differences algorithm, there are several other mathematical formulas each with their own drawbacks and limitations. The model presented could easily be modified to provide recommendations of favorite artists, authors, or whatever your site calls for. You could also base recommendations on the demographics of your users, or you may want to provide an explicit survey for all of your users to fill out to gain knowledge of your users' preferences on whatever topic your site deals with.

Drawbacks of User-Based Collaborative Filtering
One of the major drawbacks of user-based predictive systems in general is that they do not scale well. The computational complexity of these methods grows linearly with the number of customers and items, which in commercial applications can each grow to be several million. Another problem deals with the sparsity of recommendations on the data set, which might be quite large. In large e-commerce sites, even active customers may have purchased well under 1% of the products. Therefore, a system based upon nearest neighbors may be unable to make any product recommendations for a particular user. To address these scalability concerns, item-based recommendation techniques have been developed to identify relationships between the items themselves, and to use these to compute a list of recommendations.

Item-Based Collaborative Filtering
One way to make item-based recommendations is to simply look at items that a user has purchased together or that were part of the same transaction. Items that appear the most in orders in which the specific item appears would be the most likely to be a successful prediction. A sample query is provided in Listing 7.

This is the simplest way to provide quality item-based recommendations. It should perform quickly on the fly, but could always be run offline as a scheduled job for your entire database. A more in-depth discussion is beyond the scope of this article, but you can visit the link below for articles that will lead you in the right direction.

The recommendation technique you choose depends on the nature of your users and your application. You may have a small, controlled site with a limited set of users where user-based collaborative filtering may work just fine, or you may have a very large site with many items, which would necessitate an item-based solution. The key is to choose carefully and test things out to make sure they perform and scale correctly. It should also be noted that in many cases, it may make sense to perform the predictions themselves as a scheduled job and just store them in the database as part of the record for each item. Other cases may allow you to perform the recommendations on the fly in a brief amount of time.

Credit should also be given to Peter Boot who put out the first collaborative filter custom tag back in 2001. For more info on the science of collaborative filtering, you can visit to find links to more than 40 articles and research papers that deal with the subject. Much research continues to be done on the science of determining which collaborative filtering algorithms work best.

More Stories By Joe Danziger

Joe Danziger is a senior web applications developer with Multimax, Inc., a provider of Enterprise IT Services and Solutions supporting the critical missions of the Air Force, Army, Navy, and other Department of Defense components. He is certified as an Advanced Macromedia ColdFusion MX Developer, and also maintains the Building Blocks site ( dedicated to AJAX and ColdFusion, as well as DJ Central (, a Website serving DJs and the electronic dance music industry.

Comments (0)

Share your thoughts on this story.

Add your comment
You must be signed in to add a comment. Sign-in | Register

In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.