Amazon knows what I like

I buy a lot of books from amazon.com. You can view my online bookshelf here. One of the reasons I like amazon.com is because of its recommendations engine. Few days back it recommended me the book Thinking Fast and Slow. It is one of my favorite books.

amazonrecommends

It not only recommended the book but it also told me why it recommended the book. Since I rated the book The signal and the noise it recommended me the book Thinking Fast and Slow

amazonyourated

I got curious and wanted to understand how amazon does it. In this post I will explain the basic idea behind this recommendation.

Let us imagine that amazon has only 5 users and 5 books. The table given below contains the ratings of each user in the scale of 1 to 5. If the user has not rated the book then I have left the value as blank. I have added my ratings in the name of the user Me.

Users Signal and Noise Thinking Fast and Slow The Black Swan Outliers Talent is Overrated
Alice 4 4.1 4 3.8
Bob 4 4.2 3.8 3.5 3.2
Charles 4.5 4.5 4 3.3
David 4.8 4.2 3.6 2.9
Me 5 3.5

For each book find out its similarity with the other books. One way to find out the similarity between two books is to compute the correlation using their ratings.  Since there are 5 books we need to compute the correlation for 10 pairs (5C2). I have rated 2 books. Given below are the similarity scores (correlation) for those 2 books with the other books that I have not rated.

Thinking Fast and Slow Outliers Talent is Overrated
Signal and Noise 0.414387707 0.188982237 -0.755928946
The Black Swan 0.277350098 -1 1

Using ratings I have given and the similarity scores between the books we need to compute the weighted rating.

Weighted score for Thinking fast and slow:

My Rating (A) Thinking Fast and Slow (B) Weighted (A * B)
Signal and Noise 5 0.414387707 2.071938535
The Black Swan 3.5 0.277350098 0.970725343

Weighted score for Outliers:

My Rating (A) Outliers (B) Weighted (A * B)
Signal and Noise 5 0.188982237 0.944911185
The Black Swan 3.5 -1 -3.5

Weighted score for Talent is Overrated:

My Rating (A) Talent is Overrated (B) Weighted (A * B)
Signal and Noise 5 -0.755928946 -3.77964473
The Black Swan 3.5 1 3.5

The final step is to calculate the predicted rating for each book and recommend the book with the highest rating.

Let 

B = The Black Swan
S = Signal and noise
T = Thinking fast and slow

Predicted Rating(T) = 
             (My Rating(S) * Similarity(S,T) + 
              My Rating(B) * Similarity(B,T))) / (abs(Similarity(S,T)) + abs(Similarity(B,T)))

Predicted Rating(T) = ((5 * 0.414387707) + (3.5 * 0.277350098)) / (0.414387707 + 0.277350098)
Predicted Rating(T) = (2.071938535 + 0.970725343) / (0.691737805)
Predicted Rating(T) = 3.042663878 / 0.691737805

Predicted Rating(T) = 4.398579716

We can do the same calculations for the other 2 books. The final predicted ratings are

Thinking Fast and Slow 4.398579716
Talent is Overrated -0.159257429
Outliers -2.148971397

Thinking Fast and Slow has the highest predicted rating of 4.398. Hence amazon can safely recommend this to me.

This type of recommending items is called as Item-based collaborative filtering. It is defined as

Item-based collaborative filtering is a model-based algorithm for making recommendations. In the algorithm, the similarities between different items in the dataset are calculated by using one of a number of similarity measures, and then these similarity values are used to predict ratings for user-item pairs not present in the dataset.

This is different from the User-based collaborative filtering. In user based we find out the similarities between the users and recommended items from similar others. There are few challenges with the user based approach.

  1. Scalability – Amazon has over 100 million active customer accounts. Finding similarities between that many customers would be hard to scale. Compared to this the number of books that it carries will be much lesser. Hence finding similarities between books is going to be much faster.
  2. Sparsity – An average amazon user rates only a small fraction of the books. Hence there will be a little overlap between users, which can make it difficult to decide which users are similar. Imagine 100 million users have rated 30 books on an average. Hence there are 3 billion ratings in total. If you spread this evenly across 3 million books there are 1000 ratings for each book. Hence item based recommendation has a better chance of finding similar items.
  3. Stability – Items are stable for a very long period of time. Once a book gets published its contents will not change. Hence comparisons between items will not change as often as comparison between users. User preferences change all the time. Since items are stable we do not have to continuously calculate each item’s most similar items.