In non personalized recommenders everyone gets to see the same recommendation. The site does not know who you are. If I go to amazon.com as an anonymous user it shows items that are currently viewed by other members. This is one example of a non personalized recommendation.
There are two variations in this – Predictions and Recommendations. Predictions are bold statements and they come in the form of scores, stars and counts. Recommendations on the other hand are not bold statements. They just place items without any numbers associated with it.
One of my favorite book is The Signal and the Noise by Nate Silver. In the scale of 1 to 5 the book has got 4.3 stars out of 5. This is an example of prediction.
The math behind this is very simple.
Score = (364 * 5 + 161 * 4 + 62 * 3 + 21 * 2 + 21 * 1) Score = 2713 / 629 Score = 4.3 out of 5 stars
In the same page it also displays the information about the other books which the customers bought after buying Signal and Noise. This is an example of recommendation.
How would amazon do this? In order to understand this I came up with a matrix of members and different books purchased by them.
Signal and Noise | Naked Statistics | Influence | Made to Stick | How to read a book | |
Allen | O | O | O | ||
Bruce | O | O | O | ||
Peter | O | O | |||
Tim | O | O | |||
Tom | O | O |
Let X be the number of customers who purchased the book Signal and Noise. Let Y be the other books they purchased. You need to compute the ratio given below for each book and sort them by descending order. Pick up the Top K books and show them as related.
Total Customers who purchased X and Y / Total Customers who purchased X
If you do the above math for all the other four books you will get
Naked Statistics | 100% |
Influence | 100% |
Made to Stick | 0% |
How to read a book | 0% |
Naked Statistics makes perfect sense. But why did Influence came to the top when it has been purchased by customers who have not purchased Signal and Noise. This situation is called as banana trap. In a grocery store most of the customers will buy bananas. If someone buys a razor and a banana then you cannot tell that the purchase of a razor influenced the purchase of banana. Hence we need to adjust the formula to handle this case as well. The modified version is
(Total Customers who purchased X and Y / Total Customers who purchased X) / (Total Customers who did not purchase X but got Y / Total Customers who did not purchase X)
Substituting the numbers we get
Naked Statistics = (2 / 2) / (1 / 3) = 3 Influence = (2 / 2) / (3 / 3) = 1
The denominator acts as a normalizer and you can see that Naked Statistics clearly stands out.
Let move from amazon.com to Hacker news. Hacker News is a social news website that caters to programmers and entrepreneurs, delivering content related to computer science and entrepreneurship. How does the site decide which news should show at the top? Readers of the news articles can submit up votes and comments. The algorithm uses this information along with the time at which the post was submitted to decide its rank. Excerpt from their site.
How are stories ranked?
On the front page, by points divided by a power of the time since they were submitted. Comments in comment threads are ranked the same way. Stories on the new page and comments on the comments page are listed chronologically.
In this article the formula for calculating the score is given as
Score = (P-1) / (T+2) ^ G
where,
P = points of an item (and -1 is to negate submitters vote)
T = time since submission (in hours)
G = Gravity, defaults to 1.8 in news.arc
Points | Time since submission (in hours) | Score | |
Article 1 | 100 | 0 | 28.43 |
Article 2 | 200 | 2 | 16.41 |
Article 3 | 500 | 20 | 1.91 |
Article 4 | 2000 | 200 | 0.14 |
Even though Article 1 has lowest points it got the highest score. This makes perfect sense as more weight is given to the recent articles.
Reddit is another website which is called as the front page of internet. Users submit articles in the from of link and text. These articles can be voted up and down. The site uses a slightly different algorithm to score the articles. The details of the algorithm can be found here. Given below is the key piece of code which does the scoring.
def hot(ups, downs, date): s = score(ups, downs) order = log(max(abs(s), 1), 10) sign = 1 if s > 0 else -1 if s < 0 else 0 seconds = epoch_seconds(date) - 1134028003 return round(order + sign * seconds / 45000, 7)
Reddit differs from Hacker news in the following ways.
- Instead of using the absolute net points(ups – downs) the logarithm of the net points is used. Why would they do this? The site does not want to differentiate the scores for 100 and 200 points but it wants to differentiate the scores between 100 and 2500 points. Logarithms helps in solving this problem. In base10 Log(100) = 2 and Log(2500) = 3.39. If you need a refresher in logarithms click here.
- What does this number 1134028003 seconds mean? This comes to 35.95 years. Adding this to epoch of Jan 1 1970 this comes to the year when Reddit was founded.
- The scores will not decrease as time goes by. But the new articles are ranked higher the older articles.
Points | Time since submission (in seconds) | Score | |
Article 1 | 1100 | 252288000 | 5609.441393 |
Article 2 | 1500 | 252288000 | 5609.576091 |
Article 3 | 12000 | 252288000 | 5610.479181 |
Article 4 | 500000 | 252288000 | 5612.09897 |
All the articles are submitted at the same time in 2013. Even though there is a huge difference in the points between Article 1 and Article 4 the score differs only by 3 points as it uses logarithmic scale.
If you want to learn more about how recommendations systems work sign up for the free online course.