My wife bought a kindle for her dad and he asked me what new books he should read? I knew that he is a big fan of autobiographies but I still asked – “What kinds of books do you like to read?” He listed three different genres and I searched Google for the best sellers in those categories. This is what we naively do to recommend something, but a better way would be to use a book recommendation system. Recommendation systems, also known as recommendation engines, simply suggest products to new and existing customers. In fact, businesses are using these systems more than ever to provide personalized recommendation, which is expected to attract more sales in upcoming years.
Book publishing is one of the traditional businesses in the world. The revenue from the global book publishing market is forecast to increase in the coming years, growing from around $113 billion in 2015 to about $123 billion by 2020. There is also a substantially large market of electronic and audio books. Amazon’s e-book market share is going steady and it has acquired Goodreads, which has a user base of at least 20 million. There are several other book recommenders – What should I read next, Bookish, Jellybooks, and the list goes on. Revenues of these companies depend on the quality of recommendation and naturally they always want to make better predictions. A lot of these systems are based on algorithms that learn from past data and the availability of a vast amount of user historical data certainly helps.
The first step to build a book recommenders is to understand user’s past data. I have analyzed Amazon’s book review dataset which contains data from 1996-2014. This is a large JSON file with ~9 million rows, and I could not fit it all into my laptop’s memory. I did an initial analysis with python using the first 20,000 rows. Fortunately, in the era of big data, it’s not a problem to work with 9 million rows with Google BigQuery. Well, you need to know a bit of SQL but the biggest advantage of BQ is (for me and many others) that it can be interfaced with a pandas dataframe – the resource I use for data analysis and visualization. Please look here to see how I interfaced pandas to BQ.
What have I learnt?
Readers love to give good reviews – the numbers of 4 & 5 stars are significantly higher than 1, 2 & 3 stars.
2013 was a great year for books.
Sale is higher during summer and holidays (No surprise!!)
Did you notice how quiet was 2007-2009? it may be correlated to the great recession.
Top five books (Need to have best sellers for a book recommendation system)
- Gone Girl by Gillian Flynn.
- The Hunger Games (Hunger Games Trilogy, Book 1) by Suzanne Collins.
- The Book Thief by Markus Zusak.
- Sycamore Row by John Grisham.
- Mockingjay (The Hunger Games) by Suzanne Collins.
Are the reviews helpful?
Amazon has this system where the customers are asked to vote if they find a review is helpful. You probably have seen this in amazon product reviews:
In this example, 158 out of a 163 people found this review helpful. We can compute the helpfulness rating for this review dividing the helpful vote by the total vote. If we do that for all reviews, how does the distribution look?
A large number of people (57% to be exact) voted that reviews are not helpful (rating 0). However, about 2 million found reviews are completely useful (rating 1). Can we find two most active reviewers and look at their helpfulness ratings?
The most active reviewer has reviewed ~23,000 times and has a rating of 1 about 34%. He/she also has numerous ratings which are more than 50%. This is not bad – the reviewer seems to be very helpful. This reviewer can be tagged as a “trusted” one. The fifth most active reviewer got a rating 0 about 1500 times out of his total 5000 reviews. Well, book reviews are subjective, and it is not unusual that reviews be either useful or useless to readers.
To better understand helpfulness, we look at 16 most active reviewers and divide them into four helpfulness categories – below 25%, 25-50%, 51-75% and 76-100%.
In the above figure, pink and orange colors represent above 50% ratings. Majority of the reviewers are helpful and they don’t have any negative ratings. Two reviewers (two longest bars) have more than 10,000 reviews. One of them has ratings in all four categories, and the other one also has ratings in three groups except 75-100%. Some reviewers have received either 75-100% or below 25% ratings.
The above analysis raises the question that which reviews are more likely to get a better helpfulness – 4 & 5 stars (positive ratings) or below 4 stars (negative ratings)? I have done a statistical test to answer this question. Helpfulness is divided into two groups – ‘yes’ or ‘no’, where below 50% is considered as NOT helpful. 20,000 reviews are chosen randomly from a total of 9 million rows to ensure that reviews are independent. Out of these 20,000 samples, 51% of the negative reviews are found to have a helpful vote compared to 42% of the positive reviews. This means that the negative reviews get a 9% higher helpfulness rating in a random sample. The question is that then what is the probability that negative reviews are likely to have better helpfulness in the entire dataset? We start testing the hypothesis (null) that there is no difference in helpfulness between these two groups and the 9% difference is observed purely by chance. Applying the central limit theorem and statistical z-test I found that the probability of null hypothesis (p-value) is < 0.001 in the 95% confidence interval, which suggest that the difference observed in the sample is statistically significant. We can then conclude that negative reviews are more likely to get a better helpfulness rating. However, this does not mean that positive reviews are not helpful as based on this data we can only reject the null hypothesis.
The text of the reviews is another important topic and I am going to discuss it in the next blog.
Code and complete analysis
Please see my github.