Kindle Store Reviews - Sentiment Analysis and Recommendation Systems

#sentimentanalysis #datascience #machinelearning #python

As a capstone project of Flatiron School, I worked on Kindle Store reviews on Amazon.

Why did I choose this project?

With the increasing demand in e-commerce, voice of customer concerns getting more important such as reviews or surveys. Customers could not touch online products so they want to learn from previous customer's experiences. If the company fails to meet this expectation, it can lose money. In this project, I choose Kindle store reviews because it is easy to buy and read books with Kindle and Kindle app is available for every device. So, there are many people who prefer to buy online books and give feedback about them online. I worked on book reviews but the important point of this project for me is that it can be applied for different companies and different areas. I found this online business problem interesting so, I tried to solve it.

How did I solve this problem?

I have two main solutions. Firstly, I did sentiment analysis to classify reviews as positive or negative. It is hard to read thousands of reviews regularly and understand which product is liked or hated by customers. It takes too much time and generally does not give brief idea about product comparison. I build a model which can predict the review is positive or negative from text and saving our time.

My second solution is to build a recommendation system to meet customers' expectations with recommend them good books and sell more products.

Models

I tried different models for sentiment analysis. Firstly, I tried statistical models as LogReg, DecisionTree, ExtraTree, RandomForest, XGBoost, LGBM Classifiers. Between them, I found best result 87% accuracy. I decided to try neural nets. As neural net models, I built Torch model with Fast Text Class. Then, I tried CNN and RNN with different layer numbers and layer types on Keras. At last, I used transfer learning technique with using Pre-trained BERT model which gave me 95% for small sample data and 97% for large-scale data. The results with small-scale data were inserted below;

Recommendation Systems

As a recommender system, I used collaborative filtering which finds similar users according to past ratings on same products and recommend user B the book which liked by also user A. Although this system has many advantages such as easy to find interested areas of customers and adding data to system easily, it has one big disadvantage as cold-start problem.

To solve this issue, I built a system which finds similarities between keywords and reviews. It is easy to write keywords in search tab and take recommendations but what if I do not have any other information for product than reviews, or maybe best book for me is not best book for new user. So, I prefer to calculate text similarities between keywords and past reviews of other users. Then, I find the best N similar users and give recommendation to new user according to this.

Challenges of This Project?

The first challenge was that the data is very big to analyze. There are more than 2.2 million reviews with 139,815 users and 98,824 books. I used meta data as helper dataset to match book titles with book ID's. Every trial can take hours even my personal computer can not run this model for this whole data. So, I found different solutions. Firstly, to run models easily and compare, I took sample data from whole set and after decided my model with small set, I run my model for whole set. I used Google Colab also to run some models which can take hours.
The second challenges of the project was to build neural networks. Before using transfer learning technique, I tried to train Torch and Keras models. It is not easy to understand which layer is necessary or what the layers works for.
As a third challenge, it is not easy to say this recommendation system is well. Although we are trying to guess likes or dislikes of people, it is hard to predict human nature.

As a result, this project can help companies to save time with not to read many reviews for comparing products. If I can recommend good books, people can buy more products and advice my online store to others. Also, this project give opportunity to compare products easily and define dislikes quickly. And if the customer spend less time on server with finding interested books easily, it means fewer problems for my server.

All details and codes can be found in my GitHub repo.

As a personal gain, I learned a lot of different techniques, models and I got more information about deep learning. It was a good hands-on learning experience for me about neural networks.

As general feelings, I am very happy with my level which I have reached in 15 weeks. When I compares my first projects with capstone, I feel that I have improved myself a lot. I am grateful to our instructor Bryan Arnold and our coach Lindsey Berlin for helping and supporting us always to push the limits. And special thanks to my dear husband, I am grateful to him who encourages me always to try my best.