DEV Community

Tebid Quinsy
Tebid Quinsy

Posted on

Analysis of a Reviews dataset

Looking through kaggle, there's a lot of projects you and I can do when deeping your feet into the scary world of data science. Of all the datasets, discussions and notebooks on the platform is the Amazon Reviews dataset; but alas why take a project without an understanding of future deliverables, defining project process.

In undertaking this project, I wondered:

  • Why is this project important?
  • How will this project help in the real-world?
  • What does a potential employer see by adding this project to my portfolio? What do I want them to see?
  • How do I showcase this project in my portfolio?
  • What is the main takeaway from this project for myself, and anyone who comes across it?

Defining the project

By defining the project, we are able to manage expectations, set a time frame and plan steps to execute the project. Generally, for most data projects, the outline is:

  • Data collection and Initial analysis
  • Pre-processing
  • Feature Extraction
  • Model selection and Evaluation
  • Deployment

I'm not looking to re-invent the wheel so I'm using the same approach.

Process

For this project, the data is taken from amazon reviews data source; in the source, the data was split into train and test files but I merged them to be more random.

Loading the data
Looking through the data, I realized there were 6 columns; 2 numerical and the other 4 having text which I noticed were split reviews.

To preprocess the data, I merged the textual data into a single review and got a singular rating value for each review.
Data Preprocessing
I was interested in looking at the general ratings and identifying the main topic of each review.
But why will this knowledge be important? Who will need insights such as this? Retailers need to understand what products are being sold and what aren't. It would have been more insightful to have information about locations to better help the retailers but alas such data wasn't added to the dataset.

After preparing the data, the next step was to get the topics from the sentiments (Feature Extraction).

HuggingFace is known for its Transformers Python library, which simplifies the process of downloading and training machine learning models. I used 2 of these models to 'predict' the ratings of the reviews and compare the reviews to the provided review.
To evaluate the models' predictions, I employed an evaluate_models function.
Evaluate Models
I found out that the autoTokenizer is less accurate compared to huggingFace's distilbert; regardless, I left the model in there as a visible lesson.

Finally, use CountVectorizer to convert text data into numerical representations that can be understood by machine learning models. CountVectorizer tokenizes the text, removing basic english words, and then builds a vocabulary of known words. This technique is used to create a fixed-length vector of numbers representing the occurrences of words in the text.
After vectorizing the data, I used LDA, NMF and SVD to get topics from each review.
Topic Extraction
From the columns generated, I generated another column of common topics to all columns from the topic extraction.
Common Topics

Generating these columns makes me want to actually look at these ratings and their topics.
The data we ended up with after all this manipulation looks like this:
Final data
And from the data, these charts were produced

Common Topics and Ratings

Common Topics and Average Ratings

Challenges

While working on this project, I faced an issue of too much data. The data was hard to manipulate due to it's volume.

  • To overcome this, I sampled the data then implemented concurrency on batches of the samples. Concurrency handles the complexity of multi-threading and multi-processing and enables asynchronous execution and results.

Lessons learned

  • As a data scientist, we are accustomed to working in notebooks rather than working in scripts. But in the current day and age, the data scientist is to be familiar and comfortable with writing and working with scripts.
  • Before commencing a project, especially a portfolio project, think about what you want to showcase, what you are learning and always practice time management. And be sure to look at the data and think of where this data is coming from _ in the future where would this data come from, who is going to collect this data _; then think of how this can be automated.
  • What is your final product? Where is it? What about updates? I chose to deploy the results on streamlit

Future progress

  • In the future, I want to pull data from an api. In that case the data will be more relevant and could include other attributes such as location which will be important to the retailers. The model will be updated accordingly.
  • In case I use an api, then I could use apache airflow to build a pipeline which automates this via a dag in a script.

Conclusion

These are my thoughts about a beginner project in data science tackling data analysis in python and sentiment analysis of an amazon reviews dataset collected over the span of 18 years. These thoughts would be relevant to a beginner data scientist or someone wanting to get into the data science field.
Hopefully, you learnt something from my ramblings.

Top comments (0)