Abstract:
The increasing number of social media users has generated large amounts of data, including customer reviews for online women's clothing. Sentiment analysis, using machine learning algorithms such as random forest and naïve Bayes, can help companies determine the sentiment of customer reviews and identify the value of their products in the market. This analysis aids customers in making informed purchasing decisions and allows companies to optimize their product offerings for increased conversions.
- Background of Study: Reading through numerous customer reviews to make a purchasing decision can be a tedious and time-consuming task. It can also result in a loss of interest in buying a product. To illustrate this, consider the scenario of a mother searching for a dress to gift her daughter. As she navigates through the reviews, she encounters comments such as "good dress," "perfect dress," "the dress looks horrible," "when I wear the clothes I look terrible," "I was an angel in these clothes," "not perfectly fit," and so on. The abundance of these comments can leave the mother in a challenging position, making it difficult for her to draw a conclusive judgment about the dress. However, it is important to recognize that analyzing customer reviews provides valuable insights to companies. It allows them to identify any shortcomings in their products and determine areas for improvement. Additionally, it helps companies understand the preferences of specific age groups or regions. By analyzing customer comments, companies can effectively target specific regions and age groups, tailoring their products accordingly. This analysis also assists companies in deciding which products should be discontinued or disregarded based on customer feedback. In summary, while the process of reviewing customer comments can be overwhelming for consumers, it serves as a valuable tool for companies to gather feedback, identify areas of improvement, and make informed decisions about their products. The question of how to solve a problem like the one mentioned above is where Sentiment Analysis comes into play. Sentiment Analysis allows for the classification of products into positive, negative, or neutral categories, which ultimately saves time for buyers and enhances their decision-making process when purchasing products. It is important to note that some of the feedback received, such as customer comments, may contain inaccuracies. To ensure the accuracy of the results, Machine Learning algorithms are employed for testing purposes. In today's business landscape, companies are confronted with vast amounts of text data originating from sources like emails, customer support transcripts, social media comments, and reviews in which those data or text are analyzed to come out with people's opinions on the classification of either good, bad or neutral.
2.0
METHODOLOGY
The Methodology involves Collecting and pre-processing text data, extracting features, training a model using machine learning or deep learning algorithms, and evaluating its performance for sentiment classification and deployment.
2.1 Dataset Source: The data was downloaded from the Kaggle website. Dataset Contents. The dataset contains 11 attributes, also known as columns. Dataset Size: It consists of 23,487 observations, also known as rows.
Attribute Description:
a. Clothing ID: This attribute serves as the primary key or unique identifier assigned to each item of clothing. Each cloth in the dataset has its own unique number.
b. Age: This attribute represents the age of the customer who purchased the product.
c. Title: The Title attribute refers to the heading or title of the customer's review.
d. Review Text: This attribute contains the actual comment or review provided by the customer.
e. Rating: The Rating attribute denotes the rank or rating given to each product.
f. Recommended Id: This attribute indicates whether the product was recommended by the customer (Yes/No).
g. Positive Feedback Count: This attribute represents the count of positive feedback received for the product.
h. Division Name: This attribute classifies the clothing into different divisions or categories.
I. Department Name: The Department Name attribute provides the classification of clothes based on different departments or segments.
j. Class Name: This attribute categorizes the clothes into specific classes or types.
2.2 IMPORTATION OF LIBRARIES
Various libraries will be imported to facilitate sentiment analysis. The following libraries will be utilized, each serving a specific purpose:
Pandas: Pandas is a powerful data manipulation library that provides data structures and functions for efficient data analysis and reprocessing.
a. NumPy: NumPy is a fundamental library for scientific computing in Python. It provides support for large, multi-dimensional arrays and mathematical functions to efficiently perform numerical operations.
b. NLTK (Natural Language Toolkit): NLTK is a comprehensive library for natural language processing. It offers various tools and resources for tasks such as tokenization, stemming, part-of-speech tagging, and sentiment analysis.
c. Scikit-learn: Scikit-learn is a widely used machine learning library that provides tools for data preprocessing, feature extraction, and model training and evaluation. It includes various algorithms and metrics for classification and regression tasks.
Matplotlib: Matplotlib is a popular data visualization library that allows for the creation of graphs, plots, and charts to visually analyze and present the results.
d. Seaborn: Seaborn is a high-level data visualization library built on top of matplotlib. It provides a more intuitive and aesthetically pleasing interface for creating statistical graphics.
2.3 We perform a comprehensive analysis of the data's nature and distribution, including percentiles (25th, 50th, 75th), mean, count, and maximum values for columns such as cloth ID, age, rating, recommended ID, and positive feedback count.
Figure 1 Diagram of the description of the data
Additionally, we assess the data types (e.g., string, integer) and identify any missing values in both rows and columns. we can see the diagram below.
2.3 PREPROCESSING
This is a critical step in Machine Learning that involves cleaning and organizing raw data to make it suitable for building and training machine learning models. This process ensures that the data is in a consistent and usable format, reducing noise, inconsistencies, and irrelevant information that can negatively impact the model's performance and accuracy.
2.3.1 Data cleaning :
Handling missing value: we conduct a missing value analysis on the attributes of the data. Such as the Title, division, name, department name, and class name of all the missing values. The number of missing values or observations with null values is provided for each attribute. This format allows for a clear understanding of the presence of missing data in the specified attributes. We can see the diagram below.
Figure 2: checking for missing value
From the diagram above we find out that the attribute title has a missing value of 3,810, while the review text has 45 missing values, while the department name, division name, and class name all have the same missing value each which consists of 14 nulls.
Removing Irrelevant Data: Redundant information such as Column that does not have significance to the sentimental analysis was dropped down from the data. This column includes clothing _ id, title, division name, and department name was removed from the dataset.
Figure 3:Code for dropping the redundant data
Then the five first row of the data was then viewed after dropping the column. We can see the display of the image when the redundant data was omitted.
Figure 4. View of the data after dropping redundant columns
We first examined the shape of the data to identify any missing rows. After detecting the presence of missing data, we proceeded to remove the corresponding rows from the dataset. Subsequently, we displayed the modified data to verify that the missing rows had been successfully dropped. Please refer to the image below for a visual representation of this process.
Figure 5: Image of null row drop
2.3.2 Creating a sentimental column: In this project, we have developed a sentiment analysis algorithm aimed at classifying sentiments into three distinct categories: good, bad, and neutral. The algorithm adopts a rating-based approach, where sentiments with ratings between 4 and 5 are categorized as good, sentiments with a rating of 3 are classified as neutral, and sentiments with ratings less than 3 are identified as bad. The successful implementation and evaluation of this algorithm highlight its ability to accurately categorize sentiments. For a visual representation of the code implementation, please refer to the accompanying diagram below.
Figure 6: Creation of sentimental analysis attribute
2.4 Converting all the string values to integers for the purpose of implementing the model:
In the given table, we observed that all columns, except for "sentiment," "class name," and "review text," consist of integer values. In order to prepare the data for a machine learning model, we need to extract the string column and convert it into integer format. This conversion will enable us to further analyze the data and train the model effectively.
Once the string column has been converted, we can proceed to analyze the sentiment classification. To achieve this, we will employ the "count" function on the "sentiment" column. This function will provide us with the number of occurrences for each sentiment classification, namely "good," "bad," and "neutral."
Figure 7:count of sentiment column
From the diagram above we find that bad has a total of 2370, while good is 17435 and neutral 42863
By employing the NLTK library and the provided preprocess_text() function, we successfully removed punctuation and irrelevant words from the review text. Additionally, we utilized stop words to eliminate further irrelevant words. The resulting text was then tokenized into individual words for further analysis and modeling purposes. We can see it in the block code below.
Figure 8: block code for Tokenization
The resulting visualization displays the frequency distribution of the preprocessed words, allowing us to observe the most common terms in the review text after preprocessing. This visualization aids in gaining insights into the changes made during the pre-processing step and provides a foundation for further analysis and modeling. We can see the output of the image below.
Figure 9: The output of the review text after tokenization
Then the review text, sentiment, and class name are been converted into integers by the use of the NumPy function called values. We can see the implementation of it and the result below.
Figure 10:view of the data after conversion into integer
3.0
IMPLEMENTATION OF THE MODEL AND RESULT
Implementing a machine learning model involves training the model using labelled data and then using the trained model to make predictions or classify new, unseen data.
3.1 Identifying the target variable and independent variables for the model:
Since all the columns have been converted to integers. Then we can now train and test. From the project, we will use the sentiment as the dependent variable or predictor and the rest attribute as the independent variable. We can see it in the format below :
X = df_hm.drop(['Positive Feedback Count', 'Sentiment', 'Rating'], axis=1)
y = df_hm['Sentiment']
Figure 11:code for Testing and Training
From the code, X contains the independent attribute and Y contains the dependent attribute.
3.2 Implementing the Testing and training of the data:
The data was divided into two segments which were training and testing. While 70% of the data is allocated for training purposes, while the remaining 30% is reserved for testing and evaluation. Also, we ensure we set the random state to 42 to create consistent results when rerunning the code. This allows for reliable comparisons and evaluations of the model's performance across different runs. We can see the code of the diagram used for implementing the test and train data
3.3 Implementing the Gaussian model: The Gaussian model was implemented after dividing the data into two segments. which were the training and testing data.
Gaussian model
we implemented using the code below.
Figure 12: Gaussian sentiment analysis model implementation.
Also, we got the result of Gaussian after the implementation
Figure13: output of Gaussian sentiment analysis model
The overall accuracy of the Gaussian sentiment analysis model was found to be 0.86, indicating the proportion of correctly predicted instances in the testing set. To provide a broader assessment, macro average and weighted average metrics were calculated: Macro Average: Precision: 0.49 ,Recall: 0.65 F1-Score: 0.55 While the Weighted Average: Precision: 0.78, Recall: 0.86 F1-Score: 0.81
These metrics are crucial in evaluating the performance of the Gaussian sentiment analysis model. Precision measures the accuracy of the model's predictions for each class, while recall indicates the model's ability to identify true instances correctly. The F1-score represents the harmonic mean of precision and recall, providing a balanced measure. The support column denotes the number of instances belonging to each class. It is important to note that the Gaussian sentiment analysis model exhibited challenges in predicting Class 2, as indicated by the low precision, recall, and F1-score values for that class. This suggests the need for further improvements or alternative approaches for effectively classifying instances in this category.
Figure 14: Accuracy score of Gaussian sentiment analysis model.
The Gaussian classifier achieved an accuracy of 0.8624245102371483. The accuracy metric provides an overall measure of how well the classifier performed in predicting the correct class labels for the testing set. In this case, an accuracy of 0.8624245102371483 indicates that approximately 86.24% of the instances in the testing set were correctly classified by the Gaussian classifier. Accuracy is calculated as the ratio of the number of correctly predicted instances to the total number of instances in the testing set
3.4 Implementing random forest classifier :
The random forest classifier was implemented with the code below
Figure 15:Random forest classifier analysis model implementation
And we got an output diagram below.
Figure 16: output of random forest classifier sentiment analysis model
From the diagram above the overall accuracy of the Random Forest classifier for sentiment analysis was measured to be 0.84, indicating the proportion of correctly predicted instances in the testing set. Additionally, the macro average and weighted average metrics provide an overall assessment of the model's performance: Macro Average: Precision: 0.61, Recall: 0.60, F1-Score: 0.60 while Weighted Average:Precision: 0.82, Recall: 0.84, F1-Score: 0.83
These metrics offer insights into the effectiveness of the Random Forest classifier in classifying sentiment in the given dataset. The precision metric measures the accuracy of positive predictions for each class, while recall evaluates the model's ability to correctly identify positive instances. The F1-Score provides a balanced measure by considering both precision and recall. It is worth noting that the performance of the Random Forest classifier varies across classes, with class 2 showing relatively lower precision, recall, and F1-Score values compared to classes 0 and 1. By considering these metrics, we gain valuable insights into the Random Forest classifier's performance in sentiment analysis tasks. This information aids in decision-making and further
References
- Mbaabu, O., 2020. Introduction to Random Forest in Machine Learning [WWW Document]. Engineering Education (EngEd) Program | Section. URL https://www.section.io/engineering-education/introduction-to-random-forest-in-machine-learning/ (accessed 1.11.22).
- Parveen, N., Santhi, M.V.B.T., Burra, L.R., Pellakuri, H., 2021. Women’s e-commerce clothing sentiment analysis by probabilistic model LDA using R-SPARK [WWW Document]. unknown. URL https://www.researchgate.net/publication/348366577_Women’s_e-commerce_clothing_sentiment_analysis_by_probabilistic_model_LDA_using_R-SPARK (accessed 1.11.22).
- Python Programming Tutorials [WWW Document], n.d. . Pythonprogramming.net. URL https://pythonprogramming.net/tokenizing-words-sentences-nltk-tutorial/ (accessed 1.11.22).
- Sentiment Analysis and Naive Bayes Classification on e-commerce reviews [WWW Document], n.d. URL https://learn.hackwagon.com/showcase/6ARd2TMtWgGLgcTgx (accessed 1.11.22).
- Sparleanu, C., 2021. Algorithms for Fashion. How Machine Learning Makes the Apparel Industry More Sustainable [WWW Document]. Supertrends. URL https://www.supertrends.com/algorithms-for-fashion-how-machine-learning-makes-the-apparel-industry-more-sustainable/ (accessed 1.11.22).
- TokenEx, n.d. What is Tokenization? Everything You Need to Know [WWW Document]. TokenEx. URL https://www.tokenex.com/resource-center/what-is-tokenization (accessed 1.11.22).
- Xie, S., 2019. Sentiment Analysis using machine learning algorithms: online women clothing reviews (Research Project). aniass, n.d. GitHub - aniass/Sentiment-analysis-reviews: Sentiment analysis of women’s clothes reviews by using machine learning algorithms and Neural Networks (LSTM). [WWW Document]. GitHub. URL https://github.com/aniass/Sentiment-analysis-reviews (accessed 1.13.22). Mbaabu, O., 2020. Introduction to Random Forest in Machine Learning [WWW Document]. Engineering Education (EngEd) Program | Section. URL https://www.section.io/engineering-education/introduction-to-random-forest-in-machine-learning/ (accessed 1.11.22). Parveen, N., Santhi, M.V.B.T., Burra, L.R., Pellakuri, H., 2021. Women’s e-commerce clothing sentiment analysis by probabilistic model LDA using R-SPARK [WWW Document]. unknown. URL https://www.researchgate.net/publication/348366577_Women’s_e-commerce_clothing_sentiment_analysis_by_probabilistic_model_LDA_using_R-SPARK (accessed 1.11.22). Python Programming Tutorials [WWW Document], n.d. . Pythonprogramming.net. URL https://pythonprogramming.net/tokenizing-words-sentences-nltk-tutorial/ (accessed 1.11.22). Sentiment Analysis and Naive Bayes Classification on e-commerce reviews [WWW Document], n.d. URL https://learn.hackwagon.com/showcase/6ARd2TMtWgGLgcTgx (accessed 1.11.22). Sparleanu, C., 2021. Algorithms for Fashion. How Machine Learning Makes the Apparel Industry More Sustainable [WWW Document]. Supertrends. URL https://www.supertrends.com/algorithms-for-fashion-how-machine-learning-makes-the-apparel-industry-more-sustainable/ (accessed 1.11.22). TokenEx, n.d. What is Tokenization? Everything You Need to Know [WWW Document]. TokenEx. URL https://www.tokenex.com/resource-center/what-is-tokenization (accessed 1.11.22). Xie, S., 2019. Sentiment Analysis using machine learning algorithms: online women clothing reviews (Research Project).
Top comments (0)