Sentiment analysis is basically a domain that trys to understand human emotions through a software .If the sentiments are in written form we can classify them as positive ,negative or neutral.
It is often called opinion mining because we are trying to figure out the opinion or attitude of the customer with respect to a particular product and extract valuable information from it.
Remember the last time you left a review for a product or a mobile app or when you made a textual comment on twitter or Instagram, the algorithms have most probably already reviewed your textual comment to get valuable information.
A customer plays a very big role in the market ,the customer can either make or break your business. Businesses/ companies make decisions based on the information extracted from textual data(given by the customer/consumer) For example, if person A has a company that produces product x, but the product is not selling well in the market. The data scientist in the company will analyze the reviews on the product so as to try and find out why it the product is not selling well so as to improve on it(to see the attitude of the customers towards the product)
The information extracted through sentiment analysis can be used to determine market strategy.
*Applications of sentiment analysis *
1.REVIEW CLASSIFICATION - to know the sentiment behind the many reviews from customers.(Classify the sentiments as positive, negative or neutral)
2.PRODUCT review mining -to know what features of the product customers loves and/or hates. so as to improve the product.
In this article we will go through sentiment analysis in python using machine learning.
Here is a link to a repository([https://github.com/Em-me/twitter-sentiment-analysis] in my GitHub of a project to explain sentiment analysis. You can download the data from here [https://www.kaggle.com/datasets/kazanova/sentiment140]
and follow the steps in the git hub repository.
*Side notes for the project and explanation of some of the steps
*
- Checking for null values
Checking for null values is an important step in machine learning as missing data can affect the accuracy of your model's predictions. There are several ways to check for null values in machine learning, including:
Using the isnull() function: This function returns a Boolean value indicating whether each value in the dataset is null or not. You can then use the sum() function to count the number of null values in each column.
Using the info() function: This function provides information about the dataset, including the number of non-null values in each column. If the number of non-null values is less than the total number of rows in the dataset, then there are null values present.
Using visualization tools: Visualizing the dataset can often help identify null values. For example, you can use a heatmap to visualize the null values in the dataset.
Once you have identified the null values, you can choose to either remove the rows or columns with null values, or impute the null values with an appropriate value, such as the mean or median of the column. The choice will depend on the specifics of your dataset and the problem you are trying to solve.
The project is to asses the twitter sentiments so we have to drop the columns which are not associated with the sentiments(remain with the text column)
Data processing
Data processing is an essential step in sentiment analysis, which involves the analysis of the subjective information in text data.
Text cleaning: This step involves removing unnecessary elements from the text data such as special characters, punctuations, stop words, and numbers. Text cleaning also involves converting all the text to lowercase, removing any HTML tags and reducing words to their root forms(removing duplicates) by stemming.
Tokenization: Tokenization is the process of splitting the text into smaller chunks called tokens. Each token represents a single word or a group of words that convey a particular meaning.
Calculating polarity the text data
This involves determining the overall sentiment of a piece of text as positive, negative, or neutral.Word cloud
A word cloud is a graphical representation of text data, where the size of each word is proportional to its frequency in the text. Word clouds are often used in sentiment analysis to visualize the most commonly used words in the text and to identify the overall sentiment of the text.Bigram model
A bigram model is a type of language model that analyzes the frequency of occurrence of pairs of words (bigrams) in a piece of text. In sentiment analysis, bigram models can be used to identify common phrases or expressions that are associated with positive or negative sentiments.Building the model
Splitting the data into training and testing subsets.
A typical train/test split would be to use 70% of the data for training and 30% of the data for testing.Testing/evaluating the model
Metrics
In this session, I'll discuss common metrics used to evaluate models.
** Classification metrics**
When performing classification predictions, there's four types of outcomes that could occur.
True positives are when you predict an observation belongs to a class and it actually does belong to that class.
True negatives are when you predict an observation does not belong to a class and it actually does not belong to that class.
False positives occur when you predict an observation belongs to a class when in reality it does not.
False negatives occur when you predict an observation does not belong to a class when in fact it does.
These four outcomes are often plotted on a confusion matrix as shown in the project in the repository above.
**conclusion **
In this article, we discussed using machine learning models to extract information from textual data. This knowledge may then be used to inform business choices, such as the direction of the company or even investment plans. Then, using sentiment analysis methods, we investigated the operation of these machine learning models and the information that might be obtained from such textual data.
Top comments (0)