Njeri Gitome

Posted on Mar 25, 2023 • Edited on Jun 12, 2023

Getting started with Sentiment Analysis and Implementation

#datascience #analysis

The advent of the internet has revolutionized how people communicate their thoughts and opinions. Today, millions of people share their daily lives and express their emotions on social media platforms like Facebook, Twitter, Google, and others.

A significant amount of sentiment-rich data is being generated via social media in the form of tweets, status updates, blog posts, comments, reviews, etc. This essential data drives analysts to discover insights and patterns through sentiment analysis.

Sentiment Analysis

Sentiment analysis is the process of extracting the emotions from the user's written text by processing unstructured information and preparing a model to extract the knowledge from it. It’s often used by businesses to detect sentiment in social data, gauge brand reputation, and understand customers.

It involves the use of data mining, machine learning (ML), artificial intelligence and computational linguistics to mine text for sentiment and subjective information such as whether it is expressing positive, negative or neutral feelings.

Different approaches for sentiment analysis

There are various approaches for sentiment analysis on linguistic data, and which approach to use depends on the nature of the data and the platform one is working on.

Most research carried out in the field of sentiment analysis employs lexicon-based analysis or machine learning techniques.

Lexicon-Based approach

Also known as Dictionary based approach, it classifies linguistic (sentiments) data using lexical databases like SentiWordNet and WordNet.

It obtains a score for each word in the sentence or document and annotates it using the feature from the lexical database present. It derives text polarity based on a set of words, each which is annotated with the weight and extracts information that contributes to conclude overall sentiments to the text.

Machine Learning approach

In this approach, the words in the sentence are considered in form of vectors, and analyzed using different machine learning algorithms like Naïve Bayes, Support Vector Machine (SVM) and Maximum Entropy.

In this article, we will dataset that is available at:
https://www.kaggle.com/datasets/kazanova/sentiment140

The data consists of sentiments expressed by users through various tweets. each comment is a record, which is classified as either positive or negative.

By filtering and analyzing the data using Natural Language Processing Techniques, and sentiment polarity is calculated based on the emotion word detected in the user tweets. This approach is implemented using Python programming language and Natural Language Toolkit(NLTK).

Text-Preprocessing

Natural Language Processing (NLP) is a branch of Data Science that deals with Text data. Text data is unstructured and therefore needs extensive text preprocessing.

Some steps of the preprocessing are:

Lower casing
Removing Hyperlinks
Removing punctuations
Removing Stop words
Tokenization
Stemming
Lemmatization

Lets start by loading the data!

Our columns of interest are those of unstructured textual tweets and sentiment. Therefore, the rest of the columns are dropped.

Lowercase all the tweets

The first step is transforming the tweets into lowercase to maintain a consistent flow during the NLP tasks and text mining.
For example 'Nation' and 'nation' will be treated as two different words in any sentence, and hence, we need to make all the words lowercase in the tweets to avoid duplication.

Remove Hyperlinks

They are very common in tweets and don't add any additional information as per our problem statement of sentiment analysis.

Remove Punctuations

For most NLP problems, punctuations do not provide additional language information and are generally removed.
Similarly, punctuation symbols are not crucial for sentiment analysis. they are redundant are the removal of punctuation before text modelling is highly recommended.

Remove Stop words

Stop words are English words that do not add much meaning to a sentence. They are removed as they do not add value to the analysis.

NLTK library consists of a list of words that are considered stop words for the English language. Some of them are : [i, me, my, myself, we, our, ours]

Tokenization

This refers to splitting up a larger body of text into smaller lines (paragraphs) and words. These broken pieces are called tokens (either word token or sentence tokens). They help in understanding the context and create a vocabulary.

Below is an example of a string of data:
"What is your favourite food joint?"

In order for this sentence to be understood by a machine, tokenization is performed on the string to break it into individual parts.

Tokens:
"What" "is" "your" "favourite" "food" "joint" "?"

Code sample:

Tokenization of the tweets:

Stemming and Lemmatization

These are text normalization techniques for Natural Language Processing. Both processes aim to reduce the word into a common base word or root word.

Stemming
This is the process of reducing a word to its stem, using a list of common prefixes such as ( -ing, -ed, -es).
Pros: Faster in executing large amounts of datasets.
Cons: This may result to meaningless words.
Lemmatization
The process of reducing a word to its stem by utilizing linguistic analysis of the word.
Pros: Preserve the meaning after extracting the root word.
Cons: Computationally expensive.

Lemmatization is almost always preferred over stemming algorithms until and unless there is need for super-fast execution on a massive corpus of text data.

Applying lemmatization to the tweets:

Text Exploratory Analysis

First, analyzing the text length for different sentiments.

Word Cloud

A word cloud is a graphical representation of word frequency. The larger the word in the visualization the more common the word was in the document(s).

Word cloud for positive tweets in our dataset: