DEV Community


Posted on • Updated on

Getting started with Sentiment Analysis


Sentiment analysis is a technique that is used to determine the emotional tone behind a particular text. For example, a business can use sentiment analysis to classify reviews as positive, negative or neutral.

Looking at online reviews, insights can be gained on the sentiment behind each review and then the common themes frequently mentioned in the reviews can be identified. Based on these insights, then a business/organisation or individuals can make informed decisions in their respective operations.

In today's world, advancement in technology has made it possible for systems to learn how to do tasks. This is through Artificial Intelligence, AI. So it is also possible to teach a system how to perform sentiment analysis getting rid of the need for repetitive analysis of the data by a human.

In this article, we will briefly go over how to get a computer to perform sentiment analysis by itself using machine learning algorithms.


In order to do this, we will use a collection of about 1.6 million tweets. This dataset Sentiment140 is hosted on Kaggle.

The tweets in the dataset were collected in February 2009 using the Twitter API and were labeled with sentiment polarity using emoticons present in the tweets. For instance, tweets with positive emoticons like :) were labeled as positive, tweets with negative emoticons like :( were labeled as negative, and tweets without any emoticons were labeled as neutral.

The Sentiment140 dataset is commonly used in research and industry for sentiment analysis tasks due to its large size and labeled sentiment polarity. Researchers and practitioners can use this dataset to develop and evaluate machine learning models for sentiment analysis tasks, such as sentiment classification or sentiment regression.


Similar to any data science project, there are general steps involved in performing any data analysis. In this case, here are the steps:

1. Data Collection:

Instead of downloading the data to the local machine, the dataset will be extracted from Kaggle directly into Colab where the analysis will happen.

Authenticating the Kaggle API client

# Get the username and key from your Kaggle account
os.environ['KAGGLE_USERNAME'] = "username"
os.environ['KAGGLE_KEY'] = "key"
Enter fullscreen mode Exit fullscreen mode

Download and unzip the dataset from Kaggle

!kaggle datasets download -d kazanova/sentiment140

# Unzip the downloaded dataset
!unzip sentiment140
Enter fullscreen mode Exit fullscreen mode

Load the downloaded dataset

tweets_df = pd.read_csv('training.1600000.processed.noemoticon.csv', encoding='latin-1')
Enter fullscreen mode Exit fullscreen mode

Loaded DataFrame

2. Data Pre-Processing:

Next step is to preprocess the data by cleaning it and converting it into a structured format that can be used for analysis.

# Using the .columns method insert a list of the column names
tweets_df.columns = ['target', 'id', 'date', 'flag', 'user', 'text']
Enter fullscreen mode Exit fullscreen mode

Column Headers Added

Pre-process the text column data using regular expressions to remove elements like punctuations, special characters, urls, hashtags, stop-words, usernames and convert all to lowercase.

Before making any structural changes to the dataset, I created a copy of the original dataset and are working on the copy.

# import NLTK, Natural Language Toolkit, library
# This library provides good tools for loading and cleaning text
import nltk
import re
from nltk.corpus import stopwords'stopwords')

stop_words = set(stopwords.words('english'))

# define a function to implement the pre-processing & cleaning of the text data
def clean_text(text):
    text = re.sub(r'http\S+', '', text)  # Remove URLs
    text = re.sub(r'@[^\s]+', '', text)  # Remove usernames
    text = re.sub(r'#([^\s]+)', r'\1', text)  # Remove hashtags
    text = re.sub(r'[^\w\s]', '', text)  # Remove punctuation
    text = text.lower()  # Convert to lowercase
    text = ' '.join([word for word in text.split() if word not in stop_words])  # Remove stopwords
    return text

# Apply the above clean_text function to the text column values
# Drop the text column after adding the clean_text column to the dataframe
tweets_cp['clean_text'] = tweets_cp['text'].apply(clean_text)
tweets_cp.drop(['text'], axis=1)
Enter fullscreen mode Exit fullscreen mode

Cleaned Text

3. Feature Extraction

After data preprocessing then convert the preprocessed text into a numerical format that can be used for analysis. This involves a technique like TF-IDF, Term Frequency Inverse Document Frequency. TF-IDF can be defined as the calculation of how relevant a word in a series or corpus is to a text.

#Convert the text data into numerical features using TF-IDF
tfidf = TfidfVectorizer(stop_words='english', max_features=5000)
X = tfidf.fit_transform(tweets_cp['clean_text'])
Enter fullscreen mode Exit fullscreen mode
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, tweets_cp['target'], test_size=0.3, random_state=42)
Enter fullscreen mode Exit fullscreen mode

4. Model Selection

The next step is to pick an appropriate machine learning algorithm to classify the sentiment of the tweet text. In this case we will try this with Naive Bayes.

# Train a Naive Bayes classifier on the training data
nb = MultinomialNB(), y_train)
Enter fullscreen mode Exit fullscreen mode
# Test the model on the testing data
y_pred = nb.predict(X_test)
Enter fullscreen mode Exit fullscreen mode

5. Model Training

We will train the model using the labeled training dataset that we split in the Feature Extraction.

# Test the model on the testing data
y_pred = nb.predict(X_test)
Enter fullscreen mode Exit fullscreen mode

6. Model Evaluation

After training the model, we need to evaluate its performance on a test dataset(30% of the original dataset) that we split in the Feature Extraction section.

print('Accuracy:', accuracy_score(y_test, y_pred))
print('Precision:', precision_score(y_test, y_pred))
print('Recall:', recall_score(y_test, y_pred))
print('F1-Score:', f1_score(y_test, y_pred))
Enter fullscreen mode Exit fullscreen mode

The model without fine turning it has an accuracy score of 75% and a precision of 75%.

Evaluation Score

Accuracy: 0.7511354166666667
Precision: 0.7564523638210522
Recall: 0.7427183457378064
F1-Score: 0.7495224455818614
Enter fullscreen mode Exit fullscreen mode

7. Predict

We will try predict the sentiment of a new tweet using the model we have trained, tested and evaluated.

new_tweet = 'I hate Mondays'
new_tweet_cleaned = clean_text(new_tweet)
new_tweet_vectorized = tfidf.transform([new_tweet_cleaned])
sentiment = nb.predict(new_tweet_vectorized)[0]
print('Sentiment:', sentiment)
Enter fullscreen mode Exit fullscreen mode

The model predicts the new tweet has a negative tone.

Sentiment: 0


Sentiment analysis can help gauge how the outside world feels about a business, product, trend and so many more. With the integration of machine learning models into such analysis, the results can be outstanding. Even with fine turning of a simple model like the one that we just built can really inform decision-making at the said entity.

You can find the model code at this Link.

Why did the sentiment analyst's computer keep crashing? It couldn't handle all the feelings.

Exploring the Possibilities: Let's Collaborate on Your Next Data Venture! You can check me out at this Link

Top comments (0)