Sai Vishwa B

Posted on Aug 20

✈️ Model Comparison for Sentiment Analysis Using the Airline Tweet Dataset

In today's data-driven world, sentiment analysis has become a powerful tool for understanding public opinion. Whether it's gauging customer satisfaction or monitoring brand reputation, the ability to analyze textual data at scale is invaluable. In this blog post, we'll walk through a model comparison using the airline tweet dataset, showcasing how different machine learning models perform on the task of sentiment analysis.

📊 Dataset Overview

The airline tweet dataset consists of tweets directed at various airlines. Each tweet is labeled with one of three sentiments: positive, neutral, or negative. This makes it an ideal dataset for sentiment classification tasks. The goal is to build a model that can accurately classify the sentiment of new, unseen tweets.

🛠️ Importing the Required Libraries

Before diving into the analysis, let's import the necessary libraries:

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import re
import nltk
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, accuracy_score
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from imblearn.over_sampling import SMOTE
from nltk.corpus import stopwords
nltk.download('stopwords')

🧹 Data Loading and Preprocessing

The first step is to load the dataset and perform some basic preprocessing, such as cleaning the text and converting the labels into a format suitable for modeling.

# Load the dataset
data = pd.read_csv("AirlineTwitterData.csv", encoding = "ISO-8859-1")

df = data[['text', 'airline_sentiment']].copy()

df.rename(columns = {"text" : "tweet", "airline_sentiment" : "sentiment"}, inplace = True)

le = LabelEncoder()
df.sentiment = le.fit_transform(df.sentiment)

label_mapping = dict(zip(le.classes_, range(len(le.classes_))))

stop_words = set(stopwords.words('english'))

df['clean_tweet'] = df['tweet'].apply(lambda x : ' '.join([word for word in x.split() if word.lower() not in stop_words]))

def clean_text(text):
    text = re.sub(r'@\w+|#\w+|https?://(?:www\.)?[^\s/$.?#].[^\s]*', '', text)  # Remove mentions, hashtags, and URLs
    text = re.sub(r"[^a-zA-Z0-9\s]", '', text)  # Remove non-alphanumeric characters
    return text.strip().lower()  # Strip whitespace and convert to lowercase

# Apply the cleaning function to the DataFrame
df['clean_tweet'] = df['clean_tweet'].apply(clean_text)

df.drop('tweet', axis = 1, inplace = True)

🧠 Feature Extraction

To feed the text data into our machine learning models, we need to convert it into numerical features. We'll use TF-IDF (Term Frequency-Inverse Document Frequency) vectorization for this purpose.

Tfidf = TfidfVectorizer(max_features=10000, ngram_range=(1, 3))
X = Tfidf.fit_transform(df.clean_tweet).toarray()
y = df['sentiment']

# Handling class imbalance using SMOTE
smote = SMOTE(random_state=42)
X_res, y_res = smote.fit_resample(X, y)

✂️ Train-Test Split

We'll split the dataset into training and testing sets to evaluate our models.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)