In today's data-driven world, sentiment analysis has become a powerful tool for understanding public opinion. Whether it's gauging customer satisfaction or monitoring brand reputation, the ability to analyze textual data at scale is invaluable. In this blog post, we'll walk through a model comparison using the airline tweet dataset, showcasing how different machine learning models perform on the task of sentiment analysis.
📊 Dataset Overview
The airline tweet dataset consists of tweets directed at various airlines. Each tweet is labeled with one of three sentiments: positive, neutral, or negative. This makes it an ideal dataset for sentiment classification tasks. The goal is to build a model that can accurately classify the sentiment of new, unseen tweets.
🛠️ Importing the Required Libraries
Before diving into the analysis, let's import the necessary libraries:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import re
import nltk
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, accuracy_score
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from imblearn.over_sampling import SMOTE
from nltk.corpus import stopwords
nltk.download('stopwords')
🧹 Data Loading and Preprocessing
The first step is to load the dataset and perform some basic preprocessing, such as cleaning the text and converting the labels into a format suitable for modeling.
# Load the dataset
data = pd.read_csv("AirlineTwitterData.csv", encoding = "ISO-8859-1")
df = data[['text', 'airline_sentiment']].copy()
df.rename(columns = {"text" : "tweet", "airline_sentiment" : "sentiment"}, inplace = True)
le = LabelEncoder()
df.sentiment = le.fit_transform(df.sentiment)
label_mapping = dict(zip(le.classes_, range(len(le.classes_))))
stop_words = set(stopwords.words('english'))
df['clean_tweet'] = df['tweet'].apply(lambda x : ' '.join([word for word in x.split() if word.lower() not in stop_words]))
def clean_text(text):
text = re.sub(r'@\w+|#\w+|https?://(?:www\.)?[^\s/$.?#].[^\s]*', '', text) # Remove mentions, hashtags, and URLs
text = re.sub(r"[^a-zA-Z0-9\s]", '', text) # Remove non-alphanumeric characters
return text.strip().lower() # Strip whitespace and convert to lowercase
# Apply the cleaning function to the DataFrame
df['clean_tweet'] = df['clean_tweet'].apply(clean_text)
df.drop('tweet', axis = 1, inplace = True)
🧠 Feature Extraction
To feed the text data into our machine learning models, we need to convert it into numerical features. We'll use TF-IDF (Term Frequency-Inverse Document Frequency) vectorization for this purpose.
Tfidf = TfidfVectorizer(max_features=10000, ngram_range=(1, 3))
X = Tfidf.fit_transform(df.clean_tweet).toarray()
y = df['sentiment']
# Handling class imbalance using SMOTE
smote = SMOTE(random_state=42)
X_res, y_res = smote.fit_resample(X, y)
✂️ Train-Test Split
We'll split the dataset into training and testing sets to evaluate our models.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
🆚 Model Comparison
We'll compare the performance of several machine learning models:
- Logistic Regression
- Naive Bayes
- Random Forest
- K-Nearest Neighbors
- Decision Tree
- XGBoost
Each of these models has its strengths and weaknesses, making it crucial to evaluate them on the same dataset.
Logistic Regression
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
y_pred_logreg = logreg.predict(X_test)
print("Logistic Regression Accuracy:", accuracy_score(y_test, y_pred_logreg))
print(classification_report(y_test, y_pred_logreg))
Naive Bayes
nb = MultinomialNB()
nb.fit(X_train, y_train)
y_pred_nb = nb.predict(X_test)
print("Naive Bayes Accuracy:", accuracy_score(y_test, y_pred_nb))
print(classification_report(y_test, y_pred_nb))
Random Forest
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)
print("Random Forest Accuracy:", accuracy_score(y_test, y_pred_rf))
print(classification_report(y_test, y_pred_rf))
K-Nearest Neighbors
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
y_pred_knn = knn.predict(X_test)
print("K-Nearest Neighbors Accuracy:", accuracy_score(y_test, y_pred_knn))
print(classification_report(y_test, y_pred_knn))
Decision Tree
dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)
y_pred_dt = dt.predict(X_test)
print("Decision Tree Accuracy:", accuracy_score(y_test, y_pred_dt))
print(classification_report(y_test, y_pred_dt))
XGBoost
xgb = XGBClassifier()
xgb.fit(X_train, y_train)
y_pred_xgb = xgb.predict(X_test)
print("XGBoost Accuracy:", accuracy_score(y_test, y_pred_xgb))
print(classification_report(y_test, y_pred_xgb))
📈 Results and Discussion
After running each model, we can compare their performance based on accuracy, precision, recall, and F1-score. Typically, you'll find that models like XGBoost and Random Forest may outperform simpler models like Naive Bayes, but this depends on the dataset and the specific task. The accuracy comparison graph can be seen below.
And to get the complete project of Airline Twitter Sentiment Analysis take a look at my GitHub repo: https://github.com/SaiVishwa021/Airline_TwitterSentimentAnalysis
And the app is live now: https://airline-twittersentimentanalysis-1.onrender.com/
Top comments (0)