DEV Community: Sai Vishwa B

✈️ Model Comparison for Sentiment Analysis Using the Airline Tweet Dataset

Sai Vishwa B — Tue, 20 Aug 2024 11:54:10 +0000

In today's data-driven world, sentiment analysis has become a powerful tool for understanding public opinion. Whether it's gauging customer satisfaction or monitoring brand reputation, the ability to analyze textual data at scale is invaluable. In this blog post, we'll walk through a model comparison using the airline tweet dataset, showcasing how different machine learning models perform on the task of sentiment analysis.

📊 Dataset Overview

The airline tweet dataset consists of tweets directed at various airlines. Each tweet is labeled with one of three sentiments: positive, neutral, or negative. This makes it an ideal dataset for sentiment classification tasks. The goal is to build a model that can accurately classify the sentiment of new, unseen tweets.

🛠️ Importing the Required Libraries

Before diving into the analysis, let's import the necessary libraries:

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import re
import nltk
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, accuracy_score
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from imblearn.over_sampling import SMOTE
from nltk.corpus import stopwords
nltk.download('stopwords')

🧹 Data Loading and Preprocessing

The first step is to load the dataset and perform some basic preprocessing, such as cleaning the text and converting the labels into a format suitable for modeling.

# Load the dataset
data = pd.read_csv("AirlineTwitterData.csv", encoding = "ISO-8859-1")

df = data[['text', 'airline_sentiment']].copy()

df.rename(columns = {"text" : "tweet", "airline_sentiment" : "sentiment"}, inplace = True)

le = LabelEncoder()
df.sentiment = le.fit_transform(df.sentiment)

label_mapping = dict(zip(le.classes_, range(len(le.classes_))))

stop_words = set(stopwords.words('english'))

df['clean_tweet'] = df['tweet'].apply(lambda x : ' '.join([word for word in x.split() if word.lower() not in stop_words]))

def clean_text(text):
    text = re.sub(r'@\w+|#\w+|https?://(?:www\.)?[^\s/$.?#].[^\s]*', '', text)  # Remove mentions, hashtags, and URLs
    text = re.sub(r"[^a-zA-Z0-9\s]", '', text)  # Remove non-alphanumeric characters
    return text.strip().lower()  # Strip whitespace and convert to lowercase

# Apply the cleaning function to the DataFrame
df['clean_tweet'] = df['clean_tweet'].apply(clean_text)

df.drop('tweet', axis = 1, inplace = True)

🧠 Feature Extraction

To feed the text data into our machine learning models, we need to convert it into numerical features. We'll use TF-IDF (Term Frequency-Inverse Document Frequency) vectorization for this purpose.

Tfidf = TfidfVectorizer(max_features=10000, ngram_range=(1, 3))
X = Tfidf.fit_transform(df.clean_tweet).toarray()
y = df['sentiment']

# Handling class imbalance using SMOTE
smote = SMOTE(random_state=42)
X_res, y_res = smote.fit_resample(X, y)

✂️ Train-Test Split

We'll split the dataset into training and testing sets to evaluate our models.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

🆚 Model Comparison

We'll compare the performance of several machine learning models:

Logistic Regression
Naive Bayes
Random Forest
K-Nearest Neighbors
Decision Tree
XGBoost

Each of these models has its strengths and weaknesses, making it crucial to evaluate them on the same dataset.

Logistic Regression

logreg = LogisticRegression()
logreg.fit(X_train, y_train)
y_pred_logreg = logreg.predict(X_test)
print("Logistic Regression Accuracy:", accuracy_score(y_test, y_pred_logreg))
print(classification_report(y_test, y_pred_logreg))

Naive Bayes

nb = MultinomialNB()
nb.fit(X_train, y_train)
y_pred_nb = nb.predict(X_test)
print("Naive Bayes Accuracy:", accuracy_score(y_test, y_pred_nb))
print(classification_report(y_test, y_pred_nb))

Random Forest

rf = RandomForestClassifier()
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)
print("Random Forest Accuracy:", accuracy_score(y_test, y_pred_rf))
print(classification_report(y_test, y_pred_rf))

K-Nearest Neighbors

knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
y_pred_knn = knn.predict(X_test)
print("K-Nearest Neighbors Accuracy:", accuracy_score(y_test, y_pred_knn))
print(classification_report(y_test, y_pred_knn))

Decision Tree

dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)
y_pred_dt = dt.predict(X_test)
print("Decision Tree Accuracy:", accuracy_score(y_test, y_pred_dt))
print(classification_report(y_test, y_pred_dt))

XGBoost

xgb = XGBClassifier()
xgb.fit(X_train, y_train)
y_pred_xgb = xgb.predict(X_test)
print("XGBoost Accuracy:", accuracy_score(y_test, y_pred_xgb))
print(classification_report(y_test, y_pred_xgb))

📈 Results and Discussion

After running each model, we can compare their performance based on accuracy, precision, recall, and F1-score. Typically, you'll find that models like XGBoost and Random Forest may outperform simpler models like Naive Bayes, but this depends on the dataset and the specific task. The accuracy comparison graph can be seen below.

And to get the complete project of Airline Twitter Sentiment Analysis take a look at my GitHub repo: https://github.com/SaiVishwa021/Airline_TwitterSentimentAnalysis

And the app is live now: https://airline-twittersentimentanalysis-1.onrender.com/

Dive into the Top 25 Movies with a Flask App: From Classics to Modern Blockbusters!

Sai Vishwa B — Wed, 14 Aug 2024 12:09:12 +0000

Ever found yourself endlessly scrolling through IMDB, trying to decide which movie to watch? What if you could just type in a year and instantly get the top 25 movies? Well, now you can! Welcome to the Top 25 Movies Flask App—a fun and simple way to explore the best movies from any year. 🎬

Check the live page: https://top25moviez-kp5g.onrender.com/
Github Repo: https://github.com/SaiVishwa021/Top25Movies

What is Web Scraping?

Before we dive into the app, let's talk a bit about web scraping. Imagine you're on a treasure hunt, but instead of gold, you're after data hidden within web pages. Web scraping is just that—a technique to fetch and extract data from websites, simulating human browsing behavior. It’s like being a data detective! 🕵️‍♂️

How the Top 25 Movies Flask App Works

Here’s a sneak peek into how the magic happens behind the scenes:

Send HTTP Request: The app sends an HTTP GET request to IMDB to fetch the top movies for a specific year.
Parse HTML Content: Once the response is received, BeautifulSoup steps in. This powerful library parses the HTML, allowing us to navigate through the web page’s structure.
Extract Movie Information: We then hunt for movie elements within the HTML, grabbing juicy details like movie titles, ranks, and more.
Save Data: After collecting all the movie info, it’s saved neatly into a CSV file in the dataset folder. Now, you’ve got a handy reference list of top movies!

Features You’ll Love

Scrape movie data from IMDB from 1950 onwards. Want to explore the classics or check out modern blockbusters? We’ve got you covered.
Save the movie info to your very own dataset folder for further analysis or just for fun.
Simple and intuitive user interface. No need to be a tech wizard; it’s user-friendly!

Installation: Get the App Running in No Time

Ready to explore the top movies? Here’s how to get started:

Clone the repository:

git clone https://github.com/SaiVishwa021/Top25Movies.git

Install the Required Packages

To get started, install the necessary packages by running the following command:

pip install -r requirements.txt

Run the Flask application:

python app.py

Example:

To fetch the top 25 movies for the year 2023, just enter 2023 in the input field and click "Submit". The app will not only display the top 25 movies but also save the data to dataset/top_25_movies_2023.csv. 🎉

Feel free to share your thoughts or add more features to make this app even cooler. Happy movie hunting! 🍿

Predicting Customer Churn with XGBoost: A Comprehensive Guide🚀

Sai Vishwa B — Thu, 08 Aug 2024 10:20:05 +0000

Customer churn prediction is a critical task for businesses, particularly in the banking sector. Identifying customers who are likely to leave allows for proactive retention strategies, potentially saving significant revenue. In this blog post, I'll walk you through my project on predicting customer churn using the XGBoost algorithm, covering everything from data preprocessing to model evaluation.

📋 Table of Contents

Introduction
Project Overview
Installation
Usage
Model Comparison
Model Training
Understanding XGBoost
Results
Conclusion

📌 Introduction

Customer churn is when customers stop using a company's products or services. Predicting churn helps businesses take proactive measures to retain customers, thus improving long-term profitability. In this project, I used the XGBoost algorithm, known for its efficiency and performance, to build a model for predicting customer churn in a bank.

Check out the project for better understanding GitHub repository

💡 Project Overview

The goal of this project is to build a machine learning model that predicts whether a customer will churn based on various features such as credit score, age, gender, balance, and more. I compared multiple algorithms, including Logistic Regression, Random Forest, KNN, and Naive Bayes, but ultimately chose XGBoost for its superior performance.

⚙️ Installation

To run this project, ensure you have Python installed. Clone the repository and install the required packages using the following command:

pip install -r requirements.txt

To run the flask app

python app.py

🚀 Usage

Clone the repository.
Place the dataset Churn_Modelling.csv in the project directory.
Run the xgb.py script to train the model.
Use app.py to serve the model and make predictions via a web interface.

🔍 Model Comparison

In the 'ChurnPrediction.ipynb' notebook, I compared the performance of five different machine learning algorithms:

Logistic Regression
XGBoost
Random Forest
K-Nearest Neighbors (KNN)
Naive Bayes

This comparison helps in understanding which model performs best for our churn prediction task.

🏋️ Model Training

The model training process involves several key steps:

Data Loading: Load the customer data from the CSV file.
Data Preprocessing: Encode categorical variables, drop unnecessary columns, and split the data into features and target variables.
Data Balancing: Use SMOTE (Synthetic Minority Over-sampling Technique) to handle class imbalance.
Model Training: Train an XGBoost classifier on the balanced training data.
Model Evaluation: Evaluate the model using classification metrics.

🧠 Understanding XGBoost

XGBoost (Extreme Gradient Boosting)

XGBoost is a scalable and efficient implementation of gradient boosted decision trees. Here's a brief overview of how it works:

Decision Trees: XGBoost builds an ensemble of decision trees, where each tree is trained to correct the errors of the previous ones.
Gradient Boosting: Uses gradient descent to minimize the loss function by adjusting weights. New trees are added sequentially, correcting errors from existing trees.
Regularization: Includes regularization terms to control overfitting and improve generalization.
Parallel Processing: Leverages parallel processing for faster computation.

📈 Results

The model's performance is evaluated using metrics such as precision, recall, F1-score, and accuracy. Below are the classification reports for both the training and test datasets.

Classification Report

🏁 Conclusion

In this project, I demonstrated how to predict customer churn using the XGBoost algorithm. By comparing various models and fine-tuning the chosen algorithm, I achieved a high-performance model capable of accurately predicting customer churn. This project highlights the importance of data preprocessing, handling class imbalance, and choosing the right algorithm for the task.

Feel free to check out the GitHub repository for the complete code and dataset. Happy coding!

🔍 Comparing and Contrasting Popular Probability Distributions: A Practical Approach 📊

Sai Vishwa B — Thu, 01 Aug 2024 04:39:15 +0000

Understanding different statistical distributions and their properties is crucial for data analysis and modeling. In this blog, we'll explore several types of distributions using Python, including binomial, uniform, and log-normal distributions. We'll use libraries such as NumPy, Matplotlib, and Seaborn for this purpose. Let's dive in! 🚀

A distribution model is a mathematical function that describes the probability of different outcomes or values in a dataset. It helps to understand the patterns and structure of data.

Why Distribution Models are Used in Machine Learning?

Understanding Data: Helps in summarizing and describing the dataset.
Data Generation: Creates synthetic data for testing algorithms.
Model Assumptions: Many algorithms assume specific data distributions (e.g., normal distribution in linear regression).
Feature Engineering: Transforms data to meet model assumptions (e.g., using logarithms for skewed data).
Probability-Based Models: Used in probabilistic methods like Naive Bayes.
Evaluation Metrics: Helps in evaluating and improving model performance by understanding error distributions.

Some of the distribution models are:

Bernoulli distribution
Uniform distribution
Binomial distribution
Normal distribution
Poisson distribution

🎯Bernoulli distribution:

Represents the outcome of a single experiment with two possible outcomes: success (1) or failure (0).

Here one outcome is dependent on the other

Example: Flipping a coin once. If heads is considered a success (
𝑝=0.5), the probability of getting heads (success) is 0.5, and the probability of getting tails (failure) is also 0.5.

Sample implementation:

import numpy as np
import matplotlib.pyplot as plt

s = np.random.binomial(10, 0.5, 1000)
plt.hist(s, 16, color='g')
plt.title('Binomial Distribution')
plt.xlabel('Number of successes')
plt.ylabel('Frequency')
plt.show()

🔀 Uniform Distribution

All outcomes are equally likely(each outcome of an experiment has an equal probability of occurring) within a certain range.

For discrete values:

where a and b are the minimum and maximum values.

For continuous values:

where n is the number of possible outcomes.

Rolling a fair six-sided die. Each number (1 through 6) has an equal probability of 1/6.

Sample implementation:

s = np.random.uniform(low=0, high=1, size=1000)
plt.hist(s, bins=30, color='r', edgecolor='black', alpha=0.75)
plt.title('Uniform Distribution')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.show()

📊 Visualizing Binomial Distribution with Seaborn

Describes the number of successes in a fixed number of independent Bernoulli trials.

Binomial and Bernoulli may look similar but,

Bernoulli: You flip a coin once. The distribution tells you the probability of getting heads (success) or tails (failure).

Binomial: You flip a coin 10 times. The distribution tells you the probability of getting a certain number of heads (e.g., exactly 5 heads) out of 10 flips.

Sample implementation:

import seaborn as sns
from scipy.stats import binom

data = binom.rvs(n=17, p=0.7, loc=0, size=1010)
ax = sns.histplot(data, kde=True, color='g', bins=30, stat='density', element='step', linewidth=2.2, alpha=0.7)
plt.title('Binomial Distribution with Seaborn')
plt.xlabel('Number of successes')
plt.ylabel('Density')
plt.show()

🌟 Normal Distribution

Normal distribution, also known as the Gaussian distribution, is a probability distribution that is symmetric about the mean, showing that data near the mean are more frequent in occurrence than data far from the mean. The normal distribution appears as a "bell curve" when graphed.

This distribution is characterized by its mean (average) and standard deviation (which measures the spread of data).

The mean, median, and mode are all equal and located at the center of the distribution. This equality is a result of the symmetrical bell-shaped curve of the normal distribution. Here's why:

Mean: The average value of all the data points.

Median: The middle value when all the data points are arranged in ascending order.

Mode: The most frequently occurring value in the data set.

Sample implementation:

import scipy.stats
import numpy as np
import matplotlib.pyplot as plt

np.random.seed(1234)
samples = np.random.lognormal(mean=1.0, sigma=0.4, size=10000)  # sigma = std value
shape, loc, scale = scipy.stats.lognorm.fit(samples, floc=0)
num_bins = 50
counts, edges, patches = plt.hist(samples, bins=num_bins, color='b')
plt.title('Log-Normal Distribution')
plt.xlabel('Value')
plt.ylabel('Frequency')
plt.show()

🎲 Poisson Distribution

Describes the number of events occurring within a fixed interval of time or space, where these events happen with a known constant mean rate and independently of the time since the last event. 🌟

The number of emails a person receives in an hour. If a person receives an average of 4 emails per hour (𝜆=4), the probability of receiving exactly 2 emails in an hour is:

Sample implementation:

s = np.random.poisson(5, 10000)
plt.hist(s, 16, color='b')
plt.title('Poisson Distribution')
plt.xlabel('Number of events')
plt.ylabel('Frequency')
plt.show()

Colab notebook: https://colab.research.google.com/drive/1uKp3FCC5QmQy53fz83eS7hwengOhk9zx?usp=sharing

By understanding these distributions and their properties, we can better analyze and interpret data in various fields such as finance, science, and engineering. Happy analyzing! 🧠📈

How to preprocess your Dataset

Sai Vishwa B — Tue, 30 Jul 2024 06:50:32 +0000

Introduction

The Titanic dataset is a classic dataset used in data science and machine learning projects. It contains information about the passengers on the Titanic, and the goal is often to predict which passengers survived the disaster. Before building any predictive model, it's crucial to preprocess the data to ensure it's clean and suitable for analysis. This blog post will guide you through the essential steps of preprocessing the Titanic dataset using Python.

Step 1: Loading the Data

The first step in any data analysis project is loading the dataset. We use the pandas library to read the CSV file containing the Titanic data. This dataset includes features like Name, Age, Sex, Ticket, Fare, and whether the passenger survived (Survived).

import pandas as pd
import numpy as np

Load the Titanic dataset

titanic = pd.read_csv('titanic.csv')
titanic.head()

Understand the data

The dataset contains the following variables related to passengers on the Titanic:

Survival: Indicates if the passenger survived.
- 0 = No
- 1 = Yes
Pclass: Ticket class of the passenger.
- 1 = 1st class
- 2 = 2nd class
- 3 = 3rd class
Sex: Gender of the passenger.
Age: Age of the passenger in years.
SibSp: Number of siblings or spouses aboard the Titanic.
Parch: Number of parents or children aboard the Titanic.
Ticket: Ticket number.
Fare: Passenger fare.
Cabin: Cabin number.
Embarked: Port of embarkation.
- C = Cherbourg
- Q = Queenstown
- S = Southampton

Step 2: Exploratory Data Analysis (EDA)

Exploratory Data Analysis (EDA) involves examining the dataset to understand its structure and the relationships between different variables. This step helps identify any patterns, trends, or anomalies in the data.

Overview of the Dataset

We start by displaying the first few rows of the dataset and getting a summary of the statistics. This gives us an idea of the data types, the range of values, and the presence of any missing values.

# Display the first few rows
print(titanic.head())

# Summary statistics
print(titanic.describe(include='all'))

Step 3: Data Cleaning

Data cleaning is the process of handling missing values, correcting data types, and removing any inconsistencies. In the Titanic dataset, features like Age, Cabin, and Embarked have missing values.

Handling Missing Values

To handle missing values, we can fill them with appropriate values or drop rows/columns with missing data. For example, we can fill missing Age values with the median age and drop rows with missing Embarked values.

# Fill missing age values with the mode
titanic['Age'].fillna(titanic['Age'].mode(), inplace=True)

# Drop rows with missing 'Embarked' values
titanic.dropna(subset=['Embarked'], inplace=True)

# Check remaining missing values
print(titanic.isnull().sum())

Step 4: Feature Engineering

Feature engineering involves transforming existing ones to improve model performance. This step can include encoding categorical variables scaling numerical features.

Encoding Categorical Variables

Machine learning algorithms require numerical input, so we need to convert categorical features into numerical ones. We can use one-hot encoding for features like Sex and Embarked.

# Convert categorical features to numerical
from sklearn import preprocessing
le = preprocessing.LabelEncoder()

#fit the required column to be transformed
le.fit(df['Sex'])
df['Sex'] = le.transform(df['Sex'])

Conclusion

Preprocessing is a critical step in any data science project. In this blog post, we covered the essential steps of loading data, performing exploratory data analysis, cleaning the data, and feature engineering. These steps help ensure our data is ready for analysis or model building. The next step is to use this preprocessed data to build predictive models and evaluate their performance. For further insights take a look into my colab notebook

By following these steps, beginners can get a solid foundation in data preprocessing, setting the stage for more advanced data analysis and machine learning tasks. Happy coding!