<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Sai Vishwa B</title>
    <description>The latest articles on DEV Community by Sai Vishwa B (@saivishwa).</description>
    <link>https://dev.to/saivishwa</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1859994%2Fc998342b-0202-4398-a42a-082e663dd13c.jpg</url>
      <title>DEV Community: Sai Vishwa B</title>
      <link>https://dev.to/saivishwa</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/saivishwa"/>
    <language>en</language>
    <item>
      <title>✈️ Model Comparison for Sentiment Analysis Using the Airline Tweet Dataset</title>
      <dc:creator>Sai Vishwa B</dc:creator>
      <pubDate>Tue, 20 Aug 2024 11:54:10 +0000</pubDate>
      <link>https://dev.to/saivishwa/model-comparison-for-sentiment-analysis-using-the-airline-tweet-dataset-3b4g</link>
      <guid>https://dev.to/saivishwa/model-comparison-for-sentiment-analysis-using-the-airline-tweet-dataset-3b4g</guid>
      <description>&lt;p&gt;In today's data-driven world, sentiment analysis has become a powerful tool for understanding public opinion. Whether it's gauging customer satisfaction or monitoring brand reputation, the ability to analyze textual data at scale is invaluable. In this blog post, we'll walk through a model comparison using the airline tweet dataset, showcasing how different machine learning models perform on the task of sentiment analysis.&lt;/p&gt;

&lt;h2&gt;
  
  
  📊 Dataset Overview
&lt;/h2&gt;

&lt;p&gt;The airline tweet dataset consists of tweets directed at various airlines. Each tweet is labeled with one of three sentiments: positive, neutral, or negative. This makes it an ideal dataset for sentiment classification tasks. The goal is to build a model that can accurately classify the sentiment of new, unseen tweets.&lt;/p&gt;

&lt;h2&gt;
  
  
  🛠️ Importing the Required Libraries
&lt;/h2&gt;

&lt;p&gt;Before diving into the analysis, let's import the necessary libraries:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import re
import nltk
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, accuracy_score
from sklearn.linear_model import LogisticRegression
from xgboost import XGBClassifier
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from imblearn.over_sampling import SMOTE
from nltk.corpus import stopwords
nltk.download('stopwords')
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  🧹 Data Loading and Preprocessing
&lt;/h2&gt;

&lt;p&gt;The first step is to load the dataset and perform some basic preprocessing, such as cleaning the text and converting the labels into a format suitable for modeling.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Load the dataset
data = pd.read_csv("AirlineTwitterData.csv", encoding = "ISO-8859-1")

df = data[['text', 'airline_sentiment']].copy()

df.rename(columns = {"text" : "tweet", "airline_sentiment" : "sentiment"}, inplace = True)

le = LabelEncoder()
df.sentiment = le.fit_transform(df.sentiment)

label_mapping = dict(zip(le.classes_, range(len(le.classes_))))

stop_words = set(stopwords.words('english'))

df['clean_tweet'] = df['tweet'].apply(lambda x : ' '.join([word for word in x.split() if word.lower() not in stop_words]))

def clean_text(text):
    text = re.sub(r'@\w+|#\w+|https?://(?:www\.)?[^\s/$.?#].[^\s]*', '', text)  # Remove mentions, hashtags, and URLs
    text = re.sub(r"[^a-zA-Z0-9\s]", '', text)  # Remove non-alphanumeric characters
    return text.strip().lower()  # Strip whitespace and convert to lowercase

# Apply the cleaning function to the DataFrame
df['clean_tweet'] = df['clean_tweet'].apply(clean_text)

df.drop('tweet', axis = 1, inplace = True)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  🧠 Feature Extraction
&lt;/h2&gt;

&lt;p&gt;To feed the text data into our machine learning models, we need to convert it into numerical features. We'll use TF-IDF (Term Frequency-Inverse Document Frequency) vectorization for this purpose.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Tfidf = TfidfVectorizer(max_features=10000, ngram_range=(1, 3))
X = Tfidf.fit_transform(df.clean_tweet).toarray()
y = df['sentiment']

# Handling class imbalance using SMOTE
smote = SMOTE(random_state=42)
X_res, y_res = smote.fit_resample(X, y)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  ✂️ Train-Test Split
&lt;/h2&gt;

&lt;p&gt;We'll split the dataset into training and testing sets to evaluate our models.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  🆚 Model Comparison
&lt;/h2&gt;

&lt;p&gt;We'll compare the performance of several machine learning models:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Logistic Regression&lt;/li&gt;
&lt;li&gt;Naive Bayes&lt;/li&gt;
&lt;li&gt;Random Forest&lt;/li&gt;
&lt;li&gt;K-Nearest Neighbors&lt;/li&gt;
&lt;li&gt;Decision Tree&lt;/li&gt;
&lt;li&gt;XGBoost&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Each of these models has its strengths and weaknesses, making it crucial to evaluate them on the same dataset.&lt;/p&gt;

&lt;h3&gt;
  
  
  Logistic Regression
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;logreg = LogisticRegression()
logreg.fit(X_train, y_train)
y_pred_logreg = logreg.predict(X_test)
print("Logistic Regression Accuracy:", accuracy_score(y_test, y_pred_logreg))
print(classification_report(y_test, y_pred_logreg))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fld1of6x4p8tbdda64wwk.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fld1of6x4p8tbdda64wwk.png" alt="Image description" width="614" height="235"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Naive Bayes
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;nb = MultinomialNB()
nb.fit(X_train, y_train)
y_pred_nb = nb.predict(X_test)
print("Naive Bayes Accuracy:", accuracy_score(y_test, y_pred_nb))
print(classification_report(y_test, y_pred_nb))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F27ir1vf1xthlkir0da6q.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F27ir1vf1xthlkir0da6q.png" alt="Image description" width="642" height="251"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Random Forest
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;rf = RandomForestClassifier()
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)
print("Random Forest Accuracy:", accuracy_score(y_test, y_pred_rf))
print(classification_report(y_test, y_pred_rf))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fium0vvmfwhx2vksk03zz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fium0vvmfwhx2vksk03zz.png" alt="Image description" width="622" height="254"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  K-Nearest Neighbors
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
y_pred_knn = knn.predict(X_test)
print("K-Nearest Neighbors Accuracy:", accuracy_score(y_test, y_pred_knn))
print(classification_report(y_test, y_pred_knn))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi5nqy13inasejpmylf5k.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi5nqy13inasejpmylf5k.png" alt="Image description" width="604" height="238"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Decision Tree
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)
y_pred_dt = dt.predict(X_test)
print("Decision Tree Accuracy:", accuracy_score(y_test, y_pred_dt))
print(classification_report(y_test, y_pred_dt))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhxdl09b9b8te9ouwzhlp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhxdl09b9b8te9ouwzhlp.png" alt="Image description" width="619" height="243"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;XGBoost&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;xgb = XGBClassifier()
xgb.fit(X_train, y_train)
y_pred_xgb = xgb.predict(X_test)
print("XGBoost Accuracy:", accuracy_score(y_test, y_pred_xgb))
print(classification_report(y_test, y_pred_xgb))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh6m7pyxofaugt993p2zq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh6m7pyxofaugt993p2zq.png" alt="Image description" width="596" height="236"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  📈 Results and Discussion
&lt;/h2&gt;

&lt;p&gt;After running each model, we can compare their performance based on accuracy, precision, recall, and F1-score. Typically, you'll find that models like XGBoost and Random Forest may outperform simpler models like Naive Bayes, but this depends on the dataset and the specific task. The accuracy comparison graph can be seen below.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frbpdswisb1dx8h0ydwt2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frbpdswisb1dx8h0ydwt2.png" alt="Image description" width="768" height="718"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;And to get the complete project of Airline Twitter Sentiment Analysis take a look at my GitHub repo: &lt;a href="https://github.com/SaiVishwa021/Airline_TwitterSentimentAnalysis" rel="noopener noreferrer"&gt;https://github.com/SaiVishwa021/Airline_TwitterSentimentAnalysis&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;And the app is live now: &lt;a href="https://airline-twittersentimentanalysis-1.onrender.com/" rel="noopener noreferrer"&gt;https://airline-twittersentimentanalysis-1.onrender.com/&lt;/a&gt;&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Dive into the Top 25 Movies with a Flask App: From Classics to Modern Blockbusters!</title>
      <dc:creator>Sai Vishwa B</dc:creator>
      <pubDate>Wed, 14 Aug 2024 12:09:12 +0000</pubDate>
      <link>https://dev.to/saivishwa/dive-into-the-top-25-movies-with-a-flask-app-from-classics-to-modern-blockbusters-10a3</link>
      <guid>https://dev.to/saivishwa/dive-into-the-top-25-movies-with-a-flask-app-from-classics-to-modern-blockbusters-10a3</guid>
      <description>&lt;p&gt;Ever found yourself endlessly scrolling through IMDB, trying to decide which movie to watch? What if you could just type in a year and instantly get the top 25 movies? Well, now you can! Welcome to the &lt;strong&gt;Top 25 Movies Flask App&lt;/strong&gt;—a fun and simple way to explore the best movies from any year. 🎬&lt;/p&gt;

&lt;p&gt;Check the live page: &lt;a href="https://top25moviez-kp5g.onrender.com/" rel="noopener noreferrer"&gt;https://top25moviez-kp5g.onrender.com/&lt;/a&gt;&lt;br&gt;
Github Repo: &lt;a href="https://github.com/SaiVishwa021/Top25Movies" rel="noopener noreferrer"&gt;https://github.com/SaiVishwa021/Top25Movies&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  What is Web Scraping?
&lt;/h2&gt;

&lt;p&gt;Before we dive into the app, let's talk a bit about web scraping. Imagine you're on a treasure hunt, but instead of gold, you're after data hidden within web pages. Web scraping is just that—a technique to fetch and extract data from websites, simulating human browsing behavior. It’s like being a data detective! 🕵️‍♂️&lt;/p&gt;
&lt;h2&gt;
  
  
  How the Top 25 Movies Flask App Works
&lt;/h2&gt;

&lt;p&gt;Here’s a sneak peek into how the magic happens behind the scenes:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Send HTTP Request&lt;/strong&gt;: The app sends an HTTP GET request to IMDB to fetch the top movies for a specific year.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Parse HTML Content&lt;/strong&gt;: Once the response is received, BeautifulSoup steps in. This powerful library parses the HTML, allowing us to navigate through the web page’s structure.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Extract Movie Information&lt;/strong&gt;: We then hunt for movie elements within the HTML, grabbing juicy details like movie titles, ranks, and more.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Save Data&lt;/strong&gt;: After collecting all the movie info, it’s saved neatly into a CSV file in the dataset folder. Now, you’ve got a handy reference list of top movies!&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;
  
  
  Features You’ll Love
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Scrape movie data from IMDB from 1950 onwards.&lt;/strong&gt; Want to explore the classics or check out modern blockbusters? We’ve got you covered.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Save the movie info&lt;/strong&gt; to your very own dataset folder for further analysis or just for fun.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Simple and intuitive user interface.&lt;/strong&gt; No need to be a tech wizard; it’s user-friendly!&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Installation: Get the App Running in No Time
&lt;/h2&gt;

&lt;p&gt;Ready to explore the top movies? Here’s how to get started:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Clone the repository:&lt;/strong&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/SaiVishwa021/Top25Movies.git
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Install the Required Packages&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;To get started, install the necessary packages by running the following command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; requirements.txt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Run the Flask application:&lt;/strong&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python app.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Example:&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;To fetch the top 25 movies for the year 2023, just enter 2023 in the input field and click "Submit". The app will not only display the top 25 movies but also save the data to dataset/top_25_movies_2023.csv. 🎉&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Feel free to share your thoughts or add more features to make this app even cooler. Happy movie hunting! 🍿&lt;/strong&gt;&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Predicting Customer Churn with XGBoost: A Comprehensive Guide🚀</title>
      <dc:creator>Sai Vishwa B</dc:creator>
      <pubDate>Thu, 08 Aug 2024 10:20:05 +0000</pubDate>
      <link>https://dev.to/saivishwa/predicting-customer-churn-with-xgboost-a-comprehensive-guide-5011</link>
      <guid>https://dev.to/saivishwa/predicting-customer-churn-with-xgboost-a-comprehensive-guide-5011</guid>
      <description>&lt;p&gt;Customer churn prediction is a critical task for businesses, particularly in the banking sector. Identifying customers who are likely to leave allows for proactive retention strategies, potentially saving significant revenue. In this blog post, I'll walk you through my project on predicting customer churn using the XGBoost algorithm, covering everything from data preprocessing to model evaluation. &lt;/p&gt;

&lt;h2&gt;
  
  
  📋 Table of Contents
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Introduction&lt;/li&gt;
&lt;li&gt;Project Overview&lt;/li&gt;
&lt;li&gt;Installation&lt;/li&gt;
&lt;li&gt;Usage&lt;/li&gt;
&lt;li&gt;Model Comparison&lt;/li&gt;
&lt;li&gt;Model Training&lt;/li&gt;
&lt;li&gt;Understanding XGBoost&lt;/li&gt;
&lt;li&gt;Results&lt;/li&gt;
&lt;li&gt;Conclusion&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  📌 Introduction
&lt;/h2&gt;

&lt;p&gt;Customer churn is when customers stop using a company's products or services. Predicting churn helps businesses take proactive measures to retain customers, thus improving long-term profitability. In this project, I used the XGBoost algorithm, known for its efficiency and performance, to build a model for predicting customer churn in a bank.&lt;/p&gt;

&lt;p&gt;Check out the project for better understanding &lt;a href="https://github.com/SaiVishwa021/ChurnPredictionUsingXGBoost" rel="noopener noreferrer"&gt;GitHub repository&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  💡 Project Overview
&lt;/h2&gt;

&lt;p&gt;The goal of this project is to build a machine learning model that predicts whether a customer will churn based on various features such as credit score, age, gender, balance, and more. I compared multiple algorithms, including Logistic Regression, Random Forest, KNN, and Naive Bayes, but ultimately chose XGBoost for its superior performance.&lt;/p&gt;

&lt;h2&gt;
  
  
  ⚙️ Installation
&lt;/h2&gt;

&lt;p&gt;To run this project, ensure you have Python installed. Clone the repository and install the required packages using the following command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; requirements.txt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To run the flask app&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python app.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  🚀 Usage
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Clone the repository.&lt;/li&gt;
&lt;li&gt;Place the dataset &lt;code&gt;Churn_Modelling.csv&lt;/code&gt; in the project directory.&lt;/li&gt;
&lt;li&gt;Run the &lt;code&gt;xgb.py&lt;/code&gt; script to train the model.&lt;/li&gt;
&lt;li&gt;Use &lt;code&gt;app.py&lt;/code&gt; to serve the model and make predictions via a web interface.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  🔍 Model Comparison
&lt;/h2&gt;

&lt;p&gt;In the 'ChurnPrediction.ipynb' notebook, I compared the performance of five different machine learning algorithms:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Logistic Regression&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;XGBoost&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Random Forest&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;K-Nearest Neighbors (KNN)&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Naive Bayes&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This comparison helps in understanding which model performs best for our churn prediction task.&lt;/p&gt;

&lt;h2&gt;
  
  
  🏋️ Model Training
&lt;/h2&gt;

&lt;p&gt;The model training process involves several key steps:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Data Loading&lt;/strong&gt;: Load the customer data from the CSV file.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data Preprocessing&lt;/strong&gt;: Encode categorical variables, drop unnecessary columns, and split the data into features and target variables.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data Balancing&lt;/strong&gt;: Use SMOTE (Synthetic Minority Over-sampling Technique) to handle class imbalance.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Model Training&lt;/strong&gt;: Train an XGBoost classifier on the balanced training data.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Model Evaluation&lt;/strong&gt;: Evaluate the model using classification metrics.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  🧠 Understanding XGBoost
&lt;/h2&gt;

&lt;h3&gt;
  
  
  XGBoost (Extreme Gradient Boosting)
&lt;/h3&gt;

&lt;p&gt;XGBoost is a scalable and efficient implementation of gradient boosted decision trees. Here's a brief overview of how it works:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Decision Trees&lt;/strong&gt;: XGBoost builds an ensemble of decision trees, where each tree is trained to correct the errors of the previous ones.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Gradient Boosting&lt;/strong&gt;: Uses gradient descent to minimize the loss function by adjusting weights. New trees are added sequentially, correcting errors from existing trees.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Regularization&lt;/strong&gt;: Includes regularization terms to control overfitting and improve generalization.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Parallel Processing&lt;/strong&gt;: Leverages parallel processing for faster computation.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  📈 Results
&lt;/h2&gt;

&lt;p&gt;The model's performance is evaluated using metrics such as precision, recall, F1-score, and accuracy. Below are the classification reports for both the training and test datasets.&lt;/p&gt;

&lt;h3&gt;
  
  
  Classification Report
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2Fuser-attachments%2Fassets%2Fc9e6438f-063b-4514-8eba-6108c6e29ddd" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2Fuser-attachments%2Fassets%2Fc9e6438f-063b-4514-8eba-6108c6e29ddd" alt="Training Data Classification Report"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2Fuser-attachments%2Fassets%2F59d4ddc1-a995-4f79-aa3d-5899614de08a" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fgithub.com%2Fuser-attachments%2Fassets%2F59d4ddc1-a995-4f79-aa3d-5899614de08a" alt="Test Data Classification Report"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  🏁 Conclusion
&lt;/h2&gt;

&lt;p&gt;In this project, I demonstrated how to predict customer churn using the XGBoost algorithm. By comparing various models and fine-tuning the chosen algorithm, I achieved a high-performance model capable of accurately predicting customer churn. This project highlights the importance of data preprocessing, handling class imbalance, and choosing the right algorithm for the task.&lt;/p&gt;

&lt;p&gt;Feel free to check out the &lt;a href="https://github.com/SaiVishwa021/ChurnPredictionUsingXGBoost" rel="noopener noreferrer"&gt;GitHub repository&lt;/a&gt; for the complete code and dataset. Happy coding!&lt;/p&gt;




</description>
    </item>
    <item>
      <title>🔍 Comparing and Contrasting Popular Probability Distributions: A Practical Approach 📊</title>
      <dc:creator>Sai Vishwa B</dc:creator>
      <pubDate>Thu, 01 Aug 2024 04:39:15 +0000</pubDate>
      <link>https://dev.to/saivishwa/comparing-and-contrasting-popular-probability-distributions-a-practical-approach-1j2i</link>
      <guid>https://dev.to/saivishwa/comparing-and-contrasting-popular-probability-distributions-a-practical-approach-1j2i</guid>
      <description>&lt;p&gt;Understanding different statistical distributions and their properties is crucial for data analysis and modeling. In this blog, we'll explore several types of distributions using Python, including binomial, uniform, and log-normal distributions. We'll use libraries such as NumPy, Matplotlib, and Seaborn for this purpose. Let's dive in! 🚀&lt;/p&gt;

&lt;p&gt;A distribution model is a mathematical function that describes the probability of different outcomes or values in a dataset. It helps to understand the patterns and structure of data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why Distribution Models are Used in Machine Learning?&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Understanding Data&lt;/strong&gt;: Helps in summarizing and describing the dataset.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Data Generation&lt;/strong&gt;: Creates synthetic data for testing algorithms.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Model Assumptions&lt;/strong&gt;: Many algorithms assume specific data distributions (e.g., normal distribution in linear regression).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Feature Engineering&lt;/strong&gt;: Transforms data to meet model assumptions (e.g., using logarithms for skewed data).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Probability-Based Models&lt;/strong&gt;: Used in probabilistic methods like Naive Bayes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Evaluation Metrics&lt;/strong&gt;: Helps in evaluating and improving model performance by understanding error distributions.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Some of the distribution models are:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Bernoulli distribution&lt;/li&gt;
&lt;li&gt;Uniform distribution&lt;/li&gt;
&lt;li&gt;Binomial distribution&lt;/li&gt;
&lt;li&gt;Normal distribution&lt;/li&gt;
&lt;li&gt;Poisson distribution&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;🎯Bernoulli distribution:&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Represents the outcome of a single experiment with two possible outcomes: success (1) or failure (0).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhi5ydiq6aydtrfp0sj0n.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhi5ydiq6aydtrfp0sj0n.png" alt="Image description" width="397" height="127"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Here one outcome is dependent on the other&lt;/p&gt;

&lt;p&gt;Example: Flipping a coin once. If heads is considered a success (&lt;br&gt;
𝑝=0.5), the probability of getting heads (success) is 0.5, and the probability of getting tails (failure) is also 0.5.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sample implementation:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;matplotlib.pyplot&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;plt&lt;/span&gt;

&lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;binomial&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;hist&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;color&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;g&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;title&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Binomial Distribution&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;xlabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Number of successes&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ylabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Frequency&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  🔀 Uniform Distribution
&lt;/h2&gt;

&lt;p&gt;All outcomes are equally likely(each outcome of an experiment has an equal probability of occurring) within a certain range.&lt;/p&gt;

&lt;p&gt;For discrete values:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjs03motzk18f376b614d.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjs03motzk18f376b614d.png" alt="Image description" width="370" height="156"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;where a and b are the minimum and maximum values.&lt;/p&gt;

&lt;p&gt;For continuous values:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5o8902i2bvg4mt1gt4qj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5o8902i2bvg4mt1gt4qj.png" alt="Image description" width="675" height="165"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;where n is the number of possible outcomes.&lt;/p&gt;

&lt;p&gt;Rolling a fair six-sided die. Each number (1 through 6) has an equal probability of 1/6.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sample implementation:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;uniform&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;low&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;high&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;hist&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bins&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;color&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;edgecolor&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;black&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;alpha&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.75&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;title&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Uniform Distribution&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;xlabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Value&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ylabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Frequency&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  📊 Visualizing Binomial Distribution with Seaborn
&lt;/h2&gt;

&lt;p&gt;Describes the number of successes in a fixed number of independent Bernoulli trials.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkdaq7crkycy33yjm5j8w.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkdaq7crkycy33yjm5j8w.png" alt="Image description" width="641" height="359"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Binomial and Bernoulli may look similar but,&lt;/p&gt;

&lt;p&gt;Bernoulli: You flip a coin once. The distribution tells you the probability of getting heads (success) or tails (failure).&lt;/p&gt;

&lt;p&gt;Binomial: You flip a coin 10 times. The distribution tells you the probability of getting a certain number of heads (e.g., exactly 5 heads) out of 10 flips.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sample implementation:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;seaborn&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;sns&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;scipy.stats&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;binom&lt;/span&gt;

&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;binom&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;rvs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;17&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;loc&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1010&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;ax&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sns&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;histplot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;kde&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;color&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;g&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bins&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;stat&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;density&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;element&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;step&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;linewidth&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;2.2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;alpha&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.7&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;title&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Binomial Distribution with Seaborn&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;xlabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Number of successes&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ylabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Density&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  🌟 Normal Distribution
&lt;/h2&gt;

&lt;p&gt;Normal distribution, also known as the Gaussian distribution, is a probability distribution that is symmetric about the mean, showing that data near the mean are more frequent in occurrence than data far from the mean. The normal distribution appears as a "bell curve" when graphed.&lt;/p&gt;

&lt;p&gt;This distribution is characterized by its mean (average) and standard deviation (which measures the spread of data).&lt;/p&gt;

&lt;p&gt;The mean, median, and mode are all equal and located at the center of the distribution. This equality is a result of the symmetrical bell-shaped curve of the normal distribution. Here's why:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mean&lt;/strong&gt;: The average value of all the data points.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Median&lt;/strong&gt;: The middle value when all the data points are arranged in ascending order.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mode&lt;/strong&gt;: The most frequently occurring value in the data set.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6fuvvdujfqjtyu94snh5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6fuvvdujfqjtyu94snh5.png" alt="Image description" width="498" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sample implementation:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;scipy.stats&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;matplotlib.pyplot&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;plt&lt;/span&gt;

&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;seed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1234&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;samples&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lognormal&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mean&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sigma&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;10000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# sigma = std value
&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;loc&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;scale&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;scipy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stats&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;lognorm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;fit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;samples&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;floc&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;num_bins&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt;
&lt;span class="n"&gt;counts&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;edges&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;patches&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;hist&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;samples&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bins&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;num_bins&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;color&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;b&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;title&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Log-Normal Distribution&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;xlabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Value&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ylabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Frequency&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  🎲 Poisson Distribution
&lt;/h2&gt;

&lt;p&gt;Describes the number of events occurring within a fixed interval of time or space, where these events happen with a known constant mean rate and independently of the time since the last event. 🌟&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbwivx15b6eiefu9vsdp7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbwivx15b6eiefu9vsdp7.png" alt="Image description" width="638" height="273"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The number of emails a person receives in an hour. If a person receives an average of 4 emails per hour (𝜆=4), the probability of receiving exactly 2 emails in an hour is:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft2a9tb2wq9tvbsn39t9k.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft2a9tb2wq9tvbsn39t9k.png" alt="Image description" width="468" height="93"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sample implementation:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;random&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;poisson&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;hist&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;color&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;b&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;title&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Poisson Distribution&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;xlabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Number of events&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;ylabel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Frequency&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;plt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;show&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Colab notebook: &lt;a href="https://colab.research.google.com/drive/1uKp3FCC5QmQy53fz83eS7hwengOhk9zx?usp=sharing" rel="noopener noreferrer"&gt;https://colab.research.google.com/drive/1uKp3FCC5QmQy53fz83eS7hwengOhk9zx?usp=sharing&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;By understanding these distributions and their properties, we can better analyze and interpret data in various fields such as finance, science, and engineering. Happy analyzing! 🧠📈&lt;/p&gt;

</description>
      <category>statistics</category>
      <category>python</category>
      <category>machinelearning</category>
      <category>datascience</category>
    </item>
    <item>
      <title>How to preprocess your Dataset</title>
      <dc:creator>Sai Vishwa B</dc:creator>
      <pubDate>Tue, 30 Jul 2024 06:50:32 +0000</pubDate>
      <link>https://dev.to/saivishwa/how-to-preprocess-your-dataset-3j0b</link>
      <guid>https://dev.to/saivishwa/how-to-preprocess-your-dataset-3j0b</guid>
      <description>&lt;h2&gt;
  
  
  &lt;strong&gt;Introduction&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The Titanic dataset is a classic dataset used in data science and machine learning projects. It contains information about the passengers on the Titanic, and the goal is often to predict which passengers survived the disaster. Before building any predictive model, it's crucial to preprocess the data to ensure it's clean and suitable for analysis. This blog post will guide you through the essential steps of preprocessing the Titanic dataset using Python.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Step 1: Loading the Data&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The first step in any data analysis project is loading the dataset. We use the pandas library to read the CSV file containing the Titanic data. This dataset includes features like &lt;code&gt;Name&lt;/code&gt;, &lt;code&gt;Age&lt;/code&gt;, &lt;code&gt;Sex&lt;/code&gt;, &lt;code&gt;Ticket&lt;/code&gt;, &lt;code&gt;Fare&lt;/code&gt;, and whether the passenger survived (&lt;code&gt;Survived&lt;/code&gt;).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import pandas as pd
import numpy as np
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Load the Titanic dataset&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;titanic = pd.read_csv('titanic.csv')
titanic.head()

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Understand the data
&lt;/h2&gt;

&lt;p&gt;The dataset contains the following variables related to passengers on the Titanic:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Survival&lt;/strong&gt;: Indicates if the passenger survived.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;0 = No&lt;/li&gt;
&lt;li&gt;1 = Yes&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;p&gt;&lt;strong&gt;Pclass&lt;/strong&gt;: Ticket class of the passenger.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;1 = 1st class&lt;/li&gt;
&lt;li&gt;2 = 2nd class&lt;/li&gt;
&lt;li&gt;3 = 3rd class&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;&lt;p&gt;&lt;strong&gt;Sex&lt;/strong&gt;: Gender of the passenger.&lt;/p&gt;&lt;/li&gt;

&lt;li&gt;&lt;p&gt;&lt;strong&gt;Age&lt;/strong&gt;: Age of the passenger in years.&lt;/p&gt;&lt;/li&gt;

&lt;li&gt;&lt;p&gt;&lt;strong&gt;SibSp&lt;/strong&gt;: Number of siblings or spouses aboard the Titanic.&lt;/p&gt;&lt;/li&gt;

&lt;li&gt;&lt;p&gt;&lt;strong&gt;Parch&lt;/strong&gt;: Number of parents or children aboard the Titanic.&lt;/p&gt;&lt;/li&gt;

&lt;li&gt;&lt;p&gt;&lt;strong&gt;Ticket&lt;/strong&gt;: Ticket number.&lt;/p&gt;&lt;/li&gt;

&lt;li&gt;&lt;p&gt;&lt;strong&gt;Fare&lt;/strong&gt;: Passenger fare.&lt;/p&gt;&lt;/li&gt;

&lt;li&gt;&lt;p&gt;&lt;strong&gt;Cabin&lt;/strong&gt;: Cabin number.&lt;/p&gt;&lt;/li&gt;

&lt;li&gt;

&lt;p&gt;&lt;strong&gt;Embarked&lt;/strong&gt;: Port of embarkation.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;C = Cherbourg&lt;/li&gt;
&lt;li&gt;Q = Queenstown&lt;/li&gt;
&lt;li&gt;S = Southampton&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Step 2: Exploratory Data Analysis (EDA)&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Exploratory Data Analysis (EDA) involves examining the dataset to understand its structure and the relationships between different variables. This step helps identify any patterns, trends, or anomalies in the data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Overview of the Dataset&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We start by displaying the first few rows of the dataset and getting a summary of the statistics. This gives us an idea of the data types, the range of values, and the presence of any missing values.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Display the first few rows
print(titanic.head())

# Summary statistics
print(titanic.describe(include='all'))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  &lt;strong&gt;Step 3: Data Cleaning&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Data cleaning is the process of handling missing values, correcting data types, and removing any inconsistencies. In the Titanic dataset, features like &lt;code&gt;Age&lt;/code&gt;, &lt;code&gt;Cabin&lt;/code&gt;, and &lt;code&gt;Embarked&lt;/code&gt; have missing values.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Handling Missing Values&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;To handle missing values, we can fill them with appropriate values or drop rows/columns with missing data. For example, we can fill missing &lt;code&gt;Age&lt;/code&gt; values with the median age and drop rows with missing &lt;code&gt;Embarked&lt;/code&gt; values.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Fill missing age values with the mode
titanic['Age'].fillna(titanic['Age'].mode(), inplace=True)

# Drop rows with missing 'Embarked' values
titanic.dropna(subset=['Embarked'], inplace=True)

# Check remaining missing values
print(titanic.isnull().sum())
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  &lt;strong&gt;Step 4: Feature Engineering&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Feature engineering involves transforming existing ones to improve model performance. This step can include encoding categorical variables scaling numerical features.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Encoding Categorical Variables&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Machine learning algorithms require numerical input, so we need to convert categorical features into numerical ones. We can use one-hot encoding for features like &lt;code&gt;Sex&lt;/code&gt; and &lt;code&gt;Embarked&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Convert categorical features to numerical
from sklearn import preprocessing
le = preprocessing.LabelEncoder()

#fit the required column to be transformed
le.fit(df['Sex'])
df['Sex'] = le.transform(df['Sex'])
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  &lt;strong&gt;Conclusion&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Preprocessing is a critical step in any data science project. In this blog post, we covered the essential steps of loading data, performing exploratory data analysis, cleaning the data, and feature engineering. These steps help ensure our data is ready for analysis or model building. The next step is to use this preprocessed data to build predictive models and evaluate their performance. For further insights take a look into my &lt;a href="https://colab.research.google.com/drive/1U8BpL5lZTcnMDVSAFJdbwZjuxTIc9O17?usp=sharing" rel="noopener noreferrer"&gt;colab notebook&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;By following these steps, beginners can get a solid foundation in data preprocessing, setting the stage for more advanced data analysis and machine learning tasks. Happy coding!&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>datascience</category>
      <category>programming</category>
      <category>python</category>
    </item>
  </channel>
</rss>
