Hey there, curious minds! 👋
If you're new to data science and machine learning like me, this post is just for you. In this blog, I’ll walk you through one of the most classic beginner projects — predicting survival on the Titanic 🚢 — in a way that’s super beginner-friendly. We’ll explore the steps I took, the code I wrote, and the lessons I learned, all while keeping things simple and clear.
📚 What's the Titanic Project?
The Titanic dataset is one of the most popular datasets used to learn data science. The goal is to predict whether a passenger survived or not based on information like their age, gender, ticket class, etc.
This project is perfect for learning how to:
- Explore data 📊
- Clean and prepare it 🧹
- Visualize patterns 🎨
- Apply machine learning 🤖
📚 Project Overview
This project uses the Titanic dataset to predict whether a passenger survived based on their features like age, sex, class, etc. It’s perfect for learning:
- Data analysis and visualization
- Handling missing data
- Feature engineering
- Training a basic machine learning model (Logistic Regression)
🧰 Tools & Technologies Used
- Language: Python
- Libraries: Pandas, NumPy, Matplotlib, Seaborn, Scikit-learn
- IDE: Jupyter Notebook
🔍 Step-by-Step Walkthrough
1. Importing Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
2. Loading the Data
df = pd.read_csv("titanic.csv")
df.head()
3. Checking for Missing Values
df.isnull().sum()
Age: 177 missing
Cabin: 687 missing → dropped
Embarked: 2 missing → filled with mode
4. Handling Missing Data
df['Age'].fillna(df['Age'].median(), inplace=True)
df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)
df.drop(columns='Cabin', inplace=True)
5. Visualizing the Data
Survival by Gender
sns.countplot(x='Survived', hue='Sex', data=df)
plt.title("Survival Count by Gender")
Survival by Passenger Class
sns.countplot(x='Survived', hue='Pclass', data=df)
plt.title("Survival Count by Class")
6. Correlation Heatmap
plt.figure(figsize=(10, 6))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
plt.title("Correlation Heatmap")
7. Feature Encoding
df['Sex_encoded'] = df['Sex'].map({'male': 0, 'female': 1})
df = pd.get_dummies(df, columns=['Embarked'], drop_first=True)
8. Model Preparation
X = df[['Pclass', 'Sex_encoded', 'Age', 'Fare', 'Embarked_Q', 'Embarked_S']]
y = df['Survived']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
9. Logistic Regression Model
model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))
🧠 Key Learnings
✅ How to explore and clean a real-world dataset
✅ Understanding visual patterns in data
✅ Feature encoding and selection
✅ Building a logistic regression model
✅ Evaluating model accuracy and performance
💭 Reflections
This project was more than just code — it helped me gain confidence in using ML tools and understand the real process behind building predictive models. I'm now more excited than ever to keep exploring!
Thanks for reading! If you're also starting out in machine learning or have suggestions for improvement, I’d love to connect and hear your thoughts 💬
Top comments (0)