Shamanta Sristy

Posted on May 13, 2025

🛳️ Titanic Survival Prediction: A Gentle Introduction to Data Science for Beginners📊

#machinelearning #datasciencebeginner #titanicproject #pythonprojects

Hey there, curious minds! 👋
If you're new to data science and machine learning like me, this post is just for you. In this blog, I’ll walk you through one of the most classic beginner projects — predicting survival on the Titanic 🚢 — in a way that’s super beginner-friendly. We’ll explore the steps I took, the code I wrote, and the lessons I learned, all while keeping things simple and clear.

📚 What's the Titanic Project?

The Titanic dataset is one of the most popular datasets used to learn data science. The goal is to predict whether a passenger survived or not based on information like their age, gender, ticket class, etc.

This project is perfect for learning how to:

Explore data 📊
Clean and prepare it 🧹
Visualize patterns 🎨
Apply machine learning 🤖

📚 Project Overview

This project uses the Titanic dataset to predict whether a passenger survived based on their features like age, sex, class, etc. It’s perfect for learning:

Data analysis and visualization
Handling missing data
Feature engineering
Training a basic machine learning model (Logistic Regression)

🧰 Tools & Technologies Used

Language: Python
Libraries: Pandas, NumPy, Matplotlib, Seaborn, Scikit-learn
IDE: Jupyter Notebook

🔍 Step-by-Step Walkthrough

1. Importing Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

2. Loading the Data

df = pd.read_csv("titanic.csv")
df.head()

3. Checking for Missing Values

df.isnull().sum()

Age: 177 missing
Cabin: 687 missing → dropped
Embarked: 2 missing → filled with mode

4. Handling Missing Data

df['Age'].fillna(df['Age'].median(), inplace=True)
df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)
df.drop(columns='Cabin', inplace=True)

5. Visualizing the Data

Survival by Gender

sns.countplot(x='Survived', hue='Sex', data=df)
plt.title("Survival Count by Gender")

Survival by Passenger Class

sns.countplot(x='Survived', hue='Pclass', data=df)
plt.title("Survival Count by Class")

6. Correlation Heatmap

plt.figure(figsize=(10, 6))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
plt.title("Correlation Heatmap")

7. Feature Encoding

df['Sex_encoded'] = df['Sex'].map({'male': 0, 'female': 1})
df = pd.get_dummies(df, columns=['Embarked'], drop_first=True)

8. Model Preparation

X = df[['Pclass', 'Sex_encoded', 'Age', 'Fare', 'Embarked_Q', 'Embarked_S']]
y = df['Survived']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

9. Logistic Regression Model

model = LogisticRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

🧠 Key Learnings

✅ How to explore and clean a real-world dataset
✅ Understanding visual patterns in data
✅ Feature encoding and selection
✅ Building a logistic regression model
✅ Evaluating model accuracy and performance

💭 Reflections

This project was more than just code — it helped me gain confidence in using ML tools and understand the real process behind building predictive models. I'm now more excited than ever to keep exploring!

Thanks for reading! If you're also starting out in machine learning or have suggestions for improvement, I’d love to connect and hear your thoughts 💬

DEV Community