DEV Community

Cover image for 🛳️ Titanic Survival Prediction: A Gentle Introduction to Data Science for Beginners📊
Shamanta Sristy
Shamanta Sristy

Posted on

🛳️ Titanic Survival Prediction: A Gentle Introduction to Data Science for Beginners📊

Hey there, curious minds! 👋
If you're new to data science and machine learning like me, this post is just for you. In this blog, I’ll walk you through one of the most classic beginner projects — predicting survival on the Titanic 🚢 — in a way that’s super beginner-friendly. We’ll explore the steps I took, the code I wrote, and the lessons I learned, all while keeping things simple and clear.


📚 What's the Titanic Project?

The Titanic dataset is one of the most popular datasets used to learn data science. The goal is to predict whether a passenger survived or not based on information like their age, gender, ticket class, etc.

This project is perfect for learning how to:

  • Explore data 📊
  • Clean and prepare it 🧹
  • Visualize patterns 🎨
  • Apply machine learning 🤖

📚 Project Overview

This project uses the Titanic dataset to predict whether a passenger survived based on their features like age, sex, class, etc. It’s perfect for learning:

  • Data analysis and visualization
  • Handling missing data
  • Feature engineering
  • Training a basic machine learning model (Logistic Regression)

🧰 Tools & Technologies Used

  • Language: Python
  • Libraries: Pandas, NumPy, Matplotlib, Seaborn, Scikit-learn
  • IDE: Jupyter Notebook

🔍 Step-by-Step Walkthrough

1. Importing Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
Enter fullscreen mode Exit fullscreen mode

2. Loading the Data

df = pd.read_csv("titanic.csv")
df.head()
Enter fullscreen mode Exit fullscreen mode

3. Checking for Missing Values

df.isnull().sum()
Enter fullscreen mode Exit fullscreen mode
  • Age: 177 missing

  • Cabin: 687 missing → dropped

  • Embarked: 2 missing → filled with mode


4. Handling Missing Data

df['Age'].fillna(df['Age'].median(), inplace=True)
df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)
df.drop(columns='Cabin', inplace=True)
Enter fullscreen mode Exit fullscreen mode

5. Visualizing the Data

Survival by Gender

sns.countplot(x='Survived', hue='Sex', data=df)
plt.title("Survival Count by Gender")
Enter fullscreen mode Exit fullscreen mode

Survival by Passenger Class

sns.countplot(x='Survived', hue='Pclass', data=df)
plt.title("Survival Count by Class")
Enter fullscreen mode Exit fullscreen mode

6. Correlation Heatmap

plt.figure(figsize=(10, 6))
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
plt.title("Correlation Heatmap")
Enter fullscreen mode Exit fullscreen mode

7. Feature Encoding

df['Sex_encoded'] = df['Sex'].map({'male': 0, 'female': 1})
df = pd.get_dummies(df, columns=['Embarked'], drop_first=True)
Enter fullscreen mode Exit fullscreen mode

8. Model Preparation

X = df[['Pclass', 'Sex_encoded', 'Age', 'Fare', 'Embarked_Q', 'Embarked_S']]
y = df['Survived']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Enter fullscreen mode Exit fullscreen mode

9. Logistic Regression Model

model = LogisticRegression()
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))
Enter fullscreen mode Exit fullscreen mode

🧠 Key Learnings

  • ✅ How to explore and clean a real-world dataset

  • ✅ Understanding visual patterns in data

  • ✅ Feature encoding and selection

  • ✅ Building a logistic regression model

  • ✅ Evaluating model accuracy and performance


💭 Reflections

This project was more than just code — it helped me gain confidence in using ML tools and understand the real process behind building predictive models. I'm now more excited than ever to keep exploring!

Thanks for reading! If you're also starting out in machine learning or have suggestions for improvement, I’d love to connect and hear your thoughts 💬

Top comments (0)