Sandip Subedi

Posted on Jun 11 • Originally published at pythoninnercode.hashnode.dev

Titanic Survival Analysis — What the Data Reveals About Who Lived and Who Died

#ai #python #data #analyst

The Titanic disaster of 1912 is one of the most studied events in history. Over 1,500 people lost their lives when the ship sank in the North Atlantic. But when you look at the passenger data, a clear pattern emerges — survival was not random. Your chances of surviving depended heavily on who you were and where you sat on the ship.

In this project, I analyzed the Titanic passenger dataset to answer one central question: What factors determined whether a passenger survived?

This is my second data analysis project. My first project was an HR Employee Attrition Analysis — if you haven't read that one yet, check it out. For this project, I followed the same structured 5-phase approach and pushed myself to go deeper with the visualizations.

Full notebook on GitHub: github.com/sandipsubedi0/titanic-survival-analysis

Dataset Overview
Source: Kaggle — Titanic: Machine Learning from Disaster

Rows: 891 passengers

Columns: 12 features including Age, Sex, Pclass, Fare, Cabin, Embarked, and Survived

Tools used: Python, Pandas, NumPy, Matplotlib, Seaborn

Phase 1 — Setup and Data Loading
I started by importing the necessary libraries and loading the dataset using a relative file path. A simple but important habit — never use a hardcoded local path like C:\Users... in a shared notebook, because it will break on every other computer.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set_style("whitegrid")

data = pd.read_csv("Titanic-Dataset.csv")
data.head()

First look at the data: 891 rows, 12 columns. The dataset includes passenger demographics, ticket details, cabin information, and whether they survived.

Phase 2 — Data Exploration (Before Cleaning)
Before touching anything, I explored the raw data to understand what I was working with.

Missing values:

Column Missing Count Missing %
Cabin 687 77.1%
Age 177 19.9%
Embarked 2 0.2%
The missing values heatmap made this visually clear — Cabin had a massive gap running through the entire column.

I also ran data.describe() to check the numerical columns. A few things stood out immediately:

Age ranged from 0.42 to 80 years — the youngest passenger was less than 1 year old

Fare had a huge range — minimum 0, maximum 512 — signaling strong economic inequality among passengers

Only 38% of passengers survived (Survived mean = 0.38)

For value_counts(), I only checked meaningful categorical columns: Survived, Pclass, Sex, and Embarked. Columns like PassengerId, Name, and Ticket are unique identifiers — running analysis on them produces no useful insight.

Phase 3 — Data Cleaning
I made a working copy of the original data before applying any changes — always keep the raw data intact as a reference.

df = data.copy()

Three cleaning decisions:

Age — filled with median Age had 177 missing values (19.9%). I filled with the median, not the mean. Why? Age has outliers (very young children, elderly passengers) that pull the mean away from the typical passenger. The median is more robust.

df["Age"].fillna(df["Age"].median(), inplace=True)

Cabin — dropped entirely 77.1% of values were missing. That's too high to fill reliably — any filling method would be guesswork on that scale. Dropping it was the right call.

df.drop(columns=["Cabin"], inplace=True)

Embarked — filled with mode Only 2 values missing. With such a small gap, filling with the most common value (mode) is perfectly safe.

df["Embarked"].fillna(df["Embarked"].mode()[0], inplace=True)

Verification: After cleaning, df.isnull().sum() showed zero missing values across all columns. The after-cleaning heatmap confirmed this — completely blank, exactly what we want to see.

Phase 4 — Exploratory Data Analysis and Visualizations
This is where the real story begins. I built 7 charts, each designed to answer a specific question.

Chart 1 — Overall Survival Count

The first question: how many people actually survived?

Out of 891 passengers, 342 survived (38.4%) and 549 did not (61.6%). More than 6 in 10 people on the Titanic did not make it. That sets the baseline for everything that follows.

Chart 2 — Survival Rate by Gender
This is where the data gets striking.

Female survival rate: ~74%

Male survival rate: ~19%

Women were nearly 4x more likely to survive than men. This is the clearest pattern in the entire dataset. The "women and children first" evacuation protocol was not just a phrase — the data confirms it was actually followed.

I used a bar chart here (not a pie chart). Survival rates are separate values for two groups — they don't add to 100% of anything, so a pie chart would be misleading.

Chart 3 — Survival Rate by Passenger Class

Passenger class tells us where on the ship you were located — and how close you were to the lifeboats.

1st Class: ~63% survival rate

2nd Class: ~47% survival rate

3rd Class: ~24% survival rate

The survival gap between 1st and 3rd class is enormous. Third-class passengers were housed in the lower decks — further from the lifeboats and with less time to reach the top deck. Economic status directly influenced survival chances.

Chart 4 — Age Distribution of All Passengers

A histogram of all passenger ages shows a right-skewed distribution. Most passengers were between 20 and 40 years old. There were relatively few children and elderly passengers compared to working-age adults.

The youngest passenger recorded was under 1 year old. The oldest was 80.

Chart 5 — Age by Survival (Overlapping Histogram)
This is one of the most informative charts in the project. By plotting two histograms on the same axes — one for survivors, one for non-survivors — with alpha=0.6 on both so they're visible through each other, the overlap pattern becomes clear.

plt.hist(df[df["Survived"]==1]["Age"], alpha=0.6, label="Survived", bins=20)
plt.hist(df[df["Survived"]==0]["Age"], alpha=0.6, label="Did not survive", bins=20)
plt.legend()

Young children (under ~10) show a higher proportion of survivors relative to non-survivors — consistent with "children first." For adults aged 20–40, non-survivors heavily outnumber survivors, reflecting the large number of 3rd-class male passengers in that age group.

Chart 6 — Fare Distribution

The fare histogram reveals extreme economic inequality on board. The distribution is heavily right-skewed — the vast majority of passengers paid low fares (under £50), while a small number paid extremely high amounts (up to 512).

This roughly maps to passenger class: 3rd-class passengers paid low fares, 1st-class passengers paid high fares. And as we saw in Chart 3, class directly correlated with survival.

Chart 7 — Gender × Class Survival Heatmap

This is the most powerful chart in the project. Instead of looking at gender and class separately, I combined them into a single heatmap using a pivot table.

pivot = df.pivot_table(values="Survived", index="Sex", columns="Pclass", aggfunc="mean")
sns.heatmap(pivot, annot=True, fmt=".2f", cmap="Blues")

The results:

1st Class 2nd Class 3rd Class
Female ~0.97 ~0.92 ~0.50
Male ~0.37 ~0.16 ~0.14
1st-class females had a ~97% survival rate. They were almost certain to survive. 3rd-class males had a ~14% survival rate. They had almost no chance.

The difference between those two groups is 83 percentage points — from the same disaster, on the same ship, at the same time.

Phase 5 — Key Findings and Conclusion
Key Findings
Only 38% of passengers survived — the majority of people on board did not make it.

Gender was the strongest single factor — female passengers survived at ~74% vs ~19% for males, confirming the "women and children first" evacuation protocol was followed.

Passenger class determined access to lifeboats — 1st class survived at ~63%, 3rd class at only ~24%. Where you sat on the ship directly affected your survival.

The combined effect was extreme — 1st-class females had a ~97% survival rate while 3rd-class males had only ~14%. The gap between best and worst case is 83 percentage points.

Children showed higher survival rates — the overlapping age histogram showed young children were more likely to survive relative to adults.

Fare inequality mirrored class inequality — most passengers paid very little, a few paid enormous amounts, and higher fare strongly correlated with higher survival.

Conclusion
The Titanic data tells a clear story: survival was not random. Gender and passenger class were the two dominant factors, and when combined, they produced an extreme range of outcomes. A 1st-class female passenger had near-certain survival. A 3rd-class male passenger had almost no chance.

The "women and children first" protocol was real — the data proves it. But access to the upper decks, proximity to lifeboats, and crew assistance were all filtered through socioeconomic status. Wealthier passengers had structural advantages that translated directly into survival.

This project taught me how to move from raw data to real insight — not just running code, but understanding what the numbers actually mean about human lives.

What's Next
This is Project 2 of my data analyst portfolio. I'm continuing to build projects that cover real-world datasets and develop my skills in Python, Pandas, and visualization.

Connect with me:

🔗 GitHub: github.com/sandipsubedi0

💼 LinkedIn: linkedin.com/in/sandip-subedi-5694b136a

📸 Instagram: @sandipsubedi0

Thanks for reading. If you found this useful, share it with someone learning data analysis.