DEV Community

Cover image for Predicting Survival on the Titanic: A Machine Learning Approach
Loi2008
Loi2008

Posted on

Predicting Survival on the Titanic: A Machine Learning Approach

1. Introduction

One of the most tragic maritime catastrophes in history, the RMS Titanic sinking in 1912 claimed a great number of lives. In addition to the disaster's immense scope, the intricate interactions between various elements that affected survival rates have long captivated scholars.

In order to create a predictive model, this research will examine the Titanic dataset, which is a comprehensive collection of passenger data. We aim to identify the primary factors influencing survival and develop a strong model that can forecast a person's chances of surviving a disaster by examining characteristics like age, gender, passenger class, among others.

In addition to illuminating past trends, the knowledge gathered from this analysis will show how effective machine learning is at deriving insightful forecasts from intricate real-world data.

2. Understanding the Dataset

The dataset used (https://github.com/Loi2008/Data_Science_Assignments/blob/main/tested.csv) contains information about Titanic passengers. The goal is to develop a prediction model that predicts the likelihood of survival (using Survived column - 0 = No, 1 = Yes), based on other features. The dataset contains the following columns:

  • PassengerId: Unique identifier for each passenger.
  • Survived: Survival (0 = No, 1 = Yes) - Target variable.
  • Pclass: Ticket class (1 = 1st, 2 = 2nd, 3 = 3rd).
  • Name: Passenger's name.
  • Sex: Sex (male/female).
  • Age: Age in years.
  • SibSp: Number of siblings/spouses aboard.
  • Parch: Number of parents/children aboard.
  • Ticket: Ticket number.
  • Fare: Passenger fare.
  • Cabin: Cabin number.
  • Embarked: Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton).

3. Building the Model

Goal: Predicting the passengers' survival likelihood using 'Survived' column based on other passenger features. Below are the steps involved, together with the Python code for each step:

a. Download all the Required Libraries

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler, OneHotEncoder 
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
Enter fullscreen mode Exit fullscreen mode

b. Suppress Warnings for Cleaner Output

warnings.filterwarnings('ignore')
Enter fullscreen mode Exit fullscreen mode

c. Load the Data

Read the CSV file into a pandas DataFrame.

df = pd.read_csv (r"tested.csv")
df.head()
Enter fullscreen mode Exit fullscreen mode

Output

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 892 0 3 Kelly, Mr. James male 34.5 0 0 330911 7.8292 NaN Q
1 893 1 3 Wilkes, Mrs. James (Ellen Needs) female 47.0 1 0 363272 7.0000 NaN S
2 894 0 2 Myles, Mr. Thomas Francis male 62.0 0 0 240276 9.6875 NaN Q
3 895 0 3 Wirz, Mr. Albert male 27.0 0 0 315154 8.6625 NaN S
4 896 1 3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) female 22.0 1 1 3101298 12.2875 NaN S

d. Exploratory Data Analysis (EDA)

Understand the data (data types, missing values and distributions).

df.info()
Enter fullscreen mode Exit fullscreen mode

Output

RangeIndex: 418 entries, 0 to 417
Data columns (total 12 columns):

# Column Non-Null Count Dtype
0 PassengerId 418 non-null int64
1 Survived 418 non-null int64
2 Pclass 418 non-null int64
3 Name 418 non-null object
4 Sex 418 non-null object
5 Age 332 non-null float64
6 SibSp 418 non-null int64
7 Parch 418 non-null int64
8 Ticket 418 non-null object
9 Fare 417 non-null float64
10 Cabin 91 non-null object
11 Embarked 418 non-null object

dtypes: float64(2), int64(5), object(5)
memory usage: 39.3+ KB

Handling missing values
Number of missing values for each column

missing = df.isnull().sum()
non_null = df.notnull().sum()
total = len(df)

# Build summary DataFrame
summary = pd.DataFrame({
    "Non-Null Count": non_null,
    "Missing Values": missing,
    "Missing %": (missing / total * 100).round(1),
    "Dtype": df.dtypes
})

# Rename index to 'Column'
summary.index.name = "Column"

# Display
summary
Enter fullscreen mode Exit fullscreen mode

Output

Column Non-Null Count Missing Values Missing % Dtype
PassengerId 418 0 0.0% int64
Survived 418 0 0.0% int64
Pclass 418 0 0.0% int64
Name 418 0 0.0% object
Sex 418 0 0.0% object
Age 332 86 20.6% float64
SibSp 418 0 0.0% int64
Parch 418 0 0.0% int64
Ticket 418 0 0.0% object
Fare 417 1 0.2% float64
Cabin 91 327 78.2% object
Embarked 418 0 0.0% object

Creating a copy of the dataset to avoid Error when modifying the original dataset (df) later .

df_processed = df.copy()
Enter fullscreen mode Exit fullscreen mode

The dataset statistical distribution

Output

Survived Pclass Sex Age SibSp Parch Fare Embarked_C Embarked_Q Embarked_S
count 418.0000 418.0000 418.0000 418.0000 418.0000 418.0000 418.0000 418.0000 418.0000 418.0000
mean 0.3636 2.2656 0.6364 29.5993 0.4474 0.3923 35.5765 0.2440 0.1100 0.6459
std 0.4816 0.8418 0.4816 12.7038 0.8968 0.9814 55.8501 0.4300 0.3133 0.4788
min 0.0000 1.0000 0.0000 0.1700 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
25% 0.0000 1.0000 0.0000 23.0000 0.0000 0.0000 7.8958 0.0000 0.0000 0.0000
50% 0.0000 3.0000 1.0000 27.0000 0.0000 0.0000 14.4542 0.0000 0.0000 1.0000
75% 1.0000 3.0000 1.0000 35.7500 1.0000 0.0000 31.4719 0.0000 0.0000 1.0000
max 1.0000 3.0000 1.0000 76.0000 8.0000 9.0000 512.3292 1.0000 1.0000 1.0000

Impute the missing values with median - Impute() usedinstead of fillna() due to its robustness. The median is preferred over the mean because the data is skewed. From the statistical distribution results above the means for "Age" and "Fare" are not equal to the median hence the data is skewed.

imputer_age = SimpleImputer(strategy='median')
df_processed['Age'] = imputer_age.fit_transform(df_processed[['Age']])
imputer_fare = SimpleImputer(strategy='median')
df_processed['Fare'] = imputer_fare.fit_transform(df_processed[['Fare']])
Enter fullscreen mode Exit fullscreen mode

Dropping the column Cabin. This is because it has a high percentage of missing values (78.2%), and directly using it often requires complex feature engineering.

df_processed = df_processed.drop('Cabin', axis=1)
Enter fullscreen mode Exit fullscreen mode

Dropping irrelevant columns.
PassengerId, Name, Ticket are usually unique identifiers or free-text fields that don't directly contribute to survival prediction for basic models.

df_processed = df_processed.drop(['PassengerId', 'Name', 'Ticket'], axis=1)
Enter fullscreen mode Exit fullscreen mode

Encode categorical columns - 'Sex' and 'Embarked'.

  • Sex: It's a binary categorical feature (male, female). LabelEncoder converts male to 1 and female to 0 for numerical models to understand.
  • Embarked: Has three categories (S, C, Q). OneHotEncoder is used to convert this into separate binary columns - Embarked_C, Embarked_Q, Embarked_S. This will prevents the model from assuming an ordinal relationship between the categories.
# 'Sex': Use LabelEncoder (binary feature)
le = LabelEncoder()
df_processed['Sex'] = le.fit_transform(df_processed['Sex']) # male=1, female=0. Or male=0, female=1 depending on internal sorting.

# 'Embarked': Use OneHotEncoder
print("Unique Embarked values before OneHotEncoding:", df_processed['Embarked'].unique())
ohe = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
embarked_encoded = ohe.fit_transform(df_processed[['Embarked']])
embarked_df = pd.DataFrame(embarked_encoded, columns=ohe.get_feature_names_out(['Embarked']), index=df_processed.index)
df_processed = pd.concat([df_processed.drop('Embarked', axis=1), embarked_df], axis=1)

print("DataFrame after preprocessing. First 5 rows:")
print(df_processed.head())
print("\n")
print("DataFrame Info after preprocessing:")
df_processed.info()
print("\n")
Enter fullscreen mode Exit fullscreen mode

Output

Survived Pclass Sex Age SibSp Parch Fare Embarked_C Embarked_Q Embarked_S
0 0 3 1 34.5 0 0 7.8292 0.0 1.0 0.0
1 1 3 0 47.0 1 0 7.0000 0.0 0.0 1.0
2 0 2 1 62.0 0 0 9.6875 0.0 1.0 0.0
3 0 3 1 27.0 0 0 8.6625 0.0 0.0 1.0
4 1 3 0 22.0 1 1 12.2875 0.0 0.0 1.0

The final columns

# Column Non-Null Count Dtype
0 Survived 418 non-null int64
1 Pclass 418 non-null int64
2 Sex 418 non-null int64
3 Age 418 non-null float64
4 SibSp 418 non-null int64
5 Parch 418 non-null int64
6 Fare 418 non-null float64
7 Embarked_C 418 non-null float64
8 Embarked_Q 418 non-null float64
9 Embarked_S 418 non-null float64

e. Splitting Data

The dataset is divided into training and testing sets.

print("Splitting Data into Training and Test Sets")
X = df_processed.drop('Survived', axis=1) 
y = df_processed['Survived']             

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test shape: {y_test.shape}")
print("\n")
Enter fullscreen mode Exit fullscreen mode

Output
The data is split Data into Training and Test sets as illustrated:

  • X_train shape: (334, 9)
  • X_test shape: (84, 9)
  • y_train shape: (334,)
  • y_test shape: (84,)

Elaboration on the splitting code

  • X = df.drop('Survived', axis=1): Creates the feature matrix X by dropping the target variable.
  • y = df['Survived']: Creates the target vector y.
  • train_test_split(X, y, test_size=0.2, random_state=42, stratify=y): Splits the data into 80% for training and 20% for testing.
  • random_state: Ensures reproducibility of the split.
  • stratify=y: Important for classification problems to ensure that the proportion of 'Survived' (0s and 1s) is roughly the same in both training and test sets.

f. Scaling Numerical Features (Columns)

StandardScaler standardizes features by removing the mean and scaling to unit variance. This is important for algorithms that are sensitive to the scale of input features like Logistic Regression to prevent features with larger values from dominating the learning process. As we are using Logistic Regression, scaling is important as it will ensure each feature contributes equally.

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("X_train_scaled (first 5 rows):")
print(X_train_scaled[:5])
print("\n")
print("X_test_scaled (first 5 rows):")
print(X_test_scaled[:5])
print("\n")
Enter fullscreen mode Exit fullscreen mode

Output
X_train_scaled (first 5 rows)

Row Col1 Col2 Col3 Col4 Col5 Col6 Col7 Col8 Col9
1 0.85435834 0.75370758 -0.21477642 -0.48043064 -0.41184087 -0.50912957 -0.54736724 -0.33665016 0.70551956
2 0.85435834 0.75370758 -0.68236136 -0.48043064 -0.41184087 -0.49535877 -0.54736724 -0.33665016 0.70551956
3 0.85435834 0.75370758 -0.21477642 -0.48043064 -0.41184087 -0.49615131 -0.54736724 2.97044263 -1.41739515
4 0.85435834 0.75370758 -1.61753124 -0.48043064 0.67126820 -0.57539142 -0.54736724 -0.33665016 0.70551956
5 0.85435834 -1.32677450 -0.21477642 -0.48043064 -0.41184087 -0.49564602 -0.54736724 2.97044263 -1.41739515

X_test_scaled (first 5 rows)

Row Col1 Col2 Col3 Col4 Col5 Col6 Col7 Col8 Col9
1 0.85435834 -1.32677450 -0.21477642 -0.48043064 -0.41184087 -0.49615131 -0.54736724 2.97044263 -1.41739515
2 -1.49424813 -1.32677450 -0.37063806 0.61116007 -0.41184087 0.32912293 1.82692702 -0.33665016 -1.41739515
3 0.85435834 0.75370758 -0.21477642 -0.48043064 -0.41184087 -0.49362833 -0.54736724 -0.33665016 0.70551956
4 -0.31994489 -1.32677450 -0.76029218 -0.48043064 -0.41184087 0.00567507 -0.54736724 -0.33665016 0.70551956
5 -1.49424813 0.75370758 -0.37063806 -0.48043064 -0.41184087 -0.18034678 1.82692702 -0.33665016 -1.41739515

4. Choosing and Training the Model

Two models chosen:
Logistic Regression: Chosen for its clear interpretability, providing insights into how each factor linearly influences survival probability, and serving as a strong, efficient baseline.

print("Training Logistic Regression Model...")
log_reg_model = LogisticRegression(random_state=42, solver='liblinear')
log_reg_model.fit(X_train_scaled, y_train)
print("Logistic Regression Model Trained.\n")
Enter fullscreen mode Exit fullscreen mode

log_reg_model.fit(X_train_scaled, y_train): Trains the model using the scaled training data.

Random Forest Classifier An ensemble method, offers higher accuracy by capturing complex, non-linear relationships and interactions without extensive data preprocessing.

print("Training Random Forest Classifier Model...")
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
# Random Forest is less sensitive to feature scaling, so we can use unscaled X
rf_model.fit(X_train, y_train)
print("Random Forest Classifier Model Trained.\n")
Enter fullscreen mode Exit fullscreen mode

rf_model.fit(X_train, y_train): Trains the model. Random Forests are less sensitive to feature scaling, so we can use the unscaled X_train here.

Together, they allow for a comprehensive analysis, balancing transparency with predictive power to understand Titanic survival.

5. Evaluating the Models

The predictions on the scaled test set for Logistic Regression is generated (y_pred_log_reg = log_reg_model.predict(X_test_scaled)).

y_pred_log_reg = log_reg_model.predict(X_test_scaled)
accuracy_log_reg = accuracy_score(y_test, y_pred_log_reg)
print(f"Logistic Regression Accuracy: {accuracy_log_reg:.4f}")
print("Logistic Regression Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_log_reg))
print("Logistic Regression Classification Report:")
print(classification_report(y_test, y_pred_log_reg))
print("\n")
Enter fullscreen mode Exit fullscreen mode

Output

Trained Logistic Regression Model Output

The predictions on the unscaled test set for Random Forest are generated (y_pred_rf = rf_model.predict(X_test))

y_pred_rf = rf_model.predict(X_test)
accuracy_rf = accuracy_score(y_test, y_pred_rf)
print(f"Random Forest Accuracy: {accuracy_rf:.4f}")
print("Random Forest Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_rf))
print("Random Forest Classification Report:")
print(classification_report(y_test, y_pred_rf))
print("\n")
Enter fullscreen mode Exit fullscreen mode

Output

Trained Random Forest Model Output
From the output of the two models:

  • accuracy_score: Calculates the proportion of correctly classified instances.
  • confusion_matrix: Shows the number of true positives, true negatives, false positives, and false negatives.
  • classification_report: Provides a detailed summary of precision, recall, f1-score, and support for each class.

6. Making predictions using sample data

  • Using hypothetical new single passenger and then multiple passengers
  • The data will be processed through the same steps
    • Removing irrelevant columns
    • Impute missing values if any.
    • Encode categorical features using the fitted encoders
    • Ensuring column order matches training data for X_train -This is crucial (Create a DataFrame with the same columns as X_train, then fill it)
    • Prediction using logistic regression and random forest
    • Visualizing key relationships

a. Capturing the data

**_Using Single Passenger_**
new_passenger_features_raw = pd.DataFrame({
    'Pclass': [1],
    'Name': ['New, Mrs. Example (Test User)'], # Name added for consistency, but will be dropped
    'Sex': ['female'],
    'Age': [30],
    'SibSp': [0],
    'Parch': [0],
    'Ticket': ['TEST12345'], # Ticket added for consistency, but will be dropped
    'Fare': [50],
    'Cabin': [None], # Cabin added for consistency, but will be dropped
    'Embarked': ['S']
})
Enter fullscreen mode Exit fullscreen mode

b. Dropping irrelevant columns

new_passenger_processed = new_passenger_features_raw.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], axis=1, errors='ignore')
Enter fullscreen mode Exit fullscreen mode

c. Imputing missing values

new_passenger_processed['Age'] = imputer_age.transform(new_passenger_processed[['Age']])
new_passenger_processed['Fare'] = imputer_fare.transform(new_passenger_processed[['Fare']])
Enter fullscreen mode Exit fullscreen mode

d. Encoding categorical features

new_passenger_processed['Sex'] = le.transform(new_passenger_processed['Sex'])
embarked_new = ohe.transform(new_passenger_processed[['Embarked']])
embarked_new_df = pd.DataFrame(embarked_new, columns=ohe.get_feature_names_out(['Embarked']))
new_passenger_processed = pd.concat([new_passenger_processed.drop('Embarked', axis=1), embarked_new_df], axis=1)
Enter fullscreen mode Exit fullscreen mode

e. Ensure column order matches training data for X_train

new_passenger_final = pd.DataFrame(columns=X.columns)
new_passenger_final = pd.concat([new_passenger_final, new_passenger_processed], ignore_index=True)
new_passenger_final = new_passenger_final.fillna(0) 
Enter fullscreen mode Exit fullscreen mode

f. Prediction with the models

Logistic Regression using scaled data

new_passenger_scaled = scaler.transform(new_passenger_final)
log_reg_prediction = log_reg_model.predict(new_passenger_scaled)
print(f"Logistic Regression predicts survival for new passenger: {'Survived' if log_reg_prediction[0] == 1 else 'Not Survived'}")
Enter fullscreen mode Exit fullscreen mode

Output: Logistic Regression predicts survival for new passenger: Survived

Random Forest using unscaled data

rf_prediction = rf_model.predict(new_passenger_final)
print(f"Random Forest predicts survival for new passenger: {'Survived' if rf_prediction[0] == 1 else 'Not Survived'}")
Enter fullscreen mode Exit fullscreen mode

Output Random Forest predicts survival for new passenger: Survived

Using Multiple Passengers

a. Capture multiple passengers' data

new_passengers_features_raw = pd.DataFrame({
    'Pclass': [1, 2, 3],
    'Name': [
        'New, Mrs. Example (Test User 1)',
        'New, Mr. Sample (Test User 2)',
        'New, Miss Demo (Test User 3)'
    ],
    'Sex': ['female', 'male', 'female'],
    'Age': [30, None, 22],        # Example: missing Age for passenger 2
    'SibSp': [0, 1, 0],
    'Parch': [0, 0, 1],
    'Ticket': ['TEST12345', 'TEST12346', 'TEST12347'],
    'Fare': [50, None, 15],       # Example: missing Fare for passenger 2
    'Cabin': [None, None, None],
    'Embarked': ['S', 'C', 'Q']
})
Enter fullscreen mode Exit fullscreen mode

b. Dropping irrelevant columns

new_passengers_processed = new_passengers_features_raw.drop(
    ['PassengerId', 'Name', 'Ticket', 'Cabin'],
    axis=1,
    errors='ignore'
)
Enter fullscreen mode Exit fullscreen mode

c. Imputing missing values using fitted imputers

new_passengers_processed['Age'] = pd.Series(
    imputer_age.transform(new_passengers_processed[['Age']]).flatten(),
    index=new_passengers_processed.index
)
new_passengers_processed['Fare'] = pd.Series(
    imputer_fare.transform(new_passengers_processed[['Fare']]).flatten(),
    index=new_passengers_processed.index
)
Enter fullscreen mode Exit fullscreen mode

d. Encoding categorical features

#Label encode 'Sex'
new_passengers_processed['Sex'] = le.transform(new_passengers_processed['Sex'])

#One-hot encode 'Embarked'
embarked_new = ohe.transform(new_passengers_processed[['Embarked']])
embarked_new_df = pd.DataFrame(
    embarked_new, 
    columns=ohe.get_feature_names_out(['Embarked']),
    index=new_passengers_processed.index
)
new_passengers_processed = pd.concat(
    [new_passengers_processed.drop('Embarked', axis=1), embarked_new_df],
    axis=1
)
Enter fullscreen mode Exit fullscreen mode

e. Align columns with training data

Handles cases where new_passengers_processed might miss a dummy column e.g., if a new passenger only has 'Embarked_S' and not 'Embarked_C').

new_passengers_final = pd.DataFrame(columns=X.columns)
new_passengers_final = pd.concat([new_passengers_final, new_passenger_processed], ignore_index=True)
new_passengers_final = new_passenger_final.fillna(0)

print(new_passengers_final)
Enter fullscreen mode Exit fullscreen mode

Output

Pclass Sex Age SibSp Parch Fare Embarked_C Embarked_Q Embarked_S
1 0 30.0 0 0 50.0000 0.0 0.0 1.0
2 1 27.0 1 0 14.4542 1.0 0.0 0.0
3 0 22.0 0 1 15.0000 0.0 1.0 0.0

e. Scale features and predict survival for multiple passengers

Logistic Regression

print("--- Logistic Regression Predictions ---")
new_passengers_scaled = scaler.transform(new_passengers_final)
log_reg_predictions = log_reg_model.predict(new_passengers_scaled)
log_reg_probabilities = log_reg_model.predict_proba(new_passengers_scaled)[:, 1] # Probability of survival

# Create results_lr using the *index* of new_passengers_final
results_lr = pd.DataFrame({
    'Predicted_Survival_LR': ['Survived' if p == 1 else 'Not Survived' for p in log_reg_predictions],
    'Survival_Probability_LR': log_reg_probabilities.round(2)
}, index=new_passengers_final.index) # Use the index of the processed features for results_lr

print(results_lr)
print("\n")
Enter fullscreen mode Exit fullscreen mode

Output

# Predicted_Survival_LR Survival_Probability_LR
0 Survived 0.99
1 Not Survived 0.01
2 Survived 0.99

Random Forest

print("--- Random Forest Predictions ---")
# Random Forest uses unscaled data, so use new_passengers_final directly
rf_predictions = rf_model.predict(new_passengers_final)
rf_probabilities = rf_model.predict_proba(new_passengers_final)[:, 1] # Probability of survival

results_rf = pd.DataFrame({
    'Predicted_Survival_RF': ['Survived' if p == 1 else 'Not Survived' for p in rf_predictions],
    'Survival_Probability_RF': rf_probabilities.round(4)
})
print(results_rf)
print("\n")
Enter fullscreen mode Exit fullscreen mode

Output

# Predicted_Survival_RF Survival_Probability_RF
0 Survived 0.96
1 Not Survived 0.01
2 Survived 0.96

Combined Prediction

combined_results = pd.merge(results_lr, results_rf, left_index=True, right_index=True)
print("Combined Predictions (merged on index)")
print(combined_results)
print("\n")
Enter fullscreen mode Exit fullscreen mode

Output

# Predicted_Survival_LR Survival_Probability_LR Predicted_Survival_RF Survival_Probability_RF
0 Survived 0.99 Survived 0.96
1 Not Survived 0.01 Not Survived 0.01
2 Survived 0.99 Survived 0.96

8. Visualizing Key Relationships

Survival rate by Sex

print("\n Visualizing Key Relationships - By Sex")
plt.figure(figsize=(6, 4))
sns.barplot(x='Sex', y='Survived', data=df_processed.replace({'Sex': {0: 'Female', 1: 'Male'}}))
plt.title('Survival Rate by Sex')
plt.ylabel('Survival Rate')
plt.show()
Enter fullscreen mode Exit fullscreen mode

Visually shows the higher survival rate for females.
Predicted Survival Rate by Sex

Survival rate by Pclass -
illustrates how survival rates vary across different passenger classes.

print("\n Visualizing Key Relationships - By PClass")
plt.figure(figsize=(6, 4))
sns.barplot(x='Pclass', y='Survived', data=df_processed)
plt.title('Survival Rate by Pclass')
plt.ylabel('Survival Rate')
plt.show()
Enter fullscreen mode Exit fullscreen mode

Survival Rate by Passenger Class

Distribution of Age and its relation to Survival

The histogram compares the age distribution of survivors vs. non-survivors.

plt.figure(figsize=(10, 6))
sns.histplot(df_processed[df_processed['Survived'] == 0]['Age'], color='red', label='Not Survived', kde=True)
sns.histplot(df_processed[df_processed['Survived'] == 1]['Age'], color='green', label='Survived', kde=True)
plt.title('Age Distribution by Survival')
plt.xlabel('Age')
plt.ylabel('Number of Passengers')
plt.legend()
plt.show()
Enter fullscreen mode Exit fullscreen mode

Relationship between Age and Survival Rate

Feature Importance for Random Forest
Since we are using Random Forest for its ability to show feature importance, visualizing or printing the rf_model.feature_importances_ adds viability to our model.

importances = rf_model.feature_importances_
feature_names = X.columns
forest_importances = pd.Series(importances, index=feature_names).sort_values(ascending=False)

plt.figure(figsize=(10, 6))
sns.barplot(x=forest_importances.values, y=forest_importances.index)
plt.title('Random Forest Feature Importances')
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.show()
Enter fullscreen mode Exit fullscreen mode

Feature Importance

7. Conclusion and Summary

Conclusion
This project effectively leveraged machine learning to forecast the survival of Titanic passengers. The tragic historical event emphasizes the crucial role that socioeconomic and demographic factors play in crisis outcomes. The predictions verified that sex and passenger class were by far the most important factors influencing survival. This was evident in both Random Forest Classifiers and Logistic Regression, with women and those in higher classes having significantly higher probability of being saved. Additionally, age was a factor, favoring very young children in particular.

Complementary insights were obtained by the application of the two separate models: As a great starting point, logistic regression provided a clear, intelligible explanation of the linear influence of each characteristic on survival probability. The effectiveness of ensemble approaches for subtle pattern identification was demonstrated by the Random Forest Classifier. this typically obtained greater predicted accuracy due to its capacity to grasp complicated non-linear linkages and interactions. The results were further supported by the visual analyses, which offered unambiguous graphical proof of the differences in survival rates.
In conclusion, this project demonstrated machine learning's ability to glean significant, actionable insights from historical data, shedding light on historical trends and providing a potent tool for comprehending intricate real-world phenomena, in addition to producing strong predictive models for the Titanic disaster.

Summary
The process of creating and assessing Titanic passenger survival prediction models was covered in length in this article. It began with an overview of the issue and proceeded through the fundamental stages of a machine learning process, with emphasis on training, evaluating and testing the model using sample data.

References

  • Atieno, L., Robina, F., & Otieno, M (2025). Survival Likelihood Model. (https://github.com/Loi2008/Data_Science_Assignments/blob/main/Prediction_Model_Titanic_Dataset.ipynb)
  • Dawson, E. (1997). The Titanic Disaster: Historical and Social Perspectives. Journal of Maritime History.
  • Kaggle. (n.d.). Titanic: Machine Learning from Disaster.
  • Breiman, L. (2001). Random Forests. Machine Learning, 45(1), 5–32.
  • Hosmer, D. W., & Lemeshow, S. (2000). Applied Logistic Regression. Wiley Series in Probability and Statistics.

Further Reading

  • Géron, A. (2019). Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow. O’Reilly Media.
  • Murphy, K. P. (2012). Machine Learning: A Probabilistic Perspective. MIT Press.
  • Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.
  • Raschka, S., & Mirjalili, V. (2019). Python Machine Learning. Packt Publishing.
  • Blog: Top 10 Machine Learning Algorithms You Should Know – Towards Data Science

Top comments (0)