Loi2008

Posted on Oct 5

Predicting Survival on the Titanic: A Machine Learning Approach

#machinelearning #datascience #beginners

1. Introduction

One of the most tragic maritime catastrophes in history, the RMS Titanic sinking in 1912 claimed a great number of lives. In addition to the disaster's immense scope, the intricate interactions between various elements that affected survival rates have long captivated scholars.

In order to create a predictive model, this research will examine the Titanic dataset, which is a comprehensive collection of passenger data. We aim to identify the primary factors influencing survival and develop a strong model that can forecast a person's chances of surviving a disaster by examining characteristics like age, gender, passenger class, among others.

In addition to illuminating past trends, the knowledge gathered from this analysis will show how effective machine learning is at deriving insightful forecasts from intricate real-world data.

2. Understanding the Dataset

The dataset used (https://github.com/Loi2008/Data_Science_Assignments/blob/main/tested.csv) contains information about Titanic passengers. The goal is to develop a prediction model that predicts the likelihood of survival (using Survived column - 0 = No, 1 = Yes), based on other features. The dataset contains the following columns:

PassengerId: Unique identifier for each passenger.
Survived: Survival (0 = No, 1 = Yes) - Target variable.
Pclass: Ticket class (1 = 1st, 2 = 2nd, 3 = 3rd).
Name: Passenger's name.
Sex: Sex (male/female).
Age: Age in years.
SibSp: Number of siblings/spouses aboard.
Parch: Number of parents/children aboard.
Ticket: Ticket number.
Fare: Passenger fare.
Cabin: Cabin number.
Embarked: Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton).

3. Building the Model

Goal: Predicting the passengers' survival likelihood using 'Survived' column based on other passenger features. Below are the steps involved, together with the Python code for each step:

a. Download all the Required Libraries

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler, OneHotEncoder 
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import matplotlib.pyplot as plt
import seaborn as sns
import warnings

b. Suppress Warnings for Cleaner Output

warnings.filterwarnings('ignore')

c. Load the Data

Read the CSV file into a pandas DataFrame.

df = pd.read_csv (r"tested.csv")
df.head()

Output

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	892	0	3	Kelly, Mr. James	male	34.5	0	0	330911	7.8292	NaN	Q
1	893	1	3	Wilkes, Mrs. James (Ellen Needs)	female	47.0	1	0	363272	7.0000	NaN	S
2	894	0	2	Myles, Mr. Thomas Francis	male	62.0	0	0	240276	9.6875	NaN	Q
3	895	0	3	Wirz, Mr. Albert	male	27.0	0	0	315154	8.6625	NaN	S
4	896	1	3	Hirvonen, Mrs. Alexander (Helga E Lindqvist)	female	22.0	1	1	3101298	12.2875	NaN	S

d. Exploratory Data Analysis (EDA)

Understand the data (data types, missing values and distributions).

df.info()

Output

RangeIndex: 418 entries, 0 to 417
Data columns (total 12 columns):

#	Column	Non-Null Count	Dtype
0	PassengerId	418 non-null	int64
1	Survived	418 non-null	int64
2	Pclass	418 non-null	int64
3	Name	418 non-null	object
4	Sex	418 non-null	object
5	Age	332 non-null	float64
6	SibSp	418 non-null	int64
7	Parch	418 non-null	int64
8	Ticket	418 non-null	object
9	Fare	417 non-null	float64
10	Cabin	91 non-null	object
11	Embarked	418 non-null	object

dtypes: float64(2), int64(5), object(5)
memory usage: 39.3+ KB

Handling missing values
Number of missing values for each column

missing = df.isnull().sum()
non_null = df.notnull().sum()
total = len(df)

# Build summary DataFrame
summary = pd.DataFrame({
    "Non-Null Count": non_null,
    "Missing Values": missing,
    "Missing %": (missing / total * 100).round(1),
    "Dtype": df.dtypes
})

# Rename index to 'Column'
summary.index.name = "Column"

# Display
summary

Output

Column	Non-Null Count	Missing Values	Missing %	Dtype
PassengerId	418	0	0.0%	int64
Survived	418	0	0.0%	int64
Pclass	418	0	0.0%	int64
Name	418	0	0.0%	object
Sex	418	0	0.0%	object
Age	332	86	20.6%	float64
SibSp	418	0	0.0%	int64
Parch	418	0	0.0%	int64
Ticket	418	0	0.0%	object
Fare	417	1	0.2%	float64
Cabin	91	327	78.2%	object
Embarked	418	0	0.0%	object

Creating a copy of the dataset to avoid Error when modifying the original dataset (df) later .

df_processed = df.copy()

The dataset statistical distribution

Output

	Survived	Pclass	Sex	Age	SibSp	Parch	Fare	Embarked_C	Embarked_Q	Embarked_S
count	418.0000	418.0000	418.0000	418.0000	418.0000	418.0000	418.0000	418.0000	418.0000	418.0000
mean	0.3636	2.2656	0.6364	29.5993	0.4474	0.3923	35.5765	0.2440	0.1100	0.6459
std	0.4816	0.8418	0.4816	12.7038	0.8968	0.9814	55.8501	0.4300	0.3133	0.4788
min	0.0000	1.0000	0.0000	0.1700	0.0000	0.0000	0.0000	0.0000	0.0000	0.0000
25%	0.0000	1.0000	0.0000	23.0000	0.0000	0.0000	7.8958	0.0000	0.0000	0.0000
50%	0.0000	3.0000	1.0000	27.0000	0.0000	0.0000	14.4542	0.0000	0.0000	1.0000
75%	1.0000	3.0000	1.0000	35.7500	1.0000	0.0000	31.4719	0.0000	0.0000	1.0000
max	1.0000	3.0000	1.0000	76.0000	8.0000	9.0000	512.3292	1.0000	1.0000	1.0000

Impute the missing values with median - Impute() usedinstead of fillna() due to its robustness. The median is preferred over the mean because the data is skewed. From the statistical distribution results above the means for "Age" and "Fare" are not equal to the median hence the data is skewed.

imputer_age = SimpleImputer(strategy='median')
df_processed['Age'] = imputer_age.fit_transform(df_processed[['Age']])
imputer_fare = SimpleImputer(strategy='median')
df_processed['Fare'] = imputer_fare.fit_transform(df_processed[['Fare']])

Dropping the column Cabin. This is because it has a high percentage of missing values (78.2%), and directly using it often requires complex feature engineering.

df_processed = df_processed.drop('Cabin', axis=1)

Dropping irrelevant columns.
PassengerId, Name, Ticket are usually unique identifiers or free-text fields that don't directly contribute to survival prediction for basic models.

df_processed = df_processed.drop(['PassengerId', 'Name', 'Ticket'], axis=1)

Encode categorical columns - 'Sex' and 'Embarked'.

Sex: It's a binary categorical feature (male, female). LabelEncoder converts male to 1 and female to 0 for numerical models to understand.
Embarked: Has three categories (S, C, Q). OneHotEncoder is used to convert this into separate binary columns - Embarked_C, Embarked_Q, Embarked_S. This will prevents the model from assuming an ordinal relationship between the categories.

# 'Sex': Use LabelEncoder (binary feature)
le = LabelEncoder()
df_processed['Sex'] = le.fit_transform(df_processed['Sex']) # male=1, female=0. Or male=0, female=1 depending on internal sorting.

# 'Embarked': Use OneHotEncoder
print("Unique Embarked values before OneHotEncoding:", df_processed['Embarked'].unique())
ohe = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
embarked_encoded = ohe.fit_transform(df_processed[['Embarked']])
embarked_df = pd.DataFrame(embarked_encoded, columns=ohe.get_feature_names_out(['Embarked']), index=df_processed.index)
df_processed = pd.concat([df_processed.drop('Embarked', axis=1), embarked_df], axis=1)

print("DataFrame after preprocessing. First 5 rows:")
print(df_processed.head())
print("\n")
print("DataFrame Info after preprocessing:")
df_processed.info()
print("\n")

Output

	Survived	Pclass	Sex	Age	SibSp	Parch	Fare	Embarked_Q	Embarked_S
0	0	3	1	34.5	0	0	7.8292	1.0	0.0
1	1	3	0	47.0	1	0	7.0000	0.0	1.0
2	0	2	1	62.0	0	0	9.6875	1.0	0.0
3	0	3	1	27.0	0	0	8.6625	0.0	1.0
4	1	3	0	22.0	1	1	12.2875	0.0	1.0

The final columns

#	Column	Non-Null Count	Dtype
0	Survived	418 non-null	int64
1	Pclass	418 non-null	int64
2	Sex	418 non-null	int64
3	Age	418 non-null	float64
4	SibSp	418 non-null	int64
5	Parch	418 non-null	int64
6	Fare	418 non-null	float64
7	Embarked_C	418 non-null	float64
8	Embarked_Q	418 non-null	float64
9	Embarked_S	418 non-null	float64

e. Splitting Data

The dataset is divided into training and testing sets.

print("Splitting Data into Training and Test Sets")
X = df_processed.drop('Survived', axis=1) 
y = df_processed['Survived']             

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test shape: {y_test.shape}")
print("\n")

Output
The data is split Data into Training and Test sets as illustrated:

X_train shape: (334, 9)
X_test shape: (84, 9)
y_train shape: (334,)
y_test shape: (84,)

Elaboration on the splitting code

X = df.drop('Survived', axis=1): Creates the feature matrix X by dropping the target variable.
y = df['Survived']: Creates the target vector y.
train_test_split(X, y, test_size=0.2, random_state=42, stratify=y): Splits the data into 80% for training and 20% for testing.
random_state: Ensures reproducibility of the split.
stratify=y: Important for classification problems to ensure that the proportion of 'Survived' (0s and 1s) is roughly the same in both training and test sets.

f. Scaling Numerical Features (Columns)

StandardScaler standardizes features by removing the mean and scaling to unit variance. This is important for algorithms that are sensitive to the scale of input features like Logistic Regression to prevent features with larger values from dominating the learning process. As we are using Logistic Regression, scaling is important as it will ensure each feature contributes equally.

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

print("X_train_scaled (first 5 rows):")
print(X_train_scaled[:5])
print("\n")
print("X_test_scaled (first 5 rows):")
print(X_test_scaled[:5])
print("\n")

Output
X_train_scaled (first 5 rows)

Row	Col1	Col2	Col3	Col4	Col5	Col6	Col7	Col8	Col9
1	0.85435834	0.75370758	-0.21477642	-0.48043064	-0.41184087	-0.50912957	-0.54736724	-0.33665016	0.70551956
2	0.85435834	0.75370758	-0.68236136	-0.48043064	-0.41184087	-0.49535877	-0.54736724	-0.33665016	0.70551956
3	0.85435834	0.75370758	-0.21477642	-0.48043064	-0.41184087	-0.49615131	-0.54736724	2.97044263	-1.41739515
4	0.85435834	0.75370758	-1.61753124	-0.48043064	0.67126820	-0.57539142	-0.54736724	-0.33665016	0.70551956
5	0.85435834	-1.32677450	-0.21477642	-0.48043064	-0.41184087	-0.49564602	-0.54736724	2.97044263	-1.41739515

X_test_scaled (first 5 rows)

Row	Col1	Col2	Col3	Col4	Col5	Col6	Col7	Col8	Col9
1	0.85435834	-1.32677450	-0.21477642	-0.48043064	-0.41184087	-0.49615131	-0.54736724	2.97044263	-1.41739515
2	-1.49424813	-1.32677450	-0.37063806	0.61116007	-0.41184087	0.32912293	1.82692702	-0.33665016	-1.41739515
3	0.85435834	0.75370758	-0.21477642	-0.48043064	-0.41184087	-0.49362833	-0.54736724	-0.33665016	0.70551956
4	-0.31994489	-1.32677450	-0.76029218	-0.48043064	-0.41184087	0.00567507	-0.54736724	-0.33665016	0.70551956
5	-1.49424813	0.75370758	-0.37063806	-0.48043064	-0.41184087	-0.18034678	1.82692702	-0.33665016	-1.41739515

4. Choosing and Training the Model

Two models chosen:
Logistic Regression: Chosen for its clear interpretability, providing insights into how each factor linearly influences survival probability, and serving as a strong, efficient baseline.

print("Training Logistic Regression Model...")
log_reg_model = LogisticRegression(random_state=42, solver='liblinear')
log_reg_model.fit(X_train_scaled, y_train)
print("Logistic Regression Model Trained.\n")

log_reg_model.fit(X_train_scaled, y_train): Trains the model using the scaled training data.

Random Forest Classifier An ensemble method, offers higher accuracy by capturing complex, non-linear relationships and interactions without extensive data preprocessing.

print("Training Random Forest Classifier Model...")
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
# Random Forest is less sensitive to feature scaling, so we can use unscaled X
rf_model.fit(X_train, y_train)
print("Random Forest Classifier Model Trained.\n")

rf_model.fit(X_train, y_train): Trains the model. Random Forests are less sensitive to feature scaling, so we can use the unscaled X_train here.

Together, they allow for a comprehensive analysis, balancing transparency with predictive power to understand Titanic survival.

5. Evaluating the Models

The predictions on the scaled test set for Logistic Regression is generated (y_pred_log_reg = log_reg_model.predict(X_test_scaled)).

y_pred_log_reg = log_reg_model.predict(X_test_scaled)
accuracy_log_reg = accuracy_score(y_test, y_pred_log_reg)
print(f"Logistic Regression Accuracy: {accuracy_log_reg:.4f}")
print("Logistic Regression Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_log_reg))
print("Logistic Regression Classification Report:")
print(classification_report(y_test, y_pred_log_reg))
print("\n")

Output

The predictions on the unscaled test set for Random Forest are generated (y_pred_rf = rf_model.predict(X_test))

y_pred_rf = rf_model.predict(X_test)
accuracy_rf = accuracy_score(y_test, y_pred_rf)
print(f"Random Forest Accuracy: {accuracy_rf:.4f}")
print("Random Forest Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_rf))
print("Random Forest Classification Report:")
print(classification_report(y_test, y_pred_rf))
print("\n")

Output

From the output of the two models:

accuracy_score: Calculates the proportion of correctly classified instances.
confusion_matrix: Shows the number of true positives, true negatives, false positives, and false negatives.
classification_report: Provides a detailed summary of precision, recall, f1-score, and support for each class.

6. Making predictions using sample data

Using hypothetical new single passenger and then multiple passengers
The data will be processed through the same steps
- Removing irrelevant columns
- Impute missing values if any.
- Encode categorical features using the fitted encoders
- Ensuring column order matches training data for X_train -This is crucial (Create a DataFrame with the same columns as X_train, then fill it)
- Prediction using logistic regression and random forest
- Visualizing key relationships

a. Capturing the data

**_Using Single Passenger_**
new_passenger_features_raw = pd.DataFrame({
    'Pclass': [1],
    'Name': ['New, Mrs. Example (Test User)'], # Name added for consistency, but will be dropped
    'Sex': ['female'],
    'Age': [30],
    'SibSp': [0],
    'Parch': [0],
    'Ticket': ['TEST12345'], # Ticket added for consistency, but will be dropped
    'Fare': [50],
    'Cabin': [None], # Cabin added for consistency, but will be dropped
    'Embarked': ['S']
})

b. Dropping irrelevant columns

new_passenger_processed = new_passenger_features_raw.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], axis=1, errors='ignore')

c. Imputing missing values

new_passenger_processed['Age'] = imputer_age.transform(new_passenger_processed[['Age']])
new_passenger_processed['Fare'] = imputer_fare.transform(new_passenger_processed[['Fare']])

d. Encoding categorical features

new_passenger_processed['Sex'] = le.transform(new_passenger_processed['Sex'])
embarked_new = ohe.transform(new_passenger_processed[['Embarked']])
embarked_new_df = pd.DataFrame(embarked_new, columns=ohe.get_feature_names_out(['Embarked']))
new_passenger_processed = pd.concat([new_passenger_processed.drop('Embarked', axis=1), embarked_new_df], axis=1)

e. Ensure column order matches training data for X_train

new_passenger_final = pd.DataFrame(columns=X.columns)
new_passenger_final = pd.concat([new_passenger_final, new_passenger_processed], ignore_index=True)
new_passenger_final = new_passenger_final.fillna(0)

f. Prediction with the models

Logistic Regression using scaled data

new_passenger_scaled = scaler.transform(new_passenger_final)
log_reg_prediction = log_reg_model.predict(new_passenger_scaled)
print(f"Logistic Regression predicts survival for new passenger: {'Survived' if log_reg_prediction[0] == 1 else 'Not Survived'}")

Output: Logistic Regression predicts survival for new passenger: Survived

Random Forest using unscaled data

rf_prediction = rf_model.predict(new_passenger_final)
print(f"Random Forest predicts survival for new passenger: {'Survived' if rf_prediction[0] == 1 else 'Not Survived'}")

Output Random Forest predicts survival for new passenger: Survived

Using Multiple Passengers

a. Capture multiple passengers' data

new_passengers_features_raw = pd.DataFrame({
    'Pclass': [1, 2, 3],
    'Name': [
        'New, Mrs. Example (Test User 1)',
        'New, Mr. Sample (Test User 2)',
        'New, Miss Demo (Test User 3)'
    ],
    'Sex': ['female', 'male', 'female'],
    'Age': [30, None, 22],        # Example: missing Age for passenger 2
    'SibSp': [0, 1, 0],
    'Parch': [0, 0, 1],
    'Ticket': ['TEST12345', 'TEST12346', 'TEST12347'],
    'Fare': [50, None, 15],       # Example: missing Fare for passenger 2
    'Cabin': [None, None, None],
    'Embarked': ['S', 'C', 'Q']
})

b. Dropping irrelevant columns

new_passengers_processed = new_passengers_features_raw.drop(
    ['PassengerId', 'Name', 'Ticket', 'Cabin'],
    axis=1,
    errors='ignore'
)

c. Imputing missing values using fitted imputers

new_passengers_processed['Age'] = pd.Series(
    imputer_age.transform(new_passengers_processed[['Age']]).flatten(),
    index=new_passengers_processed.index
)
new_passengers_processed['Fare'] = pd.Series(
    imputer_fare.transform(new_passengers_processed[['Fare']]).flatten(),
    index=new_passengers_processed.index
)

d. Encoding categorical features

#Label encode 'Sex'
new_passengers_processed['Sex'] = le.transform(new_passengers_processed['Sex'])

#One-hot encode 'Embarked'
embarked_new = ohe.transform(new_passengers_processed[['Embarked']])
embarked_new_df = pd.DataFrame(
    embarked_new, 
    columns=ohe.get_feature_names_out(['Embarked']),
    index=new_passengers_processed.index
)
new_passengers_processed = pd.concat(
    [new_passengers_processed.drop('Embarked', axis=1), embarked_new_df],
    axis=1
)

e. Align columns with training data

Handles cases where new_passengers_processed might miss a dummy column e.g., if a new passenger only has 'Embarked_S' and not 'Embarked_C').

new_passengers_final = pd.DataFrame(columns=X.columns)
new_passengers_final = pd.concat([new_passengers_final, new_passenger_processed], ignore_index=True)
new_passengers_final = new_passenger_final.fillna(0)

print(new_passengers_final)

Output

Pclass	Sex	Age	SibSp	Parch	Fare	Embarked_C	Embarked_Q	Embarked_S
1	0	30.0	0	0	50.0000	0.0	0.0	1.0
2	1	27.0	1	0	14.4542	1.0	0.0	0.0
3	0	22.0	0	1	15.0000	0.0	1.0	0.0

e. Scale features and predict survival for multiple passengers

Logistic Regression

print("--- Logistic Regression Predictions ---")
new_passengers_scaled = scaler.transform(new_passengers_final)
log_reg_predictions = log_reg_model.predict(new_passengers_scaled)
log_reg_probabilities = log_reg_model.predict_proba(new_passengers_scaled)[:, 1] # Probability of survival

# Create results_lr using the *index* of new_passengers_final
results_lr = pd.DataFrame({
    'Predicted_Survival_LR': ['Survived' if p == 1 else 'Not Survived' for p in log_reg_predictions],
    'Survival_Probability_LR': log_reg_probabilities.round(2)
}, index=new_passengers_final.index) # Use the index of the processed features for results_lr

print(results_lr)
print("\n")

Output

#	Predicted_Survival_LR	Survival_Probability_LR
0	Survived	0.99
1	Not Survived	0.01
2	Survived	0.99

Random Forest

print("--- Random Forest Predictions ---")
# Random Forest uses unscaled data, so use new_passengers_final directly
rf_predictions = rf_model.predict(new_passengers_final)
rf_probabilities = rf_model.predict_proba(new_passengers_final)[:, 1] # Probability of survival

results_rf = pd.DataFrame({
    'Predicted_Survival_RF': ['Survived' if p == 1 else 'Not Survived' for p in rf_predictions],
    'Survival_Probability_RF': rf_probabilities.round(4)
})
print(results_rf)
print("\n")

Output

#	Predicted_Survival_RF	Survival_Probability_RF
0	Survived	0.96
1	Not Survived	0.01
2	Survived	0.96

Combined Prediction

combined_results = pd.merge(results_lr, results_rf, left_index=True, right_index=True)
print("Combined Predictions (merged on index)")
print(combined_results)
print("\n")

Output

#	Predicted_Survival_LR	Survival_Probability_LR	Predicted_Survival_RF	Survival_Probability_RF
0	Survived	0.99	Survived	0.96
1	Not Survived	0.01	Not Survived	0.01
2	Survived	0.99	Survived	0.96

8. Visualizing Key Relationships

Survival rate by Sex

print("\n Visualizing Key Relationships - By Sex")
plt.figure(figsize=(6, 4))
sns.barplot(x='Sex', y='Survived', data=df_processed.replace({'Sex': {0: 'Female', 1: 'Male'}}))
plt.title('Survival Rate by Sex')
plt.ylabel('Survival Rate')
plt.show()

Visually shows the higher survival rate for females.

Survival rate by Pclass -
illustrates how survival rates vary across different passenger classes.

print("\n Visualizing Key Relationships - By PClass")
plt.figure(figsize=(6, 4))
sns.barplot(x='Pclass', y='Survived', data=df_processed)
plt.title('Survival Rate by Pclass')
plt.ylabel('Survival Rate')
plt.show()

Distribution of Age and its relation to Survival

The histogram compares the age distribution of survivors vs. non-survivors.

plt.figure(figsize=(10, 6))
sns.histplot(df_processed[df_processed['Survived'] == 0]['Age'], color='red', label='Not Survived', kde=True)
sns.histplot(df_processed[df_processed['Survived'] == 1]['Age'], color='green', label='Survived', kde=True)
plt.title('Age Distribution by Survival')
plt.xlabel('Age')
plt.ylabel('Number of Passengers')
plt.legend()
plt.show()

Feature Importance for Random Forest
Since we are using Random Forest for its ability to show feature importance, visualizing or printing the rf_model.feature_importances_ adds viability to our model.

importances = rf_model.feature_importances_
feature_names = X.columns
forest_importances = pd.Series(importances, index=feature_names).sort_values(ascending=False)

plt.figure(figsize=(10, 6))
sns.barplot(x=forest_importances.values, y=forest_importances.index)
plt.title('Random Forest Feature Importances')
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.show()

7. Conclusion and Summary

Conclusion
This project effectively leveraged machine learning to forecast the survival of Titanic passengers. The tragic historical event emphasizes the crucial role that socioeconomic and demographic factors play in crisis outcomes. The predictions verified that sex and passenger class were by far the most important factors influencing survival. This was evident in both Random Forest Classifiers and Logistic Regression, with women and those in higher classes having significantly higher probability of being saved. Additionally, age was a factor, favoring very young children in particular.

Complementary insights were obtained by the application of the two separate models: As a great starting point, logistic regression provided a clear, intelligible explanation of the linear influence of each characteristic on survival probability. The effectiveness of ensemble approaches for subtle pattern identification was demonstrated by the Random Forest Classifier. this typically obtained greater predicted accuracy due to its capacity to grasp complicated non-linear linkages and interactions. The results were further supported by the visual analyses, which offered unambiguous graphical proof of the differences in survival rates.
In conclusion, this project demonstrated machine learning's ability to glean significant, actionable insights from historical data, shedding light on historical trends and providing a potent tool for comprehending intricate real-world phenomena, in addition to producing strong predictive models for the Titanic disaster.

Summary
The process of creating and assessing Titanic passenger survival prediction models was covered in length in this article. It began with an overview of the issue and proceeded through the fundamental stages of a machine learning process, with emphasis on training, evaluating and testing the model using sample data.

References

Atieno, L., Robina, F., & Otieno, M (2025). Survival Likelihood Model. (https://github.com/Loi2008/Data_Science_Assignments/blob/main/Prediction_Model_Titanic_Dataset.ipynb)
Dawson, E. (1997). The Titanic Disaster: Historical and Social Perspectives. Journal of Maritime History.
Kaggle. (n.d.). Titanic: Machine Learning from Disaster.
Breiman, L. (2001). Random Forests. Machine Learning, 45(1), 5–32.
Hosmer, D. W., & Lemeshow, S. (2000). Applied Logistic Regression. Wiley Series in Probability and Statistics.

DEV Community

Predicting Survival on the Titanic: A Machine Learning Approach

1. Introduction

2. Understanding the Dataset

3. Building the Model

a. Download all the Required Libraries

b. Suppress Warnings for Cleaner Output

c. Load the Data

d. Exploratory Data Analysis (EDA)

e. Splitting Data

f. Scaling Numerical Features (Columns)

4. Choosing and Training the Model

5. Evaluating the Models

6. Making predictions using sample data

a. Capturing the data

b. Dropping irrelevant columns

c. Imputing missing values

d. Encoding categorical features

e. Ensure column order matches training data for X_train

f. Prediction with the models

a. Capture multiple passengers' data

b. Dropping irrelevant columns

c. Imputing missing values using fitted imputers

d. Encoding categorical features

e. Align columns with training data

e. Scale features and predict survival for multiple passengers

8. Visualizing Key Relationships

7. Conclusion and Summary

References

Further Reading

Top comments (0)