1. Introduction
One of the most tragic maritime catastrophes in history, the RMS Titanic sinking in 1912 claimed a great number of lives. In addition to the disaster's immense scope, the intricate interactions between various elements that affected survival rates have long captivated scholars.
In order to create a predictive model, this research will examine the Titanic dataset, which is a comprehensive collection of passenger data. We aim to identify the primary factors influencing survival and develop a strong model that can forecast a person's chances of surviving a disaster by examining characteristics like age, gender, passenger class, among others.
In addition to illuminating past trends, the knowledge gathered from this analysis will show how effective machine learning is at deriving insightful forecasts from intricate real-world data.
2. Understanding the Dataset
The dataset used (https://github.com/Loi2008/Data_Science_Assignments/blob/main/tested.csv) contains information about Titanic passengers. The goal is to develop a prediction model that predicts the likelihood of survival (using Survived column - 0 = No, 1 = Yes), based on other features. The dataset contains the following columns:
- PassengerId: Unique identifier for each passenger.
- Survived: Survival (0 = No, 1 = Yes) - Target variable.
- Pclass: Ticket class (1 = 1st, 2 = 2nd, 3 = 3rd).
- Name: Passenger's name.
- Sex: Sex (male/female).
- Age: Age in years.
- SibSp: Number of siblings/spouses aboard.
- Parch: Number of parents/children aboard.
- Ticket: Ticket number.
- Fare: Passenger fare.
- Cabin: Cabin number.
- Embarked: Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton).
3. Building the Model
Goal: Predicting the passengers' survival likelihood using 'Survived' column based on other passenger features. Below are the steps involved, together with the Python code for each step:
a. Download all the Required Libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
b. Suppress Warnings for Cleaner Output
warnings.filterwarnings('ignore')
c. Load the Data
Read the CSV file into a pandas DataFrame.
df = pd.read_csv (r"tested.csv")
df.head()
Output
PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 892 | 0 | 3 | Kelly, Mr. James | male | 34.5 | 0 | 0 | 330911 | 7.8292 | NaN | Q |
1 | 893 | 1 | 3 | Wilkes, Mrs. James (Ellen Needs) | female | 47.0 | 1 | 0 | 363272 | 7.0000 | NaN | S |
2 | 894 | 0 | 2 | Myles, Mr. Thomas Francis | male | 62.0 | 0 | 0 | 240276 | 9.6875 | NaN | Q |
3 | 895 | 0 | 3 | Wirz, Mr. Albert | male | 27.0 | 0 | 0 | 315154 | 8.6625 | NaN | S |
4 | 896 | 1 | 3 | Hirvonen, Mrs. Alexander (Helga E Lindqvist) | female | 22.0 | 1 | 1 | 3101298 | 12.2875 | NaN | S |
d. Exploratory Data Analysis (EDA)
Understand the data (data types, missing values and distributions).
df.info()
Output
RangeIndex: 418 entries, 0 to 417
Data columns (total 12 columns):
# | Column | Non-Null Count | Dtype |
---|---|---|---|
0 | PassengerId | 418 non-null | int64 |
1 | Survived | 418 non-null | int64 |
2 | Pclass | 418 non-null | int64 |
3 | Name | 418 non-null | object |
4 | Sex | 418 non-null | object |
5 | Age | 332 non-null | float64 |
6 | SibSp | 418 non-null | int64 |
7 | Parch | 418 non-null | int64 |
8 | Ticket | 418 non-null | object |
9 | Fare | 417 non-null | float64 |
10 | Cabin | 91 non-null | object |
11 | Embarked | 418 non-null | object |
dtypes: float64(2), int64(5), object(5)
memory usage: 39.3+ KB
Handling missing values
Number of missing values for each column
missing = df.isnull().sum()
non_null = df.notnull().sum()
total = len(df)
# Build summary DataFrame
summary = pd.DataFrame({
"Non-Null Count": non_null,
"Missing Values": missing,
"Missing %": (missing / total * 100).round(1),
"Dtype": df.dtypes
})
# Rename index to 'Column'
summary.index.name = "Column"
# Display
summary
Output
Column | Non-Null Count | Missing Values | Missing % | Dtype |
---|---|---|---|---|
PassengerId | 418 | 0 | 0.0% | int64 |
Survived | 418 | 0 | 0.0% | int64 |
Pclass | 418 | 0 | 0.0% | int64 |
Name | 418 | 0 | 0.0% | object |
Sex | 418 | 0 | 0.0% | object |
Age | 332 | 86 | 20.6% | float64 |
SibSp | 418 | 0 | 0.0% | int64 |
Parch | 418 | 0 | 0.0% | int64 |
Ticket | 418 | 0 | 0.0% | object |
Fare | 417 | 1 | 0.2% | float64 |
Cabin | 91 | 327 | 78.2% | object |
Embarked | 418 | 0 | 0.0% | object |
Creating a copy of the dataset to avoid Error when modifying the original dataset (df) later .
df_processed = df.copy()
The dataset statistical distribution
Output
Survived | Pclass | Sex | Age | SibSp | Parch | Fare | Embarked_C | Embarked_Q | Embarked_S | |
---|---|---|---|---|---|---|---|---|---|---|
count | 418.0000 | 418.0000 | 418.0000 | 418.0000 | 418.0000 | 418.0000 | 418.0000 | 418.0000 | 418.0000 | 418.0000 |
mean | 0.3636 | 2.2656 | 0.6364 | 29.5993 | 0.4474 | 0.3923 | 35.5765 | 0.2440 | 0.1100 | 0.6459 |
std | 0.4816 | 0.8418 | 0.4816 | 12.7038 | 0.8968 | 0.9814 | 55.8501 | 0.4300 | 0.3133 | 0.4788 |
min | 0.0000 | 1.0000 | 0.0000 | 0.1700 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0000 |
25% | 0.0000 | 1.0000 | 0.0000 | 23.0000 | 0.0000 | 0.0000 | 7.8958 | 0.0000 | 0.0000 | 0.0000 |
50% | 0.0000 | 3.0000 | 1.0000 | 27.0000 | 0.0000 | 0.0000 | 14.4542 | 0.0000 | 0.0000 | 1.0000 |
75% | 1.0000 | 3.0000 | 1.0000 | 35.7500 | 1.0000 | 0.0000 | 31.4719 | 0.0000 | 0.0000 | 1.0000 |
max | 1.0000 | 3.0000 | 1.0000 | 76.0000 | 8.0000 | 9.0000 | 512.3292 | 1.0000 | 1.0000 | 1.0000 |
Impute the missing values with median - Impute() usedinstead of fillna() due to its robustness. The median is preferred over the mean because the data is skewed. From the statistical distribution results above the means for "Age" and "Fare" are not equal to the median hence the data is skewed.
imputer_age = SimpleImputer(strategy='median')
df_processed['Age'] = imputer_age.fit_transform(df_processed[['Age']])
imputer_fare = SimpleImputer(strategy='median')
df_processed['Fare'] = imputer_fare.fit_transform(df_processed[['Fare']])
Dropping the column Cabin. This is because it has a high percentage of missing values (78.2%), and directly using it often requires complex feature engineering.
df_processed = df_processed.drop('Cabin', axis=1)
Dropping irrelevant columns.
PassengerId, Name, Ticket are usually unique identifiers or free-text fields that don't directly contribute to survival prediction for basic models.
df_processed = df_processed.drop(['PassengerId', 'Name', 'Ticket'], axis=1)
Encode categorical columns - 'Sex' and 'Embarked'.
- Sex: It's a binary categorical feature (male, female). LabelEncoder converts male to 1 and female to 0 for numerical models to understand.
- Embarked: Has three categories (S, C, Q). OneHotEncoder is used to convert this into separate binary columns - Embarked_C, Embarked_Q, Embarked_S. This will prevents the model from assuming an ordinal relationship between the categories.
# 'Sex': Use LabelEncoder (binary feature)
le = LabelEncoder()
df_processed['Sex'] = le.fit_transform(df_processed['Sex']) # male=1, female=0. Or male=0, female=1 depending on internal sorting.
# 'Embarked': Use OneHotEncoder
print("Unique Embarked values before OneHotEncoding:", df_processed['Embarked'].unique())
ohe = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
embarked_encoded = ohe.fit_transform(df_processed[['Embarked']])
embarked_df = pd.DataFrame(embarked_encoded, columns=ohe.get_feature_names_out(['Embarked']), index=df_processed.index)
df_processed = pd.concat([df_processed.drop('Embarked', axis=1), embarked_df], axis=1)
print("DataFrame after preprocessing. First 5 rows:")
print(df_processed.head())
print("\n")
print("DataFrame Info after preprocessing:")
df_processed.info()
print("\n")
Output
Survived | Pclass | Sex | Age | SibSp | Parch | Fare | Embarked_C | Embarked_Q | Embarked_S | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 3 | 1 | 34.5 | 0 | 0 | 7.8292 | 0.0 | 1.0 | 0.0 |
1 | 1 | 3 | 0 | 47.0 | 1 | 0 | 7.0000 | 0.0 | 0.0 | 1.0 |
2 | 0 | 2 | 1 | 62.0 | 0 | 0 | 9.6875 | 0.0 | 1.0 | 0.0 |
3 | 0 | 3 | 1 | 27.0 | 0 | 0 | 8.6625 | 0.0 | 0.0 | 1.0 |
4 | 1 | 3 | 0 | 22.0 | 1 | 1 | 12.2875 | 0.0 | 0.0 | 1.0 |
The final columns
# | Column | Non-Null Count | Dtype |
---|---|---|---|
0 | Survived | 418 non-null | int64 |
1 | Pclass | 418 non-null | int64 |
2 | Sex | 418 non-null | int64 |
3 | Age | 418 non-null | float64 |
4 | SibSp | 418 non-null | int64 |
5 | Parch | 418 non-null | int64 |
6 | Fare | 418 non-null | float64 |
7 | Embarked_C | 418 non-null | float64 |
8 | Embarked_Q | 418 non-null | float64 |
9 | Embarked_S | 418 non-null | float64 |
e. Splitting Data
The dataset is divided into training and testing sets.
print("Splitting Data into Training and Test Sets")
X = df_processed.drop('Survived', axis=1)
y = df_processed['Survived']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
print(f"X_train shape: {X_train.shape}")
print(f"X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train.shape}")
print(f"y_test shape: {y_test.shape}")
print("\n")
Output
The data is split Data into Training and Test sets as illustrated:
- X_train shape: (334, 9)
- X_test shape: (84, 9)
- y_train shape: (334,)
- y_test shape: (84,)
Elaboration on the splitting code
- X = df.drop('Survived', axis=1): Creates the feature matrix X by dropping the target variable.
- y = df['Survived']: Creates the target vector y.
- train_test_split(X, y, test_size=0.2, random_state=42, stratify=y): Splits the data into 80% for training and 20% for testing.
- random_state: Ensures reproducibility of the split.
- stratify=y: Important for classification problems to ensure that the proportion of 'Survived' (0s and 1s) is roughly the same in both training and test sets.
f. Scaling Numerical Features (Columns)
StandardScaler standardizes features by removing the mean and scaling to unit variance. This is important for algorithms that are sensitive to the scale of input features like Logistic Regression to prevent features with larger values from dominating the learning process. As we are using Logistic Regression, scaling is important as it will ensure each feature contributes equally.
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
print("X_train_scaled (first 5 rows):")
print(X_train_scaled[:5])
print("\n")
print("X_test_scaled (first 5 rows):")
print(X_test_scaled[:5])
print("\n")
Output
X_train_scaled (first 5 rows)
Row | Col1 | Col2 | Col3 | Col4 | Col5 | Col6 | Col7 | Col8 | Col9 |
---|---|---|---|---|---|---|---|---|---|
1 | 0.85435834 | 0.75370758 | -0.21477642 | -0.48043064 | -0.41184087 | -0.50912957 | -0.54736724 | -0.33665016 | 0.70551956 |
2 | 0.85435834 | 0.75370758 | -0.68236136 | -0.48043064 | -0.41184087 | -0.49535877 | -0.54736724 | -0.33665016 | 0.70551956 |
3 | 0.85435834 | 0.75370758 | -0.21477642 | -0.48043064 | -0.41184087 | -0.49615131 | -0.54736724 | 2.97044263 | -1.41739515 |
4 | 0.85435834 | 0.75370758 | -1.61753124 | -0.48043064 | 0.67126820 | -0.57539142 | -0.54736724 | -0.33665016 | 0.70551956 |
5 | 0.85435834 | -1.32677450 | -0.21477642 | -0.48043064 | -0.41184087 | -0.49564602 | -0.54736724 | 2.97044263 | -1.41739515 |
X_test_scaled (first 5 rows)
Row | Col1 | Col2 | Col3 | Col4 | Col5 | Col6 | Col7 | Col8 | Col9 |
---|---|---|---|---|---|---|---|---|---|
1 | 0.85435834 | -1.32677450 | -0.21477642 | -0.48043064 | -0.41184087 | -0.49615131 | -0.54736724 | 2.97044263 | -1.41739515 |
2 | -1.49424813 | -1.32677450 | -0.37063806 | 0.61116007 | -0.41184087 | 0.32912293 | 1.82692702 | -0.33665016 | -1.41739515 |
3 | 0.85435834 | 0.75370758 | -0.21477642 | -0.48043064 | -0.41184087 | -0.49362833 | -0.54736724 | -0.33665016 | 0.70551956 |
4 | -0.31994489 | -1.32677450 | -0.76029218 | -0.48043064 | -0.41184087 | 0.00567507 | -0.54736724 | -0.33665016 | 0.70551956 |
5 | -1.49424813 | 0.75370758 | -0.37063806 | -0.48043064 | -0.41184087 | -0.18034678 | 1.82692702 | -0.33665016 | -1.41739515 |
4. Choosing and Training the Model
Two models chosen:
Logistic Regression: Chosen for its clear interpretability, providing insights into how each factor linearly influences survival probability, and serving as a strong, efficient baseline.
print("Training Logistic Regression Model...")
log_reg_model = LogisticRegression(random_state=42, solver='liblinear')
log_reg_model.fit(X_train_scaled, y_train)
print("Logistic Regression Model Trained.\n")
log_reg_model.fit(X_train_scaled, y_train): Trains the model using the scaled training data.
Random Forest Classifier An ensemble method, offers higher accuracy by capturing complex, non-linear relationships and interactions without extensive data preprocessing.
print("Training Random Forest Classifier Model...")
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
# Random Forest is less sensitive to feature scaling, so we can use unscaled X
rf_model.fit(X_train, y_train)
print("Random Forest Classifier Model Trained.\n")
rf_model.fit(X_train, y_train): Trains the model. Random Forests are less sensitive to feature scaling, so we can use the unscaled X_train here.
Together, they allow for a comprehensive analysis, balancing transparency with predictive power to understand Titanic survival.
5. Evaluating the Models
The predictions on the scaled test set for Logistic Regression is generated (y_pred_log_reg = log_reg_model.predict(X_test_scaled)).
y_pred_log_reg = log_reg_model.predict(X_test_scaled)
accuracy_log_reg = accuracy_score(y_test, y_pred_log_reg)
print(f"Logistic Regression Accuracy: {accuracy_log_reg:.4f}")
print("Logistic Regression Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_log_reg))
print("Logistic Regression Classification Report:")
print(classification_report(y_test, y_pred_log_reg))
print("\n")
Output
The predictions on the unscaled test set for Random Forest are generated (y_pred_rf = rf_model.predict(X_test))
y_pred_rf = rf_model.predict(X_test)
accuracy_rf = accuracy_score(y_test, y_pred_rf)
print(f"Random Forest Accuracy: {accuracy_rf:.4f}")
print("Random Forest Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_rf))
print("Random Forest Classification Report:")
print(classification_report(y_test, y_pred_rf))
print("\n")
Output
From the output of the two models:
- accuracy_score: Calculates the proportion of correctly classified instances.
- confusion_matrix: Shows the number of true positives, true negatives, false positives, and false negatives.
- classification_report: Provides a detailed summary of precision, recall, f1-score, and support for each class.
6. Making predictions using sample data
- Using hypothetical new single passenger and then multiple passengers
- The data will be processed through the same steps
- Removing irrelevant columns
- Impute missing values if any.
- Encode categorical features using the fitted encoders
- Ensuring column order matches training data for X_train -This is crucial (Create a DataFrame with the same columns as X_train, then fill it)
- Prediction using logistic regression and random forest
- Visualizing key relationships
a. Capturing the data
**_Using Single Passenger_**
new_passenger_features_raw = pd.DataFrame({
'Pclass': [1],
'Name': ['New, Mrs. Example (Test User)'], # Name added for consistency, but will be dropped
'Sex': ['female'],
'Age': [30],
'SibSp': [0],
'Parch': [0],
'Ticket': ['TEST12345'], # Ticket added for consistency, but will be dropped
'Fare': [50],
'Cabin': [None], # Cabin added for consistency, but will be dropped
'Embarked': ['S']
})
b. Dropping irrelevant columns
new_passenger_processed = new_passenger_features_raw.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], axis=1, errors='ignore')
c. Imputing missing values
new_passenger_processed['Age'] = imputer_age.transform(new_passenger_processed[['Age']])
new_passenger_processed['Fare'] = imputer_fare.transform(new_passenger_processed[['Fare']])
d. Encoding categorical features
new_passenger_processed['Sex'] = le.transform(new_passenger_processed['Sex'])
embarked_new = ohe.transform(new_passenger_processed[['Embarked']])
embarked_new_df = pd.DataFrame(embarked_new, columns=ohe.get_feature_names_out(['Embarked']))
new_passenger_processed = pd.concat([new_passenger_processed.drop('Embarked', axis=1), embarked_new_df], axis=1)
e. Ensure column order matches training data for X_train
new_passenger_final = pd.DataFrame(columns=X.columns)
new_passenger_final = pd.concat([new_passenger_final, new_passenger_processed], ignore_index=True)
new_passenger_final = new_passenger_final.fillna(0)
f. Prediction with the models
Logistic Regression using scaled data
new_passenger_scaled = scaler.transform(new_passenger_final)
log_reg_prediction = log_reg_model.predict(new_passenger_scaled)
print(f"Logistic Regression predicts survival for new passenger: {'Survived' if log_reg_prediction[0] == 1 else 'Not Survived'}")
Output: Logistic Regression predicts survival for new passenger: Survived
Random Forest using unscaled data
rf_prediction = rf_model.predict(new_passenger_final)
print(f"Random Forest predicts survival for new passenger: {'Survived' if rf_prediction[0] == 1 else 'Not Survived'}")
Output Random Forest predicts survival for new passenger: Survived
Using Multiple Passengers
a. Capture multiple passengers' data
new_passengers_features_raw = pd.DataFrame({
'Pclass': [1, 2, 3],
'Name': [
'New, Mrs. Example (Test User 1)',
'New, Mr. Sample (Test User 2)',
'New, Miss Demo (Test User 3)'
],
'Sex': ['female', 'male', 'female'],
'Age': [30, None, 22], # Example: missing Age for passenger 2
'SibSp': [0, 1, 0],
'Parch': [0, 0, 1],
'Ticket': ['TEST12345', 'TEST12346', 'TEST12347'],
'Fare': [50, None, 15], # Example: missing Fare for passenger 2
'Cabin': [None, None, None],
'Embarked': ['S', 'C', 'Q']
})
b. Dropping irrelevant columns
new_passengers_processed = new_passengers_features_raw.drop(
['PassengerId', 'Name', 'Ticket', 'Cabin'],
axis=1,
errors='ignore'
)
c. Imputing missing values using fitted imputers
new_passengers_processed['Age'] = pd.Series(
imputer_age.transform(new_passengers_processed[['Age']]).flatten(),
index=new_passengers_processed.index
)
new_passengers_processed['Fare'] = pd.Series(
imputer_fare.transform(new_passengers_processed[['Fare']]).flatten(),
index=new_passengers_processed.index
)
d. Encoding categorical features
#Label encode 'Sex'
new_passengers_processed['Sex'] = le.transform(new_passengers_processed['Sex'])
#One-hot encode 'Embarked'
embarked_new = ohe.transform(new_passengers_processed[['Embarked']])
embarked_new_df = pd.DataFrame(
embarked_new,
columns=ohe.get_feature_names_out(['Embarked']),
index=new_passengers_processed.index
)
new_passengers_processed = pd.concat(
[new_passengers_processed.drop('Embarked', axis=1), embarked_new_df],
axis=1
)
e. Align columns with training data
Handles cases where new_passengers_processed might miss a dummy column e.g., if a new passenger only has 'Embarked_S' and not 'Embarked_C').
new_passengers_final = pd.DataFrame(columns=X.columns)
new_passengers_final = pd.concat([new_passengers_final, new_passenger_processed], ignore_index=True)
new_passengers_final = new_passenger_final.fillna(0)
print(new_passengers_final)
Output
Pclass | Sex | Age | SibSp | Parch | Fare | Embarked_C | Embarked_Q | Embarked_S |
---|---|---|---|---|---|---|---|---|
1 | 0 | 30.0 | 0 | 0 | 50.0000 | 0.0 | 0.0 | 1.0 |
2 | 1 | 27.0 | 1 | 0 | 14.4542 | 1.0 | 0.0 | 0.0 |
3 | 0 | 22.0 | 0 | 1 | 15.0000 | 0.0 | 1.0 | 0.0 |
e. Scale features and predict survival for multiple passengers
Logistic Regression
print("--- Logistic Regression Predictions ---")
new_passengers_scaled = scaler.transform(new_passengers_final)
log_reg_predictions = log_reg_model.predict(new_passengers_scaled)
log_reg_probabilities = log_reg_model.predict_proba(new_passengers_scaled)[:, 1] # Probability of survival
# Create results_lr using the *index* of new_passengers_final
results_lr = pd.DataFrame({
'Predicted_Survival_LR': ['Survived' if p == 1 else 'Not Survived' for p in log_reg_predictions],
'Survival_Probability_LR': log_reg_probabilities.round(2)
}, index=new_passengers_final.index) # Use the index of the processed features for results_lr
print(results_lr)
print("\n")
Output
# | Predicted_Survival_LR | Survival_Probability_LR |
---|---|---|
0 | Survived | 0.99 |
1 | Not Survived | 0.01 |
2 | Survived | 0.99 |
Random Forest
print("--- Random Forest Predictions ---")
# Random Forest uses unscaled data, so use new_passengers_final directly
rf_predictions = rf_model.predict(new_passengers_final)
rf_probabilities = rf_model.predict_proba(new_passengers_final)[:, 1] # Probability of survival
results_rf = pd.DataFrame({
'Predicted_Survival_RF': ['Survived' if p == 1 else 'Not Survived' for p in rf_predictions],
'Survival_Probability_RF': rf_probabilities.round(4)
})
print(results_rf)
print("\n")
Output
# | Predicted_Survival_RF | Survival_Probability_RF |
---|---|---|
0 | Survived | 0.96 |
1 | Not Survived | 0.01 |
2 | Survived | 0.96 |
Combined Prediction
combined_results = pd.merge(results_lr, results_rf, left_index=True, right_index=True)
print("Combined Predictions (merged on index)")
print(combined_results)
print("\n")
Output
# | Predicted_Survival_LR | Survival_Probability_LR | Predicted_Survival_RF | Survival_Probability_RF |
---|---|---|---|---|
0 | Survived | 0.99 | Survived | 0.96 |
1 | Not Survived | 0.01 | Not Survived | 0.01 |
2 | Survived | 0.99 | Survived | 0.96 |
8. Visualizing Key Relationships
Survival rate by Sex
print("\n Visualizing Key Relationships - By Sex")
plt.figure(figsize=(6, 4))
sns.barplot(x='Sex', y='Survived', data=df_processed.replace({'Sex': {0: 'Female', 1: 'Male'}}))
plt.title('Survival Rate by Sex')
plt.ylabel('Survival Rate')
plt.show()
Visually shows the higher survival rate for females.
Survival rate by Pclass -
illustrates how survival rates vary across different passenger classes.
print("\n Visualizing Key Relationships - By PClass")
plt.figure(figsize=(6, 4))
sns.barplot(x='Pclass', y='Survived', data=df_processed)
plt.title('Survival Rate by Pclass')
plt.ylabel('Survival Rate')
plt.show()
Distribution of Age and its relation to Survival
The histogram compares the age distribution of survivors vs. non-survivors.
plt.figure(figsize=(10, 6))
sns.histplot(df_processed[df_processed['Survived'] == 0]['Age'], color='red', label='Not Survived', kde=True)
sns.histplot(df_processed[df_processed['Survived'] == 1]['Age'], color='green', label='Survived', kde=True)
plt.title('Age Distribution by Survival')
plt.xlabel('Age')
plt.ylabel('Number of Passengers')
plt.legend()
plt.show()
Feature Importance for Random Forest
Since we are using Random Forest for its ability to show feature importance, visualizing or printing the rf_model.feature_importances_ adds viability to our model.
importances = rf_model.feature_importances_
feature_names = X.columns
forest_importances = pd.Series(importances, index=feature_names).sort_values(ascending=False)
plt.figure(figsize=(10, 6))
sns.barplot(x=forest_importances.values, y=forest_importances.index)
plt.title('Random Forest Feature Importances')
plt.xlabel('Importance')
plt.ylabel('Feature')
plt.show()
7. Conclusion and Summary
Conclusion
This project effectively leveraged machine learning to forecast the survival of Titanic passengers. The tragic historical event emphasizes the crucial role that socioeconomic and demographic factors play in crisis outcomes. The predictions verified that sex and passenger class were by far the most important factors influencing survival. This was evident in both Random Forest Classifiers and Logistic Regression, with women and those in higher classes having significantly higher probability of being saved. Additionally, age was a factor, favoring very young children in particular.
Complementary insights were obtained by the application of the two separate models: As a great starting point, logistic regression provided a clear, intelligible explanation of the linear influence of each characteristic on survival probability. The effectiveness of ensemble approaches for subtle pattern identification was demonstrated by the Random Forest Classifier. this typically obtained greater predicted accuracy due to its capacity to grasp complicated non-linear linkages and interactions. The results were further supported by the visual analyses, which offered unambiguous graphical proof of the differences in survival rates.
In conclusion, this project demonstrated machine learning's ability to glean significant, actionable insights from historical data, shedding light on historical trends and providing a potent tool for comprehending intricate real-world phenomena, in addition to producing strong predictive models for the Titanic disaster.
Summary
The process of creating and assessing Titanic passenger survival prediction models was covered in length in this article. It began with an overview of the issue and proceeded through the fundamental stages of a machine learning process, with emphasis on training, evaluating and testing the model using sample data.
References
- Atieno, L., Robina, F., & Otieno, M (2025). Survival Likelihood Model. (https://github.com/Loi2008/Data_Science_Assignments/blob/main/Prediction_Model_Titanic_Dataset.ipynb)
- Dawson, E. (1997). The Titanic Disaster: Historical and Social Perspectives. Journal of Maritime History.
- Kaggle. (n.d.). Titanic: Machine Learning from Disaster.
- Breiman, L. (2001). Random Forests. Machine Learning, 45(1), 5–32.
- Hosmer, D. W., & Lemeshow, S. (2000). Applied Logistic Regression. Wiley Series in Probability and Statistics.
Further Reading
- Géron, A. (2019). Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow. O’Reilly Media.
- Murphy, K. P. (2012). Machine Learning: A Probabilistic Perspective. MIT Press.
- Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.
- Raschka, S., & Mirjalili, V. (2019). Python Machine Learning. Packt Publishing.
- Blog: Top 10 Machine Learning Algorithms You Should Know – Towards Data Science
Top comments (0)