Argha Sarkar

Posted on Jun 11

Titanic Survival Prediction Using Machine Learning: Complete Data Science Project

#ai #webdev #programming #productivity

When I first started learning Machine Learning, Titanic dataset was one of the most popular projects I found on Kaggle. Honestly, at first it looked like a simple classification problem. But after working on it, I understood that it teaches many important concepts like data cleaning, feature engineering, exploratory data analysis, model building, hyperparameter tuning, and model interpretation.

In this project, the goal was simple. We needed to predict whether a passenger survived or not during the Titanic disaster using information such as age, gender, ticket fare, passenger class, family details, and other features.

The first step was understanding dataset. After loading data using Pandas, I checked data types, summary statistics, and missing values. One major challenge was missing data. Cabin column had many missing values and Age column also contained several missing records. So data cleaning became necessary before building any model.

After basic analysis, I performed Exploratory Data Analysis (EDA). This part helped me understand passenger behavior better.

One interesting observation was that female passengers survived much more than male passengers. Another observation was that passengers travelling in first class had higher survival rates compared to second and third class passengers. Age and fare also showed some relationship with survival.

For visualization, libraries such as Matplotlib and Seaborn were used. Count plots, distribution plots, box plots, and heatmaps helped identify patterns in data.

Then came feature engineering, which I think was one of the most important parts of the project.

Several new features were created:

FamilySize
CabinKnown
TicketGroup
IsAlone
Passenger Title extracted from Name

These new features gave additional information which was not directly available in original dataset.

After feature engineering, a preprocessing pipeline was created. Missing values were handled using imputers. Numerical features were scaled and categorical variables were encoded using One-Hot Encoding. This ensured that data was ready for machine learning models.

Multiple machine learning algorithms were considered such as Logistic Regression, Decision Tree, Random Forest, Gradient Boosting, and XGBoost.

Among these models, XGBoost was selected as the main model because it generally performs very well on structured tabular data.

To improve performance further, Optuna was used for hyperparameter tuning. Different combinations of parameters like learning rate, maximum depth, subsample ratio, and number of estimators were tested automatically. Cross-validation was used during optimization to make model evaluation more reliable.

After training, model performance was measured using accuracy score, classification report, and confusion matrix. These metrics helped evaluate how well the model classified survivors and non-survivors.

The final step was model explainability using SHAP. This was actually my favorite part. SHAP values helped explain which features influenced predictions the most. Instead of treating model as a black box, we could understand why model predicted survival for a passenger.

From what I understand, this project is a complete end-to-end Machine Learning workflow. It starts from raw data and ends with model interpretation. For beginners and intermediate learners, Titanic project is still one of the best ways to learn practical Machine Learning because it combines data analysis, feature creation, model training, tuning, and explainability in a single project.

Below, I am sharing some code snippets. Here, I am not sharing the whole code -

Data Loading

import pandas as pd

train_df = pd.read_csv("train.csv")

print(train_df.info())
print(train_df.describe())

Feature Engineering

train_df["FamilySize"] = train_df["SibSp"] + train_df["Parch"] + 1

train_df["IsAlone"] = (
    train_df["FamilySize"] == 1
).astype(int)

train_df["CabinKnown"] = (
    train_df["Cabin"].notnull()
).astype(int)

train_df["Title"] = train_df["Name"].str.extract(
    ' ([A-Za-z]+)\.',
    expand=False
)

Preprocessing Pipeline

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import (
    OneHotEncoder,
    StandardScaler
)
from sklearn.impute import SimpleImputer

numeric_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])

categorical_pipeline = Pipeline([
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("encoder", OneHotEncoder(handle_unknown="ignore"))
])

XGBoost

from xgboost import XGBClassifier

model = XGBClassifier(
    random_state=42,
    eval_metric="logloss"
)

Hyperparameter Tuning with Optuna

import optuna

def objective(trial):

    params = {
        "n_estimators":
            trial.suggest_int(
                "n_estimators",
                100,
                1000
            ),

        "max_depth":
            trial.suggest_int(
                "max_depth",
                3,
                10
            ),

        "learning_rate":
            trial.suggest_float(
                "learning_rate",
                0.01,
                0.3
            )
    }

    return score

Model Evaluation

from sklearn.metrics import (
    accuracy_score,
    classification_report
)

preds = model.predict(X_test)

print(
    accuracy_score(
        y_test,
        preds
    )
)

print(
    classification_report(
        y_test,
        preds
    )
)

SHAP Explainability

import shap

explainer = shap.TreeExplainer(
    trained_xgb_model
)

shap_values = explainer.shap_values(
    X_test_preprocessed_df
)

shap.summary_plot(
    shap_values,
    X_test_preprocessed_df
)

Full code -

https://github.com/argha-sarkar/30-days-ml-code

DEV Community