Kaggle Titanic Practice 3: Cabin Feature Engineering (Is It Really Effective?)

#kaggle #machinelearning #python #lightgbm

Abstract

Extracted the deck (floor) information from the first letter of the Cabin feature and created a new feature.
Summary of the evaluation using 5-Fold CV (Cross-Validation) on models like LightGBM.

Introduction

Continuing with the Kaggle Titanic competition.
In the Titanic dataset, the Cabin (cabin number) column has more than 70% missing values, so I had previously excluded it.
However, in seeking survival possibilities, the physical distance to lifeboat stations and the rate of flooding when sinking might have differed depending on the cabin deck. Therefore, I thought it could be a meaningful feature.

Implementation

I implemented a preprocessing function to extract the first letter of Cabin as the Deck feature, filling missing values with 'U' (Unknown).
The code is as follows:

def feature_engineering(df):
    df = df.copy()
    # Existing preprocessing (e.g. extracting Title, filling Age)
    df['Fare'] = df['Fare'].fillna(df['Fare'].median())
    df['Embarked'] = df['Embarked'].fillna(df['Embarked'].mode()[0])

    # Extract the first letter of Cabin as Deck
    df['Deck'] = df['Cabin'].fillna('U').apply(lambda x: x[0])

    le = LabelEncoder()
    for col in ['Sex', 'Embarked', 'Title', 'Deck']:
        df[col] = le.fit_transform(df[col])
    return df

After preprocessing, I encoded the categorical variables using LabelEncoder and added the Deck feature to the model inputs.

Results

Evaluation results using 5-Fold CV:

Model	Before (Baseline)	After (Cabin Added)	Difference
Logistic Regression	0.8014 +/- 0.0133	0.7991 +/- 0.0199	-0.0023
Random Forest	0.8227 +/- 0.0077	0.8148 +/- 0.0149	-0.0079
XGBoost	0.8181 +/- 0.0220	0.8227 +/- 0.0159	+0.0046
LightGBM	0.8350 +/- 0.0178	0.8361 +/- 0.0278	+0.0011

The validation score of LightGBM improved slightly from 0.8350 to 0.8361.
Since XGBoost, another tree-based model, also improved, the feature seems to have some effect.

Submitting to Kaggle

When I submitted the predictions to the Titanic competition, the public score dropped slightly from 0.77272 to 0.77033. The simple addition of this feature might be introducing noise due to overfitting.