DEV Community

kito2718
kito2718

Posted on

Kaggle Titanic: Cabin Feature Engineering (Is It Really Effective?)

Kaggle Practice 1: Setting Up a Local Environment for the Kaggle Titanic Competition
Kaggle Practice 2: First Submission
Kaggle Practice 3: Feature Engineering for Cabin
Kaggle Practice 4: Feature Engineering (Imputing Age with Random Forest)

https://www.kaggle.com/c/titanic

Available on GitHub

Abstract

  • Extracted the deck (floor) information from the first letter of the Cabin feature and created a new feature.
  • Summary of the evaluation using 5-Fold CV (Cross-Validation) on models like LightGBM.

Introduction

Continuing with the Kaggle Titanic competition.
In the Titanic dataset, the Cabin (cabin number) column has more than 70% missing values, so I had previously excluded it.
However, in seeking survival possibilities, the physical distance to lifeboat stations and the rate of flooding when sinking might have differed depending on the cabin deck. Therefore, I thought it could be a meaningful feature.

Implementation

I implemented a preprocessing function to extract the first letter of Cabin as the Deck feature, filling missing values with 'U' (Unknown).
The code is as follows:

def feature_engineering(df):
    df = df.copy()
    # Existing preprocessing (e.g. extracting Title, filling Age)
    df['Fare'] = df['Fare'].fillna(df['Fare'].median())
    df['Embarked'] = df['Embarked'].fillna(df['Embarked'].mode()[0])

    # Extract the first letter of Cabin as Deck
    df['Deck'] = df['Cabin'].fillna('U').apply(lambda x: x[0])

    le = LabelEncoder()
    for col in ['Sex', 'Embarked', 'Title', 'Deck']:
        df[col] = le.fit_transform(df[col])
    return df
Enter fullscreen mode Exit fullscreen mode

After preprocessing, I encoded the categorical variables using LabelEncoder and added the Deck feature to the model inputs.

Results

Evaluation results using 5-Fold CV:

Model Before (Baseline) After (Cabin Added) Difference
Logistic Regression 0.8014 +/- 0.0133 0.7991 +/- 0.0199 -0.0023
Random Forest 0.8227 +/- 0.0077 0.8148 +/- 0.0149 -0.0079
XGBoost 0.8181 +/- 0.0220 0.8227 +/- 0.0159 +0.0046
LightGBM 0.8350 +/- 0.0178 0.8361 +/- 0.0278 +0.0011

The validation score of LightGBM improved slightly from 0.8350 to 0.8361.
Since XGBoost, another tree-based model, also improved, the feature seems to have some effect.

Submitting to Kaggle

When I submitted the predictions to the Titanic competition, the public score dropped slightly from 0.77272 to 0.77033. The simple addition of this feature might be introducing noise due to overfitting.

Summary

Given the score drop after submission, I will next try creating group features by clustering the cabin information.

Hope this helps!

Japanese version:
Kaggle Practice 1: Setting up Kaggle Titanic Environment on a Local PC
Kaggle Practice 2: First Submission
Kaggle Practice 3: Feature Engineering for Cabin
Kaggle Practice 4: Feature Engineering (Imputing Age with Random Forest)

Top comments (0)