Kaggle Practice 1: Setting Up a Local Environment for the Kaggle Titanic Competition
Kaggle Practice 2: First Submission
Kaggle Practice 3: Feature Engineering for Cabin
Kaggle Practice 4: Feature Engineering (Imputing Age with Random Forest)
https://www.kaggle.com/c/titanic
Abstract
- Extracted the deck (floor) information from the first letter of the
Cabinfeature and created a new feature. - Summary of the evaluation using 5-Fold CV (Cross-Validation) on models like LightGBM.
Introduction
Continuing with the Kaggle Titanic competition.
In the Titanic dataset, the Cabin (cabin number) column has more than 70% missing values, so I had previously excluded it.
However, in seeking survival possibilities, the physical distance to lifeboat stations and the rate of flooding when sinking might have differed depending on the cabin deck. Therefore, I thought it could be a meaningful feature.
Implementation
I implemented a preprocessing function to extract the first letter of Cabin as the Deck feature, filling missing values with 'U' (Unknown).
The code is as follows:
def feature_engineering(df):
df = df.copy()
# Existing preprocessing (e.g. extracting Title, filling Age)
df['Fare'] = df['Fare'].fillna(df['Fare'].median())
df['Embarked'] = df['Embarked'].fillna(df['Embarked'].mode()[0])
# Extract the first letter of Cabin as Deck
df['Deck'] = df['Cabin'].fillna('U').apply(lambda x: x[0])
le = LabelEncoder()
for col in ['Sex', 'Embarked', 'Title', 'Deck']:
df[col] = le.fit_transform(df[col])
return df
After preprocessing, I encoded the categorical variables using LabelEncoder and added the Deck feature to the model inputs.
Results
Evaluation results using 5-Fold CV:
| Model | Before (Baseline) | After (Cabin Added) | Difference |
|---|---|---|---|
| Logistic Regression | 0.8014 +/- 0.0133 | 0.7991 +/- 0.0199 | -0.0023 |
| Random Forest | 0.8227 +/- 0.0077 | 0.8148 +/- 0.0149 | -0.0079 |
| XGBoost | 0.8181 +/- 0.0220 | 0.8227 +/- 0.0159 | +0.0046 |
| LightGBM | 0.8350 +/- 0.0178 | 0.8361 +/- 0.0278 | +0.0011 |
The validation score of LightGBM improved slightly from 0.8350 to 0.8361.
Since XGBoost, another tree-based model, also improved, the feature seems to have some effect.
Submitting to Kaggle
When I submitted the predictions to the Titanic competition, the public score dropped slightly from 0.77272 to 0.77033. The simple addition of this feature might be introducing noise due to overfitting.
Summary
Given the score drop after submission, I will next try creating group features by clustering the cabin information.
Hope this helps!
Japanese version:
Kaggle Practice 1: Setting up Kaggle Titanic Environment on a Local PC
Kaggle Practice 2: First Submission
Kaggle Practice 3: Feature Engineering for Cabin
Kaggle Practice 4: Feature Engineering (Imputing Age with Random Forest)
Top comments (0)