Hi everyone! ๐ I'm currently working through the Titanic dataset as part of the CampusX YouTube course, and I ran into an interesting issue involving OneHotEncoder
and SimpleImputer
that I finally understood after digging into the problem.
This blog is all about that journey โ what caused the shape mismatch between training and testing data, and how I fixed it. If you're also working on preprocessing categorical variables in machine learning, this might save you a few hours of debugging!
๐ง The Setup
Weโre using the Titanic dataset for classification (predicting survival), and like most people, Iโm preprocessing the Sex
and Embarked
columns using:
- SimpleImputer to handle missing values
- OneHotEncoder to convert categorical variables into numerical format
Hereโs a snippet of what I had:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
si_embarked = SimpleImputer(strategy='most_frequent')
ohe_embarked = OneHotEncoder(sparse=False, handle_unknown='ignore')
# Imputation
x_train_embarked = si_embarked.fit_transform(x_train[['Embarked']])
x_test_embarked = si_embarked.transform(x_test[['Embarked']])
# Encoding
x_train_embarked = ohe_embarked.fit_transform(x_train[['Embarked']])
x_test_embarked = ohe_embarked.transform(x_test[['Embarked']])
โ ๏ธ The Issue
After running this, I checked the shapes:
x_train_embarked.shape โ (712, 4)
x_test_embarked.shape โ (179, 3)
Wait โ what?!
Why does the train set have 4 columns, and the test set has only 3, even though I used handle_unknown='ignore'
?
Wasnโt that supposed to handle unknown categories safely?
๐ต๏ธโโ๏ธ Investigating the Root Cause
I ran a few more checks and realized something sneaky:
x_train['Embarked'].isnull().sum() # Output: 2
Hmmโฆ thatโs weird. I thought I had already imputed missing values. But then I remembered this part of my code:
x_train_embarked = si_embarked.fit_transform(x_train[['Embarked']])
Aha! ๐ก I had imputed missing values into a new variable, x_train_embarked
, but I never updated the original x_train
DataFrame!
That means the original x_train['Embarked']
still had NaN
values when I called .fit()
on the encoder:
x_train['Embarked'] = ohe_embarked.fit_transform(x_train[['Embarked']])
This caused the encoder to treat NaN
as a valid category, resulting in 4 categories being learned:
['C', 'Q', 'S', NaN]
But in the test data, there were no NaN values, only:
['C', 'Q', 'S']
So the encoder ignored the unseen NaN
category, resulting in:
x_test_embarked.shape = (179, 3)
โ The Fix
The correct way was to assign the imputed values back to the original DataFrame:
x_train['Embarked'] = si_embarked.fit_transform(x_train[['Embarked']])
x_test['Embarked'] = si_embarked.transform(x_test[['Embarked']])
Now, when I fit the encoder:
ohe_embarked = OneHotEncoder(sparse=False, handle_unknown='ignore')
x_train_embarked = ohe_embarked.fit_transform(x_train[['Embarked']])
x_test_embarked = ohe_embarked.transform(x_test[['Embarked']])
โ
The shapes finally matched:
x_train_embarked.shape โ (712, 3)
x_test_embarked.shape โ (179, 3)
๐ Bonus: What handle_unknown='ignore'
Really Means
Hereโs a quick visual explanation:
Imagine your training data had these categories:
['Red', 'Blue', 'Green']
And you encode them like this:
Red | Blue | Green |
---|---|---|
1 | 0 | 0 |
0 | 1 | 0 |
0 | 0 | 1 |
Now your test data contains a new category: 'Yellow'
.
If you use:
OneHotEncoder(handle_unknown='ignore')
Then the encoder will just assign all 0s for 'Yellow'
:
Red | Blue | Green |
---|---|---|
0 | 0 | 0 |
โ No crash. But also โ you now have a row of all zeros!
๐ Final Takeaways
- Always handle missing values before encoding
- If youโre using
SimpleImputer
, assign the output back to your original DataFrame -
handle_unknown='ignore'
prevents errors, but doesnโt fix shape mismatches caused by unseen categories during.fit()
This was a great learning moment for me while working through the Titanic dataset with CampusX. Hope this helps anyone else facing the same mystery! ๐งฉ
Let me know if you've run into similar preprocessing surprises!
Top comments (0)