Vineet Chauhan

Posted on May 28

Feature Engineering is Not Just “Cleaning Data”: What I Learned While Building a Real ML Pipeline

#machinelearning #ai #programming #datascience

Most machine learning tutorials make preprocessing look straightforward.

Handle missing values.
Encode categorical features.
Train the model.
Get accuracy.

But while working on a real classification dataset, I realized feature engineering is far less about applying textbook techniques and far more about making careful decisions under uncertainty.

This project completely changed how I think about preprocessing.

Instead of writing another “complete guide to feature engineering,” I wanted to document the actual engineering problems I faced while building a preprocessing pipeline — including debugging mistakes, failed assumptions, distribution shifts, encoding challenges, and how preprocessing itself changed model behaviour.

The Dataset Looked Simple at First

Initially, the dataset looked manageable:

Numerical features
Categorical features
Missing values
Binary target variable

Nothing seemed unusual.

But the moment preprocessing started, the real complexity appeared.

The First Problem: Missing Values Were Uneven Everywhere

One of the first things I checked was the percentage of missing values across columns.

Some columns had:

less than 1% missing values
some had 5–10%
others had more than 30%

This immediately raised an important question:

Should every missing value be handled using the same strategy?

The answer quickly became no.

My Initial Mistake: Applying One Strategy Everywhere

At first, I tried treating all missing values similarly.

That approach failed quickly because:

Complete Case Analysis removed too many rows
KNN Imputation behaved poorly on categorical-heavy features
Encoded categorical values introduced unrealistic numeric relationships
Feature distributions started changing unexpectedly

This was the first moment I realized:

Feature engineering is not a fixed recipe.

Different features require different preprocessing decisions.

Using Complete Case Analysis (CCA)

For columns with less than 5% missing values, I used Complete Case Analysis.

At first, this seemed harmless.

But then I decided to compare feature distributions before and after row deletion.

This turned out to be one of the most important observations in the project.

Even small row deletions slightly changed feature density and distributions.

That was the moment I understood:

Missing value handling is not only about removing NaNs — it can also reshape the dataset itself.

A Small Pandas Mistake That Broke My Pipeline

One debugging issue confused me for quite a while.

Initially, I wrote:

df[cols].dropna()

This unintentionally removed all other columns from the dataframe.

The correct approach was:

df.dropna(subset=cols)

The difference was tiny syntactically but huge logically.

This taught me something surprisingly important:

Many machine learning problems are not model problems.
They are dataframe manipulation problems.

Why KNN Imputer Became Complicated

Initially, I planned to use KNN Imputer for all remaining missing values.

But another issue appeared immediately.

KNN relies on distance calculations.

Distance works naturally for numerical data, but categorical columns require encoding first.

That introduced several complications:

Label encoding created artificial numeric relationships
One Hot Encoding exploded feature dimensionality
NaN values converted into strings accidentally during preprocessing
Encoded categories distorted neighbor similarity

This made me realize:

KNN Imputer works much better for numerical features than heavily categorical datasets.

Eventually, I switched to a hybrid preprocessing strategy instead of forcing one solution everywhere.

Final Missing Value Strategy

I ended up using:

Feature Type:- Strategy:-

Low missing numerical ---> Complete Case Analysis
High missing numerical ---> Median/KNN Imputation
High missing categorical ---> Most Frequent / “Missing” category

This hybrid approach worked far better than blindly applying one technique globally.

Encoding Was More Important Than I Expected

Encoding looked simple in theory.

But in practice, deciding between:

One Hot Encoding
Ordinal Encoding
Label Encoding

actually mattered a lot.

I used:

One Hot Encoding

For:

gender
major_discipline
company_type
enrolled_university

because these categories had no natural order.

Ordinal Encoding

For:

education_level

because educational levels actually contain ranking:

Primary School < High School < Graduate < Masters < PhD

This distinction improved model behavior more than I initially expected.

The Most Interesting Observation: Preprocessing Changed the Models More Than the Models Changed Themselves

I trained multiple models:

Logistic Regression
Decision Tree
Random Forest

both before and after preprocessing.

The results were surprisingly different.

Linear models improved heavily after scaling and proper encoding.

Random Forest remained comparatively stable even before aggressive preprocessing.

That observation completely changed my perspective.

Data preprocessing often influences performance more than changing the algorithm itself.

Another Real Problem: Feature Explosion

After applying One Hot Encoding, the number of features increased rapidly.

This was another practical challenge rarely discussed in beginner tutorials.

Encoding solved categorical representation issues, but it also increased dimensionality and preprocessing complexity significantly.

Why Pipelines Became Necessary

At one point, preprocessing became chaotic.

Different transformations were happening separately:

scaling
imputation
encoding
train-test transformations

Tracking transformed columns manually became painful.

This was when I finally understood why sklearn Pipelines and ColumnTransformers matter so much.

Not because they look advanced —
but because preprocessing becomes unmanageable very quickly in real projects.

What This Project Changed for Me

Before this project, I thought feature engineering mostly meant:

removing null values
encoding categories
scaling features

Now I think differently.

Feature engineering is closer to:

understanding how data behaves under transformation.

Every preprocessing decision changes:

distributions
feature relationships
dimensionality
information retention
model assumptions

Even small preprocessing choices can significantly change model behavior.

*Final Thoughts
*
One thing became very clear after this project:

Machine learning is not just model training.

Most real effort goes into:

understanding data
debugging preprocessing
handling edge cases
preserving useful information
testing assumptions

Feature engineering is where datasets stop behaving like clean classroom examples and start behaving like real systems.

"And honestly, that is where machine learning starts becoming interesting."

DEV Community

Feature Engineering is Not Just “Cleaning Data”: What I Learned While Building a Real ML Pipeline

Top comments (0)