hqqqqy

Posted on Apr 5 • Originally published at mathisimple.com

Machine Learning Data Preprocessing: The Mistakes That Break Models Before Training

#beginners #machinelearning #python #tutorial

Machine Learning Data Preprocessing: The Mistakes That Break Models Before Training

Your model isn't "not learning."

It's learning the wrong thing — because the data was already broken before training began.

I've seen it countless times: someone spends weeks tuning hyperparameters only to discover the real problem was a preprocessing mistake made in the first 10 lines of code.

🌐 This is a cross-post from my interactive tutorial site mathisimple.com, where every chart and diagram is fully interactive — adjust parameters and watch how small preprocessing decisions dramatically change model performance.

Here are the five most damaging preprocessing mistakes I see in practice, demonstrated with a real estate price prediction example.

Our Dataset

We're predicting house prices using these features:

numeric: square footage, number of bedrooms, age of house
categorical: neighborhood type (urban, suburban, rural), house style (modern, traditional, cottage)
problematic: some missing values, a few extreme outliers in price

Mistake #1: Data Leakage from Improper Train-Test Split

The cardinal sin.

# WRONG - leakage!
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)  # using all data!
X_train, X_test = train_test_split(X_scaled)

Correct way:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)  # only on training data
X_test_scaled = scaler.transform(X_test)        # never fit on test

Why it matters: If you scale using the entire dataset, information from the test set leaks into training. Your "impressive" 94% R² score is fake.

Mistake #2: Handling Missing Values After Splitting (or not at all)

Never use df.fillna(df.mean()) on the whole dataset.

Better strategies:

For numerical: use training set median (more robust to outliers)
For categorical: use the most frequent category from training set only
Consider adding a "missing" indicator column — sometimes the fact that data is missing is predictive

Mistake #3: Wrong Categorical Encoding

Using LabelEncoder on nominal categories (like neighborhood) is dangerous — it implies order where none exists.

Use OneHotEncoder or pd.get_dummies() for nominal data.

For ordinal data (like education level: high school < bachelor < master), use OrdinalEncoder.

Mistake #4: Ignoring Feature Scales

Let's say square footage ranges from 800 to 5000, while number of bedrooms is 1-5.

Tree-based models (Random Forest, XGBoost) are somewhat robust, but distance-based models (kNN, SVM, neural nets) will be dominated by the larger-scale feature.

This is why feature scaling exists.

Mistake #5: Not Reproducing Preprocessing in Production

The preprocessing pipeline you used during training must be saved and applied identically in production.

from sklearn.pipeline import Pipeline

preprocessor = Pipeline(steps=[
    ('scaler', StandardScaler()),
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])

model_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('regressor', RandomForestRegressor())
])

The Interactive Version Shows It Best

On mathisimple.com, you can:

Introduce different types of data problems (missing values, outliers, imbalanced categories)
See how each mistake affects final model performance in real time
Compare "naive" preprocessing vs correct pipeline approach
Experiment with different models to see which are more sensitive to these issues

👉 Open the interactive preprocessing pitfalls tutorial

You'll gain an intuitive understanding of why "my model was working in the notebook but failed in production" happens so often.

This article is part of the Machine Learning Foundations series. Next up: why feature scaling can completely flip your model's predictions in certain cases.

Have you ever chased a bug for days only to discover it was a preprocessing issue? Share your war stories below.

DEV Community

Machine Learning Data Preprocessing: The Mistakes That Break Models Before Training

Machine Learning Data Preprocessing: The Mistakes That Break Models Before Training

Our Dataset

Mistake #1: Data Leakage from Improper Train-Test Split

Mistake #2: Handling Missing Values After Splitting (or not at all)

Mistake #3: Wrong Categorical Encoding

Mistake #4: Ignoring Feature Scales

Mistake #5: Not Reproducing Preprocessing in Production

The Interactive Version Shows It Best

Top comments (0)