The Preprocessing Checklist I Wish I Had on My First ML Project

#python #datascience #machinelearning #tutorial

My first production ML model predicted house prices with 95% accuracy in testing. In production, it predicted negative prices for 30% of houses. The bug wasn't in the model — it was in preprocessing steps I didn't even know I needed.

The Three Silent Bugs

Bug #1: I encoded categorical variables after splitting train/test, so the test set had categories the model never saw.

Bug #2: I filled missing values with the mean of the entire dataset, leaking test statistics into training.

Bug #3: I scaled features using the test set's mean and standard deviation, not the training set's.

All three bugs were invisible in development because train and test came from the same distribution. In production, new data had different patterns, and the model collapsed.

The 5-Minute Preprocessing Checklist

Here's the exact sequence I follow now, in order. The order matters — doing these steps out of sequence causes subtle bugs.

Step 1: Split First, Preprocess Second

from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np

# Load data
df = pd.read_csv('houses.csv')

# CRITICAL: Split BEFORE any preprocessing
X = df.drop('price', axis=1)
y = df['price']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Now preprocess train and test separately, using only train statistics

Why this matters: Any statistics you calculate (mean, median, categories, scaling parameters) must come from training data only. If you preprocess before splitting, test statistics leak into your preprocessing.

In my exploration of data preprocessing pitfalls, I found that this single mistake — preprocessing before splitting — is the most common cause of models that work in testing but fail in production.

Step 2: Handle Missing Values (Train Statistics Only)

from sklearn.impute import SimpleImputer

# WRONG: Calculate mean from entire dataset
mean_wrong = X['square_feet'].mean()  # Includes test data!
X_train['square_feet'].fillna(mean_wrong, inplace=True)
X_test['square_feet'].fillna(mean_wrong, inplace=True)

# RIGHT: Calculate mean from training data only
imputer = SimpleImputer(strategy='mean')
imputer.fit(X_train[['square_feet']])  # Fit on train only

X_train['square_feet'] = imputer.transform(X_train[['square_feet']])
X_test['square_feet'] = imputer.transform(X_test[['square_feet']])

My missing value decision tree:

Data Type	Missing < 5%	Missing 5-40%	Missing > 40%
Numerical	Mean/median imputation	Model-based imputation or add missing indicator	Drop feature
Categorical	Mode imputation	Add "missing" category	Drop feature
Time series	Forward fill or interpolation	Seasonal imputation	Drop feature

Step 3: Encode Categorical Variables (Handle Unseen Categories)

from sklearn.preprocessing import LabelEncoder

# WRONG: Encode train and test separately
le_wrong = LabelEncoder()
X_train['neighborhood'] = le_wrong.fit_transform(X_train['neighborhood'])
X_test['neighborhood'] = LabelEncoder().fit_transform(X_test['neighborhood'])  # Different encoding!

# RIGHT: Fit on train, handle unseen categories in test
le_right = LabelEncoder()
le_right.fit(X_train['neighborhood'])

# Handle categories in test that weren't in train
X_test['neighborhood'] = X_test['neighborhood'].map(
    lambda x: x if x in le_right.classes_ else 'unknown'
)

# Add 'unknown' to encoder if needed
if 'unknown' not in le_right.classes_:
    le_right.classes_ = np.append(le_right.classes_, 'unknown')

X_train['neighborhood'] = le_right.transform(X_train['neighborhood'])
X_test['neighborhood'] = le_right.transform(X_test['neighborhood'])

Better approach: Use OneHotEncoder with handle_unknown='ignore':

from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
encoder.fit(X_train[['neighborhood']])

X_train_encoded = encoder.transform(X_train[['neighborhood']])
X_test_encoded = encoder.transform(X_test[['neighborhood']])  # Unseen categories become all zeros

Step 4: Feature Scaling (Train Statistics Only)

from sklearn.preprocessing import StandardScaler

# WRONG: Fit scaler on test data
scaler_wrong = StandardScaler()
X_train_scaled = scaler_wrong.fit_transform(X_train)
X_test_scaled = StandardScaler().fit_transform(X_test)  # Uses test mean/std!

# RIGHT: Fit on train, transform both
scaler_right = StandardScaler()
X_train_scaled = scaler_right.fit_transform(X_train)
X_test_scaled = scaler_right.transform(X_test)  # Uses train mean/std

When to scale:

Must scale: kNN, SVM, neural networks, PCA, clustering
Don't scale: Tree-based models (Random Forest, XGBoost, LightGBM)
Depends: Linear/logistic regression (scale for interpretability, not performance)

Step 5: Check for Data Leakage

Data leakage is when information from the test set leaks into training. It causes optimistic accuracy that doesn't hold in production.

# Common leakage sources:
# 1. Target leakage: Features that contain the target
# 2. Train-test contamination: Test statistics in preprocessing
# 3. Temporal leakage: Using future data to predict the past

# Check for suspiciously high correlations with target
correlations = X_train.corrwith(y_train).abs().sort_values(ascending=False)
print("Top correlations with target:")
print(correlations.head(10))

# If any feature has correlation > 0.95, investigate for leakage

The Pipeline Pattern: Preventing Mistakes

The best way to avoid preprocessing bugs is to use sklearn's Pipeline. It automatically applies steps in order and prevents leakage:

from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor

# Define preprocessing and model in one pipeline
pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler()),
    ('model', RandomForestRegressor(random_state=42))
])

# Fit pipeline on training data
pipeline.fit(X_train, y_train)

# Predict on test data (preprocessing applied automatically)
y_pred = pipeline.predict(X_test)

# Save entire pipeline for production
import joblib
joblib.dump(pipeline, 'model_pipeline.pkl')

Why pipelines prevent bugs:

Fit only on training data: pipeline.fit(X_train, y_train) ensures all preprocessing uses train statistics
Consistent preprocessing: Test and production data get identical preprocessing
Easy deployment: Save one object, not separate preprocessors and model
Cross-validation safe: Works correctly with cross_val_score and GridSearchCV

What Most Tutorials Miss

The biggest mistake I made was not saving the preprocessing objects. I trained a model, saved it, then in production I had to recreate the preprocessing from scratch. The new preprocessing had slightly different parameters (different mean, different categories), and predictions were garbage.

Always save these objects:

import joblib

# Save the entire pipeline
joblib.dump(pipeline, 'model_pipeline.pkl')

# Or save preprocessors separately if not using Pipeline
joblib.dump(scaler, 'scaler.pkl')
joblib.dump(encoder, 'encoder.pkl')
joblib.dump(imputer, 'imputer.pkl')

# In production, load and use the same objects
pipeline = joblib.load('model_pipeline.pkl')
predictions = pipeline.predict(new_data)

Another gotcha: checking for missing values after splitting but not checking again after preprocessing. Some operations (like scaling) can introduce NaN or inf values:

# After each preprocessing step, check for NaN/inf
def check_data_quality(X, step_name):
    if np.any(np.isnan(X)):
        print(f"Warning: NaN values after {step_name}")
    if np.any(np.isinf(X)):
        print(f"Warning: Inf values after {step_name}")

check_data_quality(X_train_scaled, "scaling")

Key Takeaways for Developers

Always split before preprocessing — any statistics (mean, categories, scaling) must come from training data only
Use Pipeline to bundle preprocessing and model — it prevents leakage and makes deployment easier
Handle unseen categories in test/production — use handle_unknown='ignore' in encoders
Save all preprocessing objects alongside the model — you'll need them for production predictions
Check for data leakage by looking for suspiciously high correlations with the target

The three bugs that broke my first production model now take five minutes to prevent with this checklist. If you want to see interactive examples of how preprocessing order affects model performance, check out the data preprocessing visualizer — it shows exactly how leakage happens and how to prevent it.

For more on preprocessing best practices, see the scikit-learn preprocessing guide and this comprehensive paper on data leakage.