My first production ML model predicted house prices with 95% accuracy in testing. In production, it predicted negative prices for 30% of houses. The bug wasn't in the model — it was in preprocessing steps I didn't even know I needed.
The Three Silent Bugs
Bug #1: I encoded categorical variables after splitting train/test, so the test set had categories the model never saw.
Bug #2: I filled missing values with the mean of the entire dataset, leaking test statistics into training.
Bug #3: I scaled features using the test set's mean and standard deviation, not the training set's.
All three bugs were invisible in development because train and test came from the same distribution. In production, new data had different patterns, and the model collapsed.
The 5-Minute Preprocessing Checklist
Here's the exact sequence I follow now, in order. The order matters — doing these steps out of sequence causes subtle bugs.
Step 1: Split First, Preprocess Second
from sklearn.model_selection import train_test_split
import pandas as pd
import numpy as np
# Load data
df = pd.read_csv('houses.csv')
# CRITICAL: Split BEFORE any preprocessing
X = df.drop('price', axis=1)
y = df['price']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Now preprocess train and test separately, using only train statistics
Why this matters: Any statistics you calculate (mean, median, categories, scaling parameters) must come from training data only. If you preprocess before splitting, test statistics leak into your preprocessing.
In my exploration of data preprocessing pitfalls, I found that this single mistake — preprocessing before splitting — is the most common cause of models that work in testing but fail in production.
Step 2: Handle Missing Values (Train Statistics Only)
from sklearn.impute import SimpleImputer
# WRONG: Calculate mean from entire dataset
mean_wrong = X['square_feet'].mean() # Includes test data!
X_train['square_feet'].fillna(mean_wrong, inplace=True)
X_test['square_feet'].fillna(mean_wrong, inplace=True)
# RIGHT: Calculate mean from training data only
imputer = SimpleImputer(strategy='mean')
imputer.fit(X_train[['square_feet']]) # Fit on train only
X_train['square_feet'] = imputer.transform(X_train[['square_feet']])
X_test['square_feet'] = imputer.transform(X_test[['square_feet']])
My missing value decision tree:
| Data Type | Missing < 5% | Missing 5-40% | Missing > 40% |
|---|---|---|---|
| Numerical | Mean/median imputation | Model-based imputation or add missing indicator | Drop feature |
| Categorical | Mode imputation | Add "missing" category | Drop feature |
| Time series | Forward fill or interpolation | Seasonal imputation | Drop feature |
Step 3: Encode Categorical Variables (Handle Unseen Categories)
from sklearn.preprocessing import LabelEncoder
# WRONG: Encode train and test separately
le_wrong = LabelEncoder()
X_train['neighborhood'] = le_wrong.fit_transform(X_train['neighborhood'])
X_test['neighborhood'] = LabelEncoder().fit_transform(X_test['neighborhood']) # Different encoding!
# RIGHT: Fit on train, handle unseen categories in test
le_right = LabelEncoder()
le_right.fit(X_train['neighborhood'])
# Handle categories in test that weren't in train
X_test['neighborhood'] = X_test['neighborhood'].map(
lambda x: x if x in le_right.classes_ else 'unknown'
)
# Add 'unknown' to encoder if needed
if 'unknown' not in le_right.classes_:
le_right.classes_ = np.append(le_right.classes_, 'unknown')
X_train['neighborhood'] = le_right.transform(X_train['neighborhood'])
X_test['neighborhood'] = le_right.transform(X_test['neighborhood'])
Better approach: Use OneHotEncoder with handle_unknown='ignore':
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(handle_unknown='ignore', sparse_output=False)
encoder.fit(X_train[['neighborhood']])
X_train_encoded = encoder.transform(X_train[['neighborhood']])
X_test_encoded = encoder.transform(X_test[['neighborhood']]) # Unseen categories become all zeros
Step 4: Feature Scaling (Train Statistics Only)
from sklearn.preprocessing import StandardScaler
# WRONG: Fit scaler on test data
scaler_wrong = StandardScaler()
X_train_scaled = scaler_wrong.fit_transform(X_train)
X_test_scaled = StandardScaler().fit_transform(X_test) # Uses test mean/std!
# RIGHT: Fit on train, transform both
scaler_right = StandardScaler()
X_train_scaled = scaler_right.fit_transform(X_train)
X_test_scaled = scaler_right.transform(X_test) # Uses train mean/std
When to scale:
- Must scale: kNN, SVM, neural networks, PCA, clustering
- Don't scale: Tree-based models (Random Forest, XGBoost, LightGBM)
- Depends: Linear/logistic regression (scale for interpretability, not performance)
Step 5: Check for Data Leakage
Data leakage is when information from the test set leaks into training. It causes optimistic accuracy that doesn't hold in production.
# Common leakage sources:
# 1. Target leakage: Features that contain the target
# 2. Train-test contamination: Test statistics in preprocessing
# 3. Temporal leakage: Using future data to predict the past
# Check for suspiciously high correlations with target
correlations = X_train.corrwith(y_train).abs().sort_values(ascending=False)
print("Top correlations with target:")
print(correlations.head(10))
# If any feature has correlation > 0.95, investigate for leakage
The Pipeline Pattern: Preventing Mistakes
The best way to avoid preprocessing bugs is to use sklearn's Pipeline. It automatically applies steps in order and prevents leakage:
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor
# Define preprocessing and model in one pipeline
pipeline = Pipeline([
('imputer', SimpleImputer(strategy='mean')),
('scaler', StandardScaler()),
('model', RandomForestRegressor(random_state=42))
])
# Fit pipeline on training data
pipeline.fit(X_train, y_train)
# Predict on test data (preprocessing applied automatically)
y_pred = pipeline.predict(X_test)
# Save entire pipeline for production
import joblib
joblib.dump(pipeline, 'model_pipeline.pkl')
Why pipelines prevent bugs:
-
Fit only on training data:
pipeline.fit(X_train, y_train)ensures all preprocessing uses train statistics - Consistent preprocessing: Test and production data get identical preprocessing
- Easy deployment: Save one object, not separate preprocessors and model
-
Cross-validation safe: Works correctly with
cross_val_scoreandGridSearchCV
What Most Tutorials Miss
The biggest mistake I made was not saving the preprocessing objects. I trained a model, saved it, then in production I had to recreate the preprocessing from scratch. The new preprocessing had slightly different parameters (different mean, different categories), and predictions were garbage.
Always save these objects:
import joblib
# Save the entire pipeline
joblib.dump(pipeline, 'model_pipeline.pkl')
# Or save preprocessors separately if not using Pipeline
joblib.dump(scaler, 'scaler.pkl')
joblib.dump(encoder, 'encoder.pkl')
joblib.dump(imputer, 'imputer.pkl')
# In production, load and use the same objects
pipeline = joblib.load('model_pipeline.pkl')
predictions = pipeline.predict(new_data)
Another gotcha: checking for missing values after splitting but not checking again after preprocessing. Some operations (like scaling) can introduce NaN or inf values:
# After each preprocessing step, check for NaN/inf
def check_data_quality(X, step_name):
if np.any(np.isnan(X)):
print(f"Warning: NaN values after {step_name}")
if np.any(np.isinf(X)):
print(f"Warning: Inf values after {step_name}")
check_data_quality(X_train_scaled, "scaling")
Key Takeaways for Developers
- Always split before preprocessing — any statistics (mean, categories, scaling) must come from training data only
- Use Pipeline to bundle preprocessing and model — it prevents leakage and makes deployment easier
-
Handle unseen categories in test/production — use
handle_unknown='ignore'in encoders - Save all preprocessing objects alongside the model — you'll need them for production predictions
- Check for data leakage by looking for suspiciously high correlations with the target
The three bugs that broke my first production model now take five minutes to prevent with this checklist. If you want to see interactive examples of how preprocessing order affects model performance, check out the data preprocessing visualizer — it shows exactly how leakage happens and how to prevent it.
For more on preprocessing best practices, see the scikit-learn preprocessing guide and this comprehensive paper on data leakage.
Top comments (0)