Machine Learning Data Preprocessing: The Mistakes That Break Models Before Training
Your model isn't "not learning."
It's learning the wrong thing — because the data was already broken before training began.
I've seen it countless times: someone spends weeks tuning hyperparameters only to discover the real problem was a preprocessing mistake made in the first 10 lines of code.
🌐 This is a cross-post from my interactive tutorial site mathisimple.com, where every chart and diagram is fully interactive — adjust parameters and watch how small preprocessing decisions dramatically change model performance.
Here are the five most damaging preprocessing mistakes I see in practice, demonstrated with a real estate price prediction example.
Our Dataset
We're predicting house prices using these features:
- numeric: square footage, number of bedrooms, age of house
- categorical: neighborhood type (urban, suburban, rural), house style (modern, traditional, cottage)
- problematic: some missing values, a few extreme outliers in price
Mistake #1: Data Leakage from Improper Train-Test Split
The cardinal sin.
# WRONG - leakage!
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X) # using all data!
X_train, X_test = train_test_split(X_scaled)
Correct way:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train) # only on training data
X_test_scaled = scaler.transform(X_test) # never fit on test
Why it matters: If you scale using the entire dataset, information from the test set leaks into training. Your "impressive" 94% R² score is fake.
Mistake #2: Handling Missing Values After Splitting (or not at all)
Never use df.fillna(df.mean()) on the whole dataset.
Better strategies:
- For numerical: use training set median (more robust to outliers)
- For categorical: use the most frequent category from training set only
- Consider adding a "missing" indicator column — sometimes the fact that data is missing is predictive
Mistake #3: Wrong Categorical Encoding
Using LabelEncoder on nominal categories (like neighborhood) is dangerous — it implies order where none exists.
Use OneHotEncoder or pd.get_dummies() for nominal data.
For ordinal data (like education level: high school < bachelor < master), use OrdinalEncoder.
Mistake #4: Ignoring Feature Scales
Let's say square footage ranges from 800 to 5000, while number of bedrooms is 1-5.
Tree-based models (Random Forest, XGBoost) are somewhat robust, but distance-based models (kNN, SVM, neural nets) will be dominated by the larger-scale feature.
This is why feature scaling exists.
Mistake #5: Not Reproducing Preprocessing in Production
The preprocessing pipeline you used during training must be saved and applied identically in production.
from sklearn.pipeline import Pipeline
preprocessor = Pipeline(steps=[
('scaler', StandardScaler()),
('encoder', OneHotEncoder(handle_unknown='ignore'))
])
model_pipeline = Pipeline([
('preprocessor', preprocessor),
('regressor', RandomForestRegressor())
])
The Interactive Version Shows It Best
On mathisimple.com, you can:
- Introduce different types of data problems (missing values, outliers, imbalanced categories)
- See how each mistake affects final model performance in real time
- Compare "naive" preprocessing vs correct pipeline approach
- Experiment with different models to see which are more sensitive to these issues
👉 Open the interactive preprocessing pitfalls tutorial
You'll gain an intuitive understanding of why "my model was working in the notebook but failed in production" happens so often.
This article is part of the Machine Learning Foundations series. Next up: why feature scaling can completely flip your model's predictions in certain cases.
Have you ever chased a bug for days only to discover it was a preprocessing issue? Share your war stories below.
Top comments (0)