Data Preprocessing - this is exactly the right next step after Pandas. Think of Data Preprocessing as the bridge between raw data and usable ML input. You can find more information Pandas here: The next basic concept of Machine Learning after NumPy: Pandas
What is Data Preprocessing in Machine Learning?
Data preprocessing is the process of cleaning, transforming, and preparing data so that a machine learning model can learn from it effectively.
A model can only be as good as the data you feed it.
Why Data Preprocessing is Critical?
Raw data usually has:
- Missing values
- Different scales (Age vs Salary)
- Categorical text values
- Noise & irrelevant features
ML algorithms assume clean, numerical, well-scaled data.
Core Data Preprocessing Concepts (Must-Know)
-
Train–Test Split
Concept: We don’t train and evaluate on the same data.
- Training set → learn patterns
- Test set → evaluate performance
What does it mean?
We divide data into:- Training data → teaches the model
- Testing data → checks how well it learned
Typical split:
- 80% train / 20% test
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)Why 80:20 or 70:30?
Imagine you have 100 exam questions:
- You practice with 80 questions
- You test yourself with 20 new onesIf you test on questions you already practiced → false confidence.
Common ratios:
- 80% train / 20% test → most common
- 70% / 30% → small datasets
- 90% / 10% → very large datasetsIf you use too much training:
- Test set too small → unreliable evaluationIf you use too much testing:
- Model doesn’t learn enoughWhy it matters: Prevents overfitting and gives realistic performance.
-
Handling Missing Values
Problem: ML models cannot work withNaNCommon strategies:
- Remove rows/columns (small datasets → risky)
- Replace with:
- Mean / Median (numerical)
- Mode (categorical)
df.fillna(df.mean())Rule of thumb:
- Use median if data has outliers
- Never fill test data using test statistics (data leakage!)
What is an outlier?
A value that is very different from the rest.
Example:
Salaries in a company:
[45k, 48k, 50k, 52k, 49k, 2,000k]That 2,000k (2 million) salary:
-Skews the average
-Confuses the modelWhy it’s bad?
- Mean salary becomes unrealistic
- Model learns wrong patterns
What we do:
- Remove it
- Cap it
- Use median instead of mean
-
Encoding Categorical Variables
Problem: ML models only understand numbers, not texts.
Types:- Label Encoding → ordered categories
-
One-Hot Encoding → unordered categories (most common)
pd.get_dummies(df["City"])
Example:
City = ["Delhi", "New York City", "Delhi"]❌** Wrong Way:**
Delhi = 1 New York City = 2Model thinks New York City > Delhi ❌ (no meaning!)
✅ Correct way: One-Hot Encoding
Create separate columns:
[Delhi] = [1, 0, 1]
[New York City] = [0, 1, 0]Now:
- No false ordering
- Model understands categories correctly
Key idea:
Never give false numeric meaning to categories. -
Feature Scaling
Problem: Features have different ranges. Different features have different scales.
Example 1:- Age → 0–100
- Salary → 0–100000
This breaks distance-based models (KNN, SVM).
Example 2:
- Age → 18–60
- Salary → 20,000–200,000
Model pays more attention to Salary just because numbers are bigger ❌
✅ Solution: Scaling
Bring all values to similar ranges.Two common methods:
- Standardization → most used
- Normalization → 0 to 1 range
🔹 Standardization (most used) =
(x − mean) / std
from sklearn.preprocessing import StandardScaler scaler = StandardScaler() X_scaled = scaler.fit_transform(X)🔹 Normalization =
(x − min) / (max − min)Important:
- Fit scaler on training data only
- Apply the same transformation to test data
-
Feature Selection
Goal: Keep only useful features and remove useless ones.
Why?- Reduces noise
- Improves performance
- Avoids overfitting
Examples:
- Remove constant columns
- Remove highly correlated features
- Domain knowledge–based selection
Example:
Predicting house price:- ✅ Size
- ✅ Location
- ❌ Owner name
- ❌ Phone number
Why?
- Less noise
- Faster training
- Better accuracy
-
Outlier Handling
Outliers distort learning.
Common approaches:- Remove extreme values
- Cap values (winsorization)
- Use robust scalers
Models like tree-based algorithms are less sensitive.
Outliers are not always wrong!
Examples:- Billionaires exist
- Olympic athletes exist
Options:
- Remove (if error)
- Cap (limit max/min)
- Keep (if meaningful)
Models affected:
- Linear models → very sensitive
- Tree models → less sensitive
-
Data Leakage (CRITICAL CONCEPT)
What is it?
Using information in training that wouldn’t be available in real life or using future or test information during training.
🚫 Examples:- Scaling before train-test split
- Filling missing values using entire dataset
- Using future data to predict past
Rule:
All preprocessing decisions must be learned from training data only.
❌ Bad example:
- Scaling entire dataset before split
- Finding mean using full data
Model secretly sees test data ❌
✅ Correct way:
- Split data
- Learn statistics from training
- Apply to test
Final Mental Model (Remember this):
Clean data → Fair split → Honest training → Reliable model
Typical ML Preprocessing Pipeline
To summarize:
Data preprocessing is where ML models are made or broken — it’s more important than the algorithm itself.

Top comments (0)