Understanding Data Preprocessing

#100daysofcode #mlbasics #datapreprocessing

Data Preprocessing - this is exactly the right next step after Pandas. Think of Data Preprocessing as the bridge between raw data and usable ML input. You can find more information Pandas here: The next basic concept of Machine Learning after NumPy: Pandas

What is Data Preprocessing in Machine Learning?

Data preprocessing is the process of cleaning, transforming, and preparing data so that a machine learning model can learn from it effectively.

A model can only be as good as the data you feed it.

Why Data Preprocessing is Critical?
Raw data usually has:

Missing values
Different scales (Age vs Salary)
Categorical text values
Noise & irrelevant features

ML algorithms assume clean, numerical, well-scaled data.

Core Data Preprocessing Concepts (Must-Know)

Train–Test Split

Concept: We don’t train and evaluate on the same data.
- Training set → learn patterns
- Test set → evaluate performance
What does it mean?
We divide data into:
- Training data → teaches the model
- Testing data → checks how well it learned
Typical split:
- 80% train / 20% test
```
 from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
```
Why 80:20 or 70:30?
Imagine you have 100 exam questions:
- You practice with 80 questions
- You test yourself with 20 new ones

If you test on questions you already practiced → false confidence.
Common ratios:
- 80% train / 20% test → most common
- 70% / 30% → small datasets
- 90% / 10% → very large datasets

If you use too much training:
- Test set too small → unreliable evaluation

If you use too much testing:
- Model doesn’t learn enough

Why it matters: Prevents overfitting and gives realistic performance.
Handling Missing Values
Problem: ML models cannot work with NaN

Common strategies:
- Remove rows/columns (small datasets → risky)
- Replace with:
  - Mean / Median (numerical)
  - Mode (categorical)
```
 df.fillna(df.mean())
```
Rule of thumb:
- Use median if data has outliers
- Never fill test data using test statistics (data leakage!)
What is an outlier?
A value that is very different from the rest.
Example:
Salaries in a company:
[45k, 48k, 50k, 52k, 49k, 2,000k]

That 2,000k (2 million) salary:
-Skews the average
-Confuses the model

Why it’s bad?
- Mean salary becomes unrealistic
- Model learns wrong patterns
What we do:
- Remove it
- Cap it
- Use median instead of mean
Encoding Categorical Variables
Problem: ML models only understand numbers, not texts.
Types:
- Label Encoding → ordered categories
- One-Hot Encoding → unordered categories (most common) pd.get_dummies(df["City"])
Example:
City = ["Delhi", "New York City", "Delhi"]

❌** Wrong Way:**
```
Delhi = 1
New York City = 2
```
Model thinks New York City > Delhi ❌ (no meaning!)

✅ Correct way: One-Hot Encoding
Create separate columns:
[Delhi] = [1, 0, 1]
[New York City] = [0, 1, 0]

Now:
- No false ordering
- Model understands categories correctly
Key idea:
Never give false numeric meaning to categories.
Feature Scaling
Problem: Features have different ranges. Different features have different scales.
Example 1:
- Age → 0–100
- Salary → 0–100000
This breaks distance-based models (KNN, SVM).

Example 2:
- Age → 18–60
- Salary → 20,000–200,000
Model pays more attention to Salary just because numbers are bigger ❌

✅ Solution: Scaling
Bring all values to similar ranges.

Two common methods:
- Standardization → most used
- Normalization → 0 to 1 range
🔹 Standardization (most used) = (x − mean) / std
```
   from sklearn.preprocessing import StandardScaler
   scaler = StandardScaler()
   X_scaled = scaler.fit_transform(X)
```
🔹 Normalization = (x − min) / (max − min)

Important:
- Fit scaler on training data only
- Apply the same transformation to test data
Feature Selection
Goal: Keep only useful features and remove useless ones.
Why?
- Reduces noise
- Improves performance
- Avoids overfitting
Examples:
- Remove constant columns
- Remove highly correlated features
- Domain knowledge–based selection
Example:
Predicting house price:
- ✅ Size
- ✅ Location
- ❌ Owner name
- ❌ Phone number
Why?
- Less noise
- Faster training
- Better accuracy
Outlier Handling
Outliers distort learning.
Common approaches:
- Remove extreme values
- Cap values (winsorization)
- Use robust scalers
Models like tree-based algorithms are less sensitive.

Outliers are not always wrong!
Examples:
- Billionaires exist
- Olympic athletes exist
Options:
- Remove (if error)
- Cap (limit max/min)
- Keep (if meaningful)
Models affected:
- Linear models → very sensitive
- Tree models → less sensitive
Data Leakage (CRITICAL CONCEPT)
What is it?
Using information in training that wouldn’t be available in real life or using future or test information during training.
🚫 Examples:
- Scaling before train-test split
- Filling missing values using entire dataset
- Using future data to predict past
Rule:

All preprocessing decisions must be learned from training data only.

❌ Bad example:
- Scaling entire dataset before split
- Finding mean using full data
Model secretly sees test data ❌

✅ Correct way:
- Split data
- Learn statistics from training
- Apply to test
Final Mental Model (Remember this):
Clean data → Fair split → Honest training → Reliable model