DEV Community

Juhi Kushwah
Juhi Kushwah

Posted on

Understanding Data Preprocessing

Data Preprocessing - this is exactly the right next step after Pandas. Think of Data Preprocessing as the bridge between raw data and usable ML input. You can find more information Pandas here: The next basic concept of Machine Learning after NumPy: Pandas

What is Data Preprocessing in Machine Learning?

Data preprocessing is the process of cleaning, transforming, and preparing data so that a machine learning model can learn from it effectively.

A model can only be as good as the data you feed it.

Why Data Preprocessing is Critical?
Raw data usually has:

  • Missing values
  • Different scales (Age vs Salary)
  • Categorical text values
  • Noise & irrelevant features

ML algorithms assume clean, numerical, well-scaled data.

Core Data Preprocessing Concepts (Must-Know)

  1. Train–Test Split

    Concept: We don’t train and evaluate on the same data.

    • Training set → learn patterns
    • Test set → evaluate performance

    What does it mean?
    We divide data into:

    • Training data → teaches the model
    • Testing data → checks how well it learned

    Typical split:

    • 80% train / 20% test
     from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    

    Why 80:20 or 70:30?
    Imagine you have 100 exam questions:
    - You practice with 80 questions
    - You test yourself with 20 new ones

    If you test on questions you already practiced → false confidence.
    Common ratios:
    - 80% train / 20% test → most common
    - 70% / 30% → small datasets
    - 90% / 10% → very large datasets

    If you use too much training:
    - Test set too small → unreliable evaluation

    If you use too much testing:
    - Model doesn’t learn enough

    Why it matters: Prevents overfitting and gives realistic performance.

  2. Handling Missing Values
    Problem: ML models cannot work with NaN

    Common strategies:

    • Remove rows/columns (small datasets → risky)
    • Replace with:
      • Mean / Median (numerical)
      • Mode (categorical)
     df.fillna(df.mean())
    

    Rule of thumb:

    • Use median if data has outliers
    • Never fill test data using test statistics (data leakage!)

    What is an outlier?
    A value that is very different from the rest.
    Example:
    Salaries in a company:
    [45k, 48k, 50k, 52k, 49k, 2,000k]

    That 2,000k (2 million) salary:
    -Skews the average
    -Confuses the model

    Why it’s bad?

    • Mean salary becomes unrealistic
    • Model learns wrong patterns

    What we do:

    • Remove it
    • Cap it
    • Use median instead of mean
  3. Encoding Categorical Variables
    Problem: ML models only understand numbers, not texts.
    Types:

    • Label Encoding → ordered categories
    • One-Hot Encoding → unordered categories (most common) pd.get_dummies(df["City"])

    Example:
    City = ["Delhi", "New York City", "Delhi"]

    ❌** Wrong Way:**

    Delhi = 1
    New York City = 2
    

    Model thinks New York City > Delhi ❌ (no meaning!)

    Correct way: One-Hot Encoding
    Create separate columns:
    [Delhi] = [1, 0, 1]
    [New York City] = [0, 1, 0]

    Now:

    • No false ordering
    • Model understands categories correctly

    Key idea:
    Never give false numeric meaning to categories.

  4. Feature Scaling
    Problem: Features have different ranges. Different features have different scales.
    Example 1:

    • Age → 0–100
    • Salary → 0–100000

    This breaks distance-based models (KNN, SVM).

    Example 2:

    • Age → 18–60
    • Salary → 20,000–200,000

    Model pays more attention to Salary just because numbers are bigger ❌

    Solution: Scaling
    Bring all values to similar ranges.

    Two common methods:

    • Standardization → most used
    • Normalization → 0 to 1 range

    🔹 Standardization (most used) = (x − mean) / std

       from sklearn.preprocessing import StandardScaler
       scaler = StandardScaler()
       X_scaled = scaler.fit_transform(X)
    

    🔹 Normalization = (x − min) / (max − min)

    Important:

    • Fit scaler on training data only
    • Apply the same transformation to test data
  5. Feature Selection
    Goal: Keep only useful features and remove useless ones.
    Why?

    • Reduces noise
    • Improves performance
    • Avoids overfitting

    Examples:

    • Remove constant columns
    • Remove highly correlated features
    • Domain knowledge–based selection

    Example:
    Predicting house price:

    • ✅ Size
    • ✅ Location
    • ❌ Owner name
    • ❌ Phone number

    Why?

    • Less noise
    • Faster training
    • Better accuracy
  6. Outlier Handling
    Outliers distort learning.
    Common approaches:

    • Remove extreme values
    • Cap values (winsorization)
    • Use robust scalers

    Models like tree-based algorithms are less sensitive.

    Outliers are not always wrong!
    Examples:

    • Billionaires exist
    • Olympic athletes exist

    Options:

    • Remove (if error)
    • Cap (limit max/min)
    • Keep (if meaningful)

    Models affected:

    • Linear models → very sensitive
    • Tree models → less sensitive
  7. Data Leakage (CRITICAL CONCEPT)
    What is it?
    Using information in training that wouldn’t be available in real life or using future or test information during training.
    🚫 Examples:

    • Scaling before train-test split
    • Filling missing values using entire dataset
    • Using future data to predict past

    Rule:

    All preprocessing decisions must be learned from training data only.

    ❌ Bad example:

    • Scaling entire dataset before split
    • Finding mean using full data

    Model secretly sees test data ❌

    ✅ Correct way:

    • Split data
    • Learn statistics from training
    • Apply to test

    Final Mental Model (Remember this):
    Clean data → Fair split → Honest training → Reliable model

Typical ML Preprocessing Pipeline

ML Preprocessing Pipeline

To summarize:

Data preprocessing is where ML models are made or broken — it’s more important than the algorithm itself.

Top comments (0)