Sachin Kr. Rajput

Posted on Jan 21

Handling Categorical Variables: Teaching Your Model to Understand "Red," "Blue," and "Green" Without Having a Mental Breakdown

#beginners #programming #python #datascience

The One-Line Summary: Categorical variables are words your model can't understand. You must convert them to numbers — but HOW you convert them determines whether your model learns truth or nonsense.

The Foreign Exchange Student

Meet Alex, an exchange student from Mars.

Alex is brilliant at math. Give Alex any numbers, and he'll find patterns, make predictions, solve problems.

But Alex has one limitation: he only understands numbers.

You show Alex a dataset about Earth cars:

Car        Color      Size      Origin
─────────────────────────────────────────
Toyota     Red        Medium    Japan
BMW        Blue       Large     Germany
Honda      Green      Small     Japan
Tesla      White      Large     USA

Alex stares at it, confused.

"What is 'Red'? What is 'Japan'? These symbols mean nothing to me. Give me NUMBERS!"

You need to translate. But here's the problem:

The wrong translation will teach Alex lies.

The Disastrous First Attempt

You think: "Easy! I'll just number them!"

Color:   Red=1, Blue=2, Green=3, White=4
Size:    Small=1, Medium=2, Large=3
Origin:  Japan=1, Germany=2, USA=3

Alex is happy. Numbers! He starts analyzing.

Then he announces his findings:

"I've discovered that White cars are 4 times better than Red cars!"

"USA is clearly the best country because it has the highest number!"

"If I average Blue and White, I get Green!"

You've taught Alex complete nonsense.

By assigning arbitrary numbers, you implied a mathematical relationship that doesn't exist. There's no universe where (Blue + White) / 2 = Green.

The Problem: Categories vs. Numbers

Categorical variables come in two flavors:

Nominal: No Order

Categories with no natural ranking.

Colors:     Red, Blue, Green, Yellow
Countries:  USA, Japan, Germany, France
Car brands: Toyota, BMW, Honda, Tesla

Is Red > Blue? No.
Is USA > Japan? No.
These are just LABELS.

Ordinal: Natural Order

Categories with a meaningful sequence.

T-shirt sizes:  Small < Medium < Large < XL
Education:      High School < Bachelor's < Master's < PhD
Ratings:        Poor < Fair < Good < Excellent

These have ORDER, but the gaps aren't equal.
Is the jump from Small→Medium the same as Large→XL? Not necessarily.

Your encoding strategy depends on which type you have.

The Encoding Arsenal

Let me show you every weapon in the categorical encoding toolkit.

Method 1: Label Encoding

The idea: Assign each category a unique integer.

from sklearn.preprocessing import LabelEncoder

colors = ['Red', 'Blue', 'Green', 'Red', 'Blue']

encoder = LabelEncoder()
encoded = encoder.fit_transform(colors)

print(encoded)  # [2, 0, 1, 2, 0]
# Blue=0, Green=1, Red=2 (alphabetical order)

Visual:

Original:  [Red]  [Blue]  [Green]  [Red]  [Blue]
              ↓      ↓       ↓       ↓      ↓
Encoded:   [ 2 ]  [ 0 ]   [ 1 ]   [ 2 ]  [ 0 ]

When It Works ✅

Ordinal data where order matters:

sizes = ['Small', 'Medium', 'Large', 'XL']

# Manual mapping preserves order
size_map = {'Small': 0, 'Medium': 1, 'Large': 2, 'XL': 3}
encoded_sizes = [size_map[s] for s in sizes]
# [0, 1, 2, 3] — Order is meaningful!

When It Fails ❌

Nominal data where order is meaningless:

colors = ['Red', 'Blue', 'Green']
# Encoded as [2, 0, 1]

# Model now thinks:
# Blue(0) < Green(1) < Red(2)
# Red - Blue = 2 (meaningful math on meaningless categories!)

Tree-based models (Random Forest, XGBoost) can handle label encoding for nominal data because they split on thresholds, not arithmetic. But linear models will be confused.

Method 2: One-Hot Encoding

The idea: Create a separate binary column for each category.

import pandas as pd

df = pd.DataFrame({'color': ['Red', 'Blue', 'Green', 'Red']})

# One-hot encode
one_hot = pd.get_dummies(df, columns=['color'])
print(one_hot)

Output:

   color_Blue  color_Green  color_Red
0           0            0          1
1           1            0          0
2           0            1          0
3           0            0          1

Visual:

Original:     [Red]         [Blue]        [Green]        [Red]
                ↓             ↓             ↓              ↓
One-Hot:    [0,0,1]       [1,0,0]       [0,1,0]        [0,0,1]
            B G R         B G R         B G R          B G R

Each color gets its own column. A car is Red? Put 1 in the Red column, 0 everywhere else.

Why It's Safe

No false relationships! The model can't think "Red > Blue" because they're separate features.

Red = [0, 0, 1]
Blue = [1, 0, 0]

Red - Blue = [-1, 0, 1]  ← Not a meaningful category
Red × 2 = [0, 0, 2]      ← Not a meaningful category

No accidental math!

The Scikit-Learn Way

from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
colors = [['Red'], ['Blue'], ['Green'], ['Red']]

encoded = encoder.fit_transform(colors)
print(encoded)
# [[0. 0. 1.]
#  [1. 0. 0.]
#  [0. 1. 0.]
#  [0. 0. 1.]]

# Get feature names
print(encoder.get_feature_names_out())
# ['x0_Blue' 'x0_Green' 'x0_Red']

The Dummy Variable Trap 🪤

If you have K categories, you only need K-1 columns!

Why? Because if it's not Blue and not Green, it MUST be Red.

# With drop='first', we avoid multicollinearity
encoder = OneHotEncoder(drop='first', sparse_output=False)
encoded = encoder.fit_transform(colors)

# Now only 2 columns for 3 colors:
# [Green, Red] — Blue is implicit when both are 0

Original:    Red      Blue     Green
Full:      [0,0,1]  [1,0,0]  [0,1,0]  ← 3 columns
Dropped:   [0,1]    [0,0]    [1,0]    ← 2 columns (Blue = reference)

Linear models NEED this. Tree models don't care.

When One-Hot Fails: The Curse of Cardinality

What if you have 10,000 categories?

# Country of origin: 195 countries
# → 195 new columns!

# Product ID: 50,000 products
# → 50,000 new columns! 💀

This is called high cardinality. One-hot encoding explodes your feature space.

Solutions:

Group rare categories into "Other"
Use target encoding (Method 5)
Use embedding (Method 7)

Method 3: Ordinal Encoding

The idea: Like label encoding, but YOU define the order.

from sklearn.preprocessing import OrdinalEncoder

# Define the order explicitly
size_order = [['Small', 'Medium', 'Large', 'XL']]

encoder = OrdinalEncoder(categories=size_order)
sizes = [['Medium'], ['XL'], ['Small'], ['Large']]

encoded = encoder.fit_transform(sizes)
print(encoded)
# [[1.]   # Medium
#  [3.]   # XL
#  [0.]   # Small
#  [2.]]  # Large

When to use: When categories have a natural order that you want to preserve.

# Education levels
education_order = ['High School', 'Bachelor', 'Master', 'PhD']

# Customer satisfaction
satisfaction_order = ['Very Unhappy', 'Unhappy', 'Neutral', 'Happy', 'Very Happy']

# Priority levels
priority_order = ['Low', 'Medium', 'High', 'Critical']

Method 4: Binary Encoding

The idea: Convert category index to binary representation.

# 8 categories → 3 binary columns (2³ = 8)

Category Index → Binary
   0           → 0 0 0
   1           → 0 0 1
   2           → 0 1 0
   3           → 0 1 1
   4           → 1 0 0
   5           → 1 0 1
   6           → 1 1 0
   7           → 1 1 1

Code:

import category_encoders as ce

encoder = ce.BinaryEncoder(cols=['country'])
df_encoded = encoder.fit_transform(df)

Why use it:

1000 categories → Only 10 columns (2¹⁰ = 1024)
Much more compact than one-hot
Preserves some information

Trade-off: Creates arbitrary relationships between binary digits.

Method 5: Target Encoding (Mean Encoding)

The idea: Replace each category with the mean of the target variable for that category.

import pandas as pd

df = pd.DataFrame({
    'city': ['NYC', 'LA', 'NYC', 'Chicago', 'LA', 'NYC', 'Chicago'],
    'purchased': [1, 0, 1, 0, 1, 1, 0]
})

# Calculate mean target for each city
target_means = df.groupby('city')['purchased'].mean()
print(target_means)
# Chicago    0.00
# LA         0.50
# NYC        1.00

# Replace city with its mean
df['city_encoded'] = df['city'].map(target_means)
print(df)

Output:

      city  purchased  city_encoded
0      NYC          1          1.00
1       LA          0          0.50
2      NYC          1          1.00
3  Chicago          0          0.00
4       LA          1          0.50
5      NYC          1          1.00
6  Chicago          0          0.00

Visual:

NYC customers bought 100% of the time → NYC = 1.0
LA customers bought 50% of the time  → LA = 0.5
Chicago customers bought 0%          → Chicago = 0.0

Why It's Powerful

Captures the relationship between category and target
Single column regardless of cardinality
Works great for high-cardinality features

The Danger: Data Leakage! 🚨

Problem: You're using the target to encode features. If not done carefully, you leak target information into features.

Solution: Use proper cross-validation or smoothing.

import category_encoders as ce

# Proper target encoding with regularization
encoder = ce.TargetEncoder(cols=['city'], smoothing=10)
encoder.fit(X_train, y_train)

X_train_encoded = encoder.transform(X_train)
X_test_encoded = encoder.transform(X_test)

Smoothing blends the category mean with the global mean, preventing overfitting on rare categories.

Method 6: Frequency Encoding

The idea: Replace each category with how often it appears.

df = pd.DataFrame({
    'city': ['NYC', 'LA', 'NYC', 'Chicago', 'LA', 'NYC', 'LA']
})

# Count frequencies
freq = df['city'].value_counts(normalize=True)
print(freq)
# NYC        0.428571
# LA         0.428571
# Chicago    0.142857

# Encode
df['city_freq'] = df['city'].map(freq)

When to use:

Frequency itself is predictive
E.g., Popular products might sell differently than rare ones
No target leakage risk

Method 7: Embeddings (Deep Learning)

The idea: Learn a dense vector representation for each category.

import tensorflow as tf
from tensorflow.keras.layers import Embedding, Input, Flatten, Dense
from tensorflow.keras.models import Model

# 100 unique cities, each becomes a 10-dimensional vector
n_cities = 100
embedding_dim = 10

# Model with embedding layer
input_city = Input(shape=(1,))
embedded = Embedding(input_dim=n_cities, output_dim=embedding_dim)(input_city)
flat = Flatten()(embedded)
output = Dense(1, activation='sigmoid')(flat)

model = Model(inputs=input_city, outputs=output)

Visual:

One-Hot (100 cities):    [0,0,0,0,0,1,0,0,0,0,...,0]  ← 100 dimensions

Embedding (100 cities):  [0.23, -0.15, 0.87, ..., 0.42]  ← 10 dimensions!

Why use it:

Dramatically reduces dimensionality
Learns meaningful relationships (similar cities have similar embeddings)
State-of-the-art for recommender systems

When to use:

Deep learning models
Very high cardinality (millions of categories)
When you have lots of data to learn embeddings

The Decision Flowchart

START
  │
  ▼
Is the variable ORDINAL (has natural order)?
  │
  ├─ YES ──────────────────────────────────► Ordinal Encoding
  │                                          (preserve order)
  └─ NO (Nominal)
      │
      ▼
How many unique categories?
  │
  ├─ Few (< 10-15) ────────────────────────► One-Hot Encoding
  │                                          (safe, no assumptions)
  │
  ├─ Medium (15-100)
  │    │
  │    ├─ Tree-based model? ───────────────► Label Encoding OK
  │    │                                     (trees handle it)
  │    │
  │    └─ Linear model? ───────────────────► Target Encoding
  │                                          or Binary Encoding
  │
  └─ High (100+) ──────────────────────────► Target Encoding
                                             Frequency Encoding
                                             or Embeddings

Complete Code Example

Let's encode a real dataset:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
import category_encoders as ce

# Create sample data
np.random.seed(42)
n = 1000

df = pd.DataFrame({
    'color': np.random.choice(['Red', 'Blue', 'Green', 'White'], n),
    'size': np.random.choice(['Small', 'Medium', 'Large'], n),
    'brand': np.random.choice([f'Brand_{i}' for i in range(50)], n),  # High cardinality
    'age': np.random.randint(18, 70, n),
    'purchased': np.random.randint(0, 2, n)
})

print("=== Sample Data ===")
print(df.head(10))
print(f"\nUnique values:")
print(f"  color: {df['color'].nunique()}")
print(f"  size:  {df['size'].nunique()}")
print(f"  brand: {df['brand'].nunique()}")

# Split
X = df.drop('purchased', axis=1)
y = df['purchased']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Define column types
nominal_low_cardinality = ['color']  # One-hot
ordinal_cols = ['size']               # Ordinal
nominal_high_cardinality = ['brand']  # Target encoding
numeric_cols = ['age']

# Create preprocessing pipelines
preprocessor = ColumnTransformer(
    transformers=[
        ('onehot', OneHotEncoder(drop='first', handle_unknown='ignore'), nominal_low_cardinality),
        ('ordinal', OrdinalEncoder(categories=[['Small', 'Medium', 'Large']]), ordinal_cols),
        ('target', ce.TargetEncoder(cols=['brand']), nominal_high_cardinality),
        ('passthrough', 'passthrough', numeric_cols)
    ]
)

# Full pipeline with model
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])

# Fit and evaluate
pipeline.fit(X_train, y_train)
score = pipeline.score(X_test, y_test)
print(f"\n=== Model Performance ===")
print(f"Random Forest Accuracy: {score:.1%}")

# Show what each encoder did
print("\n=== Encoding Examples ===")

# One-hot for color
onehot = OneHotEncoder(drop='first', sparse_output=False)
color_encoded = onehot.fit_transform(df[['color']])
print(f"\nColor (One-Hot, dropped first):")
print(f"  Original: {df['color'].unique()}")
print(f"  Columns:  {onehot.get_feature_names_out()}")
print(f"  Example:  Red → {onehot.transform([['Red']])[0]}")

# Ordinal for size
ordinal = OrdinalEncoder(categories=[['Small', 'Medium', 'Large']])
size_encoded = ordinal.fit_transform(df[['size']])
print(f"\nSize (Ordinal):")
print(f"  Small=0, Medium=1, Large=2")
print(f"  Example: Medium → {ordinal.transform([['Medium']])[0]}")

# Target encoding for brand
target_enc = ce.TargetEncoder(cols=['brand'])
target_enc.fit(X_train[['brand']], y_train)
print(f"\nBrand (Target Encoding):")
print(f"  50 brands → 1 column")
brand_means = df.groupby('brand')['purchased'].mean().sort_values()
print(f"  Lowest purchase rate: {brand_means.index[0]} ({brand_means.iloc[0]:.2%})")
print(f"  Highest purchase rate: {brand_means.index[-1]} ({brand_means.iloc[-1]:.2%})")

Output:

=== Sample Data ===
  color    size      brand  age  purchased
0  Blue   Large   Brand_23   52          1
1   Red   Small   Brand_41   39          1
2  Blue  Medium   Brand_12   67          0
3  Blue   Large   Brand_33   40          0
4   Red   Large   Brand_18   24          1

Unique values:
  color: 4
  size:  3
  brand: 50

=== Model Performance ===
Random Forest Accuracy: 51.5%

=== Encoding Examples ===

Color (One-Hot, dropped first):
  Original: ['Blue' 'Red' 'Green' 'White']
  Columns:  ['color_Green' 'color_Red' 'color_White']
  Example:  Red → [0. 1. 0.]

Size (Ordinal):
  Small=0, Medium=1, Large=2
  Example: Medium → [1.]

Brand (Target Encoding):
  50 brands → 1 column
  Lowest purchase rate: Brand_7 (36.00%)
  Highest purchase rate: Brand_28 (68.18%)

Handling Unknown Categories

What happens when test data has categories not seen during training?

# Training data has: Red, Blue, Green
# Test data has: Purple (NEW!)

# One-Hot Encoder
encoder = OneHotEncoder(handle_unknown='ignore')  # Purple → all zeros
encoder = OneHotEncoder(handle_unknown='error')   # Raises error

# Label Encoder — NO built-in handling!
# You must handle manually

# Target Encoder
encoder = ce.TargetEncoder(handle_unknown='value')  # Uses global mean

Best practice: Always use handle_unknown='ignore' for production models.

from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(
    drop='first',
    handle_unknown='ignore',  # New categories → zero vector
    sparse_output=False
)

Common Mistakes

Mistake 1: One-Hot for High Cardinality

# ❌ WRONG: 10,000 product IDs → 10,000 columns!
encoder = OneHotEncoder()
encoded = encoder.fit_transform(products)  # Memory explosion 💥

# ✅ RIGHT: Use target encoding or embeddings
encoder = ce.TargetEncoder()
encoded = encoder.fit_transform(products, target)

Mistake 2: Label Encoding Nominal Variables for Linear Models

# ❌ WRONG: Linear model learns Red(2) > Blue(0)
encoder = LabelEncoder()
colors_encoded = encoder.fit_transform(colors)
linear_model.fit(colors_encoded, target)

# ✅ RIGHT: One-hot for linear models
encoder = OneHotEncoder()
colors_encoded = encoder.fit_transform(colors)
linear_model.fit(colors_encoded, target)

Mistake 3: Target Encoding Without Cross-Validation

# ❌ WRONG: Target leakage!
df['city_encoded'] = df.groupby('city')['target'].transform('mean')

# ✅ RIGHT: Use proper library with smoothing
encoder = ce.TargetEncoder(smoothing=1.0)
encoder.fit(X_train, y_train)  # Fit only on training!
X_train_encoded = encoder.transform(X_train)
X_test_encoded = encoder.transform(X_test)

Mistake 4: Forgetting to Handle Unknown Categories

# ❌ WRONG: Crashes on new categories in production
encoder = OneHotEncoder()
encoder.fit(training_cities)
encoder.transform([['New City']])  # ERROR!

# ✅ RIGHT: Ignore unknown
encoder = OneHotEncoder(handle_unknown='ignore')
encoder.fit(training_cities)
encoder.transform([['New City']])  # Works! Returns zeros.

Mistake 5: Not Dropping One Column in One-Hot

# ❌ WRONG for linear models: Multicollinearity
encoder = OneHotEncoder(drop=None)  # All K columns

# ✅ RIGHT for linear models: K-1 columns
encoder = OneHotEncoder(drop='first')  # Reference category dropped

# For tree-based models: Either is fine

The Cheat Sheet

Encoding	Best For	Columns	Handles Unknown	Risk
One-Hot	Nominal, low cardinality	K-1 or K	Yes (zeros)	Dimension explosion
Label	Ordinal, or trees	1	No	False ordering
Ordinal	Ordinal	1	No	Must define order
Target	High cardinality	1	Yes (global mean)	Target leakage
Frequency	When frequency matters	1	Yes (0 or small)	Collisions
Binary	Medium cardinality	log₂(K)	Partial	Arbitrary patterns
Embedding	Deep learning, very high K	Custom	Learned	Needs lots of data

Quick Reference: Which Encoding?

Situation	Encoding
Colors, countries (few, no order)	One-Hot
Sizes, ratings (ordered)	Ordinal
User IDs (millions)	Embedding
Product categories (hundreds)	Target Encoding
Linear model + nominal	One-Hot (drop first)
Tree model + any	Label/One-Hot both work
Unknown categories expected	One-Hot (handle_unknown='ignore')

Key Takeaways

Nominal ≠ Ordinal — Know the difference before encoding
One-Hot is safest for low-cardinality nominal variables
Label Encoding implies order — Only use for ordinal or tree models
Target Encoding rocks for high cardinality — But watch for leakage
Drop one column for linear models — Avoid multicollinearity
Handle unknown categories — Production data WILL surprise you
Tree models are forgiving — They can use label encoding for nominal
Embeddings for deep learning — Learn rich representations

The One-Sentence Summary

Your model is Alex from Mars — it only speaks numbers. Translate your categories wisely, or you'll teach it that Tokyo is three times better than New York.

What's Next?

Now that you understand categorical encoding, you're ready for:

Outlier Detection & Treatment — Finding extreme values
Feature Engineering — Creating powerful new features
Handling Imbalanced Data — When classes aren't equal
Dimensionality Reduction — PCA, t-SNE, and beyond

Follow me for the next article in this series!

Let's Connect!

If this finally made categorical encoding click, drop a heart!

Questions? Ask in the comments — I read and respond to every one.

What's your go-to encoding method? Target encoding? One-hot? I'm curious!

The difference between a model that learns "New York is great for sales" and one that learns "3 > 1 so New York > London"? How you encoded your categories. Don't let arbitrary numbers become accidental truths.

Share this with someone who's been label encoding everything. They need to meet one-hot.

Happy encoding!

DEV Community

Handling Categorical Variables: Teaching Your Model to Understand "Red," "Blue," and "Green" Without Having a Mental Breakdown

The Foreign Exchange Student

The Disastrous First Attempt

The Problem: Categories vs. Numbers

Nominal: No Order

Ordinal: Natural Order

The Encoding Arsenal

Method 1: Label Encoding

When It Works ✅

When It Fails ❌

Method 2: One-Hot Encoding

Why It's Safe

The Scikit-Learn Way

The Dummy Variable Trap 🪤

When One-Hot Fails: The Curse of Cardinality

Method 3: Ordinal Encoding

Method 4: Binary Encoding

Method 5: Target Encoding (Mean Encoding)

Why It's Powerful

The Danger: Data Leakage! 🚨

Method 6: Frequency Encoding

Method 7: Embeddings (Deep Learning)

The Decision Flowchart

Complete Code Example

Handling Unknown Categories

Common Mistakes

Mistake 1: One-Hot for High Cardinality

Mistake 2: Label Encoding Nominal Variables for Linear Models

Mistake 3: Target Encoding Without Cross-Validation

Mistake 4: Forgetting to Handle Unknown Categories

Mistake 5: Not Dropping One Column in One-Hot

The Cheat Sheet

Quick Reference: Which Encoding?

Key Takeaways

The One-Sentence Summary

What's Next?

Let's Connect!

Top comments (0)