The One-Line Summary: Categorical variables are words your model can't understand. You must convert them to numbers — but HOW you convert them determines whether your model learns truth or nonsense.
The Foreign Exchange Student
Meet Alex, an exchange student from Mars.
Alex is brilliant at math. Give Alex any numbers, and he'll find patterns, make predictions, solve problems.
But Alex has one limitation: he only understands numbers.
You show Alex a dataset about Earth cars:
Car Color Size Origin
─────────────────────────────────────────
Toyota Red Medium Japan
BMW Blue Large Germany
Honda Green Small Japan
Tesla White Large USA
Alex stares at it, confused.
"What is 'Red'? What is 'Japan'? These symbols mean nothing to me. Give me NUMBERS!"
You need to translate. But here's the problem:
The wrong translation will teach Alex lies.
The Disastrous First Attempt
You think: "Easy! I'll just number them!"
Color: Red=1, Blue=2, Green=3, White=4
Size: Small=1, Medium=2, Large=3
Origin: Japan=1, Germany=2, USA=3
Alex is happy. Numbers! He starts analyzing.
Then he announces his findings:
"I've discovered that White cars are 4 times better than Red cars!"
"USA is clearly the best country because it has the highest number!"
"If I average Blue and White, I get Green!"
You've taught Alex complete nonsense.
By assigning arbitrary numbers, you implied a mathematical relationship that doesn't exist. There's no universe where (Blue + White) / 2 = Green.
The Problem: Categories vs. Numbers
Categorical variables come in two flavors:
Nominal: No Order
Categories with no natural ranking.
Colors: Red, Blue, Green, Yellow
Countries: USA, Japan, Germany, France
Car brands: Toyota, BMW, Honda, Tesla
Is Red > Blue? No.
Is USA > Japan? No.
These are just LABELS.
Ordinal: Natural Order
Categories with a meaningful sequence.
T-shirt sizes: Small < Medium < Large < XL
Education: High School < Bachelor's < Master's < PhD
Ratings: Poor < Fair < Good < Excellent
These have ORDER, but the gaps aren't equal.
Is the jump from Small→Medium the same as Large→XL? Not necessarily.
Your encoding strategy depends on which type you have.
The Encoding Arsenal
Let me show you every weapon in the categorical encoding toolkit.
Method 1: Label Encoding
The idea: Assign each category a unique integer.
from sklearn.preprocessing import LabelEncoder
colors = ['Red', 'Blue', 'Green', 'Red', 'Blue']
encoder = LabelEncoder()
encoded = encoder.fit_transform(colors)
print(encoded) # [2, 0, 1, 2, 0]
# Blue=0, Green=1, Red=2 (alphabetical order)
Visual:
Original: [Red] [Blue] [Green] [Red] [Blue]
↓ ↓ ↓ ↓ ↓
Encoded: [ 2 ] [ 0 ] [ 1 ] [ 2 ] [ 0 ]
When It Works ✅
Ordinal data where order matters:
sizes = ['Small', 'Medium', 'Large', 'XL']
# Manual mapping preserves order
size_map = {'Small': 0, 'Medium': 1, 'Large': 2, 'XL': 3}
encoded_sizes = [size_map[s] for s in sizes]
# [0, 1, 2, 3] — Order is meaningful!
When It Fails ❌
Nominal data where order is meaningless:
colors = ['Red', 'Blue', 'Green']
# Encoded as [2, 0, 1]
# Model now thinks:
# Blue(0) < Green(1) < Red(2)
# Red - Blue = 2 (meaningful math on meaningless categories!)
Tree-based models (Random Forest, XGBoost) can handle label encoding for nominal data because they split on thresholds, not arithmetic. But linear models will be confused.
Method 2: One-Hot Encoding
The idea: Create a separate binary column for each category.
import pandas as pd
df = pd.DataFrame({'color': ['Red', 'Blue', 'Green', 'Red']})
# One-hot encode
one_hot = pd.get_dummies(df, columns=['color'])
print(one_hot)
Output:
color_Blue color_Green color_Red
0 0 0 1
1 1 0 0
2 0 1 0
3 0 0 1
Visual:
Original: [Red] [Blue] [Green] [Red]
↓ ↓ ↓ ↓
One-Hot: [0,0,1] [1,0,0] [0,1,0] [0,0,1]
B G R B G R B G R B G R
Each color gets its own column. A car is Red? Put 1 in the Red column, 0 everywhere else.
Why It's Safe
No false relationships! The model can't think "Red > Blue" because they're separate features.
Red = [0, 0, 1]
Blue = [1, 0, 0]
Red - Blue = [-1, 0, 1] ← Not a meaningful category
Red × 2 = [0, 0, 2] ← Not a meaningful category
No accidental math!
The Scikit-Learn Way
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
colors = [['Red'], ['Blue'], ['Green'], ['Red']]
encoded = encoder.fit_transform(colors)
print(encoded)
# [[0. 0. 1.]
# [1. 0. 0.]
# [0. 1. 0.]
# [0. 0. 1.]]
# Get feature names
print(encoder.get_feature_names_out())
# ['x0_Blue' 'x0_Green' 'x0_Red']
The Dummy Variable Trap 🪤
If you have K categories, you only need K-1 columns!
Why? Because if it's not Blue and not Green, it MUST be Red.
# With drop='first', we avoid multicollinearity
encoder = OneHotEncoder(drop='first', sparse_output=False)
encoded = encoder.fit_transform(colors)
# Now only 2 columns for 3 colors:
# [Green, Red] — Blue is implicit when both are 0
Original: Red Blue Green
Full: [0,0,1] [1,0,0] [0,1,0] ← 3 columns
Dropped: [0,1] [0,0] [1,0] ← 2 columns (Blue = reference)
Linear models NEED this. Tree models don't care.
When One-Hot Fails: The Curse of Cardinality
What if you have 10,000 categories?
# Country of origin: 195 countries
# → 195 new columns!
# Product ID: 50,000 products
# → 50,000 new columns! 💀
This is called high cardinality. One-hot encoding explodes your feature space.
Solutions:
- Group rare categories into "Other"
- Use target encoding (Method 5)
- Use embedding (Method 7)
Method 3: Ordinal Encoding
The idea: Like label encoding, but YOU define the order.
from sklearn.preprocessing import OrdinalEncoder
# Define the order explicitly
size_order = [['Small', 'Medium', 'Large', 'XL']]
encoder = OrdinalEncoder(categories=size_order)
sizes = [['Medium'], ['XL'], ['Small'], ['Large']]
encoded = encoder.fit_transform(sizes)
print(encoded)
# [[1.] # Medium
# [3.] # XL
# [0.] # Small
# [2.]] # Large
When to use: When categories have a natural order that you want to preserve.
# Education levels
education_order = ['High School', 'Bachelor', 'Master', 'PhD']
# Customer satisfaction
satisfaction_order = ['Very Unhappy', 'Unhappy', 'Neutral', 'Happy', 'Very Happy']
# Priority levels
priority_order = ['Low', 'Medium', 'High', 'Critical']
Method 4: Binary Encoding
The idea: Convert category index to binary representation.
# 8 categories → 3 binary columns (2³ = 8)
Category Index → Binary
0 → 0 0 0
1 → 0 0 1
2 → 0 1 0
3 → 0 1 1
4 → 1 0 0
5 → 1 0 1
6 → 1 1 0
7 → 1 1 1
Code:
import category_encoders as ce
encoder = ce.BinaryEncoder(cols=['country'])
df_encoded = encoder.fit_transform(df)
Why use it:
- 1000 categories → Only 10 columns (2¹⁰ = 1024)
- Much more compact than one-hot
- Preserves some information
Trade-off: Creates arbitrary relationships between binary digits.
Method 5: Target Encoding (Mean Encoding)
The idea: Replace each category with the mean of the target variable for that category.
import pandas as pd
df = pd.DataFrame({
'city': ['NYC', 'LA', 'NYC', 'Chicago', 'LA', 'NYC', 'Chicago'],
'purchased': [1, 0, 1, 0, 1, 1, 0]
})
# Calculate mean target for each city
target_means = df.groupby('city')['purchased'].mean()
print(target_means)
# Chicago 0.00
# LA 0.50
# NYC 1.00
# Replace city with its mean
df['city_encoded'] = df['city'].map(target_means)
print(df)
Output:
city purchased city_encoded
0 NYC 1 1.00
1 LA 0 0.50
2 NYC 1 1.00
3 Chicago 0 0.00
4 LA 1 0.50
5 NYC 1 1.00
6 Chicago 0 0.00
Visual:
NYC customers bought 100% of the time → NYC = 1.0
LA customers bought 50% of the time → LA = 0.5
Chicago customers bought 0% → Chicago = 0.0
Why It's Powerful
- Captures the relationship between category and target
- Single column regardless of cardinality
- Works great for high-cardinality features
The Danger: Data Leakage! 🚨
Problem: You're using the target to encode features. If not done carefully, you leak target information into features.
Solution: Use proper cross-validation or smoothing.
import category_encoders as ce
# Proper target encoding with regularization
encoder = ce.TargetEncoder(cols=['city'], smoothing=10)
encoder.fit(X_train, y_train)
X_train_encoded = encoder.transform(X_train)
X_test_encoded = encoder.transform(X_test)
Smoothing blends the category mean with the global mean, preventing overfitting on rare categories.
Method 6: Frequency Encoding
The idea: Replace each category with how often it appears.
df = pd.DataFrame({
'city': ['NYC', 'LA', 'NYC', 'Chicago', 'LA', 'NYC', 'LA']
})
# Count frequencies
freq = df['city'].value_counts(normalize=True)
print(freq)
# NYC 0.428571
# LA 0.428571
# Chicago 0.142857
# Encode
df['city_freq'] = df['city'].map(freq)
When to use:
- Frequency itself is predictive
- E.g., Popular products might sell differently than rare ones
- No target leakage risk
Method 7: Embeddings (Deep Learning)
The idea: Learn a dense vector representation for each category.
import tensorflow as tf
from tensorflow.keras.layers import Embedding, Input, Flatten, Dense
from tensorflow.keras.models import Model
# 100 unique cities, each becomes a 10-dimensional vector
n_cities = 100
embedding_dim = 10
# Model with embedding layer
input_city = Input(shape=(1,))
embedded = Embedding(input_dim=n_cities, output_dim=embedding_dim)(input_city)
flat = Flatten()(embedded)
output = Dense(1, activation='sigmoid')(flat)
model = Model(inputs=input_city, outputs=output)
Visual:
One-Hot (100 cities): [0,0,0,0,0,1,0,0,0,0,...,0] ← 100 dimensions
Embedding (100 cities): [0.23, -0.15, 0.87, ..., 0.42] ← 10 dimensions!
Why use it:
- Dramatically reduces dimensionality
- Learns meaningful relationships (similar cities have similar embeddings)
- State-of-the-art for recommender systems
When to use:
- Deep learning models
- Very high cardinality (millions of categories)
- When you have lots of data to learn embeddings
The Decision Flowchart
START
│
▼
Is the variable ORDINAL (has natural order)?
│
├─ YES ──────────────────────────────────► Ordinal Encoding
│ (preserve order)
└─ NO (Nominal)
│
▼
How many unique categories?
│
├─ Few (< 10-15) ────────────────────────► One-Hot Encoding
│ (safe, no assumptions)
│
├─ Medium (15-100)
│ │
│ ├─ Tree-based model? ───────────────► Label Encoding OK
│ │ (trees handle it)
│ │
│ └─ Linear model? ───────────────────► Target Encoding
│ or Binary Encoding
│
└─ High (100+) ──────────────────────────► Target Encoding
Frequency Encoding
or Embeddings
Complete Code Example
Let's encode a real dataset:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
import category_encoders as ce
# Create sample data
np.random.seed(42)
n = 1000
df = pd.DataFrame({
'color': np.random.choice(['Red', 'Blue', 'Green', 'White'], n),
'size': np.random.choice(['Small', 'Medium', 'Large'], n),
'brand': np.random.choice([f'Brand_{i}' for i in range(50)], n), # High cardinality
'age': np.random.randint(18, 70, n),
'purchased': np.random.randint(0, 2, n)
})
print("=== Sample Data ===")
print(df.head(10))
print(f"\nUnique values:")
print(f" color: {df['color'].nunique()}")
print(f" size: {df['size'].nunique()}")
print(f" brand: {df['brand'].nunique()}")
# Split
X = df.drop('purchased', axis=1)
y = df['purchased']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Define column types
nominal_low_cardinality = ['color'] # One-hot
ordinal_cols = ['size'] # Ordinal
nominal_high_cardinality = ['brand'] # Target encoding
numeric_cols = ['age']
# Create preprocessing pipelines
preprocessor = ColumnTransformer(
transformers=[
('onehot', OneHotEncoder(drop='first', handle_unknown='ignore'), nominal_low_cardinality),
('ordinal', OrdinalEncoder(categories=[['Small', 'Medium', 'Large']]), ordinal_cols),
('target', ce.TargetEncoder(cols=['brand']), nominal_high_cardinality),
('passthrough', 'passthrough', numeric_cols)
]
)
# Full pipeline with model
pipeline = Pipeline([
('preprocessor', preprocessor),
('classifier', RandomForestClassifier(n_estimators=100, random_state=42))
])
# Fit and evaluate
pipeline.fit(X_train, y_train)
score = pipeline.score(X_test, y_test)
print(f"\n=== Model Performance ===")
print(f"Random Forest Accuracy: {score:.1%}")
# Show what each encoder did
print("\n=== Encoding Examples ===")
# One-hot for color
onehot = OneHotEncoder(drop='first', sparse_output=False)
color_encoded = onehot.fit_transform(df[['color']])
print(f"\nColor (One-Hot, dropped first):")
print(f" Original: {df['color'].unique()}")
print(f" Columns: {onehot.get_feature_names_out()}")
print(f" Example: Red → {onehot.transform([['Red']])[0]}")
# Ordinal for size
ordinal = OrdinalEncoder(categories=[['Small', 'Medium', 'Large']])
size_encoded = ordinal.fit_transform(df[['size']])
print(f"\nSize (Ordinal):")
print(f" Small=0, Medium=1, Large=2")
print(f" Example: Medium → {ordinal.transform([['Medium']])[0]}")
# Target encoding for brand
target_enc = ce.TargetEncoder(cols=['brand'])
target_enc.fit(X_train[['brand']], y_train)
print(f"\nBrand (Target Encoding):")
print(f" 50 brands → 1 column")
brand_means = df.groupby('brand')['purchased'].mean().sort_values()
print(f" Lowest purchase rate: {brand_means.index[0]} ({brand_means.iloc[0]:.2%})")
print(f" Highest purchase rate: {brand_means.index[-1]} ({brand_means.iloc[-1]:.2%})")
Output:
=== Sample Data ===
color size brand age purchased
0 Blue Large Brand_23 52 1
1 Red Small Brand_41 39 1
2 Blue Medium Brand_12 67 0
3 Blue Large Brand_33 40 0
4 Red Large Brand_18 24 1
Unique values:
color: 4
size: 3
brand: 50
=== Model Performance ===
Random Forest Accuracy: 51.5%
=== Encoding Examples ===
Color (One-Hot, dropped first):
Original: ['Blue' 'Red' 'Green' 'White']
Columns: ['color_Green' 'color_Red' 'color_White']
Example: Red → [0. 1. 0.]
Size (Ordinal):
Small=0, Medium=1, Large=2
Example: Medium → [1.]
Brand (Target Encoding):
50 brands → 1 column
Lowest purchase rate: Brand_7 (36.00%)
Highest purchase rate: Brand_28 (68.18%)
Handling Unknown Categories
What happens when test data has categories not seen during training?
# Training data has: Red, Blue, Green
# Test data has: Purple (NEW!)
# One-Hot Encoder
encoder = OneHotEncoder(handle_unknown='ignore') # Purple → all zeros
encoder = OneHotEncoder(handle_unknown='error') # Raises error
# Label Encoder — NO built-in handling!
# You must handle manually
# Target Encoder
encoder = ce.TargetEncoder(handle_unknown='value') # Uses global mean
Best practice: Always use handle_unknown='ignore' for production models.
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(
drop='first',
handle_unknown='ignore', # New categories → zero vector
sparse_output=False
)
Common Mistakes
Mistake 1: One-Hot for High Cardinality
# ❌ WRONG: 10,000 product IDs → 10,000 columns!
encoder = OneHotEncoder()
encoded = encoder.fit_transform(products) # Memory explosion 💥
# ✅ RIGHT: Use target encoding or embeddings
encoder = ce.TargetEncoder()
encoded = encoder.fit_transform(products, target)
Mistake 2: Label Encoding Nominal Variables for Linear Models
# ❌ WRONG: Linear model learns Red(2) > Blue(0)
encoder = LabelEncoder()
colors_encoded = encoder.fit_transform(colors)
linear_model.fit(colors_encoded, target)
# ✅ RIGHT: One-hot for linear models
encoder = OneHotEncoder()
colors_encoded = encoder.fit_transform(colors)
linear_model.fit(colors_encoded, target)
Mistake 3: Target Encoding Without Cross-Validation
# ❌ WRONG: Target leakage!
df['city_encoded'] = df.groupby('city')['target'].transform('mean')
# ✅ RIGHT: Use proper library with smoothing
encoder = ce.TargetEncoder(smoothing=1.0)
encoder.fit(X_train, y_train) # Fit only on training!
X_train_encoded = encoder.transform(X_train)
X_test_encoded = encoder.transform(X_test)
Mistake 4: Forgetting to Handle Unknown Categories
# ❌ WRONG: Crashes on new categories in production
encoder = OneHotEncoder()
encoder.fit(training_cities)
encoder.transform([['New City']]) # ERROR!
# ✅ RIGHT: Ignore unknown
encoder = OneHotEncoder(handle_unknown='ignore')
encoder.fit(training_cities)
encoder.transform([['New City']]) # Works! Returns zeros.
Mistake 5: Not Dropping One Column in One-Hot
# ❌ WRONG for linear models: Multicollinearity
encoder = OneHotEncoder(drop=None) # All K columns
# ✅ RIGHT for linear models: K-1 columns
encoder = OneHotEncoder(drop='first') # Reference category dropped
# For tree-based models: Either is fine
The Cheat Sheet
| Encoding | Best For | Columns | Handles Unknown | Risk |
|---|---|---|---|---|
| One-Hot | Nominal, low cardinality | K-1 or K | Yes (zeros) | Dimension explosion |
| Label | Ordinal, or trees | 1 | No | False ordering |
| Ordinal | Ordinal | 1 | No | Must define order |
| Target | High cardinality | 1 | Yes (global mean) | Target leakage |
| Frequency | When frequency matters | 1 | Yes (0 or small) | Collisions |
| Binary | Medium cardinality | log₂(K) | Partial | Arbitrary patterns |
| Embedding | Deep learning, very high K | Custom | Learned | Needs lots of data |
Quick Reference: Which Encoding?
| Situation | Encoding |
|---|---|
| Colors, countries (few, no order) | One-Hot |
| Sizes, ratings (ordered) | Ordinal |
| User IDs (millions) | Embedding |
| Product categories (hundreds) | Target Encoding |
| Linear model + nominal | One-Hot (drop first) |
| Tree model + any | Label/One-Hot both work |
| Unknown categories expected | One-Hot (handle_unknown='ignore') |
Key Takeaways
Nominal ≠ Ordinal — Know the difference before encoding
One-Hot is safest for low-cardinality nominal variables
Label Encoding implies order — Only use for ordinal or tree models
Target Encoding rocks for high cardinality — But watch for leakage
Drop one column for linear models — Avoid multicollinearity
Handle unknown categories — Production data WILL surprise you
Tree models are forgiving — They can use label encoding for nominal
Embeddings for deep learning — Learn rich representations
The One-Sentence Summary
Your model is Alex from Mars — it only speaks numbers. Translate your categories wisely, or you'll teach it that Tokyo is three times better than New York.
What's Next?
Now that you understand categorical encoding, you're ready for:
- Outlier Detection & Treatment — Finding extreme values
- Feature Engineering — Creating powerful new features
- Handling Imbalanced Data — When classes aren't equal
- Dimensionality Reduction — PCA, t-SNE, and beyond
Follow me for the next article in this series!
Let's Connect!
If this finally made categorical encoding click, drop a heart!
Questions? Ask in the comments — I read and respond to every one.
What's your go-to encoding method? Target encoding? One-hot? I'm curious!
The difference between a model that learns "New York is great for sales" and one that learns "3 > 1 so New York > London"? How you encoded your categories. Don't let arbitrary numbers become accidental truths.
Share this with someone who's been label encoding everything. They need to meet one-hot.
Happy encoding!
Top comments (0)