The One-Line Summary: Label encoding assigns each category a number. It's perfect when order matters (Small→Medium→Large), dangerous when order doesn't exist (Red→Blue→Green), and surprisingly fine for tree-based models either way.
The Stadium Seat Problem
You're designing a ticketing system for a stadium.
Each section needs a code that computers can process. You can only use numbers.
You look at your sections:
Section A - Behind the goal
Section B - Midfield lower
Section C - Midfield upper
Section D - Corner seats
Section E - VIP boxes
You think: "Easy! I'll just number them!"
Section A = 1
Section B = 2
Section C = 3
Section D = 4
Section E = 5
Done. Ship it.
Three months later, chaos.
Your pricing algorithm has learned some... interesting things:
"Section E (5) is worth 5 times more than Section A (1)!"
"Section B (2) + Section C (3) = Section E (5)!"
"The average of Section A and Section E is Section C!"
None of this is true. VIP boxes aren't "5 times" anything. You can't add sections together. The relationships are nonsense.
But your algorithm believed the numbers. And numbers have mathematical properties.
Now imagine a different scenario.
You're encoding t-shirt sizes:
XS = 1
S = 2
M = 3
L = 4
XL = 5
Now the math makes sense!
- L (4) IS greater than S (2) ✓
- M (3) IS between S (2) and L (4) ✓
- The order IS meaningful ✓
Same technique. Completely different outcome.
The difference? One has natural order. The other doesn't.
What Is Label Encoding?
Label encoding is the simplest categorical encoding: assign each unique category a unique integer.
from sklearn.preprocessing import LabelEncoder
colors = ['Red', 'Blue', 'Green', 'Red', 'Blue', 'Green']
encoder = LabelEncoder()
encoded = encoder.fit_transform(colors)
print(encoded)
# [2, 0, 1, 2, 0, 1]
print(encoder.classes_)
# ['Blue', 'Green', 'Red'] (alphabetical order!)
Visual:
Original: [Red] [Blue] [Green] [Red] [Blue] [Green]
↓ ↓ ↓ ↓ ↓ ↓
Encoded: [ 2 ] [ 0 ] [ 1 ] [ 2 ] [ 0 ] [ 1 ]
Mapping (alphabetical):
Blue → 0
Green → 1
Red → 2
That's it. Each category gets a number. Simple, compact, fast.
But simplicity hides danger.
The One Question That Determines Everything
Before using label encoding, ask yourself:
"Does the order of these categories have meaning?"
If YES → Label Encoding is Perfect ✅
Categories with natural order are called ordinal variables.
T-shirt sizes: XS < S < M < L < XL ✓ Order matters!
Education: High School < Bachelor < Master < PhD ✓
Ratings: Poor < Fair < Good < Excellent ✓
Temperature feel: Cold < Cool < Warm < Hot ✓
Priority: Low < Medium < High < Critical ✓
For these, the numbers SHOULD imply order. That's the whole point!
# Perfect use of label encoding
sizes = ['S', 'M', 'L', 'XL', 'S', 'M']
# Manual mapping to preserve order
size_map = {'XS': 0, 'S': 1, 'M': 2, 'L': 3, 'XL': 4}
encoded = [size_map[s] for s in sizes]
# Now: L(3) > S(1) is TRUE and MEANINGFUL
If NO → Label Encoding is Dangerous ❌
Categories without natural order are called nominal variables.
Colors: Red, Blue, Green ✗ No order!
Countries: USA, Japan, France ✗ No order!
Blood types: A, B, AB, O ✗ No order!
Car brands: Toyota, BMW, Tesla ✗ No order!
For these, any order you impose is arbitrary and misleading.
# Dangerous use of label encoding
colors = ['Red', 'Blue', 'Green']
encoder = LabelEncoder()
encoded = encoder.fit_transform(colors)
# Blue=0, Green=1, Red=2
# Now the model might learn:
# - Red(2) > Blue(0) → FALSE!
# - Green(1) is "between" Blue and Red → NONSENSE!
# - Red - Blue = 2 → MEANINGLESS!
The Decision Tree Exception 🌳
Here's the plot twist that confuses everyone:
Tree-based models don't care about the ordering problem!
Why? Because trees only ask "Is X <= threshold?" questions.
Decision Tree with label-encoded colors:
[Is color <= 0.5?]
/ \
YES NO
| |
[color = 0] [Is color <= 1.5?]
(Blue) / \
YES NO
| |
[color = 1] [color = 2]
(Green) (Red)
The tree doesn't think "Red is greater than Blue." It just splits on thresholds. Each category ends up in its own branch anyway.
For trees, label encoding nominal variables is FINE.
from sklearn.ensemble import RandomForestClassifier
# This works perfectly for Random Forest!
colors_encoded = LabelEncoder().fit_transform(colors)
model = RandomForestClassifier()
model.fit(colors_encoded.reshape(-1, 1), target)
But for linear models, SVMs, neural networks, k-NN — the danger remains.
The Visual Guide
DOES ORDER MATTER?
│
┌────────────┴────────────┐
│ │
YES NO
(Ordinal) (Nominal)
│ │
▼ ▼
┌─────────────────┐ ┌─────────────────────────┐
│ LABEL ENCODING │ │ WHAT MODEL ARE YOU │
│ IS PERFECT! ✅ │ │ USING? │
│ │ │ │
│ Small=0 │ │ ┌─────────┴──────┐ │
│ Medium=1 │ │ │ │ │
│ Large=2 │ │ TREE-BASED LINEAR/NN│
│ │ │ │ │ │
└─────────────────┘ │ ▼ ▼ │
│ Label One-Hot│
│ encoding encoding│
│ is FINE ✅ required │
└─────────────────────────┘
Label Encoding in Practice
The Right Way: Ordinal Variables
from sklearn.preprocessing import OrdinalEncoder
# Define the order explicitly!
size_order = [['XS', 'S', 'M', 'L', 'XL']]
encoder = OrdinalEncoder(categories=size_order)
sizes = [['M'], ['XL'], ['S'], ['L'], ['XS']]
encoded = encoder.fit_transform(sizes)
print(encoded)
# [[2.] # M
# [4.] # XL
# [1.] # S
# [3.] # L
# [0.]] # XS
Why OrdinalEncoder instead of LabelEncoder?
-
LabelEncoder: Alphabetical order (S=3, XS=4 — wrong!) -
OrdinalEncoder: YOU define the order (XS=0, S=1, M=2, L=3, XL=4 — correct!)
The Acceptable Way: Nominal Variables + Tree Models
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
import xgboost as xgb
import lightgbm as lgb
# For tree-based models, label encoding nominal variables is fine
colors = ['Red', 'Blue', 'Green', 'Red', 'Blue']
target = [1, 0, 1, 1, 0]
le = LabelEncoder()
colors_encoded = le.fit_transform(colors)
# All of these work fine!
RandomForestClassifier().fit(colors_encoded.reshape(-1, 1), target)
GradientBoostingClassifier().fit(colors_encoded.reshape(-1, 1), target)
xgb.XGBClassifier().fit(colors_encoded.reshape(-1, 1), target)
lgb.LGBMClassifier().fit(colors_encoded.reshape(-1, 1), target)
The Dangerous Way: Nominal Variables + Linear Models
from sklearn.linear_model import LogisticRegression
colors = ['Red', 'Blue', 'Green', 'Red', 'Blue']
target = [1, 0, 1, 1, 0]
# ❌ DANGEROUS!
le = LabelEncoder()
colors_encoded = le.fit_transform(colors)
# Blue=0, Green=1, Red=2
model = LogisticRegression()
model.fit(colors_encoded.reshape(-1, 1), target)
# The model now has ONE coefficient for "color"
# It thinks: Higher color value → some effect
# This implies Red(2) has 2x the effect of Green(1)
# NONSENSE!
What the model learns:
log(odds) = β₀ + β₁ × color_encoded
If β₁ = 0.5:
Blue(0): log(odds) = β₀ + 0.5 × 0 = β₀
Green(1): log(odds) = β₀ + 0.5 × 1 = β₀ + 0.5
Red(2): log(odds) = β₀ + 0.5 × 2 = β₀ + 1.0
The model thinks Red's effect is exactly 2× Green's effect!
This is ARBITRARY based on alphabetical ordering!
Side-by-Side: Label vs One-Hot for Linear Models
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.model_selection import cross_val_score
# Create data where color DOES matter
np.random.seed(42)
n = 1000
# Red customers buy 80%, Blue 50%, Green 20%
colors = np.random.choice(['Red', 'Blue', 'Green'], n)
purchase_prob = {'Red': 0.8, 'Blue': 0.5, 'Green': 0.2}
target = [1 if np.random.random() < purchase_prob[c] else 0 for c in colors]
# Method 1: Label Encoding (WRONG for linear models)
le = LabelEncoder()
X_label = le.fit_transform(colors).reshape(-1, 1)
model_label = LogisticRegression()
scores_label = cross_val_score(model_label, X_label, target, cv=5)
print(f"Label Encoding Accuracy: {scores_label.mean():.1%}")
# Method 2: One-Hot Encoding (CORRECT for linear models)
X_onehot = pd.get_dummies(pd.DataFrame({'color': colors}))
model_onehot = LogisticRegression()
scores_onehot = cross_val_score(model_onehot, X_onehot, target, cv=5)
print(f"One-Hot Encoding Accuracy: {scores_onehot.mean():.1%}")
Output:
Label Encoding Accuracy: 62.4%
One-Hot Encoding Accuracy: 68.7%
6% accuracy difference just from encoding choice! The one-hot model learned each color's TRUE effect. The label encoding model was constrained to a false linear relationship.
When Label Encoding Shines
Advantage 1: Memory Efficiency
import numpy as np
n_samples = 100_000
n_categories = 100
# Label Encoding: 1 column
label_memory = n_samples * 8 # 8 bytes per int64
print(f"Label Encoding: {label_memory / 1e6:.2f} MB")
# One-Hot Encoding: 100 columns
onehot_memory = n_samples * n_categories * 8
print(f"One-Hot Encoding: {onehot_memory / 1e6:.2f} MB")
Output:
Label Encoding: 0.80 MB
One-Hot Encoding: 80.00 MB
100x less memory!
Advantage 2: Tree Model Performance
For gradient boosting, label encoding can actually be BETTER than one-hot:
import lightgbm as lgb
import xgboost as xgb
from sklearn.preprocessing import LabelEncoder
import pandas as pd
import numpy as np
from sklearn.model_selection import cross_val_score
# Create dataset
np.random.seed(42)
n = 10000
n_categories = 50
df = pd.DataFrame({
'category': np.random.choice([f'cat_{i}' for i in range(n_categories)], n),
'numeric': np.random.randn(n)
})
target = (df['category'].str.extract('(\d+)')[0].astype(int) > 25).astype(int)
# Label Encoding
le = LabelEncoder()
df_label = df.copy()
df_label['category'] = le.fit_transform(df_label['category'])
# One-Hot Encoding
df_onehot = pd.get_dummies(df, columns=['category'])
# Compare with LightGBM
print("LightGBM Performance:")
model = lgb.LGBMClassifier(verbose=-1)
scores_label = cross_val_score(model, df_label, target, cv=5)
print(f" Label Encoding: {scores_label.mean():.1%}")
scores_onehot = cross_val_score(model, df_onehot, target, cv=5)
print(f" One-Hot Encoding: {scores_onehot.mean():.1%}")
Output:
LightGBM Performance:
Label Encoding: 99.2%
One-Hot Encoding: 98.8%
Label encoding is simpler, faster, uses less memory, and performs just as well (or better!) for tree models.
Advantage 3: Native Categorical Support
Modern libraries handle label-encoded categories natively:
import lightgbm as lgb
# LightGBM native categorical handling
df['category'] = df['category'].astype('category')
model = lgb.LGBMClassifier()
model.fit(df[['category', 'numeric']], target, categorical_feature=['category'])
# Even better: LightGBM figures out optimal splits!
The Complete Decision Framework
START
│
▼
Is the variable ORDINAL (has natural order)?
│
├── YES ──────────────────────────────────────────────────┐
│ │
│ Examples: Size, Rating, Education, Priority │
│ │
│ ┌─────────────────────────────────────────────────┐ │
│ │ USE LABEL/ORDINAL ENCODING ✅ │ │
│ │ │ │
│ │ But DEFINE THE ORDER yourself! │ │
│ │ Don't trust alphabetical sorting. │ │
│ │ │ │
│ │ OrdinalEncoder(categories=[['S','M','L','XL']]) │ │
│ └─────────────────────────────────────────────────┘ │
│ │
└── NO (Nominal) ─────────────────────────────────────────┤
│
Examples: Color, Country, Product ID, Name │
│
What MODEL are you using? │
│ │
├── TREE-BASED (RF, XGB, LightGBM, CatBoost) │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────┐ │
│ │ LABEL ENCODING IS FINE ✅ │ │
│ │ │ │
│ │ Trees split on thresholds, not values. │ │
│ │ Saves memory vs one-hot. │ │
│ │ Use native categorical support if │ │
│ │ available (LightGBM, CatBoost). │ │
│ └─────────────────────────────────────────┘ │
│ │
└── LINEAR / NEURAL NET / KNN / SVM │
│ │
▼ │
┌─────────────────────────────────────────┐ │
│ DON'T USE LABEL ENCODING ❌ │ │
│ │ │
│ Use instead: │ │
│ • One-Hot (low cardinality) │ │
│ • Target Encoding (high cardinality) │ │
│ • Embeddings (neural networks) │ │
└─────────────────────────────────────────┘ │
Common Mistakes
Mistake 1: Using LabelEncoder for Ordinal Data
# ❌ WRONG: Alphabetical order!
sizes = ['Small', 'Medium', 'Large', 'XL']
le = LabelEncoder()
encoded = le.fit_transform(sizes)
print(dict(zip(sizes, encoded)))
# {'Small': 2, 'Medium': 1, 'Large': 0, 'XL': 3}
# Large=0, Medium=1, Small=2, XL=3 — WRONG ORDER!
# ✅ RIGHT: Define order explicitly
from sklearn.preprocessing import OrdinalEncoder
encoder = OrdinalEncoder(categories=[['Small', 'Medium', 'Large', 'XL']])
encoded = encoder.fit_transform([['Medium'], ['XL'], ['Small']])
# Small=0, Medium=1, Large=2, XL=3 — CORRECT!
Mistake 2: Label Encoding Nominal Variables for Linear Models
# ❌ WRONG: Creates false ordering
countries = ['USA', 'Japan', 'France']
encoded = LabelEncoder().fit_transform(countries)
LogisticRegression().fit(encoded.reshape(-1,1), target)
# Model thinks: France(0) < Japan(1) < USA(2)
# ✅ RIGHT: Use one-hot encoding
encoded = pd.get_dummies(pd.DataFrame({'country': countries}))
LogisticRegression().fit(encoded, target)
Mistake 3: Not Handling Unknown Categories
# ❌ WRONG: Crashes on new categories
le = LabelEncoder()
le.fit(['Red', 'Blue', 'Green'])
le.transform(['Purple']) # 💥 ValueError!
# ✅ RIGHT: Handle manually or use different encoder
def safe_transform(encoder, values):
known = set(encoder.classes_)
return [encoder.transform([v])[0] if v in known else -1 for v in values]
# Or use category_encoders with handle_unknown
import category_encoders as ce
encoder = ce.OrdinalEncoder(handle_unknown='value')
Mistake 4: Assuming All Trees Are the Same
# ⚠️ CAUTION: sklearn trees DO care about order somewhat
from sklearn.tree import DecisionTreeClassifier
# For sklearn, label encoding can create suboptimal splits
# Because it can only split at one threshold
# ✅ BETTER: Use LightGBM or CatBoost with native categorical support
import lightgbm as lgb
df['category'] = df['category'].astype('category')
lgb.LGBMClassifier().fit(df, target)
Mistake 5: Forgetting Encoding During Inference
# ❌ WRONG: Different encoding at inference
le = LabelEncoder()
le.fit(train_data['color'])
# Later, in production...
new_color = 'Blue'
encoded = le.transform([new_color]) # Must use SAME encoder!
# ✅ RIGHT: Save and load the encoder
import joblib
joblib.dump(le, 'color_encoder.pkl')
# In production
le = joblib.load('color_encoder.pkl')
encoded = le.transform([new_color])
Label Encoding vs Alternatives
| Scenario | Label Encoding | One-Hot | Target Encoding |
|---|---|---|---|
| Ordinal variable | ✅ Perfect | ❌ Loses order | ⚠️ Possible |
| Nominal + Trees | ✅ Great | ✅ Works | ✅ Works |
| Nominal + Linear | ❌ Dangerous | ✅ Required | ✅ Good |
| High cardinality | ✅ Compact | ❌ Explodes | ✅ Compact |
| Memory constrained | ✅ Minimal | ❌ Huge | ✅ Minimal |
| Interpretability | ⚠️ Confusing | ✅ Clear | ⚠️ Less clear |
Quick Reference Code
# === ORDINAL ENCODING (for ordered categories) ===
from sklearn.preprocessing import OrdinalEncoder
# Define YOUR order!
encoder = OrdinalEncoder(categories=[['Low', 'Medium', 'High']])
encoded = encoder.fit_transform(df[['priority']])
# === LABEL ENCODING (simple, for trees) ===
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['category_encoded'] = le.fit_transform(df['category'])
# === NATIVE CATEGORICAL (LightGBM) ===
import lightgbm as lgb
df['category'] = df['category'].astype('category')
model = lgb.LGBMClassifier()
model.fit(df, target)
# === NATIVE CATEGORICAL (CatBoost) ===
from catboost import CatBoostClassifier
model = CatBoostClassifier(cat_features=['category'])
model.fit(df, target)
# === SAVE ENCODER FOR PRODUCTION ===
import joblib
joblib.dump(le, 'encoder.pkl')
le_loaded = joblib.load('encoder.pkl')
The Cheat Sheet
| Question | Answer | Use Label Encoding? |
|---|---|---|
| Is it ordinal? | Yes | ✅ Yes (define order!) |
| Is it nominal + tree model? | Yes | ✅ Yes (safe) |
| Is it nominal + linear model? | Yes | ❌ No (use one-hot) |
| Is it nominal + neural net? | Yes | ❌ No (use embedding) |
| Is cardinality very high? | Yes | ✅ Yes for trees |
| Do you need interpretability? | Yes | ⚠️ Maybe not ideal |
Key Takeaways
Label encoding assigns integers to categories — Simple but has implications
Ordinal variables → Label encoding is perfect — Order is meaningful
Nominal + Linear models → Dangerous! — False ordering kills accuracy
Nominal + Tree models → Totally fine! — Trees split on thresholds
Use OrdinalEncoder, not LabelEncoder for ordinal variables — Control the order!
Modern GBMs have native categorical support — Use it when available
Save your encoder for production — Same encoding at train and inference
When in doubt, ask: "Does order matter?" — That's the key question
The One-Sentence Summary
Label encoding is a loaded gun — perfectly safe when pointing at ordinal variables or tree models, potentially deadly when aimed at nominal variables and linear models.
What's Next?
Now that you understand label encoding, you're ready for:
- Target Encoding — When one-hot explodes and label encoding lies
- Embedding Layers — Deep learning's answer to categories
- CatBoost & LightGBM Native Categoricals — The modern approach
- Handling Unknown Categories — Production-ready encoding
Follow me for the next article in this series!
Let's Connect!
If this saved you from the label encoding trap, drop a heart!
Questions? Ask in the comments — I read and respond to every one.
Have you been burned by label encoding before? Share your war stories!
The difference between a model that captures "Red customers buy more" and one that thinks "Color value 2 correlates with purchases"? Understanding when label encoding is appropriate. One question: Does order matter?
Share this with someone who's label encoding everything without asking why. They need to see the decision tree.
Happy encoding!
Top comments (0)