DEV Community

Cover image for Label Encoding: The Simple Trick That's Either Genius or Disaster Depending on One Question
Sachin Kr. Rajput
Sachin Kr. Rajput

Posted on

Label Encoding: The Simple Trick That's Either Genius or Disaster Depending on One Question

The One-Line Summary: Label encoding assigns each category a number. It's perfect when order matters (Small→Medium→Large), dangerous when order doesn't exist (Red→Blue→Green), and surprisingly fine for tree-based models either way.


The Stadium Seat Problem

You're designing a ticketing system for a stadium.

Each section needs a code that computers can process. You can only use numbers.

You look at your sections:

Section A - Behind the goal
Section B - Midfield lower
Section C - Midfield upper
Section D - Corner seats
Section E - VIP boxes
Enter fullscreen mode Exit fullscreen mode

You think: "Easy! I'll just number them!"

Section A = 1
Section B = 2
Section C = 3
Section D = 4
Section E = 5
Enter fullscreen mode Exit fullscreen mode

Done. Ship it.


Three months later, chaos.

Your pricing algorithm has learned some... interesting things:

"Section E (5) is worth 5 times more than Section A (1)!"

"Section B (2) + Section C (3) = Section E (5)!"

"The average of Section A and Section E is Section C!"

None of this is true. VIP boxes aren't "5 times" anything. You can't add sections together. The relationships are nonsense.

But your algorithm believed the numbers. And numbers have mathematical properties.


Now imagine a different scenario.

You're encoding t-shirt sizes:

XS = 1
S = 2
M = 3
L = 4
XL = 5
Enter fullscreen mode Exit fullscreen mode

Now the math makes sense!

  • L (4) IS greater than S (2) ✓
  • M (3) IS between S (2) and L (4) ✓
  • The order IS meaningful ✓

Same technique. Completely different outcome.

The difference? One has natural order. The other doesn't.


What Is Label Encoding?

Label encoding is the simplest categorical encoding: assign each unique category a unique integer.

from sklearn.preprocessing import LabelEncoder

colors = ['Red', 'Blue', 'Green', 'Red', 'Blue', 'Green']

encoder = LabelEncoder()
encoded = encoder.fit_transform(colors)

print(encoded)
# [2, 0, 1, 2, 0, 1]

print(encoder.classes_)
# ['Blue', 'Green', 'Red']  (alphabetical order!)
Enter fullscreen mode Exit fullscreen mode

Visual:

Original:   [Red]   [Blue]  [Green]  [Red]   [Blue]  [Green]
              ↓       ↓        ↓       ↓        ↓        ↓
Encoded:    [ 2 ]   [ 0 ]   [ 1 ]   [ 2 ]   [ 0 ]   [ 1 ]

Mapping (alphabetical):
  Blue  → 0
  Green → 1
  Red   → 2
Enter fullscreen mode Exit fullscreen mode

That's it. Each category gets a number. Simple, compact, fast.

But simplicity hides danger.


The One Question That Determines Everything

Before using label encoding, ask yourself:

"Does the order of these categories have meaning?"

If YES → Label Encoding is Perfect ✅

Categories with natural order are called ordinal variables.

T-shirt sizes:     XS < S < M < L < XL       ✓ Order matters!
Education:         High School < Bachelor < Master < PhD    ✓
Ratings:           Poor < Fair < Good < Excellent           ✓
Temperature feel:  Cold < Cool < Warm < Hot                 ✓
Priority:          Low < Medium < High < Critical           ✓
Enter fullscreen mode Exit fullscreen mode

For these, the numbers SHOULD imply order. That's the whole point!

# Perfect use of label encoding
sizes = ['S', 'M', 'L', 'XL', 'S', 'M']

# Manual mapping to preserve order
size_map = {'XS': 0, 'S': 1, 'M': 2, 'L': 3, 'XL': 4}
encoded = [size_map[s] for s in sizes]

# Now: L(3) > S(1) is TRUE and MEANINGFUL
Enter fullscreen mode Exit fullscreen mode

If NO → Label Encoding is Dangerous ❌

Categories without natural order are called nominal variables.

Colors:      Red, Blue, Green          ✗ No order!
Countries:   USA, Japan, France        ✗ No order!
Blood types: A, B, AB, O               ✗ No order!
Car brands:  Toyota, BMW, Tesla        ✗ No order!
Enter fullscreen mode Exit fullscreen mode

For these, any order you impose is arbitrary and misleading.

# Dangerous use of label encoding
colors = ['Red', 'Blue', 'Green']

encoder = LabelEncoder()
encoded = encoder.fit_transform(colors)
# Blue=0, Green=1, Red=2

# Now the model might learn:
# - Red(2) > Blue(0) → FALSE!
# - Green(1) is "between" Blue and Red → NONSENSE!
# - Red - Blue = 2 → MEANINGLESS!
Enter fullscreen mode Exit fullscreen mode

The Decision Tree Exception 🌳

Here's the plot twist that confuses everyone:

Tree-based models don't care about the ordering problem!

Why? Because trees only ask "Is X <= threshold?" questions.

Decision Tree with label-encoded colors:

                    [Is color <= 0.5?]
                    /              \
                 YES                NO
                  |                  |
            [color = 0]      [Is color <= 1.5?]
            (Blue)           /              \
                          YES                NO
                           |                  |
                     [color = 1]        [color = 2]
                     (Green)            (Red)
Enter fullscreen mode Exit fullscreen mode

The tree doesn't think "Red is greater than Blue." It just splits on thresholds. Each category ends up in its own branch anyway.

For trees, label encoding nominal variables is FINE.

from sklearn.ensemble import RandomForestClassifier

# This works perfectly for Random Forest!
colors_encoded = LabelEncoder().fit_transform(colors)
model = RandomForestClassifier()
model.fit(colors_encoded.reshape(-1, 1), target)
Enter fullscreen mode Exit fullscreen mode

But for linear models, SVMs, neural networks, k-NN — the danger remains.


The Visual Guide

                    DOES ORDER MATTER?
                           │
              ┌────────────┴────────────┐
              │                         │
             YES                        NO
         (Ordinal)                  (Nominal)
              │                         │
              ▼                         ▼
    ┌─────────────────┐      ┌─────────────────────────┐
    │ LABEL ENCODING  │      │   WHAT MODEL ARE YOU    │
    │ IS PERFECT! ✅  │      │        USING?           │
    │                 │      │                         │
    │ Small=0         │      │    ┌─────────┴──────┐   │
    │ Medium=1        │      │    │                │   │
    │ Large=2         │      │  TREE-BASED    LINEAR/NN│
    │                 │      │    │                │   │
    └─────────────────┘      │    ▼                ▼   │
                             │  Label           One-Hot│
                             │  encoding        encoding│
                             │  is FINE ✅     required │
                             └─────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

Label Encoding in Practice

The Right Way: Ordinal Variables

from sklearn.preprocessing import OrdinalEncoder

# Define the order explicitly!
size_order = [['XS', 'S', 'M', 'L', 'XL']]

encoder = OrdinalEncoder(categories=size_order)
sizes = [['M'], ['XL'], ['S'], ['L'], ['XS']]

encoded = encoder.fit_transform(sizes)
print(encoded)
# [[2.]   # M
#  [4.]   # XL
#  [1.]   # S
#  [3.]   # L
#  [0.]]  # XS
Enter fullscreen mode Exit fullscreen mode

Why OrdinalEncoder instead of LabelEncoder?

  • LabelEncoder: Alphabetical order (S=3, XS=4 — wrong!)
  • OrdinalEncoder: YOU define the order (XS=0, S=1, M=2, L=3, XL=4 — correct!)

The Acceptable Way: Nominal Variables + Tree Models

from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
import xgboost as xgb
import lightgbm as lgb

# For tree-based models, label encoding nominal variables is fine
colors = ['Red', 'Blue', 'Green', 'Red', 'Blue']
target = [1, 0, 1, 1, 0]

le = LabelEncoder()
colors_encoded = le.fit_transform(colors)

# All of these work fine!
RandomForestClassifier().fit(colors_encoded.reshape(-1, 1), target)
GradientBoostingClassifier().fit(colors_encoded.reshape(-1, 1), target)
xgb.XGBClassifier().fit(colors_encoded.reshape(-1, 1), target)
lgb.LGBMClassifier().fit(colors_encoded.reshape(-1, 1), target)
Enter fullscreen mode Exit fullscreen mode

The Dangerous Way: Nominal Variables + Linear Models

from sklearn.linear_model import LogisticRegression

colors = ['Red', 'Blue', 'Green', 'Red', 'Blue']
target = [1, 0, 1, 1, 0]

# ❌ DANGEROUS!
le = LabelEncoder()
colors_encoded = le.fit_transform(colors)
# Blue=0, Green=1, Red=2

model = LogisticRegression()
model.fit(colors_encoded.reshape(-1, 1), target)

# The model now has ONE coefficient for "color"
# It thinks: Higher color value → some effect
# This implies Red(2) has 2x the effect of Green(1)
# NONSENSE!
Enter fullscreen mode Exit fullscreen mode

What the model learns:

log(odds) = β₀ + β₁ × color_encoded

If β₁ = 0.5:
  Blue(0):  log(odds) = β₀ + 0.5 × 0 = β₀
  Green(1): log(odds) = β₀ + 0.5 × 1 = β₀ + 0.5
  Red(2):   log(odds) = β₀ + 0.5 × 2 = β₀ + 1.0

The model thinks Red's effect is exactly 2× Green's effect!
This is ARBITRARY based on alphabetical ordering!
Enter fullscreen mode Exit fullscreen mode

Side-by-Side: Label vs One-Hot for Linear Models

import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.model_selection import cross_val_score

# Create data where color DOES matter
np.random.seed(42)
n = 1000

# Red customers buy 80%, Blue 50%, Green 20%
colors = np.random.choice(['Red', 'Blue', 'Green'], n)
purchase_prob = {'Red': 0.8, 'Blue': 0.5, 'Green': 0.2}
target = [1 if np.random.random() < purchase_prob[c] else 0 for c in colors]

# Method 1: Label Encoding (WRONG for linear models)
le = LabelEncoder()
X_label = le.fit_transform(colors).reshape(-1, 1)

model_label = LogisticRegression()
scores_label = cross_val_score(model_label, X_label, target, cv=5)
print(f"Label Encoding Accuracy: {scores_label.mean():.1%}")

# Method 2: One-Hot Encoding (CORRECT for linear models)
X_onehot = pd.get_dummies(pd.DataFrame({'color': colors}))

model_onehot = LogisticRegression()
scores_onehot = cross_val_score(model_onehot, X_onehot, target, cv=5)
print(f"One-Hot Encoding Accuracy: {scores_onehot.mean():.1%}")
Enter fullscreen mode Exit fullscreen mode

Output:

Label Encoding Accuracy: 62.4%
One-Hot Encoding Accuracy: 68.7%
Enter fullscreen mode Exit fullscreen mode

6% accuracy difference just from encoding choice! The one-hot model learned each color's TRUE effect. The label encoding model was constrained to a false linear relationship.


When Label Encoding Shines

Advantage 1: Memory Efficiency

import numpy as np

n_samples = 100_000
n_categories = 100

# Label Encoding: 1 column
label_memory = n_samples * 8  # 8 bytes per int64
print(f"Label Encoding: {label_memory / 1e6:.2f} MB")

# One-Hot Encoding: 100 columns
onehot_memory = n_samples * n_categories * 8
print(f"One-Hot Encoding: {onehot_memory / 1e6:.2f} MB")
Enter fullscreen mode Exit fullscreen mode

Output:

Label Encoding: 0.80 MB
One-Hot Encoding: 80.00 MB
Enter fullscreen mode Exit fullscreen mode

100x less memory!


Advantage 2: Tree Model Performance

For gradient boosting, label encoding can actually be BETTER than one-hot:

import lightgbm as lgb
import xgboost as xgb
from sklearn.preprocessing import LabelEncoder
import pandas as pd
import numpy as np
from sklearn.model_selection import cross_val_score

# Create dataset
np.random.seed(42)
n = 10000
n_categories = 50

df = pd.DataFrame({
    'category': np.random.choice([f'cat_{i}' for i in range(n_categories)], n),
    'numeric': np.random.randn(n)
})
target = (df['category'].str.extract('(\d+)')[0].astype(int) > 25).astype(int)

# Label Encoding
le = LabelEncoder()
df_label = df.copy()
df_label['category'] = le.fit_transform(df_label['category'])

# One-Hot Encoding
df_onehot = pd.get_dummies(df, columns=['category'])

# Compare with LightGBM
print("LightGBM Performance:")
model = lgb.LGBMClassifier(verbose=-1)
scores_label = cross_val_score(model, df_label, target, cv=5)
print(f"  Label Encoding: {scores_label.mean():.1%}")

scores_onehot = cross_val_score(model, df_onehot, target, cv=5)
print(f"  One-Hot Encoding: {scores_onehot.mean():.1%}")
Enter fullscreen mode Exit fullscreen mode

Output:

LightGBM Performance:
  Label Encoding: 99.2%
  One-Hot Encoding: 98.8%
Enter fullscreen mode Exit fullscreen mode

Label encoding is simpler, faster, uses less memory, and performs just as well (or better!) for tree models.


Advantage 3: Native Categorical Support

Modern libraries handle label-encoded categories natively:

import lightgbm as lgb

# LightGBM native categorical handling
df['category'] = df['category'].astype('category')
model = lgb.LGBMClassifier()
model.fit(df[['category', 'numeric']], target, categorical_feature=['category'])

# Even better: LightGBM figures out optimal splits!
Enter fullscreen mode Exit fullscreen mode

The Complete Decision Framework

START
  │
  ▼
Is the variable ORDINAL (has natural order)?
  │
  ├── YES ──────────────────────────────────────────────────┐
  │                                                         │
  │   Examples: Size, Rating, Education, Priority           │
  │                                                         │
  │   ┌─────────────────────────────────────────────────┐   │
  │   │ USE LABEL/ORDINAL ENCODING ✅                   │   │
  │   │                                                 │   │
  │   │ But DEFINE THE ORDER yourself!                  │   │
  │   │ Don't trust alphabetical sorting.               │   │
  │   │                                                 │   │
  │   │ OrdinalEncoder(categories=[['S','M','L','XL']]) │   │
  │   └─────────────────────────────────────────────────┘   │
  │                                                         │
  └── NO (Nominal) ─────────────────────────────────────────┤
                                                            │
      Examples: Color, Country, Product ID, Name            │
                                                            │
      What MODEL are you using?                             │
        │                                                   │
        ├── TREE-BASED (RF, XGB, LightGBM, CatBoost)       │
        │     │                                             │
        │     ▼                                             │
        │   ┌─────────────────────────────────────────┐     │
        │   │ LABEL ENCODING IS FINE ✅               │     │
        │   │                                         │     │
        │   │ Trees split on thresholds, not values.  │     │
        │   │ Saves memory vs one-hot.                │     │
        │   │ Use native categorical support if       │     │
        │   │ available (LightGBM, CatBoost).         │     │
        │   └─────────────────────────────────────────┘     │
        │                                                   │
        └── LINEAR / NEURAL NET / KNN / SVM                │
              │                                             │
              ▼                                             │
            ┌─────────────────────────────────────────┐     │
            │ DON'T USE LABEL ENCODING ❌             │     │
            │                                         │     │
            │ Use instead:                            │     │
            │ • One-Hot (low cardinality)             │     │
            │ • Target Encoding (high cardinality)    │     │
            │ • Embeddings (neural networks)          │     │
            └─────────────────────────────────────────┘     │
Enter fullscreen mode Exit fullscreen mode

Common Mistakes

Mistake 1: Using LabelEncoder for Ordinal Data

# ❌ WRONG: Alphabetical order!
sizes = ['Small', 'Medium', 'Large', 'XL']
le = LabelEncoder()
encoded = le.fit_transform(sizes)
print(dict(zip(sizes, encoded)))
# {'Small': 2, 'Medium': 1, 'Large': 0, 'XL': 3}
# Large=0, Medium=1, Small=2, XL=3 — WRONG ORDER!

# ✅ RIGHT: Define order explicitly
from sklearn.preprocessing import OrdinalEncoder
encoder = OrdinalEncoder(categories=[['Small', 'Medium', 'Large', 'XL']])
encoded = encoder.fit_transform([['Medium'], ['XL'], ['Small']])
# Small=0, Medium=1, Large=2, XL=3 — CORRECT!
Enter fullscreen mode Exit fullscreen mode

Mistake 2: Label Encoding Nominal Variables for Linear Models

# ❌ WRONG: Creates false ordering
countries = ['USA', 'Japan', 'France']
encoded = LabelEncoder().fit_transform(countries)
LogisticRegression().fit(encoded.reshape(-1,1), target)
# Model thinks: France(0) < Japan(1) < USA(2)

# ✅ RIGHT: Use one-hot encoding
encoded = pd.get_dummies(pd.DataFrame({'country': countries}))
LogisticRegression().fit(encoded, target)
Enter fullscreen mode Exit fullscreen mode

Mistake 3: Not Handling Unknown Categories

# ❌ WRONG: Crashes on new categories
le = LabelEncoder()
le.fit(['Red', 'Blue', 'Green'])
le.transform(['Purple'])  # 💥 ValueError!

# ✅ RIGHT: Handle manually or use different encoder
def safe_transform(encoder, values):
    known = set(encoder.classes_)
    return [encoder.transform([v])[0] if v in known else -1 for v in values]

# Or use category_encoders with handle_unknown
import category_encoders as ce
encoder = ce.OrdinalEncoder(handle_unknown='value')
Enter fullscreen mode Exit fullscreen mode

Mistake 4: Assuming All Trees Are the Same

# ⚠️ CAUTION: sklearn trees DO care about order somewhat
from sklearn.tree import DecisionTreeClassifier

# For sklearn, label encoding can create suboptimal splits
# Because it can only split at one threshold

# ✅ BETTER: Use LightGBM or CatBoost with native categorical support
import lightgbm as lgb
df['category'] = df['category'].astype('category')
lgb.LGBMClassifier().fit(df, target)
Enter fullscreen mode Exit fullscreen mode

Mistake 5: Forgetting Encoding During Inference

# ❌ WRONG: Different encoding at inference
le = LabelEncoder()
le.fit(train_data['color'])

# Later, in production...
new_color = 'Blue'
encoded = le.transform([new_color])  # Must use SAME encoder!

# ✅ RIGHT: Save and load the encoder
import joblib
joblib.dump(le, 'color_encoder.pkl')

# In production
le = joblib.load('color_encoder.pkl')
encoded = le.transform([new_color])
Enter fullscreen mode Exit fullscreen mode

Label Encoding vs Alternatives

Scenario Label Encoding One-Hot Target Encoding
Ordinal variable ✅ Perfect ❌ Loses order ⚠️ Possible
Nominal + Trees ✅ Great ✅ Works ✅ Works
Nominal + Linear ❌ Dangerous ✅ Required ✅ Good
High cardinality ✅ Compact ❌ Explodes ✅ Compact
Memory constrained ✅ Minimal ❌ Huge ✅ Minimal
Interpretability ⚠️ Confusing ✅ Clear ⚠️ Less clear

Quick Reference Code

# === ORDINAL ENCODING (for ordered categories) ===
from sklearn.preprocessing import OrdinalEncoder

# Define YOUR order!
encoder = OrdinalEncoder(categories=[['Low', 'Medium', 'High']])
encoded = encoder.fit_transform(df[['priority']])


# === LABEL ENCODING (simple, for trees) ===
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
df['category_encoded'] = le.fit_transform(df['category'])


# === NATIVE CATEGORICAL (LightGBM) ===
import lightgbm as lgb

df['category'] = df['category'].astype('category')
model = lgb.LGBMClassifier()
model.fit(df, target)


# === NATIVE CATEGORICAL (CatBoost) ===
from catboost import CatBoostClassifier

model = CatBoostClassifier(cat_features=['category'])
model.fit(df, target)


# === SAVE ENCODER FOR PRODUCTION ===
import joblib

joblib.dump(le, 'encoder.pkl')
le_loaded = joblib.load('encoder.pkl')
Enter fullscreen mode Exit fullscreen mode

The Cheat Sheet

Question Answer Use Label Encoding?
Is it ordinal? Yes ✅ Yes (define order!)
Is it nominal + tree model? Yes ✅ Yes (safe)
Is it nominal + linear model? Yes ❌ No (use one-hot)
Is it nominal + neural net? Yes ❌ No (use embedding)
Is cardinality very high? Yes ✅ Yes for trees
Do you need interpretability? Yes ⚠️ Maybe not ideal

Key Takeaways

  1. Label encoding assigns integers to categories — Simple but has implications

  2. Ordinal variables → Label encoding is perfect — Order is meaningful

  3. Nominal + Linear models → Dangerous! — False ordering kills accuracy

  4. Nominal + Tree models → Totally fine! — Trees split on thresholds

  5. Use OrdinalEncoder, not LabelEncoder for ordinal variables — Control the order!

  6. Modern GBMs have native categorical support — Use it when available

  7. Save your encoder for production — Same encoding at train and inference

  8. When in doubt, ask: "Does order matter?" — That's the key question


The One-Sentence Summary

Label encoding is a loaded gun — perfectly safe when pointing at ordinal variables or tree models, potentially deadly when aimed at nominal variables and linear models.


What's Next?

Now that you understand label encoding, you're ready for:

  • Target Encoding — When one-hot explodes and label encoding lies
  • Embedding Layers — Deep learning's answer to categories
  • CatBoost & LightGBM Native Categoricals — The modern approach
  • Handling Unknown Categories — Production-ready encoding

Follow me for the next article in this series!


Let's Connect!

If this saved you from the label encoding trap, drop a heart!

Questions? Ask in the comments — I read and respond to every one.

Have you been burned by label encoding before? Share your war stories!


The difference between a model that captures "Red customers buy more" and one that thinks "Color value 2 correlates with purchases"? Understanding when label encoding is appropriate. One question: Does order matter?


Share this with someone who's label encoding everything without asking why. They need to see the decision tree.

Happy encoding!

Top comments (0)