DEV Community

Cover image for When to Use LabelEncoder and OneHotEncoder in Machine Learning
Adeniyi Olanrewaju
Adeniyi Olanrewaju

Posted on

When to Use LabelEncoder and OneHotEncoder in Machine Learning

Imagine trying to teach a computer what Small, Medium, and Large mean.
To you, these words are simple. But for a machine, they’re just strange strings of letters.
Machines don’t understand text, they only understand numbers.
So before we train a machine learning model, we need to translate these text categories into numbers that the model can actually process.
Two popular translators are:

  • LabelEncoder
  • OneHotEncoder But when do you use each one? Let’s make this super clear.

1. What Is Categorical Data?

Categorical data is data made up of labels or names, not numbers.
Think about these examples:

  • Size: Small, Medium, Large
  • Weather: Sunny, Rainy, Cloudy
  • Cities: Lagos, Abuja, Kano We can’t just throw these words at a model, we need to convert them into numbers first.

2. Label Encoding

Label Encoding simply assigns a number to each category, starting from 0.

Example with sizes:

Small  → 0  
Medium → 1  
Large  → 2
Enter fullscreen mode Exit fullscreen mode

When Should You Use Label Encoding?

Use it when the categories have a natural order or ranking.
Examples:

  • Low < Medium < High

  • Cold < Warm < Hot

  • Small < Medium < Large

Never use LabelEncoder for categories like colors or city names — because the model will think Green (2) > Red (0), which is meaningless!

from sklearn.preprocessing import LabelEncoder

sizes = ["Small", "Large", "Medium", "Small", "Large"]

label_encoder = LabelEncoder()
encoded_sizes = label_encoder.fit_transform(sizes)

print("Original:", sizes)
print("Encoded:", encoded_sizes)
print("Classes:", label_encoder.classes_)
Enter fullscreen mode Exit fullscreen mode

Output:

Original: ['Small', 'Large', 'Medium', 'Small', 'Large']
Encoded: [2 0 1 2 0]
Classes: ['Large' 'Medium' 'Small']
Enter fullscreen mode Exit fullscreen mode

Notice how it encoded alphabetically (Large=0, Medium=1, Small=2).
If you want Small=0, Medium=1, Large=2, you can map it manually:

size_order = {'Small': 0, 'Medium': 1, 'Large': 2}
Enter fullscreen mode Exit fullscreen mode

3. One-Hot Encoding

One-Hot Encoding creates separate columns for each category and marks them with 0 or 1.
One-Hot Encoding creates separate columns for each category and marks them with 0 or 1.

For example:

Color: Red   → [1, 0, 0]
       Blue  → [0, 1, 0]
       Green → [0, 0, 1]

Enter fullscreen mode Exit fullscreen mode

This way, no category is greater or less than another.

When Should You Use One-Hot Encoding?

Use it when the categories have no order, like colors, cities, or animal names.
It avoids the fake ranking problem that LabelEncoder might create.

from sklearn.preprocessing import OneHotEncoder
import numpy as np

colors = np.array(["Red", "Blue", "Green", "Blue", "Red"]).reshape(-1, 1)

onehot_encoder = OneHotEncoder(sparse_output=False)
encoded_colors = onehot_encoder.fit_transform(colors)

print("Original:", colors.flatten())
print("Encoded:\n", encoded_colors)
print("Categories:", onehot_encoder.categories_)
Enter fullscreen mode Exit fullscreen mode

Output:

Original: ['Red' 'Blue' 'Green' 'Blue' 'Red']
Encoded:
 [[0. 0. 1.]
  [1. 0. 0.]
  [0. 1. 0.]
  [1. 0. 0.]
  [0. 0. 1.]]
Categories: [array(['Blue', 'Green', 'Red'], dtype=object)]
Enter fullscreen mode Exit fullscreen mode

4. LabelEncoder vs OneHotEncoder

Think of it like this:

LabelEncoder says: I’ll give each category a number. You figure out what it means.

OneHotEncoder says: I’ll give each category its own column so no one feels more important than the other.

Aspect LabelEncoder OneHotEncoder
Best for Ordered categories Unordered categories
Output Single column (0, 1, 2...) Multiple columns (0/1)
Risk Fake order for labels No fake order
Example Small < Medium < Large Red, Blue, Green

5. What About 2 Categories?

If you only have 2 categories, LabelEncoder is fine because it will just give 0 and 1.

Example:

from sklearn.preprocessing import LabelEncoder

binary = ["Yes", "No", "Yes", "No"]

encoder = LabelEncoder()
encoded = encoder.fit_transform(binary)

print("Encoded:", encoded)  # [1 0 1 0]
print("Classes:", encoder.classes_)  # ['No' 'Yes']
Enter fullscreen mode Exit fullscreen mode

Here:

No  → 0
Yes → 1
Enter fullscreen mode Exit fullscreen mode

6. Things to Keep in Mind

  • Use LabelEncoder when your data has a natural order (e.g., Small < Medium < Large).

  • Use OneHotEncoder when your data has no order (e.g., colors, cities).

  • For 2 categories, LabelEncoder automatically uses 0 and 1.

  • The order of rows in your CSV does not matter.


Think of LabelEncoder as the tool for ranking categories that have order.
Think of OneHotEncoder as the tool for naming categories when order doesn’t exist.

If you mix them up (like using LabelEncoder on colors), your model might get confused and make bad predictions.

Top comments (0)