When to Use LabelEncoder and OneHotEncoder in Machine Learning

#machinelearning #python #datascience #ai

Imagine trying to teach a computer what Small, Medium, and Large mean.
To you, these words are simple. But for a machine, they’re just strange strings of letters.
Machines don’t understand text, they only understand numbers.
So before we train a machine learning model, we need to translate these text categories into numbers that the model can actually process.
Two popular translators are:

LabelEncoder
OneHotEncoder But when do you use each one? Let’s make this super clear.

1. What Is Categorical Data?

Categorical data is data made up of labels or names, not numbers.
Think about these examples:

Size: Small, Medium, Large
Weather: Sunny, Rainy, Cloudy
Cities: Lagos, Abuja, Kano We can’t just throw these words at a model, we need to convert them into numbers first.

2. Label Encoding

Label Encoding simply assigns a number to each category, starting from 0.

Example with sizes:

Small  → 0  
Medium → 1  
Large  → 2

When Should You Use Label Encoding?

Use it when the categories have a natural order or ranking.
Examples:

Low < Medium < High
Cold < Warm < Hot
Small < Medium < Large

Never use LabelEncoder for categories like colors or city names — because the model will think Green (2) > Red (0), which is meaningless!

from sklearn.preprocessing import LabelEncoder

sizes = ["Small", "Large", "Medium", "Small", "Large"]

label_encoder = LabelEncoder()
encoded_sizes = label_encoder.fit_transform(sizes)

print("Original:", sizes)
print("Encoded:", encoded_sizes)
print("Classes:", label_encoder.classes_)

Output:

Original: ['Small', 'Large', 'Medium', 'Small', 'Large']
Encoded: [2 0 1 2 0]
Classes: ['Large' 'Medium' 'Small']

Notice how it encoded alphabetically (Large=0, Medium=1, Small=2).
If you want Small=0, Medium=1, Large=2, you can map it manually:

size_order = {'Small': 0, 'Medium': 1, 'Large': 2}

3. One-Hot Encoding

One-Hot Encoding creates separate columns for each category and marks them with 0 or 1.
One-Hot Encoding creates separate columns for each category and marks them with 0 or 1.

For example:

Color: Red   → [1, 0, 0]
       Blue  → [0, 1, 0]
       Green → [0, 0, 1]

This way, no category is greater or less than another.

When Should You Use One-Hot Encoding?

Use it when the categories have no order, like colors, cities, or animal names.
It avoids the fake ranking problem that LabelEncoder might create.

from sklearn.preprocessing import OneHotEncoder
import numpy as np

colors = np.array(["Red", "Blue", "Green", "Blue", "Red"]).reshape(-1, 1)

onehot_encoder = OneHotEncoder(sparse_output=False)
encoded_colors = onehot_encoder.fit_transform(colors)

print("Original:", colors.flatten())
print("Encoded:\n", encoded_colors)
print("Categories:", onehot_encoder.categories_)

Output:

Original: ['Red' 'Blue' 'Green' 'Blue' 'Red']
Encoded:
 [[0. 0. 1.]
  [1. 0. 0.]
  [0. 1. 0.]
  [1. 0. 0.]
  [0. 0. 1.]]
Categories: [array(['Blue', 'Green', 'Red'], dtype=object)]

4. LabelEncoder vs OneHotEncoder

Think of it like this:

LabelEncoder says: I’ll give each category a number. You figure out what it means.

OneHotEncoder says: I’ll give each category its own column so no one feels more important than the other.

Aspect	LabelEncoder	OneHotEncoder
Best for	Ordered categories	Unordered categories
Output	Single column (0, 1, 2...)	Multiple columns (0/1)
Risk	Fake order for labels	No fake order
Example	Small < Medium < Large	Red, Blue, Green

5. What About 2 Categories?

If you only have 2 categories, LabelEncoder is fine because it will just give 0 and 1.

Example:

from sklearn.preprocessing import LabelEncoder

binary = ["Yes", "No", "Yes", "No"]

encoder = LabelEncoder()
encoded = encoder.fit_transform(binary)

print("Encoded:", encoded)  # [1 0 1 0]
print("Classes:", encoder.classes_)  # ['No' 'Yes']

Here:

No  → 0
Yes → 1

6. Things to Keep in Mind

Use LabelEncoder when your data has a natural order (e.g., Small < Medium < Large).
Use OneHotEncoder when your data has no order (e.g., colors, cities).
For 2 categories, LabelEncoder automatically uses 0 and 1.
The order of rows in your CSV does not matter.

Think of LabelEncoder as the tool for ranking categories that have order.
Think of OneHotEncoder as the tool for naming categories when order doesn’t exist.

If you mix them up (like using LabelEncoder on colors), your model might get confused and make bad predictions.