Imagine trying to teach a computer what Small
, Medium
, and Large
mean.
To you, these words are simple. But for a machine, they’re just strange strings of letters.
Machines don’t understand text, they only understand numbers.
So before we train a machine learning model, we need to translate these text categories into numbers that the model can actually process.
Two popular translators are:
- LabelEncoder
- OneHotEncoder But when do you use each one? Let’s make this super clear.
1. What Is Categorical Data?
Categorical data is data made up of labels or names, not numbers.
Think about these examples:
- Size: Small, Medium, Large
- Weather: Sunny, Rainy, Cloudy
- Cities: Lagos, Abuja, Kano We can’t just throw these words at a model, we need to convert them into numbers first.
2. Label Encoding
Label Encoding simply assigns a number to each category, starting from 0.
Example with sizes:
Small → 0
Medium → 1
Large → 2
When Should You Use Label Encoding?
Use it when the categories have a natural order or ranking.
Examples:
Low < Medium < High
Cold < Warm < Hot
Small < Medium < Large
Never use LabelEncoder for categories like colors or city names — because the model will think Green (2) > Red (0)
, which is meaningless!
from sklearn.preprocessing import LabelEncoder
sizes = ["Small", "Large", "Medium", "Small", "Large"]
label_encoder = LabelEncoder()
encoded_sizes = label_encoder.fit_transform(sizes)
print("Original:", sizes)
print("Encoded:", encoded_sizes)
print("Classes:", label_encoder.classes_)
Output:
Original: ['Small', 'Large', 'Medium', 'Small', 'Large']
Encoded: [2 0 1 2 0]
Classes: ['Large' 'Medium' 'Small']
Notice how it encoded alphabetically (Large=0, Medium=1, Small=2
).
If you want Small=0, Medium=1, Large=2, you can map it manually:
size_order = {'Small': 0, 'Medium': 1, 'Large': 2}
3. One-Hot Encoding
One-Hot Encoding creates separate columns for each category and marks them with 0 or 1.
One-Hot Encoding creates separate columns for each category and marks them with 0 or 1.
For example:
Color: Red → [1, 0, 0]
Blue → [0, 1, 0]
Green → [0, 0, 1]
This way, no category is greater
or less
than another.
When Should You Use One-Hot Encoding?
Use it when the categories have no order, like colors, cities, or animal names.
It avoids the fake ranking
problem that LabelEncoder might create.
from sklearn.preprocessing import OneHotEncoder
import numpy as np
colors = np.array(["Red", "Blue", "Green", "Blue", "Red"]).reshape(-1, 1)
onehot_encoder = OneHotEncoder(sparse_output=False)
encoded_colors = onehot_encoder.fit_transform(colors)
print("Original:", colors.flatten())
print("Encoded:\n", encoded_colors)
print("Categories:", onehot_encoder.categories_)
Output:
Original: ['Red' 'Blue' 'Green' 'Blue' 'Red']
Encoded:
[[0. 0. 1.]
[1. 0. 0.]
[0. 1. 0.]
[1. 0. 0.]
[0. 0. 1.]]
Categories: [array(['Blue', 'Green', 'Red'], dtype=object)]
4. LabelEncoder vs OneHotEncoder
Think of it like this:
LabelEncoder says: I’ll give each category a number. You figure out what it means.
OneHotEncoder says: I’ll give each category its own column so no one feels more important than the other.
Aspect | LabelEncoder | OneHotEncoder |
---|---|---|
Best for | Ordered categories | Unordered categories |
Output | Single column (0, 1, 2...) | Multiple columns (0/1) |
Risk | Fake order for labels | No fake order |
Example | Small < Medium < Large | Red, Blue, Green |
5. What About 2 Categories?
If you only have 2 categories, LabelEncoder is fine because it will just give 0 and 1.
Example:
from sklearn.preprocessing import LabelEncoder
binary = ["Yes", "No", "Yes", "No"]
encoder = LabelEncoder()
encoded = encoder.fit_transform(binary)
print("Encoded:", encoded) # [1 0 1 0]
print("Classes:", encoder.classes_) # ['No' 'Yes']
Here:
No → 0
Yes → 1
6. Things to Keep in Mind
Use LabelEncoder when your data has a natural order (e.g., Small < Medium < Large).
Use OneHotEncoder when your data has no order (e.g., colors, cities).
For 2 categories, LabelEncoder automatically uses 0 and 1.
The order of rows in your CSV does not matter.
Think of LabelEncoder as the tool for
ranking
categories that have order.
Think of OneHotEncoder as the tool fornaming
categories when order doesn’t exist.If you mix them up (like using LabelEncoder on colors), your model might get confused and make bad predictions.
Top comments (0)