Introduction
In machine learning, data preprocessing plays a critical role in building accurate and reliable models. Since most machine learning algorithms work with numerical data, categorical values must first be transformed into a numerical format.
This process is known as encoding, a key preprocessing technique used to prepare categorical data for machine learning models.
Without encoding, most algorithms cannot interpret categorical values such as names, colors, or payment methods.
Types of data
Before understanding encoding, itβs important to distinguish between two types of categorical data
- Ordinal data - Categorical data that has a natural order or ranking (e.g., size: small < medium < large).
- Nominal data - Categorical data without quantitative order (e.g. color: red, yellow, green)
Types of Encoding
- Label Encoding - is used for ordinal data, where each category is assigned a unique numerical value based on its order or ranking.
Example:
| Size | Encoded value |
|---|---|
| Small | 0 |
| Medium | 1 |
| Large | 2 |
df_size = pd.DataFrame({
'size':['small', 'medium', 'large']
})
size_map = {'small':0, 'medium': 1, 'large': 2}
df_size['size_encoded'] = df_size['size'].map(size_map)
print(df_size)
Output
size size_encoded
0 small 0
1 medium 1
2 large 2
- One hot Encoding - Is used for nominal categorical data
It converts each category into a separate binary feature column, where 1 indicates the presence of the category and 0 indicates its absence.
Example
| payment_method | mpesa | cash | card |
|---|---|---|---|
| card | 1 | 0 | 0 |
| cash | 0 | 1 | 0 |
| mpesa | 0 | 0 | 1 |
import pandas as pd
df=pd.DataFrame({
'payment_method':['mpesa', 'cash', 'card']
})
from sklearn.preprocessing import OneHotEncoder
oe = OneHotEncoder()
payment_encoded = oe.fit_transform(df[['payment_method']])
# print the output
encoded_df = pd.DataFrame(
payment_encoded.toarray(),
columns=oe.get_feature_names_out(['payment_method'])
)
print(encoded_df)
Output
payment_method_card payment_method_cash payment_method_mpesa
0 0.0 0.0 1.0
1 0.0 1.0 0.0
2 1.0 0.0 0.0
- Target Encoding - Is a technique used for categorical variables with high cardinality (many unique categories). Instead of creating multiple dummy columns, it replaces each category with the mean of the target variable for that category.
Example
Suppose we want to predict house price of neighborhoods in Nairobi:
| Neighborhood | Price |
|---|---|
| downtown | 500k |
| downtown | 600k |
| downtown | 700k |
| uptown | 1000k |
| uptown | 1200k |
| suburbs | 200k |
| suburbs | 300k |
Step 1: Compute mean target per category
| neighborhood | Mean Target |
|---|---|
| downtown | 600k |
| uptown | 1100k |
| suburbs | 250k |
Step 2: Replace categories
neighborhood (encoded)
| 600k |
|---|
| 600k |
| 600k |
| 1100k |
| 1100k |
| 250k |
| 250k |
Note: Target encoding must be handled carefully to avoid data leakage.
Conclusion
Encoding is a crucial step in data preprocessing that allows machine learning models to work with categorical data.
Choosing the right encoding technique depends on the type of data and the problem you are solving.
Top comments (0)