Musungu (Ruth) Ambogo

Posted on May 27

Encoding in Machine Learning Explained

#beginners #datascience #machinelearning #tutorial

Introduction

In machine learning, data preprocessing plays a critical role in building accurate and reliable models. Since most machine learning algorithms work with numerical data, categorical values must first be transformed into a numerical format.

This process is known as encoding, a key preprocessing technique used to prepare categorical data for machine learning models.

Without encoding, most algorithms cannot interpret categorical values such as names, colors, or payment methods.

Types of data

Before understanding encoding, it’s important to distinguish between two types of categorical data

Ordinal data - Categorical data that has a natural order or ranking (e.g., size: small < medium < large).
Nominal data - Categorical data without quantitative order (e.g. color: red, yellow, green)

Types of Encoding

Label Encoding - is used for ordinal data, where each category is assigned a unique numerical value based on its order or ranking.

Example:

Size	Encoded value
Small	0
Medium	1
Large	2

df_size = pd.DataFrame({
    'size':['small', 'medium', 'large']

})

size_map = {'small':0, 'medium': 1, 'large': 2}
df_size['size_encoded'] = df_size['size'].map(size_map)
print(df_size)

Output

    size         size_encoded
0   small             0
1  medium             1
2   large             2

One hot Encoding - Is used for nominal categorical data

It converts each category into a separate binary feature column, where 1 indicates the presence of the category and 0 indicates its absence.

Example

payment_method	mpesa	cash	card
card	1	0	0
cash	0	1	0
mpesa	0	0	1

import pandas as pd
df=pd.DataFrame({
    'payment_method':['mpesa', 'cash', 'card']
})

from sklearn.preprocessing import OneHotEncoder
oe = OneHotEncoder()
payment_encoded = oe.fit_transform(df[['payment_method']])

# print the output
encoded_df = pd.DataFrame(
    payment_encoded.toarray(),
    columns=oe.get_feature_names_out(['payment_method'])
)

print(encoded_df)

Output

    payment_method_card  payment_method_cash  payment_method_mpesa
0                  0.0                  0.0                   1.0
1                  0.0                  1.0                   0.0
2                  1.0                  0.0                   0.0

Target Encoding - Is a technique used for categorical variables with high cardinality (many unique categories). Instead of creating multiple dummy columns, it replaces each category with the mean of the target variable for that category.

Example

Suppose we want to predict house price of neighborhoods in Nairobi:

Neighborhood	Price
downtown	500k
downtown	600k
downtown	700k
uptown	1000k
uptown	1200k
suburbs	200k
suburbs	300k

Step 1: Compute mean target per category

neighborhood	Mean Target
downtown	600k
uptown	1100k
suburbs	250k

Step 2: Replace categories

neighborhood (encoded)

600k
600k
600k
1100k
1100k
250k
250k

Note: Target encoding must be handled carefully to avoid data leakage.

Conclusion

Encoding is a crucial step in data preprocessing that allows machine learning models to work with categorical data.
Choosing the right encoding technique depends on the type of data and the problem you are solving.

DEV Community