DEV Community

Musungu (Ruth) Ambogo
Musungu (Ruth) Ambogo

Posted on

Encoding in Machine Learning Explained

Introduction

In machine learning, data preprocessing plays a critical role in building accurate and reliable models. Since most machine learning algorithms work with numerical data, categorical values must first be transformed into a numerical format.

This process is known as encoding, a key preprocessing technique used to prepare categorical data for machine learning models.

Without encoding, most algorithms cannot interpret categorical values such as names, colors, or payment methods.

Types of data

Before understanding encoding, it’s important to distinguish between two types of categorical data

  1. Ordinal data - Categorical data that has a natural order or ranking (e.g., size: small < medium < large).
  2. Nominal data - Categorical data without quantitative order (e.g. color: red, yellow, green)

Types of Encoding

  1. Label Encoding - is used for ordinal data, where each category is assigned a unique numerical value based on its order or ranking.

Example:

Size Encoded value
Small 0
Medium 1
Large 2
df_size = pd.DataFrame({
    'size':['small', 'medium', 'large']

})

size_map = {'small':0, 'medium': 1, 'large': 2}
df_size['size_encoded'] = df_size['size'].map(size_map)
print(df_size)
Enter fullscreen mode Exit fullscreen mode

Output

    size         size_encoded
0   small             0
1  medium             1
2   large             2
Enter fullscreen mode Exit fullscreen mode
  1. One hot Encoding - Is used for nominal categorical data

It converts each category into a separate binary feature column, where 1 indicates the presence of the category and 0 indicates its absence.

Example

payment_method mpesa cash card
card 1 0 0
cash 0 1 0
mpesa 0 0 1
import pandas as pd
df=pd.DataFrame({
    'payment_method':['mpesa', 'cash', 'card']
})
Enter fullscreen mode Exit fullscreen mode
from sklearn.preprocessing import OneHotEncoder
oe = OneHotEncoder()
payment_encoded = oe.fit_transform(df[['payment_method']])

# print the output
encoded_df = pd.DataFrame(
    payment_encoded.toarray(),
    columns=oe.get_feature_names_out(['payment_method'])
)

print(encoded_df)
Enter fullscreen mode Exit fullscreen mode

Output

    payment_method_card  payment_method_cash  payment_method_mpesa
0                  0.0                  0.0                   1.0
1                  0.0                  1.0                   0.0
2                  1.0                  0.0                   0.0
Enter fullscreen mode Exit fullscreen mode
  1. Target Encoding - Is a technique used for categorical variables with high cardinality (many unique categories). Instead of creating multiple dummy columns, it replaces each category with the mean of the target variable for that category.

Example

Suppose we want to predict house price of neighborhoods in Nairobi:

Neighborhood Price
downtown 500k
downtown 600k
downtown 700k
uptown 1000k
uptown 1200k
suburbs 200k
suburbs 300k

Step 1: Compute mean target per category

neighborhood Mean Target
downtown 600k
uptown 1100k
suburbs 250k

Step 2: Replace categories

neighborhood (encoded)

600k
600k
600k
1100k
1100k
250k
250k

Note: Target encoding must be handled carefully to avoid data leakage.

Conclusion

Encoding is a crucial step in data preprocessing that allows machine learning models to work with categorical data.
Choosing the right encoding technique depends on the type of data and the problem you are solving.

Top comments (0)