DEV Community

Cover image for Different Encoding Methods for your Dataset.
krish
krish

Posted on

Different Encoding Methods for your Dataset.

Hey there, data enthusiasts! 🎀

In the exciting world of data science and machine learning, one of the first and most crucial steps is turning raw data into a format that our models can understand and learn from. This process, called data preprocessing, involves several important steps:

  1. Data Cleaning: Removal of noise and inconsistent data. Let's say there was a feature with 80% null values. will you still keep it? What about 20% null values. Those can easily be filled with statistics like mean of all categorical data.
  2. Data Integration: Combine multiple dataset sources for better predictions. Eg. combining driver's medical record with race and season data to predict their position in an F1 race. While the health wouldn't be much helpful but using that as a weight for previous race position will drastically increase its importance!
  3. Data Selection: Selection important and useful data. Try doing feature engineering and get the best features for your model.
  4. Data Transformation: Data are transformed and consolidated for mining by performing encodings and feature engineering. I consider this as the most important topic before data mining since without encoding, data mining is useless and unhelpful.
  5. Data Mining: Intelligent methods are applied to extract data patterns OR Extraction of implicit, previously unknown and potentially useful information from data. Eg. using the race year and DOB of driver to find out the age of the driver to provide new insights while removing 2 columns from model.
  6. Pattern Evaluation: Identify the truly fascinating pattern using various evaluation metrics.
  7. Knowledge Presentation: Create graphs and stats like charts, heatmaps, and much more. Understand your data and improvise wherever needed using above steps.

Central to this preprocessing is the task of encoding. This blog delves into the various encoding methodologies, providing a comprehensive analysis of them.

Importance of Encoding

Encoding is a crucial step in the data preprocessing pipeline, especially when dealing with categorical data. Categorical variables, which represent data that can be divided into specific groups or categories, often need to be converted into a numerical format for machine learning algorithms to process them effectively. This conversion process is known as encoding. Machine learning models typically require numerical input because they are based on mathematical calculations that cannot interpret categorical data directly. By transforming categorical data into numerical values through various encoding techniques, we can ensure that our models can leverage all available information, leading to better performance and more accurate predictions. Encoding not only makes data suitable for analysis but also helps preserve the relationships and characteristics inherent in the original categorical variables.

Prerequisites

No sane person codes on paper, he who codes on paper has mastered the essence of coding or the truth behind the universe itself. - ME🎀

Install the following required Python libraries

pip install scikit-learn pandas category_encoders
Enter fullscreen mode Exit fullscreen mode

Different datasets requires different encoding methods. Therefore, different examples might get used for each encoding methods.

Types of Encoding

While there are hundreds of encoding methods, we will focus on the most important and widely used ones.

  1. Multi-Hot Encoding
  2. Label Encoding
  3. Ordinal Encoding
  4. Binary Encoding
  5. Target Encoding
  6. Frequency Encoding

Multi-Hot Encoding

This method converts into binary-like data. Categorical values is mapped to a binary vector of length equal to the no. of categories. This method is usually used in classification models.

Example: Imagine you have a dataset of music tracks.

Name Artist Genre
Fly Me to the Moon The Macarons Project ["slow", "acoustic", "pop"]
Mad at Disney Salem ilese ["dance", "pop"]

Here, the genre is a feature we need to encode since providing array of multiple genre-names would be ineffective to the model.

from sklearn.preprocessing import MultiLabelBinarizer
import pandas as pd

# Creating the dataframe with list of genres per song
df = pd.DataFrame({
    "name": ["Fly Me to the Moon", "Mad at Disney"],
    "artist": ["The Macarons Project", "Salem ilese"],
    "genre": [["slow", "acoustic", "pop"], ["dance", "pop"]]
})

# Using MultiLabelBinarizer to handle the list of genres
mlb = MultiLabelBinarizer()
x_encoded = mlb.fit_transform(df["genre"])

# Creating the encoded dataframe
encoded_df = pd.DataFrame(x_encoded, columns=mlb.classes_)

# Concatenating the original columns with the encoded genres
df_final = pd.concat([df.drop(columns=["genre"]), encoded_df], axis=1)
Enter fullscreen mode Exit fullscreen mode
print(df_final)
Enter fullscreen mode Exit fullscreen mode
name artist acoustic dance pop slow
Fly Me to the Moon The Macarons Project 1 0 1 1
Mad at Disney Salem ilese 0 1 1 0

The data is encoded with the genres where 1 means HOT (or present) and 0 means COLD (or absent). A similar approach can be taken with One-Hot Encoding but binary Encoding or Label Encoding is better in those cases most of the time.

Label Encoding

This method converts each categorical value into a numerical data.

Similar to multi-hot encoding in a way. The only key difference would be that Label Encoding might inadvertently introduce ordinal relationships where none exist, which can mislead some algorithms. multi-hot encoding avoids this by treating each category independently.

Example: A company sells shirt of different sizes and colours for X amount of price.

Colour Size Company Price
red L Max 300
blue S ACM 230
red XL Zara 568
green S Gucci 927

where we need to use encoding for all 3 columns Colour, Size, and Company. We will use Label Encoding since that addition to bias can help model to predict with better accuracy.

import pandas as pd
from sklearn.preprocessing import LabelEncoder

# Creating the dataframe
df = pd.DataFrame({
    'Colour': ['red', 'blue', 'red', 'green'],
    'Size': ['L', 'S', 'XL', 'S'],
    'Company': ['Max', 'ACM', 'Zara', 'Gucci'],
    'Price': [300, 230, 568, 927]
})

# Label Encoding for 'Colour', 'Size', and 'Company'
label_encoder = LabelEncoder()
df['Colour_encoded'] = label_encoder.fit_transform(df['Colour'])
df['Size_encoded'] = label_encoder.fit_transform(df['Size'])
df['Company_encoded'] = label_encoder.fit_transform(df['Company'])

# Drop the original categorical columns after encoding
df_final = df.drop(columns=['Colour', 'Size', 'Company'])
Enter fullscreen mode Exit fullscreen mode
print(df_final)
Enter fullscreen mode Exit fullscreen mode
Price Colour_encoded Size_encoded Company_encoded
300 2 0 2
230 0 1 0
568 2 2 3
927 1 1 1

The numerical value here is assigned by sorting (alphabetically or numerically) the categories by default but if we want to intentionally give a preference to this encoding then we should look into Ordinal Encoding

Ordinal Encoding

Similar to Label Encoding with the only difference that we ourselves provide a specific order of importance to the categories (unlink how label encoder sorted all categories to provide numbering to it).

Example: In the Label Encoding example, the company should be in your preference order since we know companies like Gucci or Zara will sell T-shirts at expensive prices.

Colour Size Company Price
red L Max 300
blue S ACM 230
red XL Zara 568
green S Gucci 927

Let's use ["ACM", "Max", "Zara", "Gucci"] as our order of cheap to expensive T-shirts.

import pandas as pd
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder

# Creating the dataframe
df = pd.DataFrame({
    'Colour': ['red', 'blue', 'red', 'green'],
    'Size': ['L', 'S', 'XL', 'S'],
    'Company': ['Max', 'ACM', 'Zara', 'Gucci'],
    'Price': [300, 230, 241, 927]
})

# Label Encoding for 'Colour' and 'Size'
label_encoder_colour = LabelEncoder()
label_encoder_size = LabelEncoder()

df['Colour_encoded'] = label_encoder_colour.fit_transform(df['Colour'])
df['Size_encoded'] = label_encoder_size.fit_transform(df['Size'])

# Ordinal Encoding for 'Company' with the specified reversed order
company_order = ["ACM", "Max", "Zara", "Gucci"]
ordinal_encoder = OrdinalEncoder(categories=[company_order])

df['Company_encoded'] = ordinal_encoder.fit_transform(df[['Company']])

# Drop the original categorical columns after encoding
df_final = df.drop(columns=['Colour', 'Size', 'Company'])
Enter fullscreen mode Exit fullscreen mode
print(df_final)
Enter fullscreen mode Exit fullscreen mode
Price Colour_encoded Size_encoded Company_encoded
300 2 1 1
230 0 0 0
241 2 2 2
927 1 0 3

This adds bias to the model depending upon the company name.

Binary Encoding

This method converts each categorical value into binary digits (0s and 1s) then store them as separate columns. This is useful when you have many categories to encode and want to reduce dimensionality compared to multi-hot encoding.

Converts each category into binary code and then split the binary digits into separate columns. Results in log2(N) amount of columns while multi-hot encoding would provide (N) columns.

Example: Encoding just the Colours into something suitable.

Colour
Red
Green
Blue
Red
import pandas as pd
from category_encoders import BinaryEncoder

# Sample data
data = pd.DataFrame({'Colour': ['Red', 'Green', 'Blue', 'Red']})

# Create a BinaryEncoder object
encoder = BinaryEncoder(cols=['Colour'])

# Encode the categorical feature
encoded_data = encoder.fit_transform(data)
Enter fullscreen mode Exit fullscreen mode
print(encoded_data)
Enter fullscreen mode Exit fullscreen mode
Colour_0 Colour_1
0 1
1 0
1 1
0 1

Most of the time, if the categories are less. We should use multi-hot encoding or label encoding.

Target Encoding

Also known as Mean Encoding or Livelihood encoding. This method encodes the categorical values by replacing each category with statistics of the target variable in that category.

Highly recommended and very useful for handling high cardinality categorical variables. This captures relationship between the categorical variables and the target variable more effectively than one-hot encoding.

Formula:

EncodingValue=(n×CategoricalMean)+(m×GlobalMean)n+m Encoding Value = \frac{(n \times Categorical Mean) + (m \times Global Mean)}{n + m}

here:

  • n: No. of samples.
  • m: smoothing parameter.

Example: In house prediction model, encoding neighborhood names wth mean of house price in those area would provide more insights than just normal label encoding.

House Number Price Neighborhood Size (sq meter)
1 500000 Downtown 200
2 350000 Suburb 150
3 700000 City Center 300
4 450000 Suburb 180
5 600000 Downtown 250
import pandas as pd

# Original dataset
data = {
    'House Number': [1, 2, 3, 4, 5],
    'Price': [500000, 350000, 700000, 450000, 600000],
    'Neighborhood': ['Downtown', 'Suburb', 'City Center', 'Suburb', 'Downtown'],
    'Size (sq meter)': [200, 150, 300, 180, 250]
}

df = pd.DataFrame(data)

# Calculate mean price for each neighborhood
neighborhood_means = df.groupby('Neighborhood')['Price'].mean().to_dict()

# Map mean prices back to the original dataset
df['Neighborhood'] = df['Neighborhood'].map(neighborhood_means)
Enter fullscreen mode Exit fullscreen mode
# Display the encoded dataset
print(df)
Enter fullscreen mode Exit fullscreen mode
House Number Price Neighborhood Size (sq meter)
1 500000 550000.0 200
2 350000 400000.0 150
3 700000 700000.0 300
4 450000 400000.0 180
5 600000 550000.0 250

Frequency Encoding

This method replaces each categorical value with its frequency or count within the training dataset.

Formula:

Frequency(category)=Count(category)Total observations \text{Frequency}(\text{category}) = \frac{\text{Count}(\text{category})}{\text{Total observations}}

Example: Encoding cities based on the no. of times each city appears in the dataset.

Transaction ID Amount City Product Category
1 100 New York Electronics
2 200 Los Angeles Clothing
3 150 Chicago Electronics
4 300 New York Groceries
5 250 Chicago Clothing
import pandas as pd

# Example dataset with customer transactions
data = {
    'Transaction ID': [1, 2, 3, 4, 5],
    'Amount': [100, 200, 150, 300, 250],
    'City': ['New York', 'Los Angeles', 'Chicago', 'New York', 'Chicago'],
    'Product Category': ['Electronics', 'Clothing', 'Electronics', 'Groceries', 'Clothing']
}

df = pd.DataFrame(data)

# Using 'City' as a parameter (simple example)
selected_city = 'New York'

# Filter the dataset for the selected city
filtered_data = df[df['City'] == selected_city]

print(f"Data for transactions in {selected_city}:")
print(filtered_data)

# Applying frequency encoding to 'City'
city_frequency = df['City'].value_counts(normalize=True)
df['City'] = df['City'].map(city_frequency)
Enter fullscreen mode Exit fullscreen mode
print(df)
Enter fullscreen mode Exit fullscreen mode
Transaction ID Amount City Product Category
1 100 0.4 Electronics
2 200 0.2 Clothing
3 150 0.4 Electronics
4 300 0.4 Groceries
5 250 0.4 Clothing

Conclusion

With this, all the important and necessary encoding methods are covered! Choosing the right encoding method can significantly impact the performance of your machine learning models.

Top comments (0)