Category Magic: Transforming Categorical Data in ML

#machinelearning #sklear #python #softwaredevelopment

If you are learning machine learning, then during the data preprocessing step, if the data contains categorical data that is significant for forecasting output or the dependent variable is in categorical form, we must turn that data into numerical form.This process is known as Encoding.

Why we transform Categorical Data into Numerical Form ?

We all know that machines only understand 0's and 1's, and machine learning algorithms are no exception. It works well with numerical data. So, before we feed data to the Algorithm, we must encode the code into numerical form.

How we Encode Categorical Data ?

For transformation of Categorical Data into Numerical data we use concept of Dummy Variables.

Dummy Variable is a Binary variable that accepts 0's and 1's as indicated in the above form. Categorical data [India, Japan, South Korea] is represented by 0's and 1's.

If you wish to learn more about Dummy Varible, please click on the following link:
Concept of Dummy Variable and Dummy Variable Trap

Implementation of conversion of Categorical Data into Numerical Data using scikit-learn:

Let's use dummy data to demonstrate encoding in scikit-learn as:

Step 1 : Importing the libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Step 2: Importing the dataset
You can download the dataset from here.

dataset = pd.read_csv('Data.csv')
X = dataset.iloc[:,:-1].values
y = dataset.iloc[:,-1].values

Here, I separated the independent variables (X) from the dependent variable (y).

Step 3 : Taking care of Missing Data

from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan,strategy='mean')
imputer.fit(X[:,1:3])
X[:,1:3] = imputer.transform(X[:,1:3])

Step 4 : Encoding Categorical Data

We have reached this step at last.
Upon examining our data, we see that both our dependent variable (y) and Independent variable (X, or country) contain categorical data .

So, here we have to notice one thing which is we don't want to convert our output variable (y) into number of columns using dummy variable instead we must assigns a unique numerical label to each category, preserving the ordinal relationship if present.

For example :

Above conversion is known as Label encoding where we assigns a unique numerical label to each category, preserving the ordinal relationship if present.

So let's convert Dependent Variable (y) into Numerical value using Label Encoding.

from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
y = np.array(le.fit_transform(y))

It's output is :

[0 1 0 0 1 1 0 1 0 1]

Now let's convert our Independent variable (X)(only Country column) into Numerical data . For this we use One-Hot Encoding algorithm.

One-Hot Encoding converts each categorical value into a binary vector, creating new binary columns for each category.

This can be done as :

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder',OneHotEncoder(),[0])],remainder='passthrough')
X = np.array(ct.fit_transform(X))

It's output is :

[[1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [0.0 1.0 0.0 30.0 54000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 35.0 58000.0]
 [0.0 0.0 1.0 38.77777777777778 52000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 37.0 67000.0]]

🎉 Tadaa! 🎉 Finally, we've mastered categorical encoding! 📊💻🔤💡

Here's a quick summary of what we've learned:

One-Hot Encoding 🔥🆕:

We create binary columns for each category.
Assign a '1' to the category that applies and '0' to others.
Label Encoding 🏷️🔢:
We replace categories with numerical values.

After that we can continue further process like feature scaling, training model, testing model etc.

👉 You can access the full code from this GitHub repository: Link to Repository

Feel free to explore the code and learn more about categorical encoding! 🔍💻📂📝😊