Most of the machine learning models work with the numerical data. Most often, we encounter with categorical data in our dataset. But we still want to use those features in our dataset so that our machine learning model can learn and predict well.
Here comes the encoding to our rescue.
What is the main confusion?
The main thing is that we need to decide which type of encoding we need to use for our categorical data.
How to decide which encoding to use?
Check your category is nominal (unorder) or ordered. In other word is there a relation b/w different values for the category or not.
For example:
Category Feature for Shirt Size with possible values Large, Medium and Small.
Which encoding we should use?
To answer these questions, we need to answer following questions.
Is it order?
Yes, we can rank/order the shirt w.r.t size.
Is there a relation b/w category value?
Yes, all of them are telling us about the shirt size.
When answer to both of the above question is Yes simply use ordinal encoding.
Ordinal Encoding
Ordinal encoding simply assign a unique numerical value to each category. The number are given based on the order they are appeared in the dataset by default.
For example:
Large => 1
Medium => 2
Small => 3
If there is no order nor is the relation b/w different category for the feature say Color of the Shirt. We will use the One hot encoding.
One Hot Encoding
It creates a vector for instance, If we have feature SHIRT COLOR with two colors in our dataset say red and blue. It will represent each color with the following mapping
Red => [1,0]
Blue => [0,1]
Dummy Encoding
Dummy Encoding is just an extended form of the One Hot Encoding in where the vector length is one less than in One Hot Encoding. For the same SHIRT COLOR feature, we will have following mapping
Red => [0]
Blue => [1]
Label Encoding
Label Encoding is used to transform our dataset target variable to Numbers where each Label/class is represented by unique number. It is kind of similar to ordinal encoding however these numbers can't be ranked/ordered.
We must not use Label Encoding for the features. It is used only for target variable encoding.
Top comments (0)