When dealing with Datasets for Machine Learning, some of the data will be in the text format. For example, consider a medical dataset containing patients details or consider a dataset containing details of individuals in an organization, usually these datasets have a gender column to tell whether the person is male or female.
These non-numeric columns might contain important information needed for training the model. But, the problem is the ML and Deep Learning models cannot understand the text data. To give these features as inputs to the model we have to encode this text data into numbers before giving it to the model.
This process of conversion of this text data into the numbers is called Feature Encoding. There, are different types of feature encoding:
- Label Encoding
- Ordinal Encoding
- One Hot Encoding
Label Encoding
The label encoding we label each unique value in the column with a unique number. for example, if we consider the gender column it will only have two unique values, male and female. so we can label male as 1 and female as 2 in another column as shown.
As you can see in the above dataset we created another column for label encoded values of gender column. In the label encoded columns we represent the male with 1 and female with 2.
Ordinal Encoding
If you sense some form of ordering in the column values, then you have to go for ordinal encoding. Consider a customer reviews dataset containing the rating column.
The rating column here has values good, fine, and bad. Here when we encode them to numbers we usually want to give a higher number to good rating and lower number to bad rating. This is called ordinal encoding. The ordinal encoded values for the rating column is also shown above.
One Hot Encoding (Categorical Encoding)
Let's take the previous example of gender column before. Instead of converting male to 1 and female to 2. we can instead create two separate columns for these two unique values as shown.
Here, you can see that the is_male, is_female columns are added and these columns acts as better indicators of gender to the ML model rather than labeling each gender with a number. This type of encoding is called Categorical Encoding or One Hot Encoding.
The is_male column acts as a Boolean feature for the ML model instead of being just a labeled number in the dataset. So, overall this is a good model. But, this has a disadvantage with cardinality of column.
Cardinality: This is measure of number of unique values of the column. Eg: the cardinality of the gender column is just 2.
Problem
Now let's consider that there is a column with 100 unique values, then by one hot encoding 100 more columns will be added to dataset, which might make our Machine Learning model slower. So, we have to consider low cardinality columns for one hot encoding.
So, these are the most important encoding types you need to know in Machine learning.
Top comments (0)