DEV Community

Cover image for Non-numeric data for ML (Encoding Data)
Penumudi Varun
Penumudi Varun

Posted on

Non-numeric data for ML (Encoding Data)

When dealing with Datasets for Machine Learning, some of the data will be in the text format. For example, consider a medical dataset containing patients details or consider a dataset containing details of individuals in an organization, usually these datasets have a gender column to tell whether the person is male or female.

Non-Numeric Columns

These non-numeric columns might contain important information needed for training the model. But, the problem is the ML and Deep Learning models cannot understand the text data. To give these features as inputs to the model we have to encode this text data into numbers before giving it to the model.

This process of conversion of this text data into the numbers is called Feature Encoding. There, are different types of feature encoding:

  • Label Encoding
  • Ordinal Encoding
  • One Hot Encoding

Label Encoding

The label encoding we label each unique value in the column with a unique number. for example, if we consider the gender column it will only have two unique values, male and female. so we can label male as 1 and female as 2 in another column as shown.

Label Encoding

As you can see in the above dataset we created another column for label encoded values of gender column. In the label encoded columns we represent the male with 1 and female with 2.

Ordinal Encoding

If you sense some form of ordering in the column values, then you have to go for ordinal encoding. Consider a customer reviews dataset containing the rating column.

Ordinal Encoding

The rating column here has values good, fine, and bad. Here when we encode them to numbers we usually want to give a higher number to good rating and lower number to bad rating. This is called ordinal encoding. The ordinal encoded values for the rating column is also shown above.

One Hot Encoding (Categorical Encoding)

Let's take the previous example of gender column before. Instead of converting male to 1 and female to 2. we can instead create two separate columns for these two unique values as shown.

Categorical Encoding

Here, you can see that the is_male, is_female columns are added and these columns acts as better indicators of gender to the ML model rather than labeling each gender with a number. This type of encoding is called Categorical Encoding or One Hot Encoding.

The is_male column acts as a Boolean feature for the ML model instead of being just a labeled number in the dataset. So, overall this is a good model. But, this has a disadvantage with cardinality of column.

Cardinality: This is measure of number of unique values of the column. Eg: the cardinality of the gender column is just 2.

Problem

Now let's consider that there is a column with 100 unique values, then by one hot encoding 100 more columns will be added to dataset, which might make our Machine Learning model slower. So, we have to consider low cardinality columns for one hot encoding.

So, these are the most important encoding types you need to know in Machine learning.

API Trace View

Struggling with slow API calls?

Dan Mindru walks through how he used Sentry's new Trace View feature to shave off 22.3 seconds from an API call.

Get a practical walkthrough of how to identify bottlenecks, split tasks into multiple parallel tasks, identify slow AI model calls, and more.

Read more →

Top comments (0)

Billboard image

Imagine monitoring that's actually built for developers

Join Vercel, CrowdStrike, and thousands of other teams that trust Checkly to streamline monitor creation and configuration with Monitoring as Code.

Start Monitoring

👋 Kindness is contagious

Dive into an ocean of knowledge with this thought-provoking post, revered deeply within the supportive DEV Community. Developers of all levels are welcome to join and enhance our collective intelligence.

Saying a simple "thank you" can brighten someone's day. Share your gratitude in the comments below!

On DEV, sharing ideas eases our path and fortifies our community connections. Found this helpful? Sending a quick thanks to the author can be profoundly valued.

Okay