DEV Community

DEVunderdog
DEVunderdog

Posted on

One Hot Encoding

ChatGPT describes the one-hot encoding as following:

One-Hot Encoding is a categorical feature transformation technique used to convert non-numerical labels (like “red”, “blue”, “green”) into numerical binary vectors, so machine-learning models can understand them.

One Hot Encoding

Now the thing with this approach is that what if we have 200s of categorical values, you are not going to create individual 200 columns for each categorical values.

import pandas as pd
import numpy as np

data = pd.read_csv('test.csv', usecols=['X1', 'X2', 'X3', 'X4', 'X5','X6'])
data.head()
Enter fullscreen mode Exit fullscreen mode

This is what this dataset consists

test datasets

Now we don't know what those categories emphasize, and we also don't know how much of those individual unique categorical value exists so let's try to figure out.

for col in data.columns:
   print(col, ": ", len(data[col].unique()), ' labels)
Enter fullscreen mode Exit fullscreen mode

unique categorical values

So we got the count of how much of unique categorical values exists.

Now let's say we convert all those unique categorical values into individual columns then

pd.get_dummies(data, drop_first=True).shape
Enter fullscreen mode Exit fullscreen mode

That would yield (4209, 121) which means would have 121 columns which increases the dimension and that's not what we want.

So let's find top 10 most frequent categories specifically for X2 columns

data.X2.value_counts().sort_values(ascending=False).head(10)
Enter fullscreen mode Exit fullscreen mode

Top 10 most frequent categories

Let's store those each individual top categorical values into index.

top_10 = [x for x in data.X2.value_counts().sort_values(ascending=False).head(10).index]
Enter fullscreen mode Exit fullscreen mode

That would yield
['as', 'ae', 'ai', 'm', 'ak', 'r', 'n', 's', 'f', 'e']

Now let's have binary value that justify existence of categorical values

for label in top_10:
  data[label] = np.where(data['X2']==label, 1, 0)

data[['X2']+top_10].head(20)
Enter fullscreen mode Exit fullscreen mode

Top 20 labels

Let's finalize that into function

def one_hot_top(df, variable, top_x_labels):
  for label in top_x_labels:
    df[variable+'_'+label] = np.where(data[variable]==label, 1, 0)

data = pd.read_csv('test.csv', usecols=['X1', 'X2', 'X3', 'X4', 'X5','X6'])

one_hot_top(data, 'X2', top_10)
data.head()
Enter fullscreen mode Exit fullscreen mode

So basically you can notice that the categories which we consider we create a individual columns for those, and justify their existence using 0s and 1s if all those columns are 0s then that's not category we care, but if its 1 then that means that's the category.

I have shared my learning here, open for feedback.

Thanks :)

Socials:

Top comments (0)