One Hot Encoding

#machinelearning

ChatGPT describes the one-hot encoding as following:

One-Hot Encoding is a categorical feature transformation technique used to convert non-numerical labels (like “red”, “blue”, “green”) into numerical binary vectors, so machine-learning models can understand them.

Now the thing with this approach is that what if we have 200s of categorical values, you are not going to create individual 200 columns for each categorical values.

import pandas as pd
import numpy as np

data = pd.read_csv('test.csv', usecols=['X1', 'X2', 'X3', 'X4', 'X5','X6'])
data.head()

This is what this dataset consists

Now we don't know what those categories emphasize, and we also don't know how much of those individual unique categorical value exists so let's try to figure out.

for col in data.columns:
   print(col, ": ", len(data[col].unique()), ' labels)

So we got the count of how much of unique categorical values exists.

Now let's say we convert all those unique categorical values into individual columns then

pd.get_dummies(data, drop_first=True).shape

That would yield (4209, 121) which means would have 121 columns which increases the dimension and that's not what we want.

So let's find top 10 most frequent categories specifically for X2 columns

data.X2.value_counts().sort_values(ascending=False).head(10)

Let's store those each individual top categorical values into index.

top_10 = [x for x in data.X2.value_counts().sort_values(ascending=False).head(10).index]

That would yield
['as', 'ae', 'ai', 'm', 'ak', 'r', 'n', 's', 'f', 'e']

Now let's have binary value that justify existence of categorical values

for label in top_10:
  data[label] = np.where(data['X2']==label, 1, 0)

data[['X2']+top_10].head(20)

Let's finalize that into function

def one_hot_top(df, variable, top_x_labels):
  for label in top_x_labels:
    df[variable+'_'+label] = np.where(data[variable]==label, 1, 0)

data = pd.read_csv('test.csv', usecols=['X1', 'X2', 'X3', 'X4', 'X5','X6'])

one_hot_top(data, 'X2', top_10)
data.head()

So basically you can notice that the categories which we consider we create a individual columns for those, and justify their existence using 0s and 1s if all those columns are 0s then that's not category we care, but if its 1 then that means that's the category.

I have shared my learning here, open for feedback.

Thanks :)

Socials:

DEV Community

One Hot Encoding

Top comments (0)