DEV Community

Cover image for Feature Engineering - Expand category column
Mage
Mage

Posted on

Feature Engineering - Expand category column

TLDR

In this Mage Academy lesson, we’ll go over how to expand a category column and see how this can help us understand the data better.

Glossary

  • Definition
  • Why is it necessary
  • How to gain information
  • Conceptual example
  • How to code

Definition

Expanding the category column means splitting a categorical column into multiple numerical columns (i.e., the values of these columns are in the form of numbers), and the number of numerical columns is equal to the number of categories in the categorical column.

Why is it necessary

Data is available in 2 forms (numbers or strings). Machine learning models can understand only numbers, so before building any machine learning model, we’ve to convert strings into numbers without losing any information. Expanding category columns is one such action that helps to represent categorical columns in the form of numbers without losing information.

Now, let’s see in detail what the expand category column is and how to perform this action.

How to gain information

As mentioned, expanding the category column splits the categorical column into multiple new numerical columns (i.e., the values in these columns are in the form of numbers). These new columns have the frequency of each category as column values, and these frequency values help us prioritize the data.

For example, there’s sample data with columns "ID" and "Dish". Using the expand category column action, we can prioritize the dishes based on the number of times they've been ordered.

And with this information, we can predict how many dishes we can sell next month or we can estimate the quantity of ingredients required for the next month.

Image description

Image descriptionFrequency of each category

Image description

"Noodles" were ordered 4 times, so we can conclude that "Noodles" are the most popular dish. "Ice cream" was ordered twice, making it the second best-selling dish, and "Biryani" was ordered only once, making it the least favorite among the customers who visited the restaurant. As "Noodles" are our best-selling dish, we'll need to buy more ingredients to make this recipe more frequently.

Note: After expanding the category column "Dish", we can remove this column as this column’s information is already in the new columns "Noodles," "Biryani," and "Ice cream."

Conceptual example

Let’s understand the importance of expanding category columns in detail with the help of a more complicated example.

Image descriptionImage by digital designer from Pixabay

You own 4 pizza restaurants. Now, you'd like to know how frequently each variety of pizza is sold at each restaurant, as this helps you estimate the quantity of groceries you'll need for the next month at each restaurant.

Image descriptionImage by Alexas_Fotos from Pixabay

Image description

Restaurant names

All 4 restaurants sell 2 varieties of pizza.

Image descriptionMenu

First, let’s take a look at the order data of all restaurants.

Image descriptionImage by Photo Mix from Pixabay

Image description

Step-1: Let’s group the orders by restaurant names.

Image description

Step-2: Group the data by types of pizzas sold at each restaurant.

Image description

We see that there are 4 restaurants and each restaurant has 2 varieties or categories of pizzas sold. Now, for each variety or category, let’s count the number of pizzas sold at each restaurant.

Step-3: Display the count of pizzas sold at each restaurant.

Image description

We can also display the same result in a more readable format as shown below.

Image description

Step-4: Add the results back to the original data.

Image description

How to code
Until now, we’ve seen how to manually calculate the count of pizzas sold per variety at each restaurant. Now, let’s see how we can automate the process through code.

Find the count of each variety of pizza sold at each restaurant.

Step-1:
First, let’s group orders by company name and then by types of pizzas sold. To do this, we’ll iterate through the rows of order data and save the “Company” as keys of the dictionary. The final result should be as shown below.

Output format: {‘CompanyA’: [List of pizzas sold], ‘CompanyB’:[List of pizzas sold],....}

Image description

Image description

Step-2:
Calculate the number of pizzas sold in each category at each restaurant.

The number of pizzas sold in each category at each restaurant is calculated by counting the number of times a pizza category appears in the list corresponding to each restaurant. The final result should be as shown below.

Output format: {‘CompanyA’: {Type1: count, Type2: count}, ‘CompanyB’: {Type1: count, Type2: count},....}

Image description

Image description

First, we’ve to group the data to calculate the frequency of each category and then we’ve to join the result back into the original dataset.

Step-1: We can either use crosstab(), groupby() or pivot_table() functions to get the count of the pizzas sold in each variety or category.

Using crosstab()

Image description

Image description

Using groupby()

Image description

Image description

Using pivot_table()

Image description

Image description

Step-2: Finally, create new columns in the original data frame for each category with values equal to the number of pizzas sold in each category with respect to each restaurant and add the result back to the original dataset using join() function.

Image description

Image descriptionOriginal data set after expanding the “Type” column

Image description

We've got 2 categories in the column "Type", so 2 new columns "Cheese Pizza" and "Speciality Pizza" are created after expand the category column action. The values in the new columns represent the frequency or count of pizzas sold at each restaurant. From this, we can say "Domino’s Pizza," "Godfather’s Pizza," and "IMO’s Pizza" are famous for specialty pizzas. IMO’s Pizza sold the least, i.e., 0 cheese pizzas, while "Pizza Hut" sold the most, i.e., cheese pizzas. Based on this, we can say that 3 of the restaurants require more ingredients to make specialty pizzas than cheese pizza ingredients.

Note: After expanding the "Type" column, we can remove this column from the dataset as its information is already in the new columns "Cheese Pizza" and "Speciality Pizza."

For quick analysis and results, try our product, Mage. Our service features an "Edit data" area with “Expand category column” action.

Image descriptionData set before expanding the category column action

Image description

Image description

Image descriptionData set after expanding “type
” column_

Image description

Want to learn more about machine learning (ML)? Visit Mage Academy! ✨🔮

Latest comments (1)

Collapse
 
dangerous profile image
Tommy DANGerous

This is one of the feature engineering techniques I use for almost every dataset. This is a secret weapon to becoming an ML pro.