DEV Community

Cover image for Feature Engineering - Add with conditional
Mage
Mage

Posted on

Feature Engineering - Add with conditional

TLDR

It seems absurd to add more columns to an already large dataset, right? 👀 (See) how procedurally adding data using existing columns helps the model gain further insight when predicting.

Glossary

  • What is it?

  • The significance of boolean values in your dataset

  • Ways to implement in code! 👩‍💻

  • Magical no-code solution ✨🔮

What is it?

Adding a new column with a conditional is an operation in a machine learning analysis that takes existing data, possibly several columns of varying ranges, and performs a conditional comparison to represent those columns into a single value (often boolean) for each row.

The significance of boolean values in your dataset

To understand the importance of generalizing some of your features/columns, consider what boolean values represent: true or false values that clearly indicate to the model what the positive or negative outcomes are in this prediction.

This column containing boolean values improves model training accuracy because it summarizes a range of values into a true or false metric that simply indicates whether certain conditions are fulfilled. It’s like a test study guide: it is a summary/indicator that is helpful in predicting the likelihood of an outcome (like passing or not passing a test)

Image descriptionConfused lady math meme

If the definition was a little abstract, here are some examples! The following are problems that are simplified by adding new columns using a conditional separator:

  • Classify a range and give it value -> Assign “Pass/No Pass” to a numerical grade.
  • Simplify column data -> Reduce complexity of store records by using a boolean value to indicate whether a line of seasonal clothing sales was a net gain or loss

When ranking student satisfaction with their colleges, academic performance is only one aspect of the student experience (ie. extracurriculars, leadership, other life responsibilities). Thus, we don’t need to be so specific with typical 0-100 grades, as the wider range of values in a column generates more noise. By adding an additional “Pass/No pass” feature, we increase the accuracy of the model because a numerical range of grades is much harder to generalize (and predict) than “Pass/No pass.”

Image descriptionUCLA student life (Source: Daily Bruin)

Ways to implement in code! 👩‍💻

Using the grades example discussed earlier, we can simplify a numerical grade like “95% or 72%” into “pass or no pass” values. We have a dataset of grades (or “marks”) that we’ll go through to determine whether the student passed the exam or not.

1 import pandas as pd
2 df = pd.read_csv('Marks10.csv')
3 df
Enter fullscreen mode Exit fullscreen mode

Image description

In regular Python, we’re simply interested in the Exampoints column that denotes the grades. Thus, we’d simply save that column as a list of fifty exam scores named “grades.”

1 grades = df['Exampoints']
2 print(list(grades))
Enter fullscreen mode Exit fullscreen mode

Image description

Next, we’d use a conditional comparison to loop through all the scores to check whether the grades are passing (which is anything higher than a 60%).

1 # included full list for readability
2 grades = [31, 45, 23, 69, 78, 45, 23, 89, 100, 97,
3           56, 11, 9, 55, 43, 44, 45, 46, 47, 48,
4           49, 23, 24, 25, 26, 69, 70, 71, 72, 73,
5           74, 75, 76, 34, 35, 36, 37, 38, 39, 40,
6           41, 42, 43, 44, 45, 46, 47, 48, 49, 50]
7 pass_no_pass = []
8
9 for grade in grades:
10   if grade >= 60:
11     pass_no_pass.append('Pass')
12   else:
13     pass_no_pass.append('No pass')
Enter fullscreen mode Exit fullscreen mode

Our list of “pass or no pass” values would then contain all of the records that we need to make a new column of data!

1 print(pass_no_pass)
Enter fullscreen mode Exit fullscreen mode

Image description

Now that you know the logic behind performing a conditional operation on an existing column to determine values in a new one, next we’ll show you how you’d add the additional column to the dataset for a side-by-side comparison of grades with pass/no pass:

1 df['pass_no_pass'] = pass_no_pass
2 df
Enter fullscreen mode Exit fullscreen mode

Image description

This method is the cleanest way for a Python developer to iterate through rows of data because it’s the fastest (compared to iterrows(), a Pandas function they hate). It’s a no-nonsense approach that extracts the column we’re analyzing as a list and only looks at that. In a way, it’s the method that’s most similar to our first “From scratch” example.

First, we write a function that takes in the exam score as an input and determines whether the grade is passing or not:

1 import pandas as pd
2 df = pd.read_csv('Marks10.csv')
3
4 def passed(grade):
5 if grade >= 60:
6 return True
7 else:
8 return False
Enter fullscreen mode Exit fullscreen mode

Then, while iterating through the rows and just looking at the values in the column “Exampoints,” it passes the values into the function we created earlier. We take the returned function values and save it in our new column, “Passed.”

1 pass_no_pass = [passed(score) for score in df['Exampoints']]
2 df['Passed'] = pass_no_pass
3
4 df
Enter fullscreen mode Exit fullscreen mode

Image description

Although this method might be fast, we still need to write a helper function that helps us calculate the boolean values. In the next section, we will show you how to use a built-in numpy function to do the same thing!

Now that you know the logic behind performing a conditional operation, let’s see how we do this using a Pandas dataframe and the numpy function np.where() to add a “Passed” column with boolean values this time!

The reason why we do this instead of the “Pass/No Pass” outputs like in the last example is to clearly indicate to the model, which understands “true” values to mean an outcome, denoted by column names like “Passed,” did occur.

The function, np.where(), takes three parameters and they are:

1 The comparison, which in our case, is whether the student scored > 60.
2 True condition; if the student has “Passed,” we save this row value as “True.”
3 Overwise, “False” is stored.

1 import pandas as pd
2 import numpy as np
3
4 df = pd.read_csv('Marks10.csv')
5
6 df['Passed'] = np.where(df['Exampoints'] > 60, True, False)
7
8 df
Enter fullscreen mode Exit fullscreen mode

Image description

With this true or false column, our model now knows what “good” grades look like! However, all of the outcomes obtained in “Passed” are useful for aiding the model in making predictions on student performance, including scores of students who didn’t pass!

What’s fascinating about a data-driven analysis is that we can learn something valuable about student experience, performance, etc from everyone– not just the high-achieving students. A model can only be accurate and generalizable with a high volume of diverse data points.

Therefore, even though a D or below may not satisfy you or your parents, know that to our model, your worth is immeasurable 😊.

Image descriptionSource: Bugcat Capoo

Okay technically we can measure how influential a row of data is to training results thanks to ML, but let’s not be so cerebral all the time 😂

Magical no-code solution ✨🔮

Speaking of not thinking so hard, sometimes, when experimenting with data, we’d rather face data-related challenges than coding ones. Mage can alleviate the coding-related burdens for you!

To add a new column using a conditional operation to improve your model’s accuracy and generality, first go to:

  1. Edit data > Add column
  2. Fill in “conditional” for logic
  3. Then, fill in the logic for the columns you’re comparing (ex: “Exampoints >= 60”)
  4. Set the first outcome of the comparison evaluation as “True,” and the second as “False”

Image description

Happy experimenting! Hope Mage can give you a magical experience working with data 🌟
Want to learn more about machine learning (ML)? Visit Mage Academy! ✨🔮

Top comments (0)