Mage

Posted on Sep 30, 2021

Product developers’ guide to getting started with AI — Part 3: Terraforming dataframes

TLDR

Terraforming a planet requires large scale projects to inhabit other planets for survival. We’ll begin by terraforming datasets to calculate the cost of survival on the Titanic.

Outline

Introduction
Before we begin
Functional programming
Applying Function
Aggregating Data
Transforming Data
Data Analysis
Conclusion

From the SHAP article, we know that people in some groups were more likely to survive when the Titanic crashed. But what does it cost to survive the titanic?

![Alt Text](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/jfue71jxrb2d79yfplxk.jpeg)_Titanic meets Iceberg (Source: [Britannica](https://m.mage.ai/how-to-interpret-and-explain-your-machine-learning-models-using-shap-values-471c2635b78e))_

In “Product developers’ guide to getting started with AI — Part 3: Terraforming dataframes”, we’ll look at the price point of a “golden ticket” that ensures the best chance of survival. Based on the SHAP values calculated there is a direct correlation between the sex, passenger class, fare, and age.

![Alt Text](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/wotn4nd9tnidp8ozdkg3.png)_Mage Analyzer Page (Source: SHAP)_

Manipulating datasets are a quick and easy way to rearrange data and extract everything. In this series we’ve gone over how to pick and search through data so it’s time to look at transforming the underlying data.

Before we Begin

It is highly advised to have read part 2 before continuing forward. In this guide, we’ll be using the Titanic dataset along with Google Collab. I’ll be briefly reusing techniques from previous contents such as surfing and extracting to quickly start us off with an ideal dataframe for applying transformations and functions.

![Alt Text](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/mt8u97zvcm7q21v8d4xf.png)_Part 2: Surfing through dataframes_

Functional Programming

Python is a functional programming language, which means that all operations can be expressed as a function. This is important as later on in this guide we’ll be looking at creating functions and passing lambda expressions to apply and transform. For those that are comfortable enough with Python, you may skip this section. Otherwise, keep reading for a quick refresher on the syntax for defining functions and lambda expressions.

In Python, a function is created by the “def” keyword and takes in a number of arguments.

![Alt Text](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/wz4dzavkgsd1h0dc89dt.png)_Basic Adder that adds 1 to the value_

Rewrite the adder function as a lambda expression to shorthand.

![Alt Text](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/gq26efazb0ceoye2tiix.png)_Lambda expression of the adder_

For a small operation, like the adder above, it’s best practice to use a lambda expression. But, for more complex calculations that are used multiple times use a function. When in doubt check if there is a simpler way or how much repeating will occur.

Applying Function

The simplest form of manipulating a dataframe is by using apply. Apply takes in a function and repeats it for either all columns or rows within a dataframe. The applications of this are for quickly calculating or encrypting data.

Based on the SHAP values, we form a hypothesis that women and children are more likely to survive, possibly due to the fact that they can board first and when living in upper class areas of the ship there is less population density allowing them to quickly escape in comparison to the lower class.

![Alt Text](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ggsxn22vvxihjzl5ehzc.jpeg)_Lifeboats on the Titanic (Source: DailyMail)_

To find the average price point of the winning ticket: ticket for a young lady in 1st class, we first need to filter down our rows and columns. In the dataframe, “Pclass” represents whether a passenger is located in the 1st class, 2nd class, or 3rd class area of the Titanic. The average is calculated as the sum of the prices divided by the total number or count of items, but may also be calculated by the mean method.

Using what we’ve learned in part 2, we filter the rows down to only contain items from the sex, passenger class, and age columns. We define our filter as

Having the sex of a female
Passenger class of only 1st class
Age must be no lower than 40 years old

![Alt Text](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ex7qhtf652swi2qo6csa.png)

Then reduce it to only show the relevant information: ‘Fare’ or price of golden ticket.

![Alt Text](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/j3he7r6oq95au018vq2h.png)

Then, we take the sum of the ‘Fare’ column and divide by the total number of items.

![Alt Text](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/iy2avix0es4m85ie5j2b.png)_The total price of all golden tickets are $6484.80_ ![Alt Text](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/uivkqongv9d4s5hba9y1.png)_Average price of $113.77_

Unlike part 2, where we overwrite the values, instead store the data inside a new variable called average_price to hold the results of the calculations. This lets us preserve the old data.

We can confirm this is the same when calculating the mean of the prices.

![Alt Text](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/6o926xcmrv3yo7xu59o4.png)_The mean matches the average price of $113.77_

Pandas has multiple other built-in mathematical functions, such as median and more.

![Alt Text](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/v44dllv9bctghqh9d37v.png)_Median is $86.50_

Unfortunately, all of this must be done separately, which makes apply good for short functions, but what about longer functions? That’s where aggregate or agg shines in removing repeatability.

Aggregating Data

If you know which aggregate you want to apply ahead of time, use agg instead. When doing multiple calculations of summation, mean, or standard deviation, aggregate is a neater way to calculate than using apply.

For instance, if we were to use aggregate instead, we could grab multiple types all at once. For our next section, we’ll need the standard deviation so let’s calculate that as well. Note: The shorthand is agg, which is functionally equivalent to aggregate.

![Alt Text](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/xmqvcsdkamcz6mp1rf0t.png)_1 liner for sum, mean, max, and median_

Transforming Data

Another way of manipulating a dataframe is by using transform. This is similar to apply, except that it applies the function to itself and repeats it for all columns within a dataframe. Since it can be applied to itself, the applications are more extended and can complete multiple operations by passing values back to itself.

Because transform applies it to itself, the result must be the same length of the original input. This means that functions such as sum(), mean(), and max/min() don’t work as they condense or aggregate all the data into 1 value.

![Alt Text](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/4eq6r2xdj06flstjtugl.png) ![Alt Text](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ad9oapysbn1rcu6wu4ir.png)

Back to the original problem, find out what percentage of passengers have a “golden ticket”. Using transform, we can combine aggregation using a series to calculate the individual values. This makes transform more useful at looking at the finer details.

![Alt Text](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/1ux2zdrbxtjpm3mt8koj.png)_Calculate individual percentages_

Likewise, summing the individual results should result in 1.0 (100%)

![Alt Text](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/h4oratww7d88ploc7m74.png)_Sanity Check_

Data Analysis

To find out how many passengers paid top dollar, first we take the original dataset and calculate the percentages. We leverage transform’s ability to maintain length, along with groupby to sort our data.

![Alt Text](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/2qqez2y7yt8wdulkmev0.png)

What slice of the “pie” do the golden ticket passengers make out?

23% of all income on the ship is from golden ticket sales.

What percentage of passengers own a golden ticket?

Only 6% of all passengers purchased a golden ticket.

Key Differences

Transform returns based on self, the equal length must be satisfied. Therefore, transform can’t handle aggregate methods (sum, mean, std deviation, etc…)
Apply doesn’t take in multiple aggregations (one column at a time), while agg can.

Highlights

Transform is best used to create a new entry into a table to see fine detail.
Aggregate and apply are useful at calculating a single summary value.

Conclusion

That’s it now, you’re ready to tackle future problems in data science. Using your newfound knowledge I suggest modifying the steps to calculate what percentage of golden ticket holders survive, as your next step in familiarizing yourself with these core AI concepts. As always, stay tuned for future guides where we’ll go over more topics ranging from joining datasets to deploying a machine learning model to the Cloud.

![Alt Text](https://dev-to-uploads.s3.amazonaws.com/uploads/articles/2y43q2ndwmphnu00g0t4.gif)_I’ve got a Golden Ticket! (Source South Park)_

Top comments (1)

Tommy DANGerous • Sep 30 '21

This article helps get started with feature engineering aka adding more features to the data which ultimately improves the model. Super helpful for any developer getting started with ML.

DEV Community