TLDR
Terraforming a planet requires large scale projects to inhabit other planets for survival. We’ll begin by terraforming datasets to calculate the cost of survival on the Titanic.
Outline
- Introduction
- Before we begin
- Functional programming
- Applying Function
- Aggregating Data
- Transforming Data
- Data Analysis
- Conclusion
From the SHAP article, we know that people in some groups were more likely to survive when the Titanic crashed. But what does it cost to survive the titanic?
In “Product developers’ guide to getting started with AI — Part 3: Terraforming dataframes”, we’ll look at the price point of a “golden ticket” that ensures the best chance of survival. Based on the SHAP values calculated there is a direct correlation between the sex, passenger class, fare, and age.
Manipulating datasets are a quick and easy way to rearrange data and extract everything. In this series we’ve gone over how to pick and search through data so it’s time to look at transforming the underlying data.
Before we Begin
It is highly advised to have read part 2 before continuing forward. In this guide, we’ll be using the Titanic dataset along with Google Collab. I’ll be briefly reusing techniques from previous contents such as surfing and extracting to quickly start us off with an ideal dataframe for applying transformations and functions.
Functional Programming
Python is a functional programming language, which means that all operations can be expressed as a function. This is important as later on in this guide we’ll be looking at creating functions and passing lambda expressions to apply and transform. For those that are comfortable enough with Python, you may skip this section. Otherwise, keep reading for a quick refresher on the syntax for defining functions and lambda expressions.
In Python, a function is created by the “def” keyword and takes in a number of arguments.
Rewrite the adder function as a lambda expression to shorthand.
For a small operation, like the adder above, it’s best practice to use a lambda expression. But, for more complex calculations that are used multiple times use a function. When in doubt check if there is a simpler way or how much repeating will occur.
Applying Function
The simplest form of manipulating a dataframe is by using apply. Apply takes in a function and repeats it for either all columns or rows within a dataframe. The applications of this are for quickly calculating or encrypting data.
Based on the SHAP values, we form a hypothesis that women and children are more likely to survive, possibly due to the fact that they can board first and when living in upper class areas of the ship there is less population density allowing them to quickly escape in comparison to the lower class.
To find the average price point of the winning ticket: ticket for a young lady in 1st class, we first need to filter down our rows and columns. In the dataframe, “Pclass” represents whether a passenger is located in the 1st class, 2nd class, or 3rd class area of the Titanic. The average is calculated as the sum of the prices divided by the total number or count of items, but may also be calculated by the mean method.
Using what we’ve learned in part 2, we filter the rows down to only contain items from the sex, passenger class, and age columns. We define our filter as
- Having the sex of a female
- Passenger class of only 1st class
- Age must be no lower than 40 years old
Then reduce it to only show the relevant information: ‘Fare’ or price of golden ticket.
Then, we take the sum of the ‘Fare’ column and divide by the total number of items.
Unlike part 2, where we overwrite the values, instead store the data inside a new variable called average_price to hold the results of the calculations. This lets us preserve the old data.
We can confirm this is the same when calculating the mean of the prices.
Pandas has multiple other built-in mathematical functions, such as median and more.
Unfortunately, all of this must be done separately, which makes apply good for short functions, but what about longer functions? That’s where aggregate or agg shines in removing repeatability.
Aggregating Data
If you know which aggregate you want to apply ahead of time, use agg instead. When doing multiple calculations of summation, mean, or standard deviation, aggregate is a neater way to calculate than using apply.
For instance, if we were to use aggregate instead, we could grab multiple types all at once. For our next section, we’ll need the standard deviation so let’s calculate that as well. Note: The shorthand is agg, which is functionally equivalent to aggregate.
Transforming Data
Another way of manipulating a dataframe is by using transform. This is similar to apply, except that it applies the function to itself and repeats it for all columns within a dataframe. Since it can be applied to itself, the applications are more extended and can complete multiple operations by passing values back to itself.
Because transform applies it to itself, the result must be the same length of the original input. This means that functions such as sum(), mean(), and max/min() don’t work as they condense or aggregate all the data into 1 value.
Back to the original problem, find out what percentage of passengers have a “golden ticket”. Using transform, we can combine aggregation using a series to calculate the individual values. This makes transform more useful at looking at the finer details.
Likewise, summing the individual results should result in 1.0 (100%)
Data Analysis
To find out how many passengers paid top dollar, first we take the original dataset and calculate the percentages. We leverage transform’s ability to maintain length, along with groupby to sort our data.
What slice of the “pie” do the golden ticket passengers make out?
What percentage of passengers own a golden ticket?
Key Differences
- Transform returns based on self, the equal length must be satisfied. Therefore, transform can’t handle aggregate methods (sum, mean, std deviation, etc…)
- Apply doesn’t take in multiple aggregations (one column at a time), while agg can.
Highlights
- Transform is best used to create a new entry into a table to see fine detail.
- Aggregate and apply are useful at calculating a single summary value.
Conclusion
That’s it now, you’re ready to tackle future problems in data science. Using your newfound knowledge I suggest modifying the steps to calculate what percentage of golden ticket holders survive, as your next step in familiarizing yourself with these core AI concepts. As always, stay tuned for future guides where we’ll go over more topics ranging from joining datasets to deploying a machine learning model to the Cloud.
Top comments (1)
This article helps get started with feature engineering aka adding more features to the data which ultimately improves the model. Super helpful for any developer getting started with ML.