The Problem
When working with aggregating dataframes in pandas, I've found myself frustrated with how the results of aggregated columns are named. By default, they inherit the name of the column of which you're aggregating. For example,
import pandas as pd
import numpy as np
iris = pd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv')
iris.groupby('species').agg({
'sepal_length': np.mean
}).round(2)
sepal_length | |
---|---|
species | |
setosa | 5.01 |
versicolor | 5.94 |
virginica | 6.59 |
So obviously, we as the writers of the above code know that we took a mean of sepal length. But just looking at the output we have no idea what was done to the sepal length value. We can get around this if we enclose the aggregate function in a list:
iris.groupby('species').agg({
'sepal_length': [np.mean]
}).round(2)
sepal_length | |
---|---|
mean | |
species | |
setosa | 5.01 |
versicolor | 5.94 |
virginica | 6.59 |
Pandas adds a row (technically adds a level, creating a multiIndex) to tell us the different aggregate functions we applied to the column. In this case, we only applied one, but you could see how it would work for multiple aggregation expressions.
This approach works well. If you want to collapse the multiIndex to create more accessible columns, you can leverage a concatenation approach, inspired by this stack overflow post (note that other implementations similarly use .ravel()
):
df = iris.groupby('species').agg({
'sepal_length': [np.mean]
}).round(2)
df.columns = ['_'.join(gp) for gp in df.columns.values]
df
sepal_length_mean | |
---|---|
species | |
setosa | 5.01 |
versicolor | 5.94 |
virginica | 6.59 |
Both of these solutions have a few immediate issues:
- Column names can still be far from readable English;
- The concatenation approach may not scale for all applications;
- Pandas takes the
__name__
attribute of any custom functions and uses it for the column name here. In the case of aggregating with custom functions or lambda functions, it's not likely the column names will make sense in these formats.
A Different Solution
We can leverage the __name__
attribute to create a clearer column name and maybe even one others can make sense of. 👍
To be clear: we could obviously rename any of these columns after the dataframe is returned, but in this case I wanted a solution where I could set column names on the fly.
Taking Advantage of the __name__
Attribute
If you're unfamiliar, the __name__
attribute is something every function you or someone else defines in python comes along with.
def this_function():
pass
print(this_function.__name__)
this_function
We can change this attribute after we define it:
def this_function():
pass
this_function.__name__ = 'that.'
print(this_function.__name__)
that.
There are also some great options for adjusting a function __name__
as you define the function using decorators. More about that here.
Returning to our application, lets examine the following situation:
def my_agg(x):
return (x/20).sum()
iris.groupby('species').agg({
'sepal_length': [my_agg],
'sepal_width': [my_agg]
}).round(2)
sepal_length | sepal_width | |
---|---|---|
my_agg | my_agg | |
species | ||
setosa | 12.52 | 8.57 |
versicolor | 14.84 | 6.92 |
virginica | 16.47 | 7.44 |
We could add a line adjusting the __name__
of my_agg()
before we start our aggregation. But what if we could rename the function as we were aggregating? Similar to how we can rename columns in a SQL statement as we define them.
Higher-order Renaming Function
To solve this problem, we can define a higher-order function which returns a copy of our original function, but with the name attribute changed. It looks like this:
def renamer(agg_func,desired_name):
def return_func(x):
return agg_func(x)
return_func.__name__ = desired_name
return return_func
We can apply this function outside of our application of my_agg
to reset the __name__
on-the-fly:
iris.groupby('species').agg({
'sepal_length': [renamer(my_agg,'Cool Name')],
'sepal_width': [renamer(my_agg,'Better Name')]
}).round(2)
sepal_length | sepal_width | |
---|---|---|
Cool Name | Better Name | |
species | ||
setosa | 12.52 | 8.57 |
versicolor | 14.84 | 6.92 |
virginica | 16.47 | 7.44 |
Realistic Example
Here's a perfect scenario to utilize this solution:
from numpy import percentile
iris.groupby('species').agg({
'sepal_length': [renamer(lambda x: percentile(x,25),'25th Percentile')],
'sepal_width': [renamer(lambda x: percentile(x,75),'75th Percentile')]
}).round(2)
sepal_length | sepal_width | |
---|---|---|
25th Percentile | 75th Percentile | |
species | ||
setosa | 4.80 | 3.68 |
versicolor | 5.60 | 3.00 |
virginica | 6.22 | 3.18 |
In order to get various percentiles of sepal widths and lengths, we can leverage lambda functions and not have to bother defining our own. We use the renamer to fix give these lambda functions understandable names.
To take this a step further, we can include the column name in the rename string and drop the top level of the column multiIndex:
from numpy import percentile
df3 = iris.groupby('species').agg({
'sepal_length': [renamer(lambda x: percentile(x,25),'Length 25th Percentile')],
'sepal_width': [renamer(lambda x: percentile(x,75),'Width 75th Percentile')]
}).round(2)
df3.columns = df3.columns.droplevel()
df3
Length 25th Percentile | Width 75th Percentile | |
---|---|---|
species | ||
setosa | 4.80 | 3.68 |
versicolor | 5.60 | 3.00 |
virginica | 6.22 | 3.18 |
Final Thoughts
There are many ways to skin a cat when working with pandas dataframes, but I'm constantly looking for ways to simplify and speed-up my work-flow. This solution helps me work through aggregation steps and easily create sharable tables. It certainly won't work for all situations, but consider using it the next time you get frustrated with unhelpful column names!
Top comments (0)