This is Part 3 of the Pandas Series, where we explore common Pandas gotchas that can level up your interview answers and your daily data work.
Imagine you're in a FAANG interview. The hiring manager says, "We have a petabyte-scale dataset of user transactions. I need you to generate a daily summary report showing, for each country, the total revenue, the average order value, and the number of unique customers who made a purchase. How would you approach this with Pandas?"
This is the quintessential groupby problem. Your ability to answer it cleanly and efficiently is a massive signal of your skill level.
The Core Concept: Split-Apply-Combine
This is the mental model for any groupby operation. It's a three-step process:
Split: The data is broken into smaller groups based on the criteria you specify (e.g., all rows for 'USA', all rows for 'India', etc.).
Apply: A function is applied to each of these smaller groups independently (e.g.,
sum(),mean(),count()).Combine: The results from each group are collected and combined into a new DataFrame or Series.
From Basic Aggregations to .agg()
The simplest groupby is applying one function to one column: df.groupby('country')['revenue'].sum()
But what if you need the sum, mean, AND count? You could run the code three times, but that's inefficient. The answer is the .agg() method.
Level 1: Multiple functions on one column df.groupby('country')['revenue'].agg(['sum', 'mean', 'count'])
Level 2: Different functions on different columns. This is where it gets powerful. You use a dictionary to specify which function to apply to which column.
summary = df.groupby('country').agg(
{'revenue': 'sum', # Sum of the revenue column
'order_id': 'count', # Count of all orders
'customer_id': 'nunique'} # Count of unique customers
)
Level 3: Named Aggregations: The dictionary method is great, but the column names in the output can be messy. The modern, preferred, and most readable method is Named Aggregation. It lets you control the output column names directly.
summary = df.groupby('country').agg(
total_revenue=('revenue', 'sum'),
number_of_orders=('order_id', 'count'),
unique_customers=('customer_id', 'nunique')
)
This syntax, new_column_name=('source_column', 'function'), is clean, explicit, and produces a perfectly formatted DataFrame. This is what interviewers love to see.
⚡ For petabyte-scale data, this logic translates directly to PySpark’s groupBy().agg() — same concept, distributed execution.
And that’s a wrap for Part 3 of the Pandas Mastery Series 🎯
Next up: Should we dive into .apply() magic or merging/joining DataFrames next? Drop your pick below 👇
Top comments (0)