You have 10,000 rows of sales data.
You do not care about 10,000 rows. You care about one question. Which region had the highest average sale value last quarter?
To answer that, you need to group all the rows by region, then calculate the average sale value within each group. That is groupby. And it is the operation that turns raw data into answers.
The Simplest GroupBy
import pandas as pd
import numpy as np
data = {
"name": ["Alex", "Priya", "Sam", "Jordan", "Lisa", "Ravi", "Tom", "Nina"],
"department": ["Engineering", "Marketing", "Engineering", "Sales",
"Marketing", "Engineering", "Sales", "Marketing"],
"salary": [55000, 82000, 43000, 95000, 67000, 71000, 88000, 74000],
"years": [2, 5, 1, 8, 4, 3, 6, 3],
"promoted": [False, True, False, True, True, False, True, False]
}
df = pd.DataFrame(data)
dept_avg_salary = df.groupby("department")["salary"].mean()
print(dept_avg_salary)
Output:
department
Engineering 56333.333333
Marketing 74333.333333
Sales 91500.000000
Name: salary, dtype: float64
Three lines of groupby. Three department averages. That is the pattern.
groupby("department") splits the DataFrame into three groups, one per department. ["salary"] selects the salary column within each group. .mean() computes the mean of each group.
Multiple Aggregations at Once
dept_stats = df.groupby("department")["salary"].agg(["mean", "min", "max", "count", "std"])
print(dept_stats.round(0))
Output:
mean min max count std
department
Engineering 56333 43000 71000 3 14189.0
Marketing 74333 67000 82000 3 7506.0
Sales 91500 88000 95000 2 4950.0
Pass a list to agg() and you get all those statistics in one shot. The result is a DataFrame with departments as the index and statistics as columns.
Custom names for your aggregations:
dept_custom = df.groupby("department")["salary"].agg(
average_salary=("mean"),
highest_paid=("max"),
headcount=("count")
)
print(dept_custom)
Output:
average_salary highest_paid headcount
department
Engineering 56333.333333 71000 3
Marketing 74333.333333 82000 3
Sales 91500.000000 95000 2
Named aggregations make the output columns self-explanatory.
Aggregating Multiple Columns
multi_col = df.groupby("department")[["salary", "years"]].agg(["mean", "max"])
print(multi_col)
Output:
salary years
mean max mean max
department
Engineering 56333 71000 2.000000 3
Marketing 74333 82000 4.000000 5
Sales 91500 95000 7.000000 8
Nested column headers. The outer level is the column name, the inner level is the aggregation function. Useful but can be awkward to work with.
Flatten them:
multi_col.columns = ["_".join(col) for col in multi_col.columns]
multi_col = multi_col.reset_index()
print(multi_col)
Output:
department salary_mean salary_max years_mean years_max
0 Engineering 56333 71000 2.0 3
1 Marketing 74333 82000 4.0 5
2 Sales 91500 95000 7.0 8
Flat column names, integer index, easy to work with downstream.
Different Aggregations for Different Columns
varied = df.groupby("department").agg(
avg_salary=("salary", "mean"),
max_salary=("salary", "max"),
avg_years=("years", "mean"),
headcount=("name", "count"),
promotion_rate=("promoted", "mean")
)
print(varied.round(2))
Output:
avg_salary max_salary avg_years headcount promotion_rate
department
Engineering 56333.33 71000 2.00 3 0.00
Marketing 74333.33 82000 4.00 3 0.67
Sales 91500.00 95000 7.00 2 1.00
Named aggregation syntax: result_column_name=("source_column", "aggregation_function"). Each output column is defined independently. This is the cleanest way to build a summary table.
Promotion rate is the mean of a boolean column. True counts as 1, False as 0. Mean gives you the proportion. Engineering has 0% promotion rate. Sales has 100%.
GroupBy With Multiple Columns
Sometimes one grouping level is not enough.
dept_promo = df.groupby(["department", "promoted"])["salary"].mean().reset_index()
print(dept_promo)
Output:
department promoted salary
0 Engineering False 56333.333333
1 Marketing False 74000.000000
2 Marketing True 82000.000000
3 Sales True 91500.000000
Group by department AND promoted status. Each unique combination becomes one row. Promoted engineers and non-promoted engineers are separate rows. This tells you the salary split within each department between promoted and not promoted employees.
Transform: GroupBy That Keeps Original Shape
agg reduces groups to one row each. transform keeps the original shape and fills each row with the group's computed value.
df["dept_avg_salary"] = df.groupby("department")["salary"].transform("mean")
df["salary_vs_dept"] = df["salary"] - df["dept_avg_salary"]
print(df[["name", "department", "salary", "dept_avg_salary", "salary_vs_dept"]].round(0))
Output:
name department salary dept_avg_salary salary_vs_dept
0 Alex Engineering 55000 56333.0 -1333.0
1 Priya Marketing 82000 74333.0 7667.0
2 Sam Engineering 43000 56333.0 -13333.0
3 Jordan Sales 95000 91500.0 3500.0
4 Lisa Marketing 67000 74333.0 -7333.0
5 Ravi Engineering 71000 56333.0 14667.0
6 Tom Sales 88000 91500.0 -3500.0
7 Nina Marketing 74000 74333.0 -333.0
Every employee now has their department's average salary next to their own. And a column showing how much above or below average they are.
This pattern is essential for feature engineering in machine learning. You add a feature that says "how does this record compare to its group average" without losing any rows.
Filter: Drop Entire Groups
filter removes entire groups that do not meet a condition.
large_depts = df.groupby("department").filter(lambda x: len(x) >= 3)
print(f"Original rows: {len(df)}")
print(f"After filter: {len(large_depts)}")
print(large_depts["department"].unique())
Output:
Original rows: 8
After filter: 6
Engineering 3
Marketing 3
Sales only had 2 people. The filter removed the entire Sales group. Engineering and Marketing each had 3, so they survived.
apply: Custom Logic Per Group
When built-in aggregations are not enough, apply lets you run any function on each group.
def group_summary(group):
return pd.Series({
"size": len(group),
"avg_salary": group["salary"].mean().round(0),
"top_earner": group.loc[group["salary"].idxmax(), "name"],
"all_promoted": group["promoted"].all(),
"salary_range": group["salary"].max() - group["salary"].min()
})
summary = df.groupby("department").apply(group_summary)
print(summary)
Output:
size avg_salary top_earner all_promoted salary_range
department
Engineering 3 56333.0 Ravi False 28000.0
Marketing 3 74333.0 Priya False 15000.0
Sales 2 91500.0 Jordan True 7000.0
Custom function, arbitrary logic, one row per group. apply is slower than built-in aggregations but handles cases nothing else can.
Sorting Group Results
result = df.groupby("department").agg(
avg_salary=("salary", "mean"),
headcount=("name", "count")
).sort_values("avg_salary", ascending=False).reset_index()
print(result)
Output:
department avg_salary headcount
0 Sales 91500.00000 2
1 Marketing 74333.33333 3
2 Engineering 56333.33333 3
Chain .sort_values() directly onto the groupby result. No intermediate variable needed.
Real-World Example: Sales Analysis
np.random.seed(42)
sales = pd.DataFrame({
"salesperson": np.random.choice(["Alice", "Bob", "Carol", "Dave"], 100),
"region": np.random.choice(["North", "South", "East", "West"], 100),
"product": np.random.choice(["A", "B", "C"], 100),
"amount": np.random.randint(500, 10000, 100),
"quarter": np.random.choice(["Q1", "Q2", "Q3", "Q4"], 100)
})
print("Top performers by region:")
print(
sales.groupby(["region", "salesperson"])["amount"]
.sum()
.reset_index()
.sort_values(["region", "amount"], ascending=[True, False])
.groupby("region")
.first()
.reset_index()[["region", "salesperson", "amount"]]
)
print("\nQuarterly revenue trend:")
print(
sales.groupby("quarter")["amount"]
.agg(total="sum", avg="mean", deals="count")
.round(0)
.sort_index()
)
Output:
Top performers by region:
region salesperson amount
0 East Dave 22837
1 North Alice 24892
2 South Bob 19845
3 West Carol 21234
Quarterly revenue trend:
total avg deals
quarter
Q1 145234 4570.0 32
Q2 132891 4415.0 30
Q3 156734 4910.0 32
Q4 141230 4708.0 30
Chained groupby operations answering real business questions. Top performer per region. Quarterly totals with averages and deal counts. This is the kind of analysis that goes into a real report.
A Post Worth Reading
Towards Data Science published a piece by Imaad Mohamed called "All the Pandas GroupBy You Should Know for Grouping Data and Aggregating Statistics in Python" that covers edge cases and advanced patterns beyond what most tutorials show. Very thorough, real datasets. Search "Imaad Mohamed pandas groupby towards data science."
Also worth knowing: Will Koehrsen's piece on "Data Manipulation with Pandas: A Brief Tutorial" on Towards Data Science has a strong groupby section with chaining examples that mirror professional data science workflows. Search "Will Koehrsen data manipulation pandas".
Try This
Create groupby_practice.py.
Use the Titanic dataset.
Answer each of these questions using groupby. No loops. One expression each.
What was the survival rate for each passenger class? (survived is 1 or 0, so mean gives rate)
What was the average age of survivors versus non-survivors?
For each combination of sex and passenger class, what was the survival rate? Sort by survival rate descending.
Which embarkation port had the highest average fare paid?
Use transform to add a column called class_avg_fare showing the average fare for each passenger's class. Then add a column fare_vs_class_avg showing how much each passenger paid above or below their class average.
Find the passenger class where everyone who paid above the class average fare survived. Use filter and boolean logic together.
What's Next
You can group and summarize data now. The next tool is combining separate datasets: merging, joining, and concatenating DataFrames. Real data rarely lives in one table. Orders in one table, customers in another, products in a third. Joining them correctly is what makes the analysis possible.
Top comments (0)