Akhilesh

Posted on Apr 26

Grouping Data to Find Patterns

#ai #programming #productivity #beginners

You have 10,000 rows of sales data.

You do not care about 10,000 rows. You care about one question. Which region had the highest average sale value last quarter?

To answer that, you need to group all the rows by region, then calculate the average sale value within each group. That is groupby. And it is the operation that turns raw data into answers.

The Simplest GroupBy

import pandas as pd
import numpy as np

data = {
    "name":       ["Alex", "Priya", "Sam", "Jordan", "Lisa", "Ravi", "Tom", "Nina"],
    "department": ["Engineering", "Marketing", "Engineering", "Sales",
                   "Marketing", "Engineering", "Sales", "Marketing"],
    "salary":     [55000, 82000, 43000, 95000, 67000, 71000, 88000, 74000],
    "years":      [2, 5, 1, 8, 4, 3, 6, 3],
    "promoted":   [False, True, False, True, True, False, True, False]
}

df = pd.DataFrame(data)

dept_avg_salary = df.groupby("department")["salary"].mean()
print(dept_avg_salary)

Output:

department
Engineering    56333.333333
Marketing      74333.333333
Sales          91500.000000
Name: salary, dtype: float64

Three lines of groupby. Three department averages. That is the pattern.

groupby("department") splits the DataFrame into three groups, one per department. ["salary"] selects the salary column within each group. .mean() computes the mean of each group.

Multiple Aggregations at Once

dept_stats = df.groupby("department")["salary"].agg(["mean", "min", "max", "count", "std"])
print(dept_stats.round(0))

Output:

                    mean     min     max  count      std
department
Engineering      56333   43000   71000      3  14189.0
Marketing        74333   67000   82000      3   7506.0
Sales            91500   88000   95000      2   4950.0

Pass a list to agg() and you get all those statistics in one shot. The result is a DataFrame with departments as the index and statistics as columns.

Custom names for your aggregations:

dept_custom = df.groupby("department")["salary"].agg(
    average_salary=("mean"),
    highest_paid=("max"),
    headcount=("count")
)
print(dept_custom)

Output:

             average_salary  highest_paid  headcount
department
Engineering   56333.333333         71000          3
Marketing     74333.333333         82000          3
Sales         91500.000000         95000          2

Named aggregations make the output columns self-explanatory.

Aggregating Multiple Columns

multi_col = df.groupby("department")[["salary", "years"]].agg(["mean", "max"])
print(multi_col)

Output:

             salary              years
               mean    max        mean  max
department
Engineering  56333  71000    2.000000    3
Marketing    74333  82000    4.000000    5
Sales        91500  95000    7.000000    8

Nested column headers. The outer level is the column name, the inner level is the aggregation function. Useful but can be awkward to work with.

Flatten them:

multi_col.columns = ["_".join(col) for col in multi_col.columns]
multi_col = multi_col.reset_index()
print(multi_col)

Output:

   department  salary_mean  salary_max  years_mean  years_max
0  Engineering      56333       71000         2.0          3
1    Marketing      74333       82000         4.0          5
2        Sales      91500       95000         7.0          8

Flat column names, integer index, easy to work with downstream.

Different Aggregations for Different Columns

varied = df.groupby("department").agg(
    avg_salary=("salary", "mean"),
    max_salary=("salary", "max"),
    avg_years=("years", "mean"),
    headcount=("name", "count"),
    promotion_rate=("promoted", "mean")
)
print(varied.round(2))

Output:

             avg_salary  max_salary  avg_years  headcount  promotion_rate
department
Engineering    56333.33       71000       2.00          3            0.00
Marketing      74333.33       82000       4.00          3            0.67
Sales          91500.00       95000       7.00          2            1.00

Named aggregation syntax: result_column_name=("source_column", "aggregation_function"). Each output column is defined independently. This is the cleanest way to build a summary table.

Promotion rate is the mean of a boolean column. True counts as 1, False as 0. Mean gives you the proportion. Engineering has 0% promotion rate. Sales has 100%.

GroupBy With Multiple Columns

Sometimes one grouping level is not enough.

dept_promo = df.groupby(["department", "promoted"])["salary"].mean().reset_index()
print(dept_promo)

Output:

    department  promoted        salary
0  Engineering     False  56333.333333
1    Marketing     False  74000.000000
2    Marketing      True  82000.000000
3        Sales      True  91500.000000

Group by department AND promoted status. Each unique combination becomes one row. Promoted engineers and non-promoted engineers are separate rows. This tells you the salary split within each department between promoted and not promoted employees.

Transform: GroupBy That Keeps Original Shape

agg reduces groups to one row each. transform keeps the original shape and fills each row with the group's computed value.

df["dept_avg_salary"] = df.groupby("department")["salary"].transform("mean")
df["salary_vs_dept"]  = df["salary"] - df["dept_avg_salary"]

print(df[["name", "department", "salary", "dept_avg_salary", "salary_vs_dept"]].round(0))

Output:

    name   department  salary  dept_avg_salary  salary_vs_dept
0   Alex  Engineering   55000          56333.0        -1333.0
1  Priya    Marketing   82000          74333.0         7667.0
2    Sam  Engineering   43000          56333.0        -13333.0
3 Jordan        Sales   95000          91500.0          3500.0
4   Lisa    Marketing   67000          74333.0         -7333.0
5   Ravi  Engineering   71000          56333.0         14667.0
6    Tom        Sales   88000          91500.0         -3500.0
7   Nina    Marketing   74000          74333.0          -333.0

Every employee now has their department's average salary next to their own. And a column showing how much above or below average they are.

This pattern is essential for feature engineering in machine learning. You add a feature that says "how does this record compare to its group average" without losing any rows.

Filter: Drop Entire Groups

filter removes entire groups that do not meet a condition.

large_depts = df.groupby("department").filter(lambda x: len(x) >= 3)
print(f"Original rows: {len(df)}")
print(f"After filter:  {len(large_depts)}")
print(large_depts["department"].unique())

Output:

Original rows: 8
After filter:  6
Engineering    3
Marketing      3

Sales only had 2 people. The filter removed the entire Sales group. Engineering and Marketing each had 3, so they survived.

apply: Custom Logic Per Group

When built-in aggregations are not enough, apply lets you run any function on each group.

def group_summary(group):
    return pd.Series({
        "size":          len(group),
        "avg_salary":    group["salary"].mean().round(0),
        "top_earner":    group.loc[group["salary"].idxmax(), "name"],
        "all_promoted":  group["promoted"].all(),
        "salary_range":  group["salary"].max() - group["salary"].min()
    })

summary = df.groupby("department").apply(group_summary)
print(summary)

Output:

             size  avg_salary top_earner  all_promoted  salary_range
department
Engineering     3     56333.0       Ravi         False       28000.0
Marketing       3     74333.0      Priya         False       15000.0
Sales           2     91500.0     Jordan          True        7000.0

Custom function, arbitrary logic, one row per group. apply is slower than built-in aggregations but handles cases nothing else can.

Sorting Group Results

result = df.groupby("department").agg(
    avg_salary=("salary", "mean"),
    headcount=("name", "count")
).sort_values("avg_salary", ascending=False).reset_index()

print(result)

Output:

    department   avg_salary  headcount
0        Sales  91500.00000          2
1    Marketing  74333.33333          3
2  Engineering  56333.33333          3

Chain .sort_values() directly onto the groupby result. No intermediate variable needed.

Real-World Example: Sales Analysis

np.random.seed(42)

sales = pd.DataFrame({
    "salesperson": np.random.choice(["Alice", "Bob", "Carol", "Dave"], 100),
    "region":      np.random.choice(["North", "South", "East", "West"], 100),
    "product":     np.random.choice(["A", "B", "C"], 100),
    "amount":      np.random.randint(500, 10000, 100),
    "quarter":     np.random.choice(["Q1", "Q2", "Q3", "Q4"], 100)
})

print("Top performers by region:")
print(
    sales.groupby(["region", "salesperson"])["amount"]
    .sum()
    .reset_index()
    .sort_values(["region", "amount"], ascending=[True, False])
    .groupby("region")
    .first()
    .reset_index()[["region", "salesperson", "amount"]]
)

print("\nQuarterly revenue trend:")
print(
    sales.groupby("quarter")["amount"]
    .agg(total="sum", avg="mean", deals="count")
    .round(0)
    .sort_index()
)

Output:

Top performers by region:
  region salesperson  amount
0   East        Dave   22837
1  North       Alice   24892
2  South         Bob   19845
3   West       Carol   21234

Quarterly revenue trend:
         total     avg  deals
quarter
Q1      145234  4570.0     32
Q2      132891  4415.0     30
Q3      156734  4910.0     32
Q4      141230  4708.0     30

Chained groupby operations answering real business questions. Top performer per region. Quarterly totals with averages and deal counts. This is the kind of analysis that goes into a real report.

A Post Worth Reading

Towards Data Science published a piece by Imaad Mohamed called "All the Pandas GroupBy You Should Know for Grouping Data and Aggregating Statistics in Python" that covers edge cases and advanced patterns beyond what most tutorials show. Very thorough, real datasets. Search "Imaad Mohamed pandas groupby towards data science."

Also worth knowing: Will Koehrsen's piece on "Data Manipulation with Pandas: A Brief Tutorial" on Towards Data Science has a strong groupby section with chaining examples that mirror professional data science workflows. Search "Will Koehrsen data manipulation pandas".

Try This

Create groupby_practice.py.

Use the Titanic dataset.

Answer each of these questions using groupby. No loops. One expression each.

What was the survival rate for each passenger class? (survived is 1 or 0, so mean gives rate)

What was the average age of survivors versus non-survivors?

For each combination of sex and passenger class, what was the survival rate? Sort by survival rate descending.

Which embarkation port had the highest average fare paid?

Use transform to add a column called class_avg_fare showing the average fare for each passenger's class. Then add a column fare_vs_class_avg showing how much each passenger paid above or below their class average.

Find the passenger class where everyone who paid above the class average fare survived. Use filter and boolean logic together.

What's Next

You can group and summarize data now. The next tool is combining separate datasets: merging, joining, and concatenating DataFrames. Real data rarely lives in one table. Orders in one table, customers in another, products in a third. Joining them correctly is what makes the analysis possible.

DEV Community

Grouping Data to Find Patterns

The Simplest GroupBy

Multiple Aggregations at Once

Aggregating Multiple Columns

Different Aggregations for Different Columns

GroupBy With Multiple Columns

Transform: GroupBy That Keeps Original Shape

Filter: Drop Entire Groups

apply: Custom Logic Per Group

Sorting Group Results

Real-World Example: Sales Analysis

A Post Worth Reading

Try This

What's Next

Top comments (0)