Pandas GroupBy 2x Faster: Categorical Dtypes on 1M Rows

#pandas #groupby #categorical #dataanalysis

The One-Line Change That Cut My GroupBy Time in Half

GroupBy on a million rows: 1.8 seconds. Same operation, same data, after converting to categorical: 0.7 seconds. Not a new algorithm. Not Polars. Just astype('category') on two columns.

I stumbled onto this while profiling a sales analytics pipeline. The aggregation logic was fine—the bottleneck was pandas spending most of its time comparing strings instead of integers. Categorical dtypes solve this by mapping each unique string to an integer code internally, and that integer comparison is what makes the difference.

Let's see exactly where this speedup comes from, where it breaks down, and why the 2x claim only holds under specific conditions.

Two adorable giant pandas eating bamboo, showcasing their playful nature in a natural setting. — Photo by Mehmet Turgut Kirkgoz on Pexels

Generating Realistic Test Data: 1 Million Transactions

Before benchmarking anything, I need data that actually resembles production workloads. Random integers don't cut it—real datasets have skewed distributions, missing values, and that one category that appears 80% of the time.


python
import pandas as pd

---

*Continue reading the full article on [TildAlice](https://tildalice.io/pandas-groupby-categorical-dtype-speedup/)*

DEV Community

Pandas GroupBy 2x Faster: Categorical Dtypes on 1M Rows

The One-Line Change That Cut My GroupBy Time in Half

Generating Realistic Test Data: 1 Million Transactions

Top comments (0)