The One-Line Change That Cut My GroupBy Time in Half
GroupBy on a million rows: 1.8 seconds. Same operation, same data, after converting to categorical: 0.7 seconds. Not a new algorithm. Not Polars. Just astype('category') on two columns.
I stumbled onto this while profiling a sales analytics pipeline. The aggregation logic was fine—the bottleneck was pandas spending most of its time comparing strings instead of integers. Categorical dtypes solve this by mapping each unique string to an integer code internally, and that integer comparison is what makes the difference.
Let's see exactly where this speedup comes from, where it breaks down, and why the 2x claim only holds under specific conditions.
Generating Realistic Test Data: 1 Million Transactions
Before benchmarking anything, I need data that actually resembles production workloads. Random integers don't cut it—real datasets have skewed distributions, missing values, and that one category that appears 80% of the time.
python
import pandas as pd
---
*Continue reading the full article on [TildAlice](https://tildalice.io/pandas-groupby-categorical-dtype-speedup/)*

Top comments (0)