Pandas GroupBy 10x Faster: Map-Reduce for 100M+ Rows

#pandas #performance #groupby #dataengineering

The .apply() Trap That Killed Our ETL

One groupby().apply() call ate 47 minutes on 120 million rows. The same operation finished in 4 minutes after switching to map-reduce patterns.

Most tutorials teach .apply() first because it's flexible. That flexibility comes at a brutal cost: Python-side row iteration that bypasses NumPy's C optimizations. When your grouped data hits millions of rows, you're looking at exponential slowdown.

Here's what actually works at scale.

Adorable pandas playfully interact on logs surrounded by greenery in Chengdu Zoo. — Photo by Ramaz Bluashvili on Pexels

Why GroupBy Performance Collapses

Pandas groupby works in two phases: split (partition rows by key) and combine (aggregate each partition). The split is always fast — it's basically a hash table lookup. The combine step is where everything breaks.

.apply() executes arbitrary Python functions on each group. Pandas can't optimize what it can't see. Every function call crosses the Python-C boundary, loses vectorization, and triggers memory allocation. With 10,000 groups of 10,000 rows each, that's 10,000 separate DataFrame constructions.

Continue reading the full article on TildAlice

DEV Community

Pandas GroupBy 10x Faster: Map-Reduce for 100M+ Rows

The .apply() Trap That Killed Our ETL

Why GroupBy Performance Collapses

Top comments (0)