The .apply() Trap That Killed Our ETL
One groupby().apply() call ate 47 minutes on 120 million rows. The same operation finished in 4 minutes after switching to map-reduce patterns.
Most tutorials teach .apply() first because it's flexible. That flexibility comes at a brutal cost: Python-side row iteration that bypasses NumPy's C optimizations. When your grouped data hits millions of rows, you're looking at exponential slowdown.
Here's what actually works at scale.
Why GroupBy Performance Collapses
Pandas groupby works in two phases: split (partition rows by key) and combine (aggregate each partition). The split is always fast — it's basically a hash table lookup. The combine step is where everything breaks.
.apply() executes arbitrary Python functions on each group. Pandas can't optimize what it can't see. Every function call crosses the Python-C boundary, loses vectorization, and triggers memory allocation. With 10,000 groups of 10,000 rows each, that's 10,000 separate DataFrame constructions.
Continue reading the full article on TildAlice

Top comments (0)