Window Functions Are Slower Than You Think
Most data engineers I know reach for PySpark window functions the moment they see "rolling average" in a requirement. The result? Jobs that take 45 minutes to process what should finish in 8. The problem isn't window functions themselves—it's that most implementations force full shuffles on terabyte-scale data when you only need partial ordering within logical groups.
I spent two weeks optimizing a time series anomaly detection pipeline that was choking on 500GB of IoT sensor data. The original code used a naive Window.partitionBy().orderBy() that triggered three separate shuffle operations. After switching to a rangeFrame with proper partition pruning, the same job finished in 11 minutes.
Here's what actually works when you're doing rolling aggregations on data that doesn't fit in memory.
The Shuffle Tax: Why Your Window Query Costs $340
Every time you call Window.partitionBy("device_id").orderBy("timestamp"), Spark has to:
- Hash partition all rows by
device_idacross executors
Continue reading the full article on TildAlice

Top comments (0)