DEV Community

TildAlice
TildAlice

Posted on • Originally published at tildalice.io

PySpark Window Functions: 3 Patterns That Scale Rolling Aggs

Window Functions Are Slower Than You Think

Most data engineers I know reach for PySpark window functions the moment they see "rolling average" in a requirement. The result? Jobs that take 45 minutes to process what should finish in 8. The problem isn't window functions themselves—it's that most implementations force full shuffles on terabyte-scale data when you only need partial ordering within logical groups.

I spent two weeks optimizing a time series anomaly detection pipeline that was choking on 500GB of IoT sensor data. The original code used a naive Window.partitionBy().orderBy() that triggered three separate shuffle operations. After switching to a rangeFrame with proper partition pruning, the same job finished in 11 minutes.

Here's what actually works when you're doing rolling aggregations on data that doesn't fit in memory.

A detailed view of a large clock inside a modern railway station.

Photo by Gratisography on Pexels

The Shuffle Tax: Why Your Window Query Costs $340

Every time you call Window.partitionBy("device_id").orderBy("timestamp"), Spark has to:

  1. Hash partition all rows by device_id across executors

Continue reading the full article on TildAlice

Top comments (0)