PySpark Window Functions: 3 Patterns That Scale Rolling Aggs

#pyspark #timeseries #windowfunctions #rollingaggregations

Window Functions Are Slower Than You Think

Most data engineers I know reach for PySpark window functions the moment they see "rolling average" in a requirement. The result? Jobs that take 45 minutes to process what should finish in 8. The problem isn't window functions themselves—it's that most implementations force full shuffles on terabyte-scale data when you only need partial ordering within logical groups.

I spent two weeks optimizing a time series anomaly detection pipeline that was choking on 500GB of IoT sensor data. The original code used a naive Window.partitionBy().orderBy() that triggered three separate shuffle operations. After switching to a rangeFrame with proper partition pruning, the same job finished in 11 minutes.

Here's what actually works when you're doing rolling aggregations on data that doesn't fit in memory.