Gabriel

Posted on Nov 7

We Stopped Reaching for PySpark by Habit. Polars Made Our Small Jobs Boringly Fast.

#data #pyspark #polars

You know those “we migrated and everything is 10x faster” posts that leave out the messy bits? This isn’t one of them.

I'm a data engineer working in financial services, partnering with Palantir on one of our in-house strategic platforms*. Big, distributed data is part of the day job, so PySpark is the comfortable hoodie we’ve worn for years. But here’s the plot twist: for our small to mid-sized datasets (think: tens of MBs to a few GBs, not petabytes), we started swapping PySpark pipelines for Polars. And the dev loop went from coffee-break to “wait, it’s done?”

Let me tell you how that happened, where Polars shines, where Spark still wins, and exactly how to translate those “Spark-isms” you’ve internalized into Polars without wanting to throw your laptop.

*Disclaimer: The code and project shown here are my personal work and are not affiliated with, endorsed by, or the property of my employer.

TL;DR (for the colleague skimming this on a train)

Polars vs PySpark for small to mid-sized data: Polars’ Rust engine, lazy evaluation, and expression API deliver faster runtimes and quicker iteration on a single machine. Keep Spark for cluster-scale workloads, governed lakehouse writes, and production streaming. For migration, replace UDFs with expressions, lean on join_asof, group_by_dynamic, struct/list ops, and sink_parquet() for outputs.

The short version (so you can send this to your team chat)

If your data fits on a single beefy machine, Polars will often run circles around PySpark and make your codebase smaller.
When you actually need a cluster, a metastore, ACID lakehouse writes, or battle-hardened streaming semantics, Spark is still your friend.
The switch is less painful than you think. Expressions > UDFs, embrace lazy queries, and learn three or four Polars-specific tricks (as-of joins, dynamic windows, struct/list ops).

A quick story from the trenches

We had a workforce modeling DAG that joined dimension tables, applied a thicket of conditional business rules, and rolled up time-based metrics. In PySpark, the job was fine, just… ritualistic: spin a session, serialize UDFs, wait out the cluster shuffle tax, and go stare out the window while Catalyst did its thing.
One day we rewrote the same pipeline in Polars with lazy scans and expressions. No UDFs. No session boot. Same logic, fewer lines. The run finished before the “I’ll grab a coffee” instinct completed. The drama? The cluster didn’t even get invited to the meeting.

Why Polars works so well on “normal-sized” data

It’s Rust under the hood. Vectorized, multi-threaded, cache-friendly. Python just orchestrates expressions.
Lazy by default (when you choose it). You build a plan (scan_parquet -> transforms -> collect()), and Polars fuses steps to avoid unnecessary passes.
Expression API beats UDFs. The minute your logic is expressible, you stop paying Python <-> JVM costs and serialization overhead.

The non-obvious differences that actually matter

“But Spark can do X and I don’t see it in Polars…” (Here’s how to do it)

Below are the patterns we actually used while migrating pipelines.

1. Time-range join (point-in-time lookup)

# PySpark
from pyspark.sql import functions as F, Window as W

# For each event, attach the most recent dimension row effective <= event.ts
cond = [events.abc == dim.abc, events.ts >= dim.eff_ts]
joined = (events.join(dim, cond, "left")
          .withColumn("rn", F.row_number().over(
              W.partitionBy("event_id").orderBy(F.col("dim.eff_ts").desc())
          ))
          .filter("rn=1").drop("rn"))

# --- same idea in Polars ---
# Polars
import polars as pl

events = pl.scan_parquet("events.parquet")
dim    = pl.scan_parquet("dim.parquet")

# Natural fit: as-of join on timestamp with a key
out = events.join_asof(
    dim, on="ts", by="abc", strategy="backward", tolerance="5m"
).collect()

When join_asof isn’t enough (e.g., need ts BETWEEN start AND end): pre-expand dimension rows into (start, end) boundaries and filter after a cheap join:

# Small dim? Cross then filter (ok for small) or prebucket by day.
out = (events.join(dim, on="abc", how="left")
             .filter((pl.col("ts") >= pl.col("start")) & (pl.col("ts") < pl.col("end")))
             .with_columns(
                 pl.col("ts").sort_by("ts").over("event_id")  # keep stability if needed
             )
      ).collect()

2. Complex conditional with many CASE WHENs

# PySpark
from pyspark.sql import functions as F

df = df.withColumn(
    "final_score",
    F.when(F.col("score") >= 90, "A")
     .when(F.col("score") >= 75, "B")
     .when((F.col("score") >= 60) & (F.col("region") == "Lake Region"), "C+")
     .otherwise("C")
)

# --- same idea in Polars ---
# Polars
import polars as pl

df = df.with_columns([
    pl.when(pl.col("score") >= 90).then("A")
     .when(pl.col("score") >= 75).then("B")
     .when((pl.col("score") >= 60) & (pl.col("region") == "Lake Region")).then("C+")
     .otherwise("C")
     .alias("final_score")
])

The nested when/then/otherwise in Polars is ergonomic and compiles into a single expression tree.

3. “Collect a struct, then explode it later”

#PySpark
aggd = (df.groupBy("forest")
          .agg(F.collect_list(F.struct("region","place")).alias("full_place")))
flat = aggd.select("forest", F.explode_outer("full_place").alias("p")).select("forest","p.*")

# --- same idea in Polars ---
# Polars
aggd = (df.group_by("forest")
          .agg(pl.struct(["region","place"]).list().alias("full_place")))

flat = (aggd
        .with_columns(pl.col("full_place").explode())
        .unnest("full_place"))

unnest on struct is a great trick: no manual field projection necessary.

4. Rolling window over time with business calendars

#PySpark
from pyspark.sql import functions as F, Window as W

w = W.partitionBy("abc").orderBy("ts").rowsBetween(-6, 0)
out = df.withColumn("w7", F.avg("value").over(w))

# --- same idea in Polars ---
# Polars
out = (df.sort(["abc","ts"])
         .with_columns(
             pl.col("value").rolling_mean(window_size=7).over("abc").alias("w7")
         ))

If your index is a timestamp with irregular spacing, reach for group_by_dynamic:

roll = (df.group_by_dynamic(index_column="ts", every="1w", period="1w", by="abc")
          .agg(pl.col("value").mean().alias("w7")))

5. Broadcast a tiny dimension without… a hint

#PySpark
small_dim = spark.table("dim_small")
df = df.hint("broadcast").join(small_dim, "key", "left")

# --- same idea in Polars ---
# Polars
# Just join. If the right side is tiny, Polars keeps it in memory.
df = df.join(small_dim, on="key", how="left")

No hint needed on a single machine; Polars will do the sensible thing.

Where we still reach for Spark

Data too big for one box. If your joins or shuffles exceed a single host’s RAM/IO comfort zone, you want a cluster.
Lakehouse governance. Need native Delta/Iceberg writes with compaction, vacuuming, and a metastore? Spark’s the grown-up in the room.
Production streaming semantics. Exactly-once sinks, watermarking across stateful operators — Spark’s Structured Streaming is still the benchmark.

A tiny migration playbook (what we actually did)

Start lazy. Prefer pl.scan_parquet(...) -> transform -> collect()/sink_parquet().
Delete UDFs. If you’re writing Python UDFs in Spark, translate them into expressions in Polars. It’s almost always possible.
Explode early, unnest often. Nested data becomes readable with unnest, explode, arr ops.
Test with golden data. Same sample through both engines; compare row counts, sums, and a few canonical hashes.
Keep Spark for the big guns. Don’t martyr yourself trying to force a 2-TB join into one box.

Two-step migration (no drama version)

Step 1: Like-for-like swap. We replaced PySpark with Polars while keeping the business logic identical. Built the plan with scan_* + expressions, then checked parity with row counts, hashes, and a few KPI spot checks. Goal: prove equivalence, not chase speed.
Step 2: Optimize for Polars. We refactored to lean on lazy evaluation and vectorized expressions, trimmed intermediates, and ditched Python UDFs.
The payoff came in two clean drops: first from shedding distributed overhead (think ~5 -> ~2 minutes), then to consistently sub-minute after the Polars-native rewrite. Fewer lines, steadier memory, snappier UI.

Gotchas we hit (so you don’t)

Null propagation: Polars is strict. Use fill_null, coalesce (pl.coalesce([colA, colB])) intentionally.
Types are not vibes. Specify dtype on read if your CSVs are “creative.”
Non-equi joins: Reach for join_asof or filter-after-join patterns. If it looks like a cross join… it is a cross join—use it sparingly.
Don’t apply unless you must. If you write a Python lambda over rows, you’re leaving performance on the floor. Stick to expressions.

What I’d do on Monday

Pick a pipeline under a few GB, port the transforms you can without UDFs, wire up scan_* -> expressions -> collect()/sink_parquet(), and compare outputs. If your run time drops and your code shrinks, congratulations! You just freed up time for the hard problems.
If it doesn’t? That probably means you genuinely need Spark’s distributed muscle. Use the right tool and move on with your life.

DEV Community