DEV Community: Hanbo Wang

I built a Delta-style table format in Rust (and why it’s faster)

Hanbo Wang — Fri, 20 Feb 2026 20:30:19 +0000

Once you internalize “append-only log + snapshots,” a lot of modern data systems start looking like the same idea in different clothes.

That’s the rabbit hole that led me to build a small Delta-style table format in Rust, tuned for time-series appends. In my benchmark it beats Postgres / Delta + Spark / ClickHouse on append throughput (~3–6x).

This post is the 10-minute tour of how it works.

If you’ve ever managed a pipeline that appends daily Parquet files and wished the plumbing was simpler, this might be your kind of rabbit hole too.

If you’re mostly here for the performance results, scroll to Benchmarks — I won’t make you wait to the end.

The project is called timeseries-table-format -- a Rust library with Python bindings, implementing a minimal Delta-style table format optimized for time-series append workloads. Everything below works with just pip install.

The moment it clicked

While I was learning Kafka (docs + blogs + YouTube tutorials), one theme kept coming up: the more useful way to think about Kafka isn’t “a message queue”, but “an immutable append-only log”.

Around the same time, I was reading about how big data stacks evolved from Hadoop + Hive to the lakehouse era — when I dug into table formats like Delta Lake and Iceberg, I noticed the same pattern again: an append-only history of metadata that describes table state over time.

Once that clicked, the question became unavoidable: if the core idea is just “log + snapshots + a bit of concurrency control”, how hard would it be to build a small version myself — and tune it specifically for time-series data? That question turned into a learn-by-doing project… and eventually into the table format I’m writing about in this post.

Lakehouse table format 101 (Delta-style, then I map it to my repo)

My repo maps almost 1-to-1 onto Delta’s mental model — so I’ll explain the minimum concepts once, then show exactly where they live in my code and on disk.

Here’s what you need to know:

Immutable data files Data lives in immutable files (often Parquet). Appending means writing new files; the table format decides which files are “in” the table.
An append-only transaction log Every change is recorded as an append-only sequence of commits (“here’s what changed”: add/remove files, update table metadata).
Versioning + concurrency control (OCC) Writers commit version N+1 only if they started from the latest version N; if someone else won first, you detect a conflict and retry.
A current snapshot for readers (and checkpoints later) Readers need a consistent view: “the table as of the latest committed version”. Many systems add checkpoints later so readers don’t replay a huge log.

Delta concepts -> this repo (quick mapping)

Here’s the whole lifecycle in one picture:

Walkthrough: watch one append turn into a queryable table

We just talked about “immutable files + an append-only log + versioning + a current snapshot”. Now let’s watch those concepts play out in a real append.

Step 1) Create a table, append one Parquet file

from datetime import datetime, timezone
from pathlib import Path
import tempfile

import pyarrow as pa
import pyarrow.parquet as pq

import timeseries_table_format as ttf

with tempfile.TemporaryDirectory() as d:
    root = Path(d) / "prices_tbl"
    tbl = ttf.TimeSeriesTable.create(
        table_root=str(root),
        time_column="ts",
        bucket="1h",
        entity_columns=["symbol"],
        timezone=None,
    )

    incoming = Path(d) / "incoming.parquet"
    pq.write_table(
        pa.table(
            {
                "ts": pa.array([datetime(2024, 6, 1, tzinfo=timezone.utc)], type=pa.timestamp("us", tz="UTC")),
                "symbol": pa.array(["NVDA"]),
                "close": pa.array([10.0]),
            }
        ),
        str(incoming),
    )

    print("new version:", tbl.append_parquet(str(incoming)))

What landed on disk (conceptually):

prices_tbl/_timeseries_log/CURRENT
prices_tbl/_timeseries_log/0000000001.json (table metadata commit)
prices_tbl/data/incoming.parquet (if the input file was outside the table root and had to be copied in)
prices_tbl/_timeseries_log/0000000002.json (append commit)
prices_tbl/_timeseries_log/CURRENT now points to version 2

Step 2) The artifact: a real AddSegment action

A new data file becomes part of the table only after it’s logged.

An actual AddSegment action from this repo (from examples/nvda_table/_timeseries_log/0000000002.json):

{
  "AddSegment": {
    "segment_id": "seg-f0573298681657796623719468bf1133",
    "path": "data/nvda_1h.parquet",
    "format": "parquet",
    "ts_min": "2024-06-01T00:00:00Z",
    "ts_max": "2024-06-10T23:00:00Z",
    "row_count": 240,
    "file_size": 14272,
    "coverage_path": "_coverage/segments/segcov-ca3cea172cc538ce04756e34beaea4a4.roar"
  }
}

Notice the coverage_path -- we'll come back to that.

If you squint, you can already see the reader-side wins:

ts_min/ts_max enable coarse pruning (skip files that can’t match a time filter).
the log entry is human-inspectable and replayable.

So far we’ve looked at one table, one append. But the more interesting question is: can you register multiple tables and query across them? That’s what Session is for.

Try it yourself: 60 seconds to a join (Python)

Here’s why this matters: Session isn't just "a query wrapper for one table". It's a single SQL session backed by Apache DataFusion where you can register multiple tables and run real joins across them.

pip install timeseries-table-format

from datetime import datetime, timezone
import tempfile
from pathlib import Path

import pyarrow as pa
import pyarrow.parquet as pq

import timeseries_table_format as ttf

with tempfile.TemporaryDirectory() as d:
    base = Path(d)

    # None = no timezone normalization (use timestamps as stored in Parquet)
    tz_config = None

    prices_root = base / "prices_tbl"
    prices = ttf.TimeSeriesTable.create(
        table_root=str(prices_root),
        time_column="ts",
        bucket="1h",
        entity_columns=["symbol"],
        timezone=tz_config,
    )
    prices_seg = base / "prices.parquet"
    pq.write_table(
        pa.table(
            {
                "ts": pa.array(
                    [datetime(2024, 6, 1, tzinfo=timezone.utc), datetime(2024, 6, 1, 1, tzinfo=timezone.utc)],
                    type=pa.timestamp("us", tz="UTC"),
                ),
                "symbol": pa.array(["NVDA", "NVDA"]),
                "close": pa.array([10.0, 11.0]),
            }
        ),
        str(prices_seg),
    )
    prices.append_parquet(str(prices_seg))

    volumes_root = base / "volumes_tbl"
    volumes = ttf.TimeSeriesTable.create(
        table_root=str(volumes_root),
        time_column="ts",
        bucket="1h",
        entity_columns=["symbol"],
        timezone=tz_config,
    )
    volumes_seg = base / "volumes.parquet"
    pq.write_table(
        pa.table(
            {
                "ts": pa.array(
                    [datetime(2024, 6, 1, tzinfo=timezone.utc), datetime(2024, 6, 1, 1, tzinfo=timezone.utc)],
                    type=pa.timestamp("us", tz="UTC"),
                ),
                "symbol": pa.array(["NVDA", "NVDA"]),
                "volume": pa.array([100, 120]),
            }
        ),
        str(volumes_seg),
    )
    volumes.append_parquet(str(volumes_seg))

    sess = ttf.Session()
    sess.register_tstable("prices", str(prices_root))
    sess.register_tstable("volumes", str(volumes_root))

    out = sess.sql("""
    select p.ts as ts, p.symbol as symbol, p.close as close, v.volume as volume
    from prices p
    join volumes v
      on p.ts = v.ts and p.symbol = v.symbol
    order by ts
    """)

    print(out) # in Jupyter, use just `out` for a rich HTML table

Two tables, one SQL join, pure Python — no Rust toolchain, no Spark cluster.

If you’re wondering how this pure-Python script executes SQL so quickly without a heavy JVM or cluster, it’s because Python is just the steering wheel here. Under the hood, the engine is written in Rust and powered by Apache DataFusion. You get the ergonomics of Python, but the multi-threaded performance of a compiled language.

That join worked because the same log + snapshot design extends naturally to multiple tables in one session. But there’s one more piece that makes this format specifically useful for time-series work: coverage tracking.

Why this isn’t just Delta-in-Rust: coverage tracking

Remember this field from the AddSegment JSON earlier?

"coverage_path": "_coverage/segments/segcov-ca3cea172cc538ce04756e34beaea4a4.roar"

Time-series users keep asking questions like:

“Do I have full coverage for this time range?”
“Where are the gaps?”
“Did I already ingest this time window, or am I about to overlap/duplicate data?”

Coverage is how I solved it — and it bought me two things I didn’t expect to get for free.

Gap/coverage questions become metadata reads, not Parquet rescans.
Overlap-safe ingestion becomes the default, not “best-effort”.

What “coverage” means (in one sentence)

If you created a table with bucket="1h", coverage is just "which 1-hour slots have data".

What _coverage/ stores

Under the table root, _coverage/ stores small sidecar files:

_coverage/segments/.roar - coverage for a segment (compressed roaring bitmaps -- fast set operations on bucket IDs)
_coverage/table/-.roar - a snapshot coverage for the whole table at a log version

The table snapshot is basically the union of segment coverages so far.

How append uses coverage (end-to-end)

When you append a Parquet file, the flow becomes:

Map the segment’s timestamps into bucket IDs (based on your bucket, like 1h).
Load the current table coverage snapshot (or empty for the first append).
Check overlap: segment_coverage & table_coverage.
If overlap is non-empty, reject the append (this surfaces as CoverageOverlapError in Python).
Otherwise:

write the segment coverage sidecar (coverage_path)
write the new table snapshot sidecar
commit the log update (same Delta-style OCC as before)

The first time I saw the overlap check catch a duplicate ingest during testing, I knew this was the right abstraction — it was doing exactly the kind of silent data corruption prevention that I’d always had to bolt on manually in other pipelines.

That’s why the coverage_path shows up right next to ts_min/ts_max in the commit JSON: it's just more metadata that makes common time-series questions cheap.

“Why not just use Delta or Iceberg?” Fair question. You should, if your workload needs what they’re built for — schema evolution, MERGE/upsert, cloud object stores, the full Spark ecosystem. They’re battle-tested and general-purpose. This project exists because time-series append workloads have a narrower contract: you’re writing immutable, time-ordered segments, and your most common questions are about coverage and gaps, not schema changes. A format designed for that specific contract can bake in overlap detection, instant coverage queries, and skip the complexity you don’t need — and that’s where the speed comes from.

Benchmarks

Anyone can claim “faster.” Here’s what the numbers actually look like.

I ran the same workload across ClickHouse, Delta Lake + Spark, PostgreSQL, and TimescaleDB using the NYC TLC FHVHV trip dataset (April-June 2024, ~73M rows). The test I care most about is “daily append”: 90 day-sized files appended one after another, like a real ETL pipeline.

Headline results (lower is better):

On daily appends, this format is ~3.3x faster than ClickHouse, ~4.3x faster than Delta + Spark, and ~5.5x faster than PostgreSQL in this setup.

The query story holds up too: on time-range scans it’s ~2.5x faster than ClickHouse and ~80x faster than PostgreSQL here. (Aggregations are also competitive with ClickHouse: within ~3% in this benchmark.)

full benchmark methodology

(Also in the repo under docs/benchmarks/README.md.)

Limitations / non-goals (v0)

I intentionally scoped this as a narrow v0. Every feature I left out was a deliberate choice to keep the core sharp and measurable:

Local filesystem tables (no S3/GCS/Azure object store yet)
No compaction / merge (overlap is rejected; no upsert semantics)
No schema evolution story yet
No distributed coordinator (single-writer OCC at the log level; conflicts surface as errors you retry)
Reader side is “replay the log” (no checkpointing yet)

Try it / feedback

The quickest “does it feel nice?” path is the Python quickstart earlier in this post (“Try it yourself: 60 seconds to a join (Python)”).

Everything — code, benchmarks, docs — lives here: timeseries-table-format on GitHub — PyPI

If this post was useful, a star helps — and if you have workload ideas or strong opinions on v1 priorities (compaction, object storage, schema evolution), open an issue.

Optimizing Spark Window Functions: From 33 Minutes to 12 Minutes

Hanbo Wang — Mon, 29 Sep 2025 19:53:54 +0000

TL;DR

A ROW_NUMBER() window (PARTITION BY game_id, pitch_uid, position_num, event_time ORDER BY processed_year DESC, processed_month DESC, processed_day DESC, tie_breakers) forced cluster-wide shuffle + full sort , causing massive disk spill and ~ 33 min runtime on a serverless SQL warehouse.
Replaced that pattern with MAX_BY on a packed 64‑bit ordering key (date + tie‑breakers), turning full sorts into O(1) incremental comparisons per row.
Runtime dropped to 12 min 37 s (≈ 2.7× faster); disk spill decreased from 391 GB → 297 GB (still I/O‑bound due to fixed executor sizing).
HashAggregate isn’t selected when buffer types aren’t mutable / grouping keys aren’t binary-stable for unsafe rows; ObjectHashAggregate offered minor gains but with higher heap/GC pressure at very high cardinality.
If you only need “top‑1 per group,” avoid window sorts ; prefer MAX_BY(value, ordering) (or arg_max-style patterns) under a GROUP BY.
For extremely high cardinality, consider packing your ordering struct (e.g., date key + inverted millifeet coords) into a single BIGINT to speed comparisons.
Remaining spill is largely from grouping hash map page evictions; a classic cluster with larger executors can run spill‑free.

Before vs After (at a glance)

Intro

I work for an MLB team, and we process massive amounts of baseball tracking data every day. Recently, we had a Spark job on Databricks that needed to join an enormous fact table (player tracking records with coordinates for each frame) with a dimensional table (position number to player uid mapping), then perform deduplication using a window function.

1. The Original Query

The core logic involved a window function to handle duplicate records (some psedudo SQL):

WITH ranked AS (
  SELECT 
    tracking.*,
    lineup.fielder_id,
    lineup.position_alpha,
    ROW_NUMBER() OVER (
      PARTITION BY game_id, pitch_uid, position_num, event_time
      ORDER BY 
        processed_year DESC, 
        processed_month DESC, 
        processed_day DESC, 
        tie_breakers (x, y, z coordinates) -- tie_breaker: x, y, z coordinates
    ) AS rn
  FROM hawkeye_tracking tracking
  JOIN hawkeye_lineup lineup ON (...)
)
SELECT * EXCEPT (rn)
FROM ranked 
WHERE rn = 1

Although the query itself doesn’t look complex, the long execution time of it raised some concerns. With the table sized at only a few hundred GB, why would it take over 30 mins to run on Databricks, even with a medium-sized SQL warehouse? Something’s clearly wrong here, so I dove into DBX’s query profile to investigate.

Warehouse Size Investigation : I tested the same query on both an x-small serverless warehouse and a medium warehouse, and surprisingly, the execution times were nearly identical (~33 minutes). This immediately suggested that the bottleneck wasn’t compute capacity but rather an algorithmic or I/O problem.

2. Query Profile Analysis

Looking at the query profile, I immediately spotted the issue:

3. The Real Problem: Massive Disk Spilling

391.9 GB spilled to disk This was the biggest red flag. The query spilled nearly 400 GB to disk, which is almost 3x the amount of data actually read (146.66 GB). This indicates that Spark could not fit the intermediate results in memory and had to write them to disk repeatedly — a huge performance killer.

4. Digging Deeper: Top Operators Analysis

To understand what was causing this massive spilling, I clicked into the “Top operators” view:

The analysis revealed:

Shuffle operations consumed 47.8% of the total time (1.05 hours)
Sort operations took 34.6% of the time (56.27 minutes)
The window function was clearly the bottleneck

5. Root cause: Window Function Sorting Bottleneck

Clicking into Sort operation #11, I found the smoking gun:

90.37 GB spilled to disk from this single sort operation! Looking at the sort order, it was exactly the ORDER BY clause from our ROW_NUMBER() window function:

src.game_id ASC NULLS FIRST
src.pitch_uid ASC NULLS FIRST
src.position_num ASC NULLS FIRST
src.event_time ASC NULLS FIRST
src.processed_year DESC NULLS LAST
src.processed_month DESC NULLS LAST
src.processed_day DESC NULLS LAST

This confirms that our window function’s ROW_NUMBER() OVER (PARTITION BY ... ORDER BY ...) was forcing Spark to:

Shuffle all data by the partition keys
Sort massive partitions by the ORDER BY clause
Spill to disk when the sorted data exceeded memory

The combination of a large dataset with complex partitioning and sorting was overwhelming Spark’s memory management, causing the performance disaster we observed.

6. The Double Shuffle-Sort Mystery

But here’s what puzzled me initially: Why were there multiple shuffle operations consuming so much time?

Looking at the operations DAG here:

(huge sort on #11 and #7, massive shuffle on #8 and #4)

My first instinct was: “Is Spark doing a duplicate shuffle-sort? Why is that?”

7. Understanding What Happened: Sort-based Aggregation

To understand what went wrong, I needed to dig into how Spark handles large datasets. It turns out Spark has two strategies for aggregation:

Hash-based Aggregation (Preferred):

Creates an in-memory hash table for each unique group key
Updates aggregate values directly as it processes rows
Fast because it avoids sorting
Critical limitation : Requires all intermediate results to fit in memory

Sort-based Aggregation (Fallback):

Used when hash-based aggregation fails due to memory constraints
Process: Shuffle data by grouping keys → Sort within partitions → Aggregate
Can handle larger datasets but much slower due to sorting overhead

Here’s what happened in our case : Our dataset was too large for hash-based aggregation, so Spark automatically fell back to sort-based aggregation. But we weren’t just doing simple aggregation — we also had a window function, which created a perfect storm.

8. The Double Penalty: Why We Had Multiple Sorts

Our query had both aggregation-like behavior AND a window function, which meant we got hit with a double penalty:

Sort-based aggregation phase : Because our intermediate results were too large for memory, Spark had to use sort-based aggregation for the preliminary data processing. This involved:

Shuffling data by certain grouping keys
Sorting within each partition to enable efficient aggregation
Preparing data for the subsequent window function

Window function sorting phase : The ROW_NUMBER() window function then required its own sorting operations:

Additional shuffle by the PARTITION BY keys (game_id, pitch_uid, position_num, event_time)
Additional sort within each partition by the ORDER BY clause (processed_year DESC, processed_month DESC, processed_day DESC)

So our execution plan looked like:

Data → Sort-based Agg Shuffle → Sort-based Agg Sort → Window Shuffle → Window Sort → Results

Each arrow represents massive data movement with 2.3 billion rows, and each sort operation risked spilling to disk when memory was exceeded.

9. Why This Was So Expensive

The problem compounded because:

Volume amplification : Each operation had to process the full 2.3 billion rows
Memory pressure cascade : The first sort operation filled up memory, making subsequent operations more likely to spill
Multiple network shuffles : Data moved across the cluster multiple times
Disk I/O bottleneck : Once spilling started (391 GB!), the entire pipeline became I/O bound

This explains the performance characteristics we observed:

Shuffle operations taking 47.8% of time : Multiple data redistribution phases
Sort operations taking 34.6% of time : Multiple sorting phases with different criteria
Warehouse size didn’t matter : Once you’re spilling hundreds of GBs to disk, more CPU cores can’t help

The combination of large dataset size, memory-constrained sort-based aggregation, AND window function requirements created a performance disaster where each operation made the next one worse.

10. The Real Problem: We Don’t Need Full Sorting!

Here’s the key insight: We’re doing way more work than necessary.

Our goal was simple: for each combination of game_id, pitch_uid, position_num, and event_time, we only wanted the record with the highest values of processed_year, processed_month, and processed_day.

But our ROW_NUMBER() window function forced Spark to:

Shuffle all 2.3 billion rows by partition keys
Fully sort each partition by the ORDER BY clause
Assign row numbers to every single row
Filter out everything except rn = 1

We were essentially sorting every row in each partition just to pick the “top 1” — that’s massively inefficient!

11. The Solution: max_by Function

After some research, I discovered that Spark 3.0+ introduced the max_by function, which is exactly what we need:

-- Instead of this expensive window function:
WITH ranked AS (
  SELECT 
    tracking.*,
    lineup.fielder_id,
    lineup.position_alpha,
    ROW_NUMBER() OVER (
      PARTITION BY game_id, pitch_uid, position_num, event_time
      ORDER BY processed_year DESC, processed_month DESC, processed_day DESC
    ) AS rn
  FROM hawkeye_tracking tracking
  JOIN hawkeye_lineup lineup ON (...)
)
SELECT * EXCEPT (rn) FROM ranked WHERE rn = 1

-- We can use this much more efficient approach:
SELECT 
  game_id,
  pitch_uid, 
  position_num,
  event_time,
  MAX_BY(
    STRUCT(tracking.*, lineup.fielder_id, lineup.position_alpha),
    STRUCT(processed_year, processed_month, processed_day, tie_breakers)
  ).*
FROM hawkeye_tracking tracking
JOIN hawkeye_lineup lineup ON (...)
GROUP BY game_id, pitch_uid, position_num, event_time

12. How max_by Works Under the Hood

To understand why max_by is so much more efficient, I dug into the Spark source code. Here's the key implementation from MaxByAndMinBy.scala:

/**
 * The shared abstract superclass for `MaxBy` and `MinBy` SQL aggregate functions.
 */
abstract class MaxMinBy extends DeclarativeAggregate with BinaryLike[Expression] {

  // The attributes used to keep extremum (max or min) and associated aggregated values.
  private lazy val extremumOrdering =
    AttributeReference("extremumOrdering", orderingExpr.dataType)()
  private lazy val valueWithExtremumOrdering =
    AttributeReference("valueWithExtremumOrdering", valueExpr.dataType)()

  override lazy val updateExpressions: Seq[Expression] = Seq(
    /* valueWithExtremumOrdering = */
    CaseWhen(
      (extremumOrdering.isNull && orderingExpr.isNull, nullValue) ::
        (extremumOrdering.isNull, valueExpr) ::
        (orderingExpr.isNull, valueWithExtremumOrdering) :: Nil,
      If(predicate(extremumOrdering, orderingExpr), valueWithExtremumOrdering, valueExpr)
    ),
    /* extremumOrdering = */ orderingUpdater(extremumOrdering, orderingExpr)
  )
}

case class MaxBy(valueExpr: Expression, orderingExpr: Expression) extends MaxMinBy {
  override def prettyName: String = "max_by"

  override protected def predicate(oldExpr: Expression, newExpr: Expression): Expression =
    oldExpr > newExpr

  override protected def orderingUpdater(oldExpr: Expression, newExpr: Expression): Expression =
    greatest(oldExpr, newExpr)
}

The key insight : Instead of collecting and sorting all rows in each partition, max_by uses an incremental aggregation approach :

For each partition, maintain only two values :

valueWithExtremumOrdering: The current "best" record
extremumOrdering: The current maximum ordering value

2. For each new row, do a simple comparison :

If new_ordering > current_max_ordering, update both values
Otherwise, keep the current values

3. No sorting required : Just O(1) comparisons per row instead of O(n log n) sorting

This means:

Memory usage : O(1) per partition instead of O(n)
CPU complexity : O(n) instead of O(n log n)
Less disk spilling : The intermediate state fits more easily in memory

13. Packed Ordering Key Optimisation

And to further optimize the performance of the query, we can even convert the struct we used for ord key into a single int64 with bit manipulation, so that the comparison between records is even faster.

-- instead of directly using this expression
MAX_BY(
    STRUCT(tracking.*, lineup.fielder_id, lineup.position_alpha),
    STRUCT(processed_year, processed_month, processed_day, tie_breakers)
  ).*

-- we convert the STRUCT(processed_year, processed_month, processed_day, tie_breakers) into one single int64

scored AS (
  SELECT
    *,
    -- days since 2000-01-01 (newer date ⇒ larger value)
    datediff(
      to_date('2000-01-01'),
      to_date(
        concat(
          processed_year,
          '-',
          processed_month,
          '-',
          processed_day
        )
      )
    ) * -1 AS date_key,
    -- millifeet integers, inverted so that smaller coords win
    CAST(262143 - ROUND(x_ft * 1000) AS BIGINT) AS x_inv,
    -- BIGINT
    CAST(262143 - ROUND(y_ft * 1000) AS BIGINT) AS y_inv,
    CAST(8191 - ROUND(z_ft * 1000) AS BIGINT) AS z_inv
  FROM
    hashed
),
keys_only as (
  SELECT
    game_id,
    pitch_uid,
    position_num,
    event_time,
    pos_id64,
    /* 64-bit packed ordering key */
    (
      SHIFTLEFT(CAST(date_key AS BIGINT), 49) | SHIFTLEFT(x_inv, 31) | SHIFTLEFT(y_inv, 13) | z_inv
    ) AS ord_key
  FROM
    scored
),
winner AS (
  -- hash aggregate, no spill
  SELECT
    MAX_BY(pos_id64, ord_key) AS pos_id64_winner,
    game_id,
    pitch_uid,
    position_num,
    event_time
  FROM
    keys_only
  GROUP BY
    game_id,
    pitch_uid,
    position_num,
    event_time
)

14. Hash-Agg vs Sort-Agg Deep Dive

So this removed the sorting inside the window function, but what about the sorting before it? In another word, how can we make sure we are using hash agg instead of sort agg?

Although Spark doesn’t directly expose an option for us to choose which one to use and supposely would just pick hash agg whenever it’s possible, let’s check Spark’s source code again to confirm.

from Spark’s AggUtils.scala module we can see:

private def createAggregate(
      requiredChildDistributionExpressions: Option[Seq[Expression]] = None,
      isStreaming: Boolean = false,
      groupingExpressions: Seq[NamedExpression] = Nil,
      aggregateExpressions: Seq[AggregateExpression] = Nil,
      aggregateAttributes: Seq[Attribute] = Nil,
      initialInputBufferOffset: Int = 0,
      resultExpressions: Seq[NamedExpression] = Nil,
      child: SparkPlan): SparkPlan = {
    val useHash = Aggregate.supportsHashAggregate(
      aggregateExpressions.flatMap(_.aggregateFunction.aggBufferAttributes), groupingExpressions)

    val forceObjHashAggregate = forceApplyObjectHashAggregate(child.conf)
    val forceSortAggregate = forceApplySortAggregate(child.conf)

    if (useHash && !forceSortAggregate && !forceObjHashAggregate) {
      HashAggregateExec(
        requiredChildDistributionExpressions = requiredChildDistributionExpressions,
        isStreaming = isStreaming,
        numShufflePartitions = None,
        groupingExpressions = groupingExpressions,
        aggregateExpressions = mayRemoveAggFilters(aggregateExpressions),
        aggregateAttributes = aggregateAttributes,
        initialInputBufferOffset = initialInputBufferOffset,
        resultExpressions = resultExpressions,
        child = child)
    } else {
      val objectHashEnabled = child.conf.useObjectHashAggregation
      val useObjectHash = Aggregate.supportsObjectHashAggregate(
        aggregateExpressions, groupingExpressions)

      if (forceObjHashAggregate || (objectHashEnabled && useObjectHash && !forceSortAggregate)) {
        ObjectHashAggregateExec(
          requiredChildDistributionExpressions = requiredChildDistributionExpressions,
          isStreaming = isStreaming,
          numShufflePartitions = None,
          groupingExpressions = groupingExpressions,
          aggregateExpressions = mayRemoveAggFilters(aggregateExpressions),
          aggregateAttributes = aggregateAttributes,
          initialInputBufferOffset = initialInputBufferOffset,
          resultExpressions = resultExpressions,
          child = child)
      } else {
        SortAggregateExec(
          requiredChildDistributionExpressions = requiredChildDistributionExpressions,
          isStreaming = isStreaming,
          numShufflePartitions = None,
          groupingExpressions = groupingExpressions,
          aggregateExpressions = mayRemoveAggFilters(aggregateExpressions),
          aggregateAttributes = aggregateAttributes,
          initialInputBufferOffset = initialInputBufferOffset,
          resultExpressions = resultExpressions,
          child = child)
      }
    }
  }

to understand what’s Aggregate.supportsHashAggregate, we found the basicLogicalOperators.scala as:

def supportsHashAggregate(
      aggregateBufferAttributes: Seq[Attribute], groupingExpression: Seq[Expression]): Boolean = {
    val aggregationBufferSchema = DataTypeUtils.fromAttributes(aggregateBufferAttributes)
    isAggregateBufferMutable(aggregationBufferSchema) &&
      groupingExpression.forall(e => UnsafeRowUtils.isBinaryStable(e.dataType))
  }

which means for the agg buffer describing all the fields from the function has to be mutable , and every grouping expressions’s data type has to be binary stable.

mutable :

based on source code from here:

public static boolean isMutable(DataType dt) {
    if (dt instanceof UserDefinedType udt) {
      return isMutable(udt.sqlType());
    }
    PhysicalDataType pdt = PhysicalDataType.apply(dt);
    return pdt instanceof PhysicalPrimitiveType || pdt instanceof PhysicalDecimalType ||
      pdt instanceof PhysicalCalendarIntervalType;
  }

only these data types are considered as mutable:

PhysicalPrimitiveType : which includes:

PhysicalBooleanType
PhysicalByteType
PhysicalShortType
PhysicalIntegerType
PhysicalLongType
PhysicalFloatType
PhysicalDoubleType
PhysicalNullType

2. PhysicalDecimalType

3. PhysicalCalendarIntervalType

binary stable :

from the source code here:

def isBinaryStable(dataType: DataType): Boolean = !dataType.existsRecursively {
    case st: StringType =>
      !st.supportsBinaryEquality
    case _ => false
  }

and from here:

we know that for supportsBinaryEquality:

private[sql] def supportsBinaryEquality: Boolean =
    collationId == CollationFactory.UTF8_BINARY_COLLATION_ID ||
      CollationFactory.fetchCollation(collationId).supportsBinaryEquality

so it seems like as long as we don’t use a different collation than utf8_binary then all the strings should be binary stable.

Unfortunately because we have string types in our fields of our max_by function and they are not mutable, a typical hash agg is not feasible for our scenario.

15. Object-Hash Benchmark

But Spark do have a different middle ground type of agg that’s called ObjectHashAggregate that they introduced since version 2.2.0.

Unlike the HashAggregateExec which stores aggregation buffers in the UnsafeRow in off-heap memory, the ObjectHashAggregateExec stores the aggregation buffers in the SpecificInternalRow which internally holds a Java Array collection of aggregation buffer fields in Java heap memory.

The ObjectHashAggregateExec uses an ObjectAggregationMap instance as the hash map instead of the UnsafeFixedWidthAggregationMap used by the HashAggregateExec. The ObjectAggregationMap supports storing arbitrary Java objects as aggregate buffer values.

But for our use case, because our grouping key has really high cardinality ( over a couple hundreds of millions), the ObjectHashAgg method would consume much more RAM since it creates java object for each record, also brings enormous pressure to the JVM GC as they are stored on-heap.

But still I do tested the same SQL query between a sort agg and a object hash agg:

from pyspark.sql import functions as F

# 1) Warm up the cache of 'scored' so both experiments start from memory
print("Warming up scored cache…")
scored.count()

# 2) the same agg_sql as before, using unpacked primitives and 'ord'
agg_sql = """
SELECT struct(
    ...
) AS picked_row
FROM (
  SELECT *,
    (SHIFTLEFT(date_key,49)
     | SHIFTLEFT(x_inv,31)
     | SHIFTLEFT(y_inv,13)
     | z_inv) AS ord
  FROM scored
) t
GROUP BY game_id, pitch_uid, position_num, event_time
"""
spark.conf.set("spark.sql.objectHashAggregate.sortBased.fallbackThreshold", 1000000000)

def bench(label, use_obj_hash: bool):
    spark.conf.set("spark.sql.execution.useObjectHashAggregate", str(use_obj_hash).lower())
    start = time.time()
    cnt = spark.sql(agg_sql).count()
    print(f"{label:>20} (useObjectHash={use_obj_hash}): {time.time()-start:.1f}s, rows={cnt}")

# 3) Benchmark sort-based vs object-hash back-to-back
bench("Sort-based", False)
bench("Object-hash-agg", True)

and the diff is not that significant:

Warming up scored cache…
Sort-based (useObjectHash=False): 47.7s, rows=274967745
Object-hash-agg (useObjectHash=True): 41.5s, rows=274967745

Like I said, the object hash agg way would take more RAM and give more pressure to the JVM GC, so not that ideal for our scenario where the grouping key cardinality is super high.

16. Post-Optimisation Results (12-min Run)

Key takeaways after replacing ROW_NUMBER() with max_by:

Why some spill remains: Serverless SQL warehouses use fixed executor sizes. With 2.9 b rows, the grouping hash map still evicts pages to disk. A classic cluster with beefier executors runs spill-free.