DEV Community

Cover image for Trino vs Presto vs Athena: Federated SQL Engines for the Modern Lakehouse
Gowtham Potureddi
Gowtham Potureddi

Posted on

Trino vs Presto vs Athena: Federated SQL Engines for the Modern Lakehouse

trino vs presto looks like a single-word product comparison to a junior — interviewers know it is actually a twelve-year lineage question that splits Facebook's 2012 Presto into PrestoDB (Meta + Linux Foundation) and Trino (Starburst), with AWS Athena straddling both engines and Starburst layering commercial governance on top. The result is the most under-explained question in the modern lakehouse stack: four engine names, one shared architecture, and four very different operational, cost, and connector profiles that every senior data engineer is expected to defend at an interview whiteboard.

This guide is the cheat sheet that decodes the entire family. It walks through the trino presto athena lineage, the coordinator + workers + connectors architecture that every distributed sql engine shares, the connector ecosystem that drives real query federation, the athena vs presto cost-versus-utilisation decision, and the trino interview questions interviewers love to probe — the prestodb vs trino rename, predicate pushdown, cross-source joins, and when fault-tolerant execution actually earns its keep. Each section pairs a teaching block with a Solution-Tail interview answer — code, a step-by-step trace, an output table, then a concept-by-concept breakdown of why it works.

PipeCode blog header for Trino vs Presto vs Athena — bold white headline 'Trino vs Presto vs Athena' with subtitle 'federated SQL · connector ecosystem · lakehouse engines' and three stylised engine orbs (Trino purple, Presto blue, Athena orange) linked by glowing federation lines on a dark gradient with a small pipecode.ai attribution.

When you want hands-on reps immediately after reading, drill the SQL practice library →, rehearse on joins practice →, and stack the query-shape muscles with aggregation problems →.


On this page


1. Why federated SQL became the lakehouse default

Federated SQL is the answer to "the data lives in seven places" — the lakehouse needs a query engine, not another database

The one-sentence invariant: a federated SQL engine pushes a single SQL statement across many storage backends without copying data into one warehouse first — that is the precise property the lakehouse pattern depends on, and the precise reason Trino, PrestoDB, and Athena exist as a family rather than as competitors to Snowflake or BigQuery. Once you internalise that "query travels to data, not data to query," the entire trino vs presto interview surface stops being a feature checklist and becomes a single architectural argument.

The three pressures that produced federated SQL.

  • Data lake explosion. By 2018 every large company had at least one petabyte-scale object store (S3, ADLS, GCS) that no traditional warehouse could read natively without a copy step. The "load it into Redshift first" answer stopped fitting in the budget and the SLO.
  • Multi-source analytics. Product analytics, finance ledgers, CRM exports, event streams, and ML feature stores started living in different engines (Postgres, MySQL, Kafka, Iceberg, Elasticsearch). Stitching them into one report meant either a nightly ETL run or a query engine that could fan out reads in flight.
  • Decoupled storage and compute. The S3-as-table-storage thesis (table formats: Hive, Iceberg, Delta, Hudi) turned storage into a flat layer that any compute engine could attach to. The natural next step was a thin SQL engine on top — no storage, no transactions, just plan-execute-return.

Federated SQL vs traditional MPP — the contract.

  • MPP warehouse (Snowflake, BigQuery, Redshift). Owns its storage format, its statistics, its transaction log, its compute cluster, and its catalog. Optimal when every byte you query lives inside that warehouse.
  • Federated SQL engine (Trino, PrestoDB, Athena). Owns only the query coordinator and the compute. Storage, statistics, transactions, and catalogs all live in the connector — the engine asks each connector "what can you push down, what statistics do you have, how do I split this scan?" and dispatches accordingly. This is why a single Trino SELECT can join a Postgres customers table to an Iceberg events table to a Kafka clicks topic in one statement.

Why "query engine, not database" is the right framing.

  • No durable storage of its own. The engine has memory and scratch disk only. Catastrophic worker loss costs the in-flight query, not the data.
  • No transactional commit semantics. Writes are delegated to the connector (the Iceberg connector writes Iceberg snapshots; the Hive connector writes Parquet files into Hive partitions).
  • No proprietary statistics. Cost-based optimisation reads stats from the catalog (Hive metastore, Glue, Iceberg metadata) — there is no "ANALYZE the warehouse" step the engine owns.

The four engines you actually compare in 2026.

  • Trino. The post-2019 fork led by Starburst-affiliated maintainers. Monthly release cadence. Most aggressive connector and execution roadmap. The default open-source answer.
  • PrestoDB. The Meta + Linux Foundation continuation of the 2012 Facebook code base. Quarterly releases. Strong on Spark integration (Presto on Spark) and RaptorX hot-data caching.
  • AWS Athena. Serverless managed engine. Engine v2 was PrestoDB; engine v3 (2023+) is Trino. No cluster to run. Per-query pricing.
  • Starburst. Commercial Trino distribution with governance, caching (Warp Speed), and an enterprise control plane. Often the "we want Trino with a support contract" answer.

The 2026 reality.

  • Trino has won the open-source mindshare but Athena is the deployment majority for AWS-anchored shops because the per-query pricing kills the operational tax for spiky workloads.
  • Iceberg has become the assumed table format for new lakehouses — every engine in this family ships first-class Iceberg connectors.
  • Fault-tolerant execution (Trino) and RaptorX caching (PrestoDB) finally make "long-running ETL on a query engine" feasible, narrowing the gap with Spark for analytical workloads.

Worked example — when federated SQL is the wrong choice

Detailed explanation. A common interview opener is "would you put Trino in front of an OLTP database for low-latency dashboards?" The correct answer is no — every federated SQL engine pays a coordinator-planning tax that makes sub-100ms queries unrealistic, and every backing connector becomes a bottleneck when the workload pattern is "thousands of small reads per second." Federated SQL was designed for analytical workloads — large scans, complex joins, multi-source — not for transactional workloads.

Question. A team wants to run a customer-facing dashboard with 200 ms p95 latency over a Postgres OLTP database. Should they front Postgres with Trino? Why or why not?

Input — workload characteristics.

Workload Query rate Rows scanned Latency target
Customer dashboard 5,000 qps 50 rows / query < 200 ms p95
Analyst exploration 5 qps 1B rows / query < 30 s p95

Code — what not to do.

-- Anti-pattern: putting Trino in front of OLTP
-- A user-facing dashboard query hits Trino, which hits Postgres via JDBC connector.
-- Every request pays: coordinator planning + JDBC round trip + result marshalling.
SELECT customer_id, plan, last_login_at
FROM postgres.public.customers
WHERE customer_id = ?;
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The Trino coordinator parses the SQL, plans it, dispatches a single split to one worker — every step costs tens of milliseconds even when the underlying query is trivial.
  2. The worker opens a JDBC connection to Postgres (or borrows one from the connector pool) and issues the same SELECT. The Postgres query itself returns in 5 ms.
  3. The result is serialised back through the worker to the coordinator, then to the client. The hop adds another 10 ms.
  4. At 5,000 qps you are running thousands of JDBC connections per second and burning coordinator CPU on planning overhead for tiny queries.
  5. The right answer is "skip Trino — let the dashboard read Postgres directly, or read a materialised aggregate cache (Redis, ClickHouse, a denormalised replica)."

Output.

Approach Median latency p95 Notes
Direct Postgres 5 ms 15 ms Native, designed for this
Trino in front 80 ms 250 ms Coordinator + JDBC overhead
ClickHouse cache 8 ms 25 ms Best for high-qps analytics

Rule of thumb. Federated SQL engines are analytical engines. If the workload is "thousands of small queries per second with sub-100 ms latency," reach for the source database directly or a purpose-built serving layer. Trino, PrestoDB, and Athena all shine when query rate is low and scan volume is high — exactly the inverse profile.

Worked example — the "before federated SQL" pipeline

Detailed explanation. A second favourite interview probe: "what did teams do before Trino existed?" The honest answer is "they ran nightly ETL jobs that copied every source into a single warehouse, then queried the warehouse." Recognising the pattern that federated SQL replaces makes the value of the engines obvious, and surfaces the cost trade-off you accept by adopting one.

Question. Sketch the data flow for a single dashboard that needs Postgres customer data joined to S3 event data, before and after a federated SQL engine. What changes?

Input — what the dashboard needs.

Source Live in Daily volume
customers Postgres OLTP 50 GB
events S3 / Iceberg 5 TB

Code.

-- BEFORE — nightly ETL approach
-- 1) Extract Postgres customers via CDC into S3 staging
-- 2) Run Spark ETL to land into the warehouse (Snowflake)
-- 3) Run Spark/Glue to land events into the warehouse
-- 4) Run the join inside Snowflake
SELECT c.plan, COUNT(*) AS events
FROM warehouse.customers c
JOIN warehouse.events e ON e.customer_id = c.customer_id
GROUP BY c.plan;

-- AFTER — federated SQL via Trino
-- Same SQL, no ETL — two connectors, one statement
SELECT c.plan, COUNT(*) AS events
FROM postgres.public.customers c
JOIN iceberg.lakehouse.events e ON e.customer_id = c.customer_id
GROUP BY c.plan;
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The BEFORE pipeline pays an "ETL latency tax" — yesterday's events join yesterday's customers because both arrive in the warehouse on a nightly schedule.
  2. It also pays a "storage duplication tax" — every byte lives once in Postgres or S3 and once in Snowflake.
  3. The AFTER pipeline drops both taxes. The Trino coordinator reads customers live from Postgres and events live from Iceberg, joins them in flight, and returns the aggregate.
  4. The cost trade-off: every dashboard hit re-reads Postgres and S3. If the dashboard runs thousands of times a day, the cumulative read cost on the source systems may exceed the cost of materialising the join nightly.
  5. The right hybrid is "federated SQL for ad-hoc exploration, materialised tables for hot dashboards" — Trino is the discovery tool, the warehouse is the serving tool.

Output.

Pattern Freshness Source load Storage cost
Nightly ETL 24 h stale low (one read/day) 2x (copy in warehouse)
Federated SQL live high (per query) 1x (no copy)

Rule of thumb. Federated SQL trades fresh data and storage savings for source-system read load. Use it where freshness matters and read volume is low; materialise the result where read volume is high and freshness can lag.

SQL interview question on federated SQL fundamentals

A senior interviewer often opens with: "Explain when you'd reach for Trino instead of Snowflake. Walk me through one concrete workload where Trino wins and one where Snowflake wins, and explain the architectural reason — not just the price tag."

Solution Using the federated-vs-MPP decision frame

-- Federated SQL wins: live cross-source join, no ETL latency
-- Trino reads two backends in flight, joins in compute, returns aggregate
SELECT
    c.plan,
    DATE_TRUNC('day', e.event_ts) AS day,
    COUNT(*)                       AS events,
    COUNT(DISTINCT e.customer_id)  AS active_customers
FROM postgres.crm.customers c
JOIN iceberg.lake.events    e
       ON e.customer_id = c.customer_id
WHERE e.event_ts >= CURRENT_DATE - INTERVAL '7' DAY
GROUP BY c.plan, DATE_TRUNC('day', e.event_ts);

-- MPP warehouse wins: 50 concurrent users hitting the same dashboard
-- Snowflake's pre-aggregated micro-partitions + result cache are the right primitive
-- This query would punish Trino by re-reading both backends per request
SELECT plan, total_revenue, active_customers
FROM analytics.daily_dashboard_v1
WHERE day = CURRENT_DATE;
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Workload Engine Why
Live multi-source join, 1 query Trino No ETL needed; engine fans out to sources
Pre-aggregated dashboard, 50 qps Snowflake Result cache + micro-partition prune
1B-row ad-hoc on Iceberg Trino Native Iceberg, no warehouse load step
Hourly KPI batch Snowflake Tasks + streams own the warehouse pattern

The trace highlights the architectural divide: Trino is the right primitive when the query plan crosses backends or the data lives in S3 and must not be copied; the warehouse is the right primitive when the workload is stable, high-concurrency, and pre-aggregated.

Output:

Pattern Winner Reason
Cross-source ad-hoc Trino Federation, no copy
Iceberg ad-hoc Trino First-class connector
High-concurrency BI Snowflake Result cache, MPP
Hourly batch Snowflake Tasks, streams, transactions

Why this works — concept by concept:

  • Federated SQL trades freshness for source load — every Trino query is a live read against the backend, so you pay source-system load for every dashboard hit. Materialising the result trades freshness for cost.
  • MPP storage owns statistics — Snowflake's micro-partition statistics live inside the warehouse, which is what powers its concurrency. A query engine without owned storage relies on the connector's stats — often weaker.
  • Result cache vs cluster cache — Snowflake's result cache returns identical queries in milliseconds; Trino's per-cluster cache (or PrestoDB RaptorX) caches data blocks, not results, so warming the cache requires similar scans.
  • No transactional contract — Trino does not own a transaction log, so writes happen at the connector layer (Iceberg snapshots, Hive directory renames). Cross-source writes are not atomic.
  • Cost — Trino: O(rows scanned per query × source cost). Snowflake: O(rows × credit) + cache hit factor. The crossover is workload concurrency and read frequency, not "engine speed."

SQL
Topic — SQL fundamentals
SQL practice library

Practice →


2. The Presto → Trino → Athena lineage

One Facebook project became three engine names — the lineage is the most-asked Trino interview question

The mental model in one line: Presto was born at Facebook in 2012; the original maintainers forked it in 2019 as PrestoSQL (renamed Trino in 2020); Meta and the Linux Foundation kept the original code base under the PrestoDB name; AWS Athena moved from PrestoDB (engine v2) to Trino (engine v3). Once you can say that fluently, the rest of the prestodb vs trino interview surface is a Q&A on motivation, licensing, and roadmap rather than a knowledge gap.

Visual timeline diagram showing the lineage from Facebook Presto (2012) branching into PrestoDB (Meta / Linux Foundation) and Trino (Starburst), with Athena moving from PrestoDB engine v2 to Trino engine v3, plus a Starburst commercial branch, on a light PipeCode card.

The 12-year lineage in one paragraph.

  • 2012 — Facebook ships Presto. Built to run interactive SQL over Hive warehouses where Hive was too slow. Open-sourced in 2013. The early connectors target Hive and a handful of relational sources.
  • 2019 — the fork. The original founders (Martin Traverso, Dain Sundstrom, David Phillips) leave Facebook and form Starburst Data. They fork the project as PrestoSQL and continue active development. Facebook keeps the original code base under the PrestoDB name and donates it to the Linux Foundation as the Presto Foundation.
  • 2020 — PrestoSQL becomes Trino. A trademark dispute resolves with the fork rebranding to Trino. The codebase, releases, and community move under the new name.
  • 2021 — Athena engine v2. AWS Athena (launched 2016 on PrestoDB) is still on PrestoDB at this point.
  • 2023 — Athena engine v3. AWS moves Athena to Trino. The migration is transparent to the SQL surface, but unlocks faster release cadence and the Trino connector ecosystem.
  • 2024–2026 — Trino dominates the open-source narrative. Monthly release cadence vs PrestoDB's quarterly cadence; far broader connector coverage; Starburst's commercial offering grows around governance and caching.

Why the rename matters in interviews.

  • License clarity. Trino and PrestoDB are both Apache 2.0 — same OSS license — but they are different code bases with different release cadences and roadmaps. A query that works on Trino 435 may not parse on PrestoDB 0.288.
  • Connector parity. Trino tends to ship connector enhancements months earlier (Iceberg writes, Delta UniForm support, materialised views). PrestoDB catches up on the major ones eventually.
  • Cloud distribution. Athena's move to engine v3 (Trino) is a strong industry signal — the largest managed deployment of either engine chose Trino as its forward platform.

The release cadence comparison.

Engine Cadence License Governance
Trino Monthly Apache 2.0 Trino Software Foundation (Starburst-led)
PrestoDB Quarterly Apache 2.0 Presto Foundation (Linux Foundation)
Athena Continuous (managed) Proprietary AWS service AWS
Starburst Enterprise Quarterly Commercial (Trino core) Starburst Data

Common interview probes on the lineage.

  • "What's the difference between Presto and Trino?" — same DNA, different forks since 2019. Trino is the active fork; PrestoDB is the Meta-anchored continuation. Trino has the faster release cadence and broader connector ecosystem.
  • "Is Athena Presto or Trino?" — both, depending on era. Engine v2 was PrestoDB; engine v3 (2023+) is Trino.
  • "Why did the fork happen?" — governance disagreement and the founders wanting to ship faster than Facebook's review cycle allowed. The trademark dispute later forced the PrestoSQL → Trino rename.
  • "Is Starburst the same as Trino?" — Starburst Enterprise is a commercial distribution built on top of Trino, with added governance, caching (Warp Speed), and support. Trino is the core; Starburst is the wrapper.

Worked example — picking the right engine name in an interview answer

Detailed explanation. The classic mistake is to say "Presto" when you mean Trino, or "PrestoDB" when you mean Athena engine v3. Interviewers track the precision because it correlates with real platform choices — saying "we use Presto" without specifying which fork tells them you have not actually compared the engines. The fix is to always name the fork and the version.

Question. A platform team says "we run Presto on AWS." Which of the four engines do they actually mean, and what follow-up question would you ask to disambiguate?

Input — what "we run Presto on AWS" can mean.

Phrase used Likely engine Disambiguating follow-up
"Presto on EMR" PrestoDB (EMR's default) "EMR version? PrestoDB or Trino flavour?"
"Presto on EKS" Trino (community charts target Trino) "What's the connector for the lake?"
"Athena" Trino (engine v3) "Are you on engine v2 or v3?"
"Starburst Galaxy / Enterprise" Trino + Starburst "Self-hosted or Galaxy SaaS?"

Code — making the difference real with a version probe.

-- Trino — version probe
SELECT node_version FROM system.runtime.nodes LIMIT 1;
-- Returns: 435 (or similar three-digit Trino release)

-- PrestoDB — version probe
SELECT version() AS presto_version;
-- Returns: 0.288 (or similar 0.xxx PrestoDB release)

-- Athena — version probe in the AWS console
-- Workgroup settings show "Engine version 2" or "Engine version 3"
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Ask "what is the version string?" — Trino versions look like 435; PrestoDB versions look like 0.288. The format alone disambiguates.
  2. If the answer is "Athena," ask "engine v2 or v3?" — that maps directly to PrestoDB vs Trino under the hood and changes everything from connector support to SQL dialect quirks.
  3. If the answer is "Starburst," ask "Galaxy or Enterprise?" — Galaxy is the SaaS multi-tenant; Enterprise is self-hosted. Both wrap Trino, but operational responsibility differs.
  4. With the engine pinned, you can speak confidently about connector availability, feature flags, and known issues — without the engine pinned, every answer is a guess.

Output.

Engine Version format Fork
Trino three-digit (e.g. 435) Starburst-led OSS
PrestoDB 0.xxx (e.g. 0.288) Meta + Linux Foundation
Athena v3 "Engine version 3" Trino under the hood
Athena v2 "Engine version 2" PrestoDB under the hood

Rule of thumb. Every time someone says "Presto," ask "which one?" — the conversation is more productive after that. In interview answers, name the fork and the major version every time you reference a feature; that one habit signals "I have actually shipped on these engines" louder than any benchmark number.

Worked example — what the fork unlocked

Detailed explanation. A frequent interview probe is "what did the Trino fork actually enable that PrestoDB does not have?" The honest answer is release velocity plus a more aggressive execution roadmap — dynamic filtering, fault-tolerant execution (Project Tardigrade), and a much faster cadence on table-format support (Iceberg, Delta, Hudi writes). PrestoDB has its own innovations (RaptorX caching, Presto on Spark) but ships them quarterly.

Question. Sketch a side-by-side feature comparison for Trino vs PrestoDB as of 2026, focusing on the three or four features interviewers commonly probe.

Input — feature axes that matter in practice.

Axis Trino PrestoDB
Release cadence monthly quarterly
Iceberg writes mature (table maintenance, MERGE) available, slower roadmap
Fault-tolerant execution yes (Project Tardigrade) partial (Presto on Spark)
Caching community plugins + Starburst Warp Speed RaptorX (built-in)
Connector breadth wider narrower
MPP join enhancements dynamic filtering, AQE AQE-style improvements

Code — a feature flag check.

-- Trino: enable fault-tolerant execution at session level
SET SESSION retry_policy = 'TASK';

-- Run a long ETL query; failed tasks are retried instead of failing the query
INSERT INTO iceberg.lake.daily_kpis
SELECT day, plan, COUNT(*) AS events
FROM iceberg.lake.events
WHERE event_ts >= CURRENT_DATE - INTERVAL '1' DAY
GROUP BY day, plan;

-- PrestoDB equivalent: route the query through Presto on Spark for resiliency
-- This is a deployment switch, not a session setting
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The Trino SET SESSION retry_policy = 'TASK' enables Project Tardigrade — a worker failure no longer kills the whole query; the failed task retries on another worker.
  2. PrestoDB's nearest equivalent is to run long queries on the "Presto on Spark" runtime, which inherits Spark's task-level resilience.
  3. The trade-off is operational: Trino's TFE is in-process; Presto on Spark is a separate execution path with its own driver setup.
  4. For a typical ETL workload that runs 10–60 minutes, TFE is the easier setup; for hours-long jobs that already have a Spark pipeline, Presto on Spark is the natural fit.

Output.

Scenario Trino PrestoDB
5-min interactive query sub-second feature parity sub-second feature parity
30-min ETL with worker churn TFE retries tasks needs Presto on Spark
Iceberg writes first-class MERGE / OPTIMIZE available, less polished

Rule of thumb. For greenfield builds in 2026, Trino is the default open-source pick unless you have a specific reason to stay on PrestoDB (existing investment, EMR contract, RaptorX cache savings). For managed deployments, Athena is the right starting point on AWS unless cluster utilisation is high enough to justify self-hosting Trino on EKS.

SQL interview question on the Trino / Presto / Athena lineage

A senior interviewer often frames this as: "Tell me the lineage of Trino, Presto, and Athena. Why does it matter for SQL portability? Walk me through one query that behaves differently on each."

Solution Using a SQL dialect probe across the family

-- A query that exposes the three dialect quirks
SELECT
    user_id,
    -- 1) ARRAY indexing — Trino & PrestoDB are 1-indexed; Athena inherits whichever engine version
    tags[1] AS first_tag,
    -- 2) DATE arithmetic — Trino prefers INTERVAL '7' DAY; PrestoDB allows both DATE + 7 forms
    CURRENT_DATE - INTERVAL '7' DAY AS week_ago,
    -- 3) Lambda / higher-order functions — same syntax across all three
    REDUCE(amounts, 0, (s, x) -> s + x, s -> s) AS total_spend
FROM events_table;
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Quirk Trino PrestoDB Athena v2 Athena v3
array[1] 1-indexed 1-indexed 1-indexed 1-indexed
DATE - INTERVAL '7' DAY yes yes yes yes
DATE - 7 no yes yes no
REDUCE(...) yes yes yes yes
LISTAGG yes (since 421) no no yes (v3)
MERGE on Iceberg yes newer no (v2) yes (v3)

The trace highlights that the common subset of SQL works everywhere, but the long tail (LISTAGG, MERGE on table formats, certain window-function frame syntaxes) diverges. The senior signal is to write portable SQL by default and reach for engine-specific sugar only when you must.

Output:

Query family Portable answer
Aggregations SUM, AVG, COUNT(DISTINCT ...) — works everywhere
Window functions ROW_NUMBER() OVER (PARTITION BY ... ORDER BY ...) — universal
Date arithmetic INTERVAL '7' DAY — portable; avoid DATE - integer
Array index arr[1] — portable; avoid element_at quirks

Why this works — concept by concept:

  • Same lineage = same SQL core — every engine in the family parses the same ANSI-style SQL surface for SELECT / JOIN / WHERE / GROUP BY / ORDER BY plus the Presto-specific extensions (UNNEST, lambda, complex types).
  • Divergence lives at the edges — date arithmetic, certain aggregates (LISTAGG, GROUP_CONCAT), and MERGE / UPSERT semantics on table formats are where the engines drift apart.
  • Athena inherits its engine version — v2 quirks match PrestoDB of the same era; v3 quirks match Trino. Re-test queries when AWS announces engine upgrades.
  • Connector quirks compound__ — even when the SQL parses, the Iceberg connector on Trino may support a write mode the same connector on PrestoDB does not. Always check connector docs against the engine version.
  • Cost — engine-portable SQL has zero runtime cost; the cost is the discipline of not reaching for non-portable sugar.

SQL
Topic — case expression
CASE expression problems (SQL)

Practice →


3. Architecture compared — coordinator, workers, connectors

Every federated SQL engine is a coordinator + N workers + N connectors — the differences live in execution and ops, not topology

The mental model in one line: one coordinator parses and plans the SQL; many workers execute the splits; many connectors translate plan fragments into reads against external storage — Trino, PrestoDB, and Athena are all built that way, and naming the topology cold is the table-stakes architecture answer interviewers expect. The differences live in how tasks are scheduled, how failures are handled, and how the operator deploys the cluster — not in the topology itself.

Three side-by-side architecture cards comparing Trino, PrestoDB and Athena — each showing a coordinator orb, three worker hex cards, and a connector ring; Trino adds dynamic filtering badge, PrestoDB adds RaptorX cache badge, Athena adds serverless badge, on a light PipeCode card.

The shared core in detail.

  • Coordinator. Single JVM (per cluster) that owns: SQL parsing, semantic analysis, query planning, cost-based optimisation, split generation, task scheduling, and the metadata API the catalog talks to. Every client connection lands at the coordinator.
  • Workers. Stateless JVMs that execute the plan fragments handed to them. A worker holds the in-flight data for its share of the query in memory and on local scratch disk; if the worker dies, the in-flight data dies with it (unless fault-tolerant execution is enabled).
  • Connectors. Pluggable Java modules implementing the SPI (Service Provider Interface). Each connector knows how to list tables in its catalog, fetch column metadata, generate splits, push down predicates, and read / write rows. The connector is the only thing that knows about S3, Hive, Iceberg, MySQL, Kafka, etc. — the engine itself is storage-agnostic.

Trino-specific architecture.

  • Dynamic filtering. When a small join build side is materialised, the coordinator broadcasts the inferred predicate ("only customer_ids in this set") to the probe side's scan; the scan uses it as a runtime filter to skip rows before reading. Huge wins on star-schema joins.
  • Fault-tolerant execution (Project Tardigrade). Long queries can run with retry_policy = 'TASK' (task retries) or 'QUERY' (whole-query retries). Intermediate data spills to a shared "exchange" (S3 / HDFS / Azure) so a failed worker no longer kills the query.
  • Adaptive Query Execution (AQE). Some query reshaping happens after the first stage materialises — the planner re-decides whether to broadcast or shuffle a join based on actual row counts.
  • Connector cadence. Trino is the first engine to land most new connector features (Iceberg writes, Delta UniForm, Hudi read-on-write, materialised views over external catalogs).

PrestoDB-specific architecture.

  • RaptorX (hierarchical caching). Built-in caching layer with file-list cache, file-handle cache, file fragment cache, and metastore cache. Strong for repeated scans against the same Hive/Iceberg partition set.
  • Presto on Spark. A deployment mode where Presto's planner emits a query that runs as a Spark application — inheriting Spark's task-level retry and elasticity. Best for very long ETL queries.
  • Native execution (Velox / Prestissimo). The C++ vectorised execution engine, originally a Meta project, that PrestoDB is gradually adopting as its worker runtime for big perf wins on CPU-bound workloads.

Athena-specific architecture.

  • Serverless coordinator + workers. There is no cluster — AWS manages the pool of coordinators and workers and routes the query to whichever capacity is available. The user sees only a workgroup, a database catalog, and a per-query bytes-scanned charge.
  • Per-query pricing. $5 per terabyte of compressed data scanned (varies by region) — that single metric replaces "how big is my cluster?"
  • Workgroups. Logical buckets for billing, query limits, and engine version pinning. A v2 workgroup runs PrestoDB; a v3 workgroup runs Trino.
  • Athena Federation. A Lambda-backed extension that adds connectors AWS does not natively support — implemented as user-deployed Lambda functions the engine calls. Slower than native connectors, but extends the federation surface.

The split — open-source vs managed vs SaaS.

Tier Example Who runs the cluster Who picks the version
Open-source self-host Trino on EKS you you
Open-source self-host PrestoDB on EMR you you
Managed serverless AWS Athena AWS AWS (you pick v2 or v3)
Commercial managed Starburst Galaxy Starburst Starburst
Commercial self-host Starburst Enterprise you you (Starburst rebuilds)

Worked example — sizing a Trino cluster on EKS

Detailed explanation. A common interview probe is "how would you size a Trino cluster for an analytics team of 50 engineers running 200 queries / hour?" The senior answer is concurrency × per-query memory, plus headroom — not "throw more CPUs at it." Knowing the back-of-envelope numbers tells the interviewer you have actually run the engine.

Question. Size a Trino cluster on EKS for 50 engineers, ~200 queries/hour peak, average query touches 50 GB of Parquet (after partition pruning), p95 query duration target of 15 seconds.

Input — workload profile.

Metric Value
Concurrent queries (p95) 8
Average data scanned / query 50 GB
Average query memory 12 GB
Coordinator 1
Required p95 latency 15 s

Code — Helm values sketch.

# values.yaml for trino-helm-chart (simplified)
coordinator:
  resources:
    requests:
      cpu: 8
      memory: 32Gi
    limits:
      cpu: 16
      memory: 64Gi

worker:
  replicas: 12
  resources:
    requests:
      cpu: 16
      memory: 64Gi
    limits:
      cpu: 32
      memory: 128Gi

server:
  config:
    query.max-memory-per-node: 50GB
    query.max-memory: 600GB
    query.max-total-memory-per-node: 60GB
    discovery-server.enabled: true
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Coordinator: single node, sized for plan compilation and task dispatch — 8 CPU / 32 GB is plenty for 200 queries/hour as long as the planner is not the bottleneck.
  2. Workers: 12 nodes at 16 vCPU / 64 GB each = 192 vCPU and 768 GB across the cluster. With p95 concurrency of 8 and average memory 12 GB, that gives ~12 GB worker memory headroom per query.
  3. query.max-memory-per-node and query.max-memory cap individual queries — set them so a runaway query cannot OOM the whole cluster.
  4. For p95 latency of 15 s on 50 GB scans, the math is "50 GB / (12 nodes × ~1 GB/s S3 read per worker)" ≈ 4 s of pure scan time. The remaining 11 s budget covers planning, join shuffles, and aggregate.
  5. Add 30% headroom for partition skew and unexpected load. The right answer is "round up to 15 workers" if the SLO is tight.

Output.

Sizing variable Value
Workers 12 (× 16 vCPU / 64 GB)
Coordinator 1 (× 8 vCPU / 32 GB)
Concurrent query slots ~8
Per-query memory ceiling 50 GB
Cluster memory ceiling 600 GB

Rule of thumb. Size workers by concurrent_queries × per_query_memory + 30% headroom. Size the coordinator for plan rate, not data — even 1000 queries/hour rarely needs more than 16 vCPU on the coordinator. Always cap query.max-memory so one bad query cannot blast the cluster.

Worked example — Athena workgroup vs self-hosted Trino — same SQL, different ops

Detailed explanation. A second favourite interview probe: "if you can run the same SQL on Athena and on self-hosted Trino, why pick one over the other?" The answer is operational responsibility plus cost shape, not "performance." Same query, same data, very different bill and incident pager.

Question. A team runs a daily ETL that scans 500 GB / day across 30 minutes. Compare the operational and cost profile on Athena vs self-hosted Trino on EKS.

Input — daily workload.

Metric Value
Daily scan 500 GB
Daily queries 200
Average wall-clock / query 30 s
Peak concurrent queries 6

Code — same SQL, two deployments.

-- Athena (engine v3 / Trino) workgroup query
-- Runs in the AWS service; you see only the bytes-scanned charge
SELECT day, plan, COUNT(*) AS events
FROM iceberg.lake.events
WHERE day = CURRENT_DATE
GROUP BY day, plan;

-- Self-hosted Trino (Trino on EKS) — same query, you run the JVMs
SELECT day, plan, COUNT(*) AS events
FROM iceberg.lake.events
WHERE day = CURRENT_DATE
GROUP BY day, plan;
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. Athena cost: 500 GB × $5/TB = ~$2.50 per day in scan. Zero cluster cost. The bill scales with scan volume.
  2. Self-hosted Trino cost: 6 workers × m5.4xlarge × 24 h = ~$70 / day on EC2 (with reservations / Spot, less). The cluster runs whether or not it is busy.
  3. Cost crossover: at this load, Athena is dramatically cheaper. The crossover happens around steady-state utilisation > 30% of a self-hosted cluster — then the fixed cost amortises.
  4. Operational profile: Athena hands you a "queries failed" CloudWatch metric and zero cluster pager. Self-hosted Trino hands you JVM tuning, pod restarts, version upgrades, and connector configuration.
  5. The right answer is "Athena for spiky / low-volume; self-host Trino when utilisation is steady and the team can pay the ops tax for the savings."

Output.

Approach Daily cost Ops effort Best fit
Athena v3 ~$2.50 low spiky, low scan
Self-host Trino ~$70 fixed medium steady, high scan
Starburst Galaxy ~$X (variable) low when you want Trino with support

Rule of thumb. Compute "Athena equivalent cost" = total daily bytes scanned × $5 / TB. Compare against your cluster fixed cost. If the cluster cost is less than half the Athena cost, self-hosting earns its keep; if the cluster cost is more than the Athena cost, you are paying for capacity you do not need.

SQL interview question on engine architecture

A senior interviewer might frame this as: "Walk me through what happens from SELECT * FROM iceberg.lake.events WHERE day = '2026-06-10' AND plan = 'gold' arriving at the coordinator, to rows returning to my client. Be specific about coordinator vs worker vs connector responsibilities."

Solution Using a stage-by-stage trace

-- The query under trace
SELECT *
FROM iceberg.lake.events
WHERE day = DATE '2026-06-10'
  AND plan = 'gold';

-- Inspect what the engine actually does:
EXPLAIN (TYPE DISTRIBUTED)
SELECT *
FROM iceberg.lake.events
WHERE day = DATE '2026-06-10'
  AND plan = 'gold';
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Stage Component What happens
1. Parse Coordinator SQL string → AST → logical plan
2. Plan Coordinator Logical plan → optimised distributed plan
3. Push down Iceberg connector day = ... and plan = ... translated into Iceberg partition + column filters; connector returns only matching data files
4. Split Coordinator Generate one split per data file (or per row group)
5. Schedule Coordinator Assign splits to workers based on data locality / fair share
6. Scan Workers + connector Read Parquet row groups from S3; apply residual predicates per row
7. Exchange Workers (For non-trivial queries) shuffle rows between stages
8. Return Coordinator Stream rows back to client over JDBC / HTTP

The trace highlights where each component owns work. The Iceberg connector eliminates entire partitions and entire files via metadata before any worker reads a byte. Workers do the I/O-heavy scan; the coordinator never reads data, it just plans and dispatches.

Output:

Component Owns
Coordinator Parse, plan, schedule, return
Workers Scan, exchange, aggregate
Connector Metadata, splits, pushdown, read/write

Why this works — concept by concept:

  • Pushdown happens at the connector — the engine asks the connector "what can you do with these predicates?" and the Iceberg connector translates day = ... into a partition filter, skipping entire S3 prefixes.
  • Splits are the unit of parallelism — each data file (or row group) becomes one split, dispatched to one worker. More splits = more parallelism, up to the worker count.
  • Workers are stateless and short-lived — without TFE, a worker death kills the query. With TFE, intermediate data persists in the exchange and the failed split retries.
  • Coordinator never reads data — its job ends at "the rows have been streamed back to the client." This is why coordinator sizing is about CPU for planning, not memory for scans.
  • Cost — Parse + plan: O(query size). Scan: O(rows after pushdown). Exchange: O(rows × join fan-out). The "magic" of pushdown reduces the dominant term to "only the partitions you asked for."

SQL
Topic — joins
JOIN problems for federated SQL

Practice →


4. Connector ecosystem & federation patterns

Connectors are how a federated SQL engine reaches the lakehouse — and how cross-source joins actually work

The mental model in one line: a connector is a Java module that implements the engine's SPI for one external system — Hive, Iceberg, Delta, Hudi, MySQL, Postgres, Kafka, Elasticsearch, MongoDB — exposing it as a SQL catalog that you can SELECT from and (sometimes) INSERT into. Once you internalise that "every external system is just another catalog," the entire query federation interview surface collapses to "which connectors exist, which support pushdown, and how do cross-source joins behave?"

Hub-and-spoke connector diagram with a central Trino coordinator orb connected by glowing federation lines to eight connector cards in a ring — Hive, Iceberg, Delta, Hudi, Postgres, MySQL, Kafka, Elasticsearch — each with a pushdown badge indicating predicate support, on a light PipeCode card.

The four connector families.

  • Lakehouse / table-format connectors. Hive, Iceberg, Delta, Hudi. These read columnar files (Parquet, ORC) on object storage with table-format metadata for partitioning, schema evolution, and snapshot isolation. The lakehouse default.
  • JDBC connectors. Postgres, MySQL, SQL Server, Oracle, Redshift, Snowflake. These wrap the source database's own SQL engine over a JDBC connection. Predicate pushdown converts WHERE clauses into native SQL the source database executes.
  • NoSQL / specialty connectors. Kafka, Elasticsearch, MongoDB, Redis, Cassandra. Each translates SQL semantics onto a non-relational API as best it can; pushdown is partial.
  • Cloud-native / system connectors. system (engine introspection), jmx, tpch / tpcds (generated test data), Iceberg REST catalog, Glue catalog, Unity catalog.

The lakehouse default — Hive vs Iceberg.

  • Hive connector. Reads Hive-managed warehouses: external tables backed by a Hive metastore (HMS) or AWS Glue catalog, with partition discovery in the catalog. Mature, ubiquitous, but lacks atomic writes, schema evolution beyond renames, and time-travel.
  • Iceberg connector. Reads Iceberg tables — same Parquet files on S3, but with a manifest-based metadata layer that gives atomic writes, full schema evolution, hidden partitioning, time-travel, and table maintenance commands (OPTIMIZE, EXPIRE_SNAPSHOTS).
  • Delta connector. Reads Delta Lake tables — Databricks' table format with a transaction log (_delta_log) on S3. Conceptually similar to Iceberg; lock-in to Databricks pipelines is the differentiator.
  • Hudi connector. Reads Apache Hudi tables — built for streaming ingest patterns (copy-on-write vs merge-on-read storage modes).

Pushdown — the most important property.

  • Predicate pushdown. The engine translates WHERE clause filters into something the connector can apply at the source. Hive/Iceberg: partition + column statistics filtering. JDBC: native WHERE in the remote SQL. Kafka: usually no pushdown (offsets, not columns). Elasticsearch: partial (term filters).
  • Projection pushdown. Only the requested columns are read from columnar storage. Critical for wide tables on Iceberg / Delta / Hudi.
  • Aggregate pushdown. Some connectors push SUM / COUNT / MAX down to the source. The Iceberg connector pushes count and min/max from statistics; JDBC connectors push the aggregate as native SQL.
  • Limit pushdown. A LIMIT 100 at the engine becomes a LIMIT 100 at the source — invaluable for "find me one row" queries on huge tables.

Federated join behaviour — the gotcha.

  • Pushdown stops at the join. Each side of a cross-source join is read separately; the rows are shipped to the engine workers and joined there. A 1B-row Iceberg table joined to a 10B-row MySQL table is not a 10B-row MySQL scan — it is a full read of both tables across the network.
  • Always filter before the join. Push the most-selective predicate as deep into each side as possible. If you can pre-aggregate a side in the source (with a CTE that the connector compiles to native SQL), do it.
  • Beware of large JDBC pulls. A SELECT * FROM jdbc.huge_table will saturate the source database's network. Push a WHERE and a column list every time.

Common interview probes on connectors.

  • "Which connectors support predicate pushdown?" — the lakehouse and JDBC connectors do, fully. Kafka and Elasticsearch do partially. Mongo and Cassandra are partial.
  • "How do I join across connectors in one SQL statement?" — SELECT ... FROM cat1.schema1.t1 JOIN cat2.schema2.t2 ON ... — the engine plans and shuffles between the two source reads.
  • "What is the difference between the Hive and Iceberg connectors?" — Hive reads partitioned Parquet/ORC with metastore catalogs; Iceberg adds atomic writes, schema evolution, snapshot isolation, and time-travel via manifest metadata.
  • "Why does the Kafka connector seem slow?" — Kafka is not a queryable table; the connector scans topics by offset, with limited pushdown. Use it for low-volume ad-hoc joins, not large scans.

Worked example — federated join: Iceberg events × Postgres customers

Detailed explanation. The canonical federation example: a customer profile lives in a Postgres OLTP database; the event stream lives as Iceberg tables in S3. A single Trino query joins them — the engine reads each side with its own connector and joins in the worker layer. The gotcha is which side ships across the network and how to keep the join fast.

Question. Write a Trino query that joins Iceberg events to Postgres customers to produce per-plan event counts for the last 7 days. Show the EXPLAIN output and explain why the join uses dynamic filtering.

Input — events (Iceberg).

event_id customer_id event_ts
1 100 2026-06-10 09:01
2 101 2026-06-10 09:02
3 100 2026-06-11 12:30

Input — customers (Postgres).

customer_id plan
100 gold
101 silver
102 gold

Code.

-- Federated join in one SQL statement
SELECT
    c.plan,
    COUNT(*) AS events_7d
FROM postgres.public.customers c
JOIN iceberg.lake.events     e
       ON e.customer_id = c.customer_id
WHERE e.event_ts >= CURRENT_TIMESTAMP - INTERVAL '7' DAY
  AND c.is_active = true
GROUP BY c.plan;
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The coordinator plans the query and pushes the is_active = true filter into the Postgres connector (compiled as native Postgres SQL WHERE is_active = true). Postgres returns only active customers.
  2. The coordinator pushes the event_ts >= ... filter into the Iceberg connector (compiled as a partition + column filter on the event table). Iceberg returns only files within the last 7 days.
  3. Dynamic filtering kicks in: once the Postgres side has materialised (small build side), the coordinator broadcasts the customer_id set to the Iceberg scan as a runtime filter. The Iceberg scan now reads only matching files / row groups.
  4. The two sides are shuffled by customer_id and joined in worker memory.
  5. The GROUP BY aggregates the joined rows per plan.

Output.

plan events_7d
gold 2
silver 1

Rule of thumb. Federated joins live or die on dynamic filtering plus selective predicates on both sides. Always filter both sides as deeply as you can before the join; let dynamic filtering on the larger side ride on the smaller side's predicate set. If neither side filters meaningfully, the engine ships the world across the network and you wait.

Worked example — predicate pushdown check with EXPLAIN

Detailed explanation. A senior interviewer often asks "how would you verify pushdown actually happened?" The answer is EXPLAIN — every federated SQL engine exposes a plan command that shows whether a predicate landed at the connector or at the engine. The senior signal is reading the plan output and naming the operators.

Question. Given a JDBC connector over Postgres, write an EXPLAIN and identify the pushed-down predicate in the plan output. Why does this matter?

Input — Postgres connector configured as pg catalog.

customer_id created_at plan
100 2026-01-15 gold
101 2026-03-01 silver

Code.

-- Inspect the plan
EXPLAIN (TYPE DISTRIBUTED, FORMAT TEXT)
SELECT customer_id, plan
FROM pg.public.customers
WHERE created_at >= DATE '2026-06-01'
  AND plan IN ('gold', 'silver');

-- Sample (abbreviated) output:
-- - Output[customerId, plan]
--   - RemoteSource[1]
-- Fragment 1
--   - TableScan
--       table = pg:public.customers
--       columns = [customer_id, plan]
--       predicate pushdown = (created_at >= DATE '2026-06-01' AND plan IN ('gold','silver'))
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The EXPLAIN output shows a TableScan node with a predicate pushdown line listing the predicates the connector accepted. Any predicate listed there is being applied by Postgres, not by the engine.
  2. If a predicate is not listed, the engine applies it after reading every row — a sign you should rewrite the predicate to be pushdown-friendly (use simple comparisons, avoid casts on the column side).
  3. The columns = [...] line shows projection pushdown — only the requested columns are read.
  4. For Iceberg, the equivalent plan shows splits = N and a dynamic filter ID reference once dynamic filtering attaches.
  5. The senior interview habit: run EXPLAIN before merging any production query against a federated source.

Output.

Plan element What it tells you
predicate pushdown = ... which WHERE went to the source
columns = [...] projection pushdown
splits = N parallelism on lakehouse scans
dynamic filter join-build-side filter on probe scan

Rule of thumb. EXPLAIN is the source-of-truth for pushdown. Run it on every cross-source query; if a WHERE clause did not push down, refactor the SQL (cast on the literal side, avoid wrapping the column in a function) until it does. The runtime difference is often 10–100x.

Worked example — cross-source join gotcha: shipping rows over the network

Detailed explanation. A frequent senior interview question: "why is my cross-source join slow even though both tables have indexes?" The honest answer is because the engine reads both sides over the network into worker memory; the source indexes only help with the initial filter, not with the join itself. Naming the cost gives the senior signal.

Question. A team complains a Trino query joining 50M Postgres rows to 5B Iceberg rows takes 40 minutes. Diagnose the bottleneck and propose two fixes.

Input — symptom.

Side Rows Storage
customers (Postgres) 50M OLTP table
events (Iceberg) 5B S3 Parquet

Code — the slow query.

-- Slow: 40-minute runtime
SELECT c.plan, COUNT(*) AS events
FROM pg.public.customers c
JOIN iceberg.lake.events  e ON e.customer_id = c.customer_id
WHERE e.event_ts >= CURRENT_DATE - INTERVAL '7' DAY
GROUP BY c.plan;

-- Diagnosis: the Postgres side has no selective filter,
-- so 50M rows ship over JDBC. The Iceberg side reads the
-- 7-day window (~50M rows) and joins. The join itself is fast;
-- the JDBC pull is the bottleneck.

-- Fix 1: filter the Postgres side more aggressively
SELECT c.plan, COUNT(*) AS events
FROM pg.public.customers c
JOIN iceberg.lake.events  e ON e.customer_id = c.customer_id
WHERE e.event_ts >= CURRENT_DATE - INTERVAL '7' DAY
  AND c.is_active = true
  AND c.created_at >= CURRENT_DATE - INTERVAL '180' DAY
GROUP BY c.plan;

-- Fix 2: pre-aggregate the Iceberg side, then look up customers
WITH event_counts AS (
    SELECT customer_id, COUNT(*) AS event_count
    FROM iceberg.lake.events
    WHERE event_ts >= CURRENT_DATE - INTERVAL '7' DAY
    GROUP BY customer_id
)
SELECT c.plan, SUM(ec.event_count) AS events
FROM pg.public.customers c
JOIN event_counts ec ON ec.customer_id = c.customer_id
GROUP BY c.plan;
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The slow query asks the engine to ship 50M Postgres rows (the whole customers table) across JDBC into worker memory. The JDBC pull is single-stream-per-connection and slow on wide tables.
  2. Fix 1 filters the Postgres side down to active customers in the last 180 days. The pushed-down WHERE reduces the JDBC pull from 50M to perhaps 5M rows — 10x faster fetch.
  3. Fix 2 pre-aggregates the Iceberg side first (5B rows → maybe 1M customer-event-count rows), then joins to the smaller pre-aggregate. The Iceberg scan is fast (parallel S3 reads); the join is now tiny.
  4. The pattern "filter both sides, prefer pre-aggregating the lakehouse side, keep JDBC pulls small" is the federation playbook.

Output.

Strategy Time
Original 40 min
Fix 1 (filter Postgres) 6 min
Fix 2 (pre-aggregate Iceberg) 3 min

Rule of thumb. In a cross-source join, the slow side is whichever side has the smallest selective predicate. JDBC connectors stream sequentially over a single (or pooled) socket — a 50M-row JDBC pull will dominate the runtime even on a fast cluster. Iceberg / Hive scans are parallel — Spark them harder is cheap; the JDBC pull is the bottleneck to hunt.

SQL interview question on connector ecosystem

A senior interviewer often asks: "Design a single SQL query that produces a daily customer activity report by joining Postgres customers, Iceberg events, and MySQL subscription_revenue. How would you tune it for federation?"

Solution Using three-connector federation with selective pushdown

-- Federated SELECT across three connectors
WITH
events_today AS (
    SELECT customer_id, COUNT(*) AS event_count
    FROM iceberg.lake.events
    WHERE event_ts >= CURRENT_DATE
      AND event_ts <  CURRENT_DATE + INTERVAL '1' DAY
    GROUP BY customer_id
),
revenue_today AS (
    SELECT customer_id, SUM(amount) AS revenue
    FROM mysql.billing.subscription_revenue
    WHERE charge_date = CURRENT_DATE
    GROUP BY customer_id
)
SELECT
    c.customer_id,
    c.plan,
    COALESCE(e.event_count, 0) AS events,
    COALESCE(r.revenue,     0) AS revenue
FROM postgres.crm.customers c
LEFT JOIN events_today  e ON e.customer_id = c.customer_id
LEFT JOIN revenue_today r ON r.customer_id = c.customer_id
WHERE c.is_active = true;
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Step Connector Action
1 Iceberg partition-filter on event_ts, GROUP BY pushed partially
2 MySQL JDBC WHERE charge_date = CURRENT_DATE, SUM pushed as native MySQL
3 Postgres JDBC WHERE is_active = true pushed as native Postgres
4 Engine hash join (Postgres ⨝ events_today) and (Postgres ⨝ revenue_today) on customer_id
5 Engine final SELECT with COALESCE — returns one row per active customer

The trace shows three CTEs that each pre-aggregate at the source. The final join is between three small, pre-aggregated streams — not three full table scans.

Output:

customer_id plan events revenue
100 gold 12 49.99
101 silver 3 9.99
102 gold 0 49.99

Why this works — concept by concept:

  • Predicate pushdown reduces each side first — each CTE filters at the connector, so each side ships only the rows that pass the WHERE.
  • Aggregate pushdown shrinks the JDBC pull — the MySQL SUM(amount) runs natively on MySQL; only one row per customer crosses the wire.
  • LEFT JOIN preserves customers with zero activity — Postgres customers without matching events or revenue still appear in the report with COALESCE(0).
  • Final aggregate is in the engine — once the three sources are pre-aggregated, the federated join is cheap and runs in memory on the workers.
  • Cost — Postgres: O(active customers). MySQL: O(today's chargers). Iceberg: O(today's events) parallel S3 reads. Join: O(active customers × max fan-out per side).

SQL
Topic — aggregation
Aggregation problems (SQL)

Practice →


5. Cost, performance & when to pick which

Engine choice is a utilisation question, not a benchmark question — Athena, Trino, and Starburst each win at a different cluster utilisation band

The mental model in one line: Athena charges $5 per TB scanned with no cluster cost; self-hosted Trino charges fixed EC2 + ops cost regardless of scan volume; Starburst charges a commercial license on top of either — so the right engine is whichever one's cost curve matches your workload's utilisation profile. Once you can draw that crossover by hand, the entire athena vs presto and distributed sql engine cost interview collapses into a single arithmetic question.

Two-panel diagram — left panel a cost-vs-utilization curve showing Athena flat per-query line crossing Trino self-hosted fixed-cluster line at ~30% utilization; right panel a decision tree with three branches (ad-hoc → Athena, steady → Trino, enterprise → Starburst), on a light PipeCode card.

The Athena pricing model in detail.

  • Per-query scan charge. ~$5 per terabyte of compressed data scanned (US-East rates; varies by region). Rounded up to 10 MB per query.
  • No cluster cost. Zero capacity charge. The bill is purely a function of data scanned.
  • DDL is free. CREATE TABLE, ALTER TABLE, schema discovery on Glue — no scan charges.
  • Optimisation levers. Compression (Parquet + Snappy / ZSTD), partition pruning (eliminate scanned partitions), columnar projection (avoid SELECT *), file size (256 MB–1 GB Parquet files for the right scan parallelism).

Self-hosted Trino / PrestoDB pricing model.

  • EC2 / EKS compute. Coordinator + workers running 24/7 (or scaled with autoscaling). Cost is roughly workers × instance hourly × utilization_hours.
  • Object storage egress. Reading S3 from EC2 in the same region is free; cross-region or cross-cloud incurs egress.
  • Ops cost. Engineering time for upgrades, alerting, JVM tuning, connector configuration. Often the dominant "real" cost for small teams.
  • Optimisation levers. Right-sizing the cluster, Spot / Graviton instances, autoscaling, result caching plugins, fault-tolerant execution to avoid restart cost on long queries.

Starburst (commercial) pricing model.

  • License fee. Per-vCPU or per-worker subscription on top of self-hosted Trino. Galaxy SaaS adds a per-cluster-hour or per-query fee.
  • Value-adds. Warp Speed caching, query result cache, role-based access control, lineage, materialised views, support contract.
  • Right fit. Enterprises with governance / compliance requirements that open-source Trino does not meet, and the engineering budget to justify the license.

The performance comparison in detail.

  • Athena cold start. The first query in a workgroup pays a small warmup tax (sub-second). Subsequent queries benefit from query plan cache and scaling capacity.
  • Trino warm cluster. A pre-warmed cluster runs identical queries faster than Athena because no scheduling latency exists between client and worker. Caching plugins (RaptorX in PrestoDB; Starburst Warp Speed) close the data-locality gap.
  • Concurrency profile. Athena scales horizontally for free; Trino's concurrency is capped by cluster size. For 50+ concurrent users, Athena often wins on absolute throughput.
  • Long-running ETL. Trino + TFE outperforms Athena for queries that run > 10 minutes because Athena imposes a 30-minute query timeout (raised, but still bounded).

Workload fit matrix.

Workload Athena Trino self-host Starburst
Ad-hoc exploration (5 qps) excellent overkill on cost overkill on cost
Scheduled BI (200 qps) depends on scan excellent excellent
Interactive BI (50 concurrent) excellent needs sizing excellent
Long ETL (30+ min) tight (timeout) excellent (TFE) excellent
ML feature gen excellent on scan excellent on cost if utilised excellent
Multi-source federation excellent (v3) excellent excellent

Decision framework.

  • Utilisation < 20%. Athena. The per-query model dominates anything you'd self-host.
  • Utilisation 20–50%. Both viable. Athena for ops simplicity; Trino for steady cost.
  • Utilisation > 50%. Self-host Trino (or Starburst Enterprise). The fixed cost amortises and Athena's scan charges add up.
  • Governance / compliance hard requirement. Starburst (Enterprise or Galaxy).
  • AWS-only stack, AWS-anchored team. Athena (v3) unless ETL exceeds 30 minutes.

Cost optimisation tactics — universal.

  • Partition pruning. Filter on partition columns (date, region) so the connector skips entire directories. The single biggest scan-reduction lever.
  • Columnar formats. Parquet + ZSTD or Snappy. Wide tables with SELECT col1, col2 should scan only those columns.
  • File sizing. 256 MB to 1 GB Parquet files. Too small wastes file-list time; too big serialises scans.
  • Result caching. Athena query result cache (per query string); Starburst result cache; Trino + Hazelcast plugins.
  • EXPLAIN before you SELECT. Every cost optimisation starts with reading the plan.

Worked example — Athena cost reduction by partition pruning

Detailed explanation. A team queries an Iceberg events table on Athena and the bill is $40/day. The query selects event_ts filters across the whole table. The fix is partition pruning — Iceberg supports partition transforms (days(event_ts)) that let the query skip entire days of data.

Question. Given a 30 TB Iceberg events table partitioned by days(event_ts), a query filters the last 7 days. What is the scan reduction and the cost reduction?

Input — table profile.

Field Value
Total table size (Parquet, ZSTD) 30 TB
Partition strategy days(event_ts)
Days retained 365
Query filter last 7 days

Code.

-- Without partition pruning awareness — accidentally scans all 30 TB
SELECT plan, COUNT(*) AS events
FROM iceberg.lake.events
WHERE CAST(event_ts AS DATE) >= CURRENT_DATE - INTERVAL '7' DAY
GROUP BY plan;
-- The CAST on the left side defeats partition pruning;
-- the connector cannot translate the predicate into a partition filter.

-- With partition pruning — scans only 7 days of data
SELECT plan, COUNT(*) AS events
FROM iceberg.lake.events
WHERE event_ts >= CURRENT_TIMESTAMP - INTERVAL '7' DAY
GROUP BY plan;
-- The predicate is on the raw column; Iceberg's hidden partitioning
-- transforms `event_ts` → day partitions and skips the other 358 days.
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. The buggy query wraps event_ts in CAST(... AS DATE). The Iceberg connector cannot translate that into a partition filter because the transform on the column side is opaque.
  2. The Iceberg connector falls back to a full scan, reading all 30 TB to apply the residual predicate at row level.
  3. The fixed query keeps event_ts on the left of the predicate without a function wrap. The connector matches the partition transform days(event_ts) and prunes 358 days.
  4. Scan reduction: 30 TB × (7 / 365) ≈ 0.575 TB. On Athena: 30 TB × $5 = $150 per query vs 0.575 TB × $5 = $2.88 per query — 52x cost reduction.
  5. The 30 ms code change buys $147 per query. On 200 queries / day, that is $29,400 per day saved.

Output.

Query form Scan Athena cost
CAST(event_ts AS DATE) >= ... 30 TB $150
event_ts >= ... 0.575 TB $2.88

Rule of thumb. Never wrap a partition column in a function on the left side of a predicate. Always cast the literal instead. The partition pruning savings are typically 10–100x, and they are deterministic — no engine tuning required.

Worked example — choosing Athena vs Trino by utilisation

Detailed explanation. A common interview probe: "given a workload, how do you choose between Athena and self-hosted Trino?" The senior answer computes both bills and reports the crossover utilisation — not "Trino is faster" or "Athena is cheaper." Knowing the math makes you defensible at the architecture review.

Question. A team scans 200 TB / month across 1500 queries. Compare Athena vs a 10-worker Trino cluster on EKS (m6i.4xlarge, $0.77/hr each). At what monthly scan volume does Trino become cheaper?

Input — cost components.

Cost component Athena Trino self-host
Per-TB scan $5 $0
Worker hourly $0 $0.77 × 10
Coordinator hourly $0 $0.50
Hours / month n/a 24 × 30 = 720

Code — the crossover calculation.

# Athena monthly cost
athena = 200 * 5  # = $1000

# Trino monthly cost (fixed)
trino_worker_cost = 10 * 0.77 * 720  # = $5544
trino_coord_cost  = 0.50 * 720       # = $360
trino_ops_cost    = 1500             # ~$1500/mo for upgrades, alerting, etc.
trino_total       = trino_worker_cost + trino_coord_cost + trino_ops_cost
# = $7404 / month

# Crossover: solve  scan_tb * 5 = 7404
crossover_tb = 7404 / 5  # ≈ 1481 TB / month
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. At 200 TB / month, Athena costs $1000 and Trino costs ~$7,400 — Athena wins decisively.
  2. The crossover happens around 1500 TB / month — at that scan volume, Athena's per-query bill matches the fixed cluster bill.
  3. Real-world considerations push the crossover lower in practice: Spot / Graviton instances cut the Trino bill by 30–50%; reserved capacity discounts another 20–40%; result caching pushes effective utilisation up.
  4. The same math with Spot pricing and reservations puts the realistic crossover around 600–800 TB / month for an utilisation-conscious team.
  5. The senior signal is to quote the crossover number, then say "at our actual 200 TB / month, Athena is dramatically cheaper — let's revisit when we hit 500 TB / month."

Output.

Monthly scan Athena Trino Winner
200 TB $1000 $7400 Athena
500 TB $2500 $7400 Athena
1500 TB $7500 $7400 Tie
3000 TB $15000 $7400 Trino

Rule of thumb. Compute Athena cost as monthly_scan_tb × $5. Compute Trino cost as cluster_fixed_monthly + ops_overhead. The crossover is cluster_fixed_monthly / 5 TB. Below that, Athena wins; above that, Trino wins — if you can keep utilisation steady.

Worked example — Trino fault-tolerant execution for a 90-minute ETL

Detailed explanation. A common interview probe: "you have a 90-minute ETL on Trino that fails 1 in 4 runs due to spot interruption. What changes?" The answer is fault-tolerant execution (TFE) with a shared exchange — Trino retries failed tasks against new workers and recovers from spot interruption without losing the whole query.

Question. Configure Trino TFE for a daily ETL that builds an Iceberg snapshot from 3 TB of upstream events, with spot worker churn.

Input — workload.

Property Value
Input 3 TB Iceberg events
Output Iceberg snapshot
Wall-clock target < 90 min
Worker type Spot
Expected spot churn ~25% of workers / hour

Code.

-- Enable fault-tolerant execution for this session
SET SESSION retry_policy = 'TASK';
SET SESSION exchange_compression = true;

-- Run the ETL: failed tasks retry on other workers
INSERT INTO iceberg.lake.daily_aggregates
SELECT
    DATE(event_ts) AS day,
    plan,
    COUNT(*)       AS events,
    SUM(amount)    AS revenue
FROM iceberg.lake.events
WHERE event_ts >= CURRENT_DATE - INTERVAL '1' DAY
  AND event_ts <  CURRENT_DATE
GROUP BY DATE(event_ts), plan;
Enter fullscreen mode Exit fullscreen mode

Step-by-step explanation.

  1. SET SESSION retry_policy = 'TASK' activates TFE for the query — intermediate task output is spilled to a configured exchange (S3 bucket or HDFS path) instead of held in worker memory.
  2. When a Spot worker is interrupted mid-query, the coordinator detects the failure and re-dispatches the failed task to another worker, which reads its inputs from the exchange and re-runs.
  3. The query continues. The wall-clock impact is bounded — each failed task costs only its own re-run, not the whole query.
  4. The exchange S3 / HDFS path adds writeback latency (typically 5–15% overhead) but eliminates the "spot interruption kills the 89-minute job" failure mode.
  5. For interactive queries, TFE is overkill — leave it off. For ETL > 10 minutes, especially on Spot, turn it on.

Output.

Setting Without TFE With TFE
Spot interruption survival no yes
Wall-clock overhead 0 5–15%
Cost of interruption full re-run failed task only

Rule of thumb. Turn TFE on for any query expected to run more than 10 minutes, especially on Spot workers. The 5–15% overhead is a cheap insurance premium against losing 80+ minutes of work to a single Spot interruption.

SQL interview question on engine selection and cost tuning

A senior interviewer often asks: "Walk me through a cost optimisation for an Athena workload that suddenly doubled in spend last month. What do you look at first?"

Solution Using a layered cost audit

-- 1) Find the heaviest queries by bytes scanned
SELECT
    query_string,
    bytes_scanned,
    ROUND(bytes_scanned / 1024.0 / 1024 / 1024 / 1024 * 5, 2) AS est_cost_usd,
    execution_time_ms
FROM information_schema.queries
WHERE date(start_time) >= DATE '2026-06-01'
ORDER BY bytes_scanned DESC
LIMIT 20;

-- 2) Find queries missing partition pruning (full-table scans)
SELECT
    query_id,
    query_string,
    table_name,
    bytes_scanned
FROM query_audit
WHERE bytes_scanned > 1024 * 1024 * 1024 * 100  -- > 100 GB
  AND query_string NOT LIKE '%CURRENT_DATE%'
  AND query_string NOT LIKE '%WHERE % = DATE %';

-- 3) Find tables that should be re-partitioned or compacted
SELECT
    table_name,
    AVG(bytes_scanned)         AS avg_scan,
    COUNT(*)                   AS query_count,
    SUM(bytes_scanned) / 1024.0 / 1024 / 1024 / 1024 * 5 AS month_cost_usd
FROM query_audit
WHERE date(start_time) >= DATE '2026-06-01'
GROUP BY table_name
ORDER BY month_cost_usd DESC
LIMIT 10;
Enter fullscreen mode Exit fullscreen mode

Step-by-step trace.

Step What you check What you find
1 top queries by scan events table dominates 60% of bill
2 queries missing pruning dashboard query has CAST(event_ts AS DATE) — pruning broken
3 tables by aggregate scan events is the big spender — compact / re-partition
4 fix and re-run scan drops 30x; bill drops accordingly

The trace highlights the audit pattern: top queries → broken pruning → table-level review. Most cost regressions in Athena come from a small number of unbounded queries or partition-pruning misses.

Output:

Action Effect
Fix CAST(...) predicate 30x scan reduction on dashboard
Compact small files 1.5x speedup, modest cost reduction
Add result cache repeat-query cost → $0
Move ETL to Trino self-host when ETL is daily and 500 GB+

Why this works — concept by concept:

  • Bytes scanned is the bill — every Athena cost optimisation maps to "reduce bytes scanned per query." Reading the audit table by scan volume surfaces the top offenders directly.
  • Partition pruning is the biggest lever — a single missed pruning rewrite is often 10–50x scan inflation. Hunt these first.
  • Result cache turns repeat queries free — Athena caches by query string; identical dashboards re-running every minute pay one query, not 60.
  • Self-host crossover — once a workload is steady and high-volume, the audit will surface a candidate to move off Athena to self-hosted Trino — but never before the utilisation justifies it.
  • Cost — the audit itself is cheap (system tables, no scan charges). The fixes are deployed as code review nudges + table maintenance jobs.

SQL
Topic — optimization
Query optimization problems (SQL)

Practice →


Cheat sheet — Trino vs Presto vs Athena recipes

  • Federated join across two sources. SELECT ... FROM cat1.s1.t1 JOIN cat2.s2.t2 ON ... — the engine fans out reads through both connectors, joins in the worker layer, and returns rows. Always filter both sides selectively before the join to avoid shipping the world.
  • Predicate pushdown check. Run EXPLAIN (TYPE DISTRIBUTED) SELECT .... Look for predicate pushdown = (...) on each TableScan node. Any predicate not listed there is being applied by the engine, not the source.
  • Athena partition projection. TBLPROPERTIES ('projection.enabled' = 'true', 'projection.day.type' = 'date', ...) — skips Glue catalog lookups for partitions, dramatically faster on time-partitioned tables.
  • Iceberg time-travel. SELECT * FROM iceberg.lake.events FOR VERSION AS OF 12345 or FOR TIMESTAMP AS OF TIMESTAMP '2026-06-01 00:00' — read a historical snapshot without restoring it.
  • Trino fault-tolerant execution. SET SESSION retry_policy = 'TASK' — long ETL queries spill intermediate data to a shared exchange and survive worker churn. Trade 5–15% wall-clock for spot resilience.
  • PrestoDB RaptorX caching. Hierarchical cache on data files + metadata. Configure cache.enabled=true on the connector to warm repeat scans on the same partitions.
  • Cross-source join performance. Pre-aggregate the lakehouse side in a CTE, then JOIN to a small filtered JDBC side. JDBC pulls are sequential — keep them small.
  • Partition pruning safety. Never wrap a partition column in a function on the left side of a predicate. Use event_ts >= TIMESTAMP '...', not CAST(event_ts AS DATE) >= DATE '...'.
  • EXPLAIN ANALYZE for runtime stats. EXPLAIN ANALYZE SELECT ... runs the query and annotates each operator with rows / bytes / wall time. The senior debugging primitive.
  • Catalog organisation. Name catalogs by source semantics: iceberg_lake, postgres_crm, mysql_billing. Don't name them by storage tier — the next migration will leak the wrong name into every dashboard.
  • Engine version pinning in dbt. Set target.engine_version = 'trino-435' (or equivalent) so model SQL is regression-tested against the actual deployed engine.
  • Athena vs Trino choice. Athena for < 20% utilisation; Trino for > 50%; both viable in the middle. Compute the crossover before you migrate.
  • Starburst Enterprise fit. Add governance, lineage, materialised views, and Warp Speed caching when those are a hard requirement; do not adopt Starburst purely for "performance."

Frequently asked questions

What's the difference between Trino and PrestoDB?

Trino and PrestoDB are two forks of the same project — Facebook Presto (2012). In 2019, the original maintainers left Facebook and forked the project as PrestoSQL, later renamed Trino in 2020 after a trademark dispute. Facebook (now Meta) kept the original code base under the PrestoDB name and donated it to the Linux Foundation. Both are Apache 2.0 licensed. The practical differences in 2026 are release cadence (Trino monthly, PrestoDB quarterly), connector breadth (Trino is wider), execution features (Trino has dynamic filtering and Project Tardigrade fault-tolerant execution), and ecosystem mindshare (most new tools and SaaS distributions target Trino first).

Is Athena Trino or Presto?

Both, depending on the engine version. Athena launched in 2016 on PrestoDB and remained on PrestoDB through engine version 2. In 2023 AWS introduced engine version 3, which is built on Trino. Workgroups can be pinned to v2 or v3; new workgroups default to v3. The SQL surface is mostly compatible across versions, but certain functions (LISTAGG, MERGE on Iceberg) and connector behaviours differ — always check the engine version in your workgroup settings before deploying a new query.

Can Trino query multiple data sources in one SQL statement?

Yes — that is the entire point of a federated SQL engine. A single Trino SELECT can read from Iceberg tables on S3, a Postgres OLTP database, a MySQL billing schema, a Kafka topic, and an Elasticsearch index in one statement: SELECT ... FROM iceberg.lake.events e JOIN postgres.crm.customers c ON ... JOIN kafka.events.clicks k ON .... The engine plans each source-side read through that source's connector (with predicate pushdown), reads them in parallel into worker memory, and joins them on the workers. The classic gotcha is that cross-source joins ship rows over the network — always filter selectively before the join so each side is small.

Is Starburst the same as Trino?

No — Starburst Enterprise is a commercial distribution built on top of Trino, not the same product. It bundles Trino with added enterprise features: governance, row- and column-level security, materialised views, Warp Speed caching, query result cache, vendor support, and a managed control plane. Starburst Galaxy is the SaaS multi-tenant version; Starburst Enterprise is self-hosted. The Trino engine is identical or near-identical to open-source Trino at any given release; the value of Starburst is the enterprise wrapper, not a different engine.

When should I pick Athena over Trino?

Pick Athena when your workload is spiky, ad-hoc, AWS-anchored, or below ~30% cluster utilisation if you were to self-host. The per-query pricing ($5/TB scanned) dominates anything you'd run on a 24/7 cluster at low utilisation. Pick self-hosted Trino when utilisation is steady and high (above ~50% of a sized cluster) — the fixed compute cost amortises and the per-query Athena charges become more expensive. The middle band (20–50% utilisation) is operational preference: Athena for simplicity, Trino for control over JVM, connectors, and feature flags. Above 30-minute query runtimes, Trino (with fault-tolerant execution) wins because Athena imposes per-query timeouts.

How does Trino handle joins across federated connectors?

Trino reads each side of the join through its own connector — with predicate, projection, and (where supported) aggregate pushdown — into worker memory. The two sides are then shuffled by the join key and combined with a hash join on the workers. Dynamic filtering further optimises the larger side: once the smaller (build) side is materialised, Trino broadcasts the set of join keys to the larger (probe) side's scan as a runtime filter, so the probe scan skips files / row groups that cannot match. The cost model is: O(rows read from each source after pushdown) for I/O, plus O(rows after dynamic filtering) for the join. The interview gotcha is that cross-source joins ship rows — JDBC pulls are sequential and dominate runtime, so pre-aggregate the JDBC side or filter it tightly before the join.

Practice on PipeCode

Pipecode.ai is Leetcode for Data Engineering — every federated SQL recipe above ships with hands-on practice rooms where you write the cross-source join, the predicate-pushdown rewrite, and the partition-pruning fix against real graded inputs. PipeCode pairs every reading with 450+ DE-focused problems and a real-time scoring engine, so you never have to wonder whether your Trino query, your PrestoDB rewrite, or your Athena cost optimisation actually holds up under interview pressure.

Practice SQL now →
JOIN drills →

Top comments (0)