Query one Apache Iceberg table from Trino, Spark, and DuckDB

#architecture #database #dataengineering #tutorial

Query one Apache Iceberg table from Trino, Spark, and DuckDB

Iceberg is useful because it lets multiple engines read the same table without copying data into engine-specific storage. The table stays in object storage. The engines differ only in how they reach the metadata.

That is the part most posts skip. They explain Trino, Spark, or DuckDB in isolation, then leave you to infer how the same table behaves across all three. If you are working on a real lakehouse, that missing comparison is the whole problem.

The short version:

Trino usually reads Iceberg through a catalog.
Spark reads Iceberg through a catalog plus Spark Iceberg extensions.
DuckDB can read a single Iceberg table directly from metadata, or attach a REST catalog when you need more than read-only inspection.

The shared model

Iceberg does not hide table state in a warehouse directory tree. It stores metadata files that point at the current snapshot, and the snapshot points at the data files. That lets Trino, Spark, and DuckDB all answer the same query against the same table as long as they agree on the table metadata.

flowchart LR
  Cat[(Catalog)] --> Meta[(Iceberg metadata)]
  Meta --> Files[(Iceberg data files)]
  Trino[Trino] --> Cat
  Spark[Spark] --> Cat
  DuckDB[DuckDB] --> Meta

Trino

Trino is the cleanest option for shared SQL on an existing catalog. Once the Iceberg connector is configured, the query is just normal SQL.

-- Trino 482
SELECT order_id, customer_id, total_amount
FROM lakehouse.analytics.orders
WHERE order_date >= DATE '2026-01-01'
ORDER BY total_amount DESC
LIMIT 10;

The upside is straightforward: Trino is a strong interactive query layer. The downside is also straightforward: it still needs catalog and storage setup, so it is not the lightest option for a local one-off read.

Spark

Spark is the right call when the query is part of a broader data job. Configure the Iceberg extension and catalog, then run the same kind of SQL.

# Spark 4.1.2 + Apache Iceberg 1.11.0
spark-sql \
  --packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.11.0 \
  --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
  --conf spark.sql.catalog.lakehouse=org.apache.iceberg.spark.SparkCatalog \
  --conf spark.sql.catalog.lakehouse.type=rest \
  --conf spark.sql.catalog.lakehouse.uri=https://catalog.example.com

-- Spark SQL in Spark 4.1.2 with Iceberg 1.11.0
SELECT order_id, customer_id, total_amount
FROM lakehouse.analytics.orders
WHERE order_date >= DATE '2026-01-01'
ORDER BY total_amount DESC
LIMIT 10;

The upside is that Spark can do the rest of the job too: transforms, batch writes, and downstream table builds. The downside is the extra session config. For a simple read, Spark is heavier than it needs to be.

DuckDB

DuckDB is the fastest way to inspect an Iceberg table locally. The Iceberg extension can read individual tables directly from metadata.

-- DuckDB current Iceberg extension
INSTALL iceberg;
LOAD iceberg;

SELECT order_id, customer_id, total_amount
FROM iceberg_scan('s3://warehouse/analytics/orders/metadata/v1.metadata.json')
WHERE order_date >= DATE '2026-01-01'
ORDER BY total_amount DESC
LIMIT 10;

That direct-read path is read-only, which is the part to remember. If you want writes or a catalog-backed workflow, attach an Iceberg REST catalog instead.

-- DuckDB current Iceberg extension
CREATE SECRET iceberg_secret (
    TYPE iceberg,
    CLIENT_ID 'admin',
    CLIENT_SECRET 'password',
    OAUTH2_SERVER_URI 'https://catalog.example.com/v1/oauth/tokens'
);

ATTACH 'warehouse' AS lakehouse (
    TYPE iceberg,
    SECRET iceberg_secret,
    ENDPOINT 'https://catalog.example.com'
);

SELECT order_id, customer_id, total_amount
FROM lakehouse.analytics.orders
WHERE order_date >= DATE '2026-01-01'
ORDER BY total_amount DESC
LIMIT 10;

The trade-off

Iceberg solves the data-layout problem. It does not erase the operational cost of using multiple engines. Trino, Spark, and DuckDB all have different strengths:

Trino is usually the best shared SQL layer.
Spark is usually the best distributed processing layer.
DuckDB is usually the best local inspection layer.

That is why the real architecture decision is not “which SQL syntax is nicer?” It is “how many places do we want to manage catalog freshness, snapshot state, and permissions?”

If you want the open-lakehouse version of that answer, the single-platform lakehouse idea is worth a look. If you want the blunt trade-off against a warehouse-centric stack, read Databasin vs Snowflake.

What to remember

You do not need three copies of the same Iceberg table to use three engines. You need one table, one snapshot model, and a clear understanding of which engine owns which job.

Read the full article for the longer breakdown: Query one Apache Iceberg table from Trino, Spark, and DuckDB.