Query one Apache Iceberg table from Trino, Spark, and DuckDB
Iceberg is useful because it lets multiple engines read the same table without copying data into engine-specific storage. The table stays in object storage. The engines differ only in how they reach the metadata.
That is the part most posts skip. They explain Trino, Spark, or DuckDB in isolation, then leave you to infer how the same table behaves across all three. If you are working on a real lakehouse, that missing comparison is the whole problem.
The short version:
- Trino usually reads Iceberg through a catalog.
- Spark reads Iceberg through a catalog plus Spark Iceberg extensions.
- DuckDB can read a single Iceberg table directly from metadata, or attach a REST catalog when you need more than read-only inspection.
The shared model
Iceberg does not hide table state in a warehouse directory tree. It stores metadata files that point at the current snapshot, and the snapshot points at the data files. That lets Trino, Spark, and DuckDB all answer the same query against the same table as long as they agree on the table metadata.
flowchart LR
Cat[(Catalog)] --> Meta[(Iceberg metadata)]
Meta --> Files[(Iceberg data files)]
Trino[Trino] --> Cat
Spark[Spark] --> Cat
DuckDB[DuckDB] --> Meta
Trino
Trino is the cleanest option for shared SQL on an existing catalog. Once the Iceberg connector is configured, the query is just normal SQL.
-- Trino 482
SELECT order_id, customer_id, total_amount
FROM lakehouse.analytics.orders
WHERE order_date >= DATE '2026-01-01'
ORDER BY total_amount DESC
LIMIT 10;
The upside is straightforward: Trino is a strong interactive query layer. The downside is also straightforward: it still needs catalog and storage setup, so it is not the lightest option for a local one-off read.
Spark
Spark is the right call when the query is part of a broader data job. Configure the Iceberg extension and catalog, then run the same kind of SQL.
# Spark 4.1.2 + Apache Iceberg 1.11.0
spark-sql \
--packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.11.0 \
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
--conf spark.sql.catalog.lakehouse=org.apache.iceberg.spark.SparkCatalog \
--conf spark.sql.catalog.lakehouse.type=rest \
--conf spark.sql.catalog.lakehouse.uri=https://catalog.example.com
-- Spark SQL in Spark 4.1.2 with Iceberg 1.11.0
SELECT order_id, customer_id, total_amount
FROM lakehouse.analytics.orders
WHERE order_date >= DATE '2026-01-01'
ORDER BY total_amount DESC
LIMIT 10;
The upside is that Spark can do the rest of the job too: transforms, batch writes, and downstream table builds. The downside is the extra session config. For a simple read, Spark is heavier than it needs to be.
DuckDB
DuckDB is the fastest way to inspect an Iceberg table locally. The Iceberg extension can read individual tables directly from metadata.
-- DuckDB current Iceberg extension
INSTALL iceberg;
LOAD iceberg;
SELECT order_id, customer_id, total_amount
FROM iceberg_scan('s3://warehouse/analytics/orders/metadata/v1.metadata.json')
WHERE order_date >= DATE '2026-01-01'
ORDER BY total_amount DESC
LIMIT 10;
That direct-read path is read-only, which is the part to remember. If you want writes or a catalog-backed workflow, attach an Iceberg REST catalog instead.
-- DuckDB current Iceberg extension
CREATE SECRET iceberg_secret (
TYPE iceberg,
CLIENT_ID 'admin',
CLIENT_SECRET 'password',
OAUTH2_SERVER_URI 'https://catalog.example.com/v1/oauth/tokens'
);
ATTACH 'warehouse' AS lakehouse (
TYPE iceberg,
SECRET iceberg_secret,
ENDPOINT 'https://catalog.example.com'
);
SELECT order_id, customer_id, total_amount
FROM lakehouse.analytics.orders
WHERE order_date >= DATE '2026-01-01'
ORDER BY total_amount DESC
LIMIT 10;
The trade-off
Iceberg solves the data-layout problem. It does not erase the operational cost of using multiple engines. Trino, Spark, and DuckDB all have different strengths:
- Trino is usually the best shared SQL layer.
- Spark is usually the best distributed processing layer.
- DuckDB is usually the best local inspection layer.
That is why the real architecture decision is not “which SQL syntax is nicer?” It is “how many places do we want to manage catalog freshness, snapshot state, and permissions?”
If you want the open-lakehouse version of that answer, the single-platform lakehouse idea is worth a look. If you want the blunt trade-off against a warehouse-centric stack, read Databasin vs Snowflake.
What to remember
You do not need three copies of the same Iceberg table to use three engines. You need one table, one snapshot model, and a clear understanding of which engine owns which job.
Read the full article for the longer breakdown: Query one Apache Iceberg table from Trino, Spark, and DuckDB.
Top comments (0)