DEV Community

ChunTing Wu
ChunTing Wu

Posted on

Trino & Iceberg Made Easy: A Ready-to-Use Playground

Earlier I briefly introduced Apache Iceberg and built an out-of-the-box experiment environment.

https://github.com/wirelessr/flink-iceberg-playground

The purpose of the article was to experience the integration of Flink SQL and Iceberg, so that I could quickly move on to the next stage of Flink development. Surprisingly, the article was really popular, and I received a few stars on my Github. It made me realize many people like me find these big data technical stacks tough and want a tutorial.

So, let's welcome another big data star, Trino.

Trino is a popular query engine that provides a single point of access to multiple heterogeneous data sources and the ability to query them using SQL. This is a bit abstract, let's take an example.

Image description

Even though every data source has different query syntax, Trino integrates well and interacts with users using SQL.

In addition to common databases, Trino also supports Iceberg queries. Thus, let's experience the integration of Trino and Iceberg, and I will provide a experiment environment as well.

Experiment environment introduction

First, provide the link to the experiment environment.

https://github.com/wirelessr/trino-iceberg-playground

Before we get started, let's take a quick look at the basic structure of Trino, which has three layers: catalog, schema, and finally table. Let's use MongoDB as an example, where catalog corresponds to a database cluster, schema refers to the name of the database under the cluster. As for table, it simply refers to collection.

However, the meaning of this catalog is different from that of Iceberg's. Iceberg's catalog is just one of Iceberg's settings, not the one that Trino recognizes as a catalog.

Sounds confusing? Let's look at an existing example, example.properties.

connector.name=iceberg
iceberg.catalog.type=nessie
iceberg.nessie-catalog.uri=http://catalog:19120/api/v1
iceberg.nessie-catalog.default-warehouse-dir=s3://warehouse
fs.native-s3.enabled=true
s3.endpoint=http://storage:9000
s3.region=us-east-1
s3.path-style-access=true
s3.aws-access-key=admin
s3.aws-secret-key=password
Enter fullscreen mode Exit fullscreen mode

This is a Trino catalog setup. We can see this catalog is linked to an Iceberg source. Below we specify the Iceberg-specific catalog as nessie and provide the nessie settings.

By the way, I wanted to continue to use the previous experiment with Flink SQL and Iceberg, but I found out Trino doesn't support Iceberg's DynamoDB catalog. Therefore, I had to create a new one.

So far we have created a catalog called example, and we will continue to build the schema and tables under it.

To initialize the tree structure, we override command in docker-compose.yaml so that the Trino container calls post-init.sh on startup.

#!/bin/bash

nohup /usr/lib/trino/bin/run-trino &

sleep 10

trino < /tmp/post-init.sql

tail -f /dev/null
Enter fullscreen mode Exit fullscreen mode

In the script, we first wake up Trino's original startup script, then wait for 10 seconds for Trino to initialize, and then feed the SQL we want to execute. Finally, it waits until it receives a kill signal.

Conclusion

In the data engineering world, where new technical stacks are constantly being created, it can be difficult just to play with each one once, not to mention comparing the tools.

So I hope these experiment environments will provide people who want to play around with them a quick experience without actually getting their hands dirty.

As for comparing tools, it would be great if someone had a more complete benchmark report to share with me. For example, I'd love to know, at this point in time (May 2024), what would be the best choice between the three lakehouses? Or how about Trino and Presto?

I have an answer in mind, but I'd like more objective numbers, so feel free to discuss.

Top comments (0)