Getting Started with Flink SQL, Apache Iceberg and DynamoDB Catalog

#bigdata #datascience #tutorial #architecture

Due to the technology transformation we want to do recently, we started to investigate Apache Iceberg. In addition, the data processing engine we use in house is Apache Flink, so it's only fair to look for an experimental environment that integrates Flink and Iceberg.

The purpose of this experiment is simply to experience using Flink SQL to operate Iceberg, but I didn't realize that it would be so difficult to set up the whole environment.

Actually, Iceberg's official document has a "large" description of the integration with Flink, but it is based on Hadoop and Hive. For a data engineer who has not experienced the legacy Hadoop ecosystem, Hadoop is like an abyss, and obviously the official documents don't work.

So I referred to Dremio's article, fixed a few fatal flaws in the article, and finally built an experiment environment where I could experience the full Flink Iceberg integration. Moreover, it's completely local and free of charge.

Before introducing the environment, let's quickly introduce Iceberg.

A short introduction to Iceberg

Apache Iceberg is one of the three types of lakehouse, the other two are Apache Hudi and Delta Lake.

Lakehouse, in a nutshell, is to provide a layer of interface for Object Storage that can be operated by SQL, so that lakehouse has more structured data than data lake and has the ease of use of a data warehouse.

In addition, lakehouse also provides integration with major common data processing engines, so you can enjoy the low cost of Object Storage but at the same time make the data easy to use.

Here's a diagram that I found very impressive.

Comparison of the three types of lakehouses is not the point of this article, there is a detailed benchmark in Kyle Weller's article.

Experiment environment

After reading the above introduction, we understand in order to be able to perform structured data operations on Object Storage, we need a mechanism to store a schema, which is called a catalog on Iceberg.

Iceberg provides a number of options for implementing catalogs, including the Hive metastore mentioned at the beginning, and DynamoDB, which will be used in this article.

The entire playground is completely free, and at its core are the following three components.

DynamoDB: This is the catalog store, and uses the local version to avoid expenses.
Minio: This is where the actual iceberg is stored, again using minio to avoid the expense of s3.
Flink: the key to Playground.

The entire experiment environment has been in Github, you can clone to follow the steps.

https://github.com/wirelessr/flink-iceberg-playground

The journey of experimentation

This environment is extremely easy to use, all you have to do is docker-compose up -d and you're good to go.

But to get it working, I looked at a lot of articles and found that none of them actually worked, either they were buggy or just provided snippets, so I set out to make one myself.

I especially want to talk about Dremio's article, it inspired me a lot. The only difference is that it uses Nessie as the catalog, and I'm not familiar with Nessie at all, so I used DynamoDB instead.

On the other hand, about DynamoDbCatalog, there is poor information on the Internet, not even what parameters need to be modified, I finally found out by reviewing the following source code.

https://github.com/apache/iceberg/blob/f21199d002c6ad008a6772783b4fdad622c09b59/aws/src/main/java/org/apache/iceberg/aws/AwsProperties.java#L113

Changing DynamoDB's table name should be important if the environment goes to production, but even this parameter (dynamodb.table-name) has to be found in the code.

In addition, because the experiment environment enabled minio as an alternative to s3, there is another parameter besides endpoint that is also critical, and it requires a change in the path style approach, and again, this parameter (s3.path-style-access) is what I found by looking at the source code, which is completely undocumented.

https://github.com/apache/iceberg/blob/f21199d002c6ad008a6772783b4fdad622c09b59/aws/src/main/java/org/apache/iceberg/aws/s3/S3FileIOProperties.java#L155

Finally, the experiment environment has been built, there is a docker image that has been packaged with Flink and all the dependencies that need to be built, the image is a bit large and needs to be built for a period of time, so I also provide a ready-made one.

https://hub.docker.com/r/wirelessr/flink-iceberg

Just replace the image used by sql-client in docker-compose.yml.