Dr. Malte Polley

Posted on Jul 29 • Updated on Aug 7

Integration of Apache Iceberg in S3, Glue, Athena, Matillion, and Snowflake – Part 1

#aws #cdk #data #governance

TL;DR

If you aim to establish transactional object control in your Data Lake, you should definitely consider Apache Iceberg. However, the path is somewhat rocky and involves manual efforts. AWS Glue, in particular, needs to offer more feature sets. Once you overcome these hurdles, the integration of Iceberg in Snowflake, Glue, and Matillion is as comfortable as dealing with other file formats. This article is less of a technical guide and more of a narrative about my journey from S3 to Glue. This part focuses on infrastructure deployment, while the second part will focus on the Matillion pipeline and Snowflake integration.

Transactional Data Lakes – Why and How?

When working with a Data Lake, the question of how to handle changes and data updates inevitably arises. In Europe, explicit deletion concepts are required. In a database, a data point can be relatively easily identified and updated by deleting or anonymizing it.

Apache Iceberg – Why and How?

There are three frameworks that meet the requirements of a transactional Data Lake: Apache Hudi, Apache Iceberg, and Delta Lake. AWS has published a blog post offering a purely technical comparison: Choosing an open table format for your transactional data lake on AWS.

From my perspective, the relevant aspect was that Apache Iceberg has the most extensive integration with Amazon Athena. Both read and write operations can occur here. In combination with Glue, it is possible to update objects via SQL from AWS without maintaining complex systems.

Automation is Key – But Not Given at the Crucial Point

The following section outlines my project process.

Data Integration: From S3 to Glue to Apache Iceberg

When we started, I quickly realized that we couldn't use CSV or JSON files for Apache Iceberg but rather ORC, Avro, and Parquet. In Matillion, I often use Pandas, so migrating from CSV to Parquet wasn't a big problem. My DataFrame only needed to use a different method and exchange a few parameters.

df = pd.DataFrame.from_dict(response_dict["data"])
df["Upload Date"] = file_date
df.columns = [replace_no_ascii(unicode_string=col) for col in df.columns]
df.to_parquet(
    f"s3://{s3_bucket}/{directory}/{year}/{file_date}.snappy.parquet",
    engine="auto",
    compression="snappy",
    storage_options={
        "key": access_key,
        "secret": secret_access_key,
        "token": sessions_token,
    },
)

The storage options are only needed if you want to write the data cross-account.

In the next step, I thought Glue could easily create Iceberg tables from Parquet files – wrong. So, what can Glue do here? You can build Iceberg tables from existing Glue tables. But here's the problem: you have to provide the table definition, even though the crawler has already created it for the "simple" files. This is a significant lack of automation when I want to build my data pipelines with Infrastructure as Code (IaC).

I then wondered about the Python library PyIceberg. It supports Python and AWS but cannot use S3 as a catalog, only Glue: Creating an Iceberg table on S3 using PyIceberg and Glue catalog.

The attempt to let Glue read Iceberg tables directly also failed. Unfortunately, our ELT tool Matillion cannot directly handle Apache Iceberg as of July 2024.

Data Integration: From S3 to Glue to Apache Iceberg – With the Help of Athena

Technically, I always thought it had to continue with the Glue tables that my crawler had already created. I then discovered that Athena can not only create Apache Iceberg tables via SQL but also generate Iceberg tables using DDL statements without giving a table definition.

Thus, you can indeed create Iceberg tables in AWS via custom resources without table definitions and using "IaC":

CREATE TABLE "glue-database-rexx"."datalake_api_iceberg" 
WITH (table_type = 'ICEBERG', format = 'PARQUET', location = 's3://mybucket/datalake-iceberg/', is_external = false) 
AS SELECT * FROM "glue-database-iceberg"."datalake";

The only downside: you create a copy of the existing Parqute files. So cleanup process and data INSERT mechanism is needed. You will find this in part two of this series.

Closing Words

Infrastructure as Code (IaC) for Apache Iceberg is possible – but with detours. In summary, you can use IaC to create S3 buckets, Glue crawlers, the database, and the Athena query for the initial statement to create the Apache Iceberg catalogs. The path is not very explicit, which is why I wrote it down here. Once I understood this, I could then codify my journey with the Cloud Development Kit.

DEV Community

Integration of Apache Iceberg in S3, Glue, Athena, Matillion, and Snowflake – Part 1

TL;DR

Transactional Data Lakes – Why and How?

Apache Iceberg – Why and How?

Automation is Key – But Not Given at the Crucial Point

Data Integration: From S3 to Glue to Apache Iceberg

Data Integration: From S3 to Glue to Apache Iceberg – With the Help of Athena

Closing Words

Top comments (0)

Read next

Step-by-Step Guide to Setting Up Terraform, AWS CLI, and VS Code

Data on Kubernetes: Part 4 - Argo Workflows: Simplify parallel jobs : Container-native workflow engine for Kubernetes 🔮

AWS Certification and Pathways: A Comprehensive Guide

Visualizing Big Data with Python: Best Practices and Tools