Aki for AWS Community Builders

Posted on Mar 31

AWS Snowflake Lakehouse: 2 Practical Apache Iceberg Integration Patterns

#aws #snowflake #dataengineering #apacheiceberg

Original Japanese article: AWSのレイクハウス（Apache Iceberg）をSnowflakeと連携する2つのパターンを整理する

Introduction

I'm Aki, an AWS Community Builder (@jitepengin).

In recent years, it has become increasingly common to build lakehouse architectures centered around Apache Iceberg.

Before the rise of lakehouse architecture, it was common to design systems where data was consolidated into a specific platform, such as:

an architecture centered around Amazon Redshift on AWS
an architecture centered around internal tables in Snowflake

However, with the advent of Apache Iceberg, this assumption is rapidly changing.

Now that data on Amazon S3 can be directly accessed from multiple engines, what matters is no longer simply product selection.
Instead, the architecture design itself has become the central focus: where the data resides, who owns the write responsibility, and who holds governance authority.

In this article, focusing on the coexistence of AWS and Snowflake, I will organize the following:

two patterns based on S3 × Iceberg
connectivity with Power BI Service
future prospects including AI utilization

Why AWS × Snowflake Coexistence Is Necessary

The greatest value of Apache Iceberg lies in its ability to separate data from the query engine.

For example, while keeping the physical data stored in S3, the same data can be accessed from multiple tools such as Athena, AWS Glue / Spark, Redshift, and Snowflake.

In other words, while consolidating the physical data into a single location, it has become possible to choose the most suitable analytics platform depending on the use case.

As a result, the architectural discussion has shifted from “which product should we use?” to “where should data sovereignty reside, and where should analytical ownership be placed?”

The benefits of using Snowflake here include:

user-friendly UI/UX
powerful SQL analytics capabilities
integration with Cortex AI

In particular, I believe that an architecture where AWS serves as the data sovereignty layer while Snowflake is utilized as the analytics layer is highly compatible.

Two Patterns for Snowflake × S3 Integration

Here, I will organize the two commonly used patterns when integrating Snowflake with S3.

Pattern 1: Glue Catalog Integration

In this pattern, Iceberg tables stored on S3 are referenced from Snowflake through AWS Glue Data Catalog.

The advantages of this architecture are:

relatively simple configuration
S3 becomes the Single Source of Truth
because Snowflake cannot write, AWS retains sovereignty over data management (this can also be considered a disadvantage)
since user access is consolidated into Snowflake, access control can be centralized there

In other words, Snowflake focuses solely on the role of data analytics (query engine), while AWS retains authority over data management.

Setup Procedure

Step 1: Create an External Volume (S3 Access Configuration)

Run the following on the Snowflake side:

CREATE EXTERNAL VOLUME IF NOT EXISTS sample_iceberg_volume
  STORAGE_LOCATIONS = (
    (
      NAME = 'my-s3-location'
      STORAGE_PROVIDER = 'S3'
      STORAGE_BASE_URL = catalog S3 path
      STORAGE_AWS_ROLE_ARN = role ARN to use
      STORAGE_AWS_EXTERNAL_ID = 'my_external_id'
    )
  );

Step 2: Create Glue Catalog Integration

Run the following on the Snowflake side:

CREATE OR REPLACE CATALOG INTEGRATION glue_catalog_int -- arbitrary name
  CATALOG_SOURCE = GLUE
  CATALOG_NAMESPACE = Glue catalog namespace
  TABLE_FORMAT = ICEBERG
  GLUE_AWS_ROLE_ARN = role ARN to use
  GLUE_CATALOG_ID = Glue catalog ID to use
  GLUE_REGION = 'ap-northeast-1'
  ENABLED = TRUE;

Step 3: Retrieve Required Information for AWS Trust Policy

DESC EXTERNAL VOLUME sample_iceberg_volume;

→ Note down STORAGE_AWS_IAM_USER_ARN and STORAGE_AWS_EXTERNAL_ID

DESC CATALOG INTEGRATION glue_catalog_int;

→ Note down GLUE_AWS_IAM_USER_ARN and GLUE_AWS_EXTERNAL_ID

Since the two External IDs above will have different values, be sure to add both to the AWS IAM role Trust Policy.

Please configure the Trust Policy for the role used on the AWS side.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": STORAGE_AWS_IAM_USER_ARN
            },
            "Action": "sts:AssumeRole",
            "Condition": {
                "StringEquals": {
                    "sts:ExternalId": STORAGE_AWS_EXTERNAL_ID
                }
            }
        },
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": GLUE_AWS_IAM_USER_ARN
            },
            "Action": "sts:AssumeRole",
            "Condition": {
                "StringEquals": {
                    "sts:ExternalId": GLUE_AWS_EXTERNAL_ID
                }
            }
        }
    ]
}

Step 4: Create Database

Run the following on the Snowflake side:

CREATE DATABASE IF NOT EXISTS icebergdb;

Step 5: Create Table

Run the following on the Snowflake side:

CREATE OR REPLACE ICEBERG TABLE icebergdb.public.yellow_tripdata -- arbitrary name
  EXTERNAL_VOLUME = 'sample_iceberg_volume'
  CATALOG = 'glue_catalog_int'
  CATALOG_TABLE_NAME = 'yellow_tripdata'
  AUTO_REFRESH = TRUE;

Step 6: Verify Data

Run the following on the Snowflake side:

SELECT * FROM icebergdb.public.yellow_tripdata LIMIT 5;

We were able to read the data from S3!

Pattern 2: Catalog-Linked Database (Iceberg)

This pattern integrates Glue Catalog using the REST Catalog approach and manages Iceberg tables directly from the Snowflake side.

The advantages of this architecture are:

read and write operations from Snowflake are possible
the physical data remains stored in S3
Snowflake users can perform SQL-based updates and analytics
easier integration with Power BI and Cortex AI

In other words, the biggest feature is that analysis and updates can be performed from Snowflake while keeping the physical data in S3.

However, governance must be considered from both the AWS side and the Snowflake side.

Setup Procedure

Step 1: Create an External Volume (S3 Access Configuration)

CREATE EXTERNAL VOLUME IF NOT EXISTS sample_iceberg_volume
  STORAGE_LOCATIONS = (
    (
      NAME = 'my-s3-location'
      STORAGE_PROVIDER = 'S3'
      STORAGE_BASE_URL = catalog S3 path
      STORAGE_AWS_ROLE_ARN = role ARN to use
      STORAGE_AWS_EXTERNAL_ID = 'my_external_id'
    )
  );

Step 2: Create Glue Iceberg REST Catalog Integration

CREATE OR REPLACE CATALOG INTEGRATION glue_rest_catalog_int
  CATALOG_SOURCE = ICEBERG_REST
  TABLE_FORMAT = ICEBERG
  CATALOG_NAMESPACE = Glue catalog namespace
  REST_CONFIG = (
    CATALOG_URI = Glue catalog URI
    CATALOG_API_TYPE = AWS_GLUE
    CATALOG_NAME = AWS account ID
  )
  REST_AUTHENTICATION = (
    TYPE = SIGV4
    SIGV4_IAM_ROLE = role ARN to use
    SIGV4_SIGNING_REGION = 'ap-northeast-1'
  )
  ENABLED = TRUE;

Step 3: Retrieve Required Information for AWS Trust Policy

DESC EXTERNAL VOLUME sample_iceberg_volume;

→ Note down STORAGE_AWS_IAM_USER_ARN and STORAGE_AWS_EXTERNAL_ID

DESC CATALOG INTEGRATION glue_rest_catalog_int;

→ Note down GLUE_AWS_IAM_USER_ARN and GLUE_AWS_EXTERNAL_ID

Since the two External IDs above will have different values, be sure to add both to the AWS IAM role Trust Policy.

Please configure the Trust Policy for the role used on the AWS side.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": STORAGE_AWS_IAM_USER_ARN
            },
            "Action": "sts:AssumeRole",
            "Condition": {
                "StringEquals": {
                    "sts:ExternalId": STORAGE_AWS_EXTERNAL_ID
                }
            }
        },
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": GLUE_AWS_IAM_USER_ARN
            },
            "Action": "sts:AssumeRole",
            "Condition": {
                "StringEquals": {
                    "sts:ExternalId": GLUE_AWS_EXTERNAL_ID
                }
            }
        }
    ]
}

Step 4: Create Catalog Linked Database (Read/Write Enabled)

CREATE DATABASE my_iceberg_linked_db
  LINKED_CATALOG = (
    CATALOG = 'glue_rest_catalog_int',
    ALLOWED_WRITE_OPERATIONS = ALL
  )
  EXTERNAL_VOLUME = 'sample_iceberg_volume';

Step 5: Tables Are Automatically Discovered (Synced Every 30 Seconds)

SELECT * FROM my_iceberg_linked_db."icebergdb"."yellow_tripdata" LIMIT 5;

We were able to read the data from S3!

Step 6: Write Test

INSERT INTO my_iceberg_linked_db."icebergdb"."yellow_tripdata"
  (vendorid, tpep_pickup_datetime, tpep_dropoff_datetime, passenger_count,
   trip_distance, ratecodeid, store_and_fwd_flag, pulocationid, dolocationid,
   payment_type, fare_amount, extra, mta_tax, tip_amount, tolls_amount,
   improvement_surcharge, total_amount, congestion_surcharge, airport_fee)
VALUES
  (1, '2025-01-01 13:00:00', '2025-01-01 13:30:00', 2,
   3.5, 1, 'N', 100, 200,
   1, 15.0, 2.5, 0.5, 3.0, 0,
   1.0, 22.0, 2.5, 0);

Step 7: Verify Write

SELECT * FROM my_iceberg_linked_db."icebergdb"."yellow_tripdata"
WHERE tpep_pickup_datetime = '2025-01-01 13:00:00';

We were able to write data into S3!

Athena side

Before write

After write

Which Pattern Should You Choose?

For practical use, the following way of organizing it is easy to understand.

Use Case	Recommended Pattern
AWS-led ETL workloads, with Snowflake primarily used for read/query access	Pattern 1
BI / AI / SQL updates driven primarily from Snowflake	Pattern 2
Governance needs to be centralized on AWS	Pattern 1
Operations are mainly led by Snowflake users	Pattern 2

Integration with Power BI Service

Traditionally, when referencing an S3-based lakehouse from Power BI Service, many architectures used Redshift as an intermediary.

In such cases, securely connecting directly from Power BI Service often required provisioning an EC2 instance and configuring an on-premises data gateway.

This introduces additional operational costs such as EC2 management.

Now, how about Snowflake?

Power BI provides a native connector for Snowflake, allowing direct authentication and connection from Power BI Service.

This eliminates the need for relay servers or on-premises gateways that are often required in Redshift-based architectures.

In other words, it becomes possible to exclude costly operational components such as on-premises gateways.

In addition, since the semantically organized Gold layer on the Snowflake side can be directly connected to Power BI, this also improves usability for BI users.

Thinking in Terms of Medallion Architecture

Personally, I consider the following architecture to be highly practical for real-world use.

Bronze: S3 + Iceberg
Silver: S3 + Iceberg (with Snowflake integration as needed)
Gold: S3 + Iceberg → Snowflake integration

By especially aligning the Gold layer with Snowflake, it becomes easier to provide a semantic layer that is easy for BI users and business departments to consume.

Depending on the use case, the Silver layer can also be utilized for more detailed analysis.

In other words, this enables a separation of responsibilities between data management and data analytics.

Thinking in Terms of Separation of Ownership

This is the most important point in this article.

Domain	Ownership
Physical data	AWS
Governance	AWS / Snowflake
Analytics	Snowflake
AI interaction	Snowflake

What matters is not the product itself, but how ownership is separated.

By clearly defining this separation, it becomes easier to organize the scope of responsibilities across data engineering, BI, and AI utilization, which also provides benefits from an organizational management perspective.

The Change Brought by Snowflake Cortex AI

AI utilization will become even more important going forward.

As with other data platforms, AI adoption is progressing rapidly in Snowflake as well.

By leveraging Snowflake Cortex AI, it becomes possible to query Iceberg tables on S3 using natural language.

In other words, the data platform is evolving from “a platform for writing SQL” into “a platform for conversation.”

AI utilization is expected to continue evolving in many aspects.

One key point will be preparing data that is easier for AI to use—in other words, AI-ready data.

Conclusion

In this article, I organized two patterns for integrating an AWS lakehouse (Apache Iceberg) with Snowflake.

In recent data utilization scenarios, it is increasingly common not only to rely solely on AWS, but also to integrate with platforms such as Databricks and Snowflake as introduced here.

As mentioned earlier, what matters is not the product itself, but how ownership is separated.

Depending on which service takes responsibility for data, governance, and analytics, both the architecture and configuration will change.

In any case, what is truly important is not the product itself, but the perspective of how to design a data platform that users will continue to use over time.

Going forward, it will become even more important to design architectures not only from the perspective of where data is stored, but also from the viewpoint of who owns responsibility for each layer and how that responsibility connects to user value.

I hope this article will be helpful for those considering a combination of AWS and Snowflake.

DEV Community

AWS Snowflake Lakehouse: 2 Practical Apache Iceberg Integration Patterns

Introduction

Why AWS × Snowflake Coexistence Is Necessary

Two Patterns for Snowflake × S3 Integration

Pattern 1: Glue Catalog Integration

Setup Procedure

Step 1: Create an External Volume (S3 Access Configuration)

Step 2: Create Glue Catalog Integration

Step 3: Retrieve Required Information for AWS Trust Policy

Step 4: Create Database

Step 5: Create Table

Step 6: Verify Data

Pattern 2: Catalog-Linked Database (Iceberg)

Setup Procedure

Step 1: Create an External Volume (S3 Access Configuration)

Step 2: Create Glue Iceberg REST Catalog Integration

Step 3: Retrieve Required Information for AWS Trust Policy

Step 4: Create Catalog Linked Database (Read/Write Enabled)

Step 5: Tables Are Automatically Discovered (Synced Every 30 Seconds)

Step 6: Write Test

Step 7: Verify Write

Which Pattern Should You Choose?

Integration with Power BI Service

Thinking in Terms of Medallion Architecture

Thinking in Terms of Separation of Ownership

The Change Brought by Snowflake Cortex AI

Conclusion

Top comments (0)