Original Japanese article: AWSのレイクハウス(Apache Iceberg)をSnowflakeと連携する2つのパターンを整理する
Introduction
I'm Aki, an AWS Community Builder (@jitepengin).
In recent years, it has become increasingly common to build lakehouse architectures centered around Apache Iceberg.
Before the rise of lakehouse architecture, it was common to design systems where data was consolidated into a specific platform, such as:
- an architecture centered around Amazon Redshift on AWS
- an architecture centered around internal tables in Snowflake
However, with the advent of Apache Iceberg, this assumption is rapidly changing.
Now that data on Amazon S3 can be directly accessed from multiple engines, what matters is no longer simply product selection.
Instead, the architecture design itself has become the central focus: where the data resides, who owns the write responsibility, and who holds governance authority.
In this article, focusing on the coexistence of AWS and Snowflake, I will organize the following:
- two patterns based on S3 × Iceberg
- connectivity with Power BI Service
- future prospects including AI utilization
Why AWS × Snowflake Coexistence Is Necessary
The greatest value of Apache Iceberg lies in its ability to separate data from the query engine.
For example, while keeping the physical data stored in S3, the same data can be accessed from multiple tools such as Athena, AWS Glue / Spark, Redshift, and Snowflake.
In other words, while consolidating the physical data into a single location, it has become possible to choose the most suitable analytics platform depending on the use case.
As a result, the architectural discussion has shifted from “which product should we use?” to “where should data sovereignty reside, and where should analytical ownership be placed?”
The benefits of using Snowflake here include:
- user-friendly UI/UX
- powerful SQL analytics capabilities
- integration with Cortex AI
In particular, I believe that an architecture where AWS serves as the data sovereignty layer while Snowflake is utilized as the analytics layer is highly compatible.
Two Patterns for Snowflake × S3 Integration
Here, I will organize the two commonly used patterns when integrating Snowflake with S3.
Pattern 1: Glue Catalog Integration
In this pattern, Iceberg tables stored on S3 are referenced from Snowflake through AWS Glue Data Catalog.
The advantages of this architecture are:
- relatively simple configuration
- S3 becomes the Single Source of Truth
- because Snowflake cannot write, AWS retains sovereignty over data management (this can also be considered a disadvantage)
- since user access is consolidated into Snowflake, access control can be centralized there
In other words, Snowflake focuses solely on the role of data analytics (query engine), while AWS retains authority over data management.
Setup Procedure
Step 1: Create an External Volume (S3 Access Configuration)
Run the following on the Snowflake side:
CREATE EXTERNAL VOLUME IF NOT EXISTS sample_iceberg_volume
STORAGE_LOCATIONS = (
(
NAME = 'my-s3-location'
STORAGE_PROVIDER = 'S3'
STORAGE_BASE_URL = catalog S3 path
STORAGE_AWS_ROLE_ARN = role ARN to use
STORAGE_AWS_EXTERNAL_ID = 'my_external_id'
)
);
Step 2: Create Glue Catalog Integration
Run the following on the Snowflake side:
CREATE OR REPLACE CATALOG INTEGRATION glue_catalog_int -- arbitrary name
CATALOG_SOURCE = GLUE
CATALOG_NAMESPACE = Glue catalog namespace
TABLE_FORMAT = ICEBERG
GLUE_AWS_ROLE_ARN = role ARN to use
GLUE_CATALOG_ID = Glue catalog ID to use
GLUE_REGION = 'ap-northeast-1'
ENABLED = TRUE;
Step 3: Retrieve Required Information for AWS Trust Policy
DESC EXTERNAL VOLUME sample_iceberg_volume;
→ Note down STORAGE_AWS_IAM_USER_ARN and STORAGE_AWS_EXTERNAL_ID
DESC CATALOG INTEGRATION glue_catalog_int;
→ Note down GLUE_AWS_IAM_USER_ARN and GLUE_AWS_EXTERNAL_ID
Since the two External IDs above will have different values, be sure to add both to the AWS IAM role Trust Policy.
Please configure the Trust Policy for the role used on the AWS side.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": STORAGE_AWS_IAM_USER_ARN
},
"Action": "sts:AssumeRole",
"Condition": {
"StringEquals": {
"sts:ExternalId": STORAGE_AWS_EXTERNAL_ID
}
}
},
{
"Effect": "Allow",
"Principal": {
"AWS": GLUE_AWS_IAM_USER_ARN
},
"Action": "sts:AssumeRole",
"Condition": {
"StringEquals": {
"sts:ExternalId": GLUE_AWS_EXTERNAL_ID
}
}
}
]
}
Step 4: Create Database
Run the following on the Snowflake side:
CREATE DATABASE IF NOT EXISTS icebergdb;
Step 5: Create Table
Run the following on the Snowflake side:
CREATE OR REPLACE ICEBERG TABLE icebergdb.public.yellow_tripdata -- arbitrary name
EXTERNAL_VOLUME = 'sample_iceberg_volume'
CATALOG = 'glue_catalog_int'
CATALOG_TABLE_NAME = 'yellow_tripdata'
AUTO_REFRESH = TRUE;
Step 6: Verify Data
Run the following on the Snowflake side:
SELECT * FROM icebergdb.public.yellow_tripdata LIMIT 5;
We were able to read the data from S3!
Pattern 2: Catalog-Linked Database (Iceberg)
This pattern integrates Glue Catalog using the REST Catalog approach and manages Iceberg tables directly from the Snowflake side.
The advantages of this architecture are:
- read and write operations from Snowflake are possible
- the physical data remains stored in S3
- Snowflake users can perform SQL-based updates and analytics
- easier integration with Power BI and Cortex AI
In other words, the biggest feature is that analysis and updates can be performed from Snowflake while keeping the physical data in S3.
However, governance must be considered from both the AWS side and the Snowflake side.
Setup Procedure
Step 1: Create an External Volume (S3 Access Configuration)
CREATE EXTERNAL VOLUME IF NOT EXISTS sample_iceberg_volume
STORAGE_LOCATIONS = (
(
NAME = 'my-s3-location'
STORAGE_PROVIDER = 'S3'
STORAGE_BASE_URL = catalog S3 path
STORAGE_AWS_ROLE_ARN = role ARN to use
STORAGE_AWS_EXTERNAL_ID = 'my_external_id'
)
);
Step 2: Create Glue Iceberg REST Catalog Integration
CREATE OR REPLACE CATALOG INTEGRATION glue_rest_catalog_int
CATALOG_SOURCE = ICEBERG_REST
TABLE_FORMAT = ICEBERG
CATALOG_NAMESPACE = Glue catalog namespace
REST_CONFIG = (
CATALOG_URI = Glue catalog URI
CATALOG_API_TYPE = AWS_GLUE
CATALOG_NAME = AWS account ID
)
REST_AUTHENTICATION = (
TYPE = SIGV4
SIGV4_IAM_ROLE = role ARN to use
SIGV4_SIGNING_REGION = 'ap-northeast-1'
)
ENABLED = TRUE;
Step 3: Retrieve Required Information for AWS Trust Policy
DESC EXTERNAL VOLUME sample_iceberg_volume;
→ Note down STORAGE_AWS_IAM_USER_ARN and STORAGE_AWS_EXTERNAL_ID
DESC CATALOG INTEGRATION glue_rest_catalog_int;
→ Note down GLUE_AWS_IAM_USER_ARN and GLUE_AWS_EXTERNAL_ID
Since the two External IDs above will have different values, be sure to add both to the AWS IAM role Trust Policy.
Please configure the Trust Policy for the role used on the AWS side.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": STORAGE_AWS_IAM_USER_ARN
},
"Action": "sts:AssumeRole",
"Condition": {
"StringEquals": {
"sts:ExternalId": STORAGE_AWS_EXTERNAL_ID
}
}
},
{
"Effect": "Allow",
"Principal": {
"AWS": GLUE_AWS_IAM_USER_ARN
},
"Action": "sts:AssumeRole",
"Condition": {
"StringEquals": {
"sts:ExternalId": GLUE_AWS_EXTERNAL_ID
}
}
}
]
}
Step 4: Create Catalog Linked Database (Read/Write Enabled)
CREATE DATABASE my_iceberg_linked_db
LINKED_CATALOG = (
CATALOG = 'glue_rest_catalog_int',
ALLOWED_WRITE_OPERATIONS = ALL
)
EXTERNAL_VOLUME = 'sample_iceberg_volume';
Step 5: Tables Are Automatically Discovered (Synced Every 30 Seconds)
SELECT * FROM my_iceberg_linked_db."icebergdb"."yellow_tripdata" LIMIT 5;
We were able to read the data from S3!
Step 6: Write Test
INSERT INTO my_iceberg_linked_db."icebergdb"."yellow_tripdata"
(vendorid, tpep_pickup_datetime, tpep_dropoff_datetime, passenger_count,
trip_distance, ratecodeid, store_and_fwd_flag, pulocationid, dolocationid,
payment_type, fare_amount, extra, mta_tax, tip_amount, tolls_amount,
improvement_surcharge, total_amount, congestion_surcharge, airport_fee)
VALUES
(1, '2025-01-01 13:00:00', '2025-01-01 13:30:00', 2,
3.5, 1, 'N', 100, 200,
1, 15.0, 2.5, 0.5, 3.0, 0,
1.0, 22.0, 2.5, 0);
Step 7: Verify Write
SELECT * FROM my_iceberg_linked_db."icebergdb"."yellow_tripdata"
WHERE tpep_pickup_datetime = '2025-01-01 13:00:00';
We were able to write data into S3!
Athena side
Which Pattern Should You Choose?
For practical use, the following way of organizing it is easy to understand.
| Use Case | Recommended Pattern |
|---|---|
| AWS-led ETL workloads, with Snowflake primarily used for read/query access | Pattern 1 |
| BI / AI / SQL updates driven primarily from Snowflake | Pattern 2 |
| Governance needs to be centralized on AWS | Pattern 1 |
| Operations are mainly led by Snowflake users | Pattern 2 |
Integration with Power BI Service
Traditionally, when referencing an S3-based lakehouse from Power BI Service, many architectures used Redshift as an intermediary.
In such cases, securely connecting directly from Power BI Service often required provisioning an EC2 instance and configuring an on-premises data gateway.
This introduces additional operational costs such as EC2 management.
Now, how about Snowflake?
Power BI provides a native connector for Snowflake, allowing direct authentication and connection from Power BI Service.
This eliminates the need for relay servers or on-premises gateways that are often required in Redshift-based architectures.
In other words, it becomes possible to exclude costly operational components such as on-premises gateways.
In addition, since the semantically organized Gold layer on the Snowflake side can be directly connected to Power BI, this also improves usability for BI users.
Thinking in Terms of Medallion Architecture
Personally, I consider the following architecture to be highly practical for real-world use.
- Bronze: S3 + Iceberg
- Silver: S3 + Iceberg (with Snowflake integration as needed)
- Gold: S3 + Iceberg → Snowflake integration
By especially aligning the Gold layer with Snowflake, it becomes easier to provide a semantic layer that is easy for BI users and business departments to consume.
Depending on the use case, the Silver layer can also be utilized for more detailed analysis.
In other words, this enables a separation of responsibilities between data management and data analytics.
Thinking in Terms of Separation of Ownership
This is the most important point in this article.
| Domain | Ownership |
|---|---|
| Physical data | AWS |
| Governance | AWS / Snowflake |
| Analytics | Snowflake |
| AI interaction | Snowflake |
What matters is not the product itself, but how ownership is separated.
By clearly defining this separation, it becomes easier to organize the scope of responsibilities across data engineering, BI, and AI utilization, which also provides benefits from an organizational management perspective.
The Change Brought by Snowflake Cortex AI
AI utilization will become even more important going forward.
As with other data platforms, AI adoption is progressing rapidly in Snowflake as well.
By leveraging Snowflake Cortex AI, it becomes possible to query Iceberg tables on S3 using natural language.
In other words, the data platform is evolving from “a platform for writing SQL” into “a platform for conversation.”
AI utilization is expected to continue evolving in many aspects.
One key point will be preparing data that is easier for AI to use—in other words, AI-ready data.
Conclusion
In this article, I organized two patterns for integrating an AWS lakehouse (Apache Iceberg) with Snowflake.
In recent data utilization scenarios, it is increasingly common not only to rely solely on AWS, but also to integrate with platforms such as Databricks and Snowflake as introduced here.
As mentioned earlier, what matters is not the product itself, but how ownership is separated.
Depending on which service takes responsibility for data, governance, and analytics, both the architecture and configuration will change.
In any case, what is truly important is not the product itself, but the perspective of how to design a data platform that users will continue to use over time.
Going forward, it will become even more important to design architectures not only from the perspective of where data is stored, but also from the viewpoint of who owns responsibility for each layer and how that responsibility connects to user value.
I hope this article will be helpful for those considering a combination of AWS and Snowflake.










Top comments (0)