DEV Community

Aki for AWS Community Builders

Posted on

Rethinking Lakehouse Architecture Through Data Ownership: AWS vs. Snowflake

Original Japanese article: データの主導権から考えるAWSとSnowflakeのレイクハウスアーキテクチャ

Introduction

I'm Aki, an AWS Community Builder (@jitepengin).
When designing a data platform, discussions about whether to lean toward AWS or Snowflake are still very common.

However, with the rise of Apache Iceberg, data and platforms can now be decoupled. Because of this shift, I believe we need to reconsider the question itself.

Rather than asking:

Should we build around AWS or Snowflake?

A more fundamental question is:

Who owns the data?

In this article, I'd like to define what I mean by data ownership and explore the architectural trade-offs of AWS-centric and Snowflake-centric lakehouse designs.


Why Data Ownership Matters

Apache Iceberg has made it possible to separate data from the platform that accesses it.

Today, an Iceberg table stored on Amazon S3 can be accessed from Athena, Snowflake, Spark, and many other engines. As a result, choosing a product is becoming less important than deciding who is responsible for managing the data.

Before diving into architectural patterns, let's first examine why this shift matters.

Defining Ownership Across Three Layers

In this article, I define data ownership through the following three layers:

Layer Question Example
Catalog Ownership Who owns the metadata? Glue Data Catalog / Snowflake Open Catalog
Write Ownership Who can update or delete data? Glue ETL / Snowflake DML
Governance Ownership Who controls access policies? Lake Formation / Snowflake Horizon

Only when these three layers are consistently controlled by the same authority can we truly say that ownership exists.

Conversely, when ownership is distributed or unclear, complexity tends to emerge in architecture, operations, and security.


The Reality of Vendor Lock-In

Even in the Iceberg era, platform dependencies have not disappeared—they have simply changed form.

  • Catalog dependency: Tables managed by Snowflake Open Catalog still rely operationally on a Snowflake-managed service, although external engines can access them through the REST Catalog API.
  • Write-engine dependency: Snowflake-managed Iceberg tables are primarily updated through Snowflake, though Horizon Catalog now supports external writes from engines such as Spark. The choice of write engine remains closely tied to catalog design.
  • Governance dependency: Lake Formation's fine-grained permissions are fundamentally tied to the AWS ecosystem.

Therefore, saying that "Iceberg eliminates vendor lock-in" is only partially true.

What Iceberg removes is storage-format lock-in.

Dependencies around catalog management, governance, and operational processes still remain. In practice, migrating a data platform involves challenges such as governance policies, access control, metadata management, and platform-specific features.


Extensibility and Strategic Flexibility

Data platforms are never finished.

The rapid evolution of AI technologies and the continuous changes in the Modern Data Stack mean that architectures must adapt over time.

Common examples include:

  • Adding or changing analytics tools
    Athena may be sufficient initially, but business users may later request Snowflake access.

  • Introducing AI workloads
    Integration with SageMaker or Snowflake Cortex AI may become necessary.

  • Cost optimization initiatives
    As query volumes grow, Snowflake compute costs may become significant, leading teams to move batch processing to EMR.

  • Stronger governance requirements
    Column masking or row-level security may need to be introduced later.

When ownership across the three layers is clearly defined from the beginning, these changes become easier to evaluate and implement.

Without that clarity, every change raises new questions about where responsibilities and controls should reside.


What Changed After Iceberg?

Historically, data and platforms were tightly coupled.

Snowflake-Centric AWS-Centric
Data Location Inside Snowflake Inside S3
Management Ownership Snowflake owns everything AWS owns everything
Access from Other Engines Not possible Snowflake could not access directly

Iceberg fundamentally changed this model.

Iceberg Tables on S3
        ↓
Shared by Multiple Engines

Athena / Glue / Snowflake / Spark / Redshift ...
Enter fullscreen mode Exit fullscreen mode

Iceberg adds a metadata layer on top of Parquet files stored in object storage, enabling ACID transactions and schema evolution independent of any specific compute engine.

A catalog tracks metadata such as schemas and active data files, allowing multiple engines to safely access the same table.

Data files are now shareable.

However, ownership of the catalog, write operations, and governance still depends on architectural decisions.

In other words, deciding who manages the catalog effectively determines who owns the data.


Major Iceberg Catalog Options

Catalog Type Characteristics
AWS Glue Data Catalog AWS-managed Supports REST Catalog API and integrates with Lake Formation governance
Snowflake Open Catalog Snowflake-managed (based on Apache Polaris) REST Catalog compliant and accessible from Spark, Trino, and others
Snowflake Horizon Catalog Snowflake service Exposes Snowflake-managed Iceberg tables through APIs; differs from Open Catalog because it is not a standalone metadata store

Snowflake-Centric Architecture

Characteristics

In this approach, Snowflake becomes the center of catalog management, governance, and analytics, while data files remain in external object storage such as S3.

This model prioritizes simplicity and a streamlined analytics experience.

Ownership Model

Layer Owner
Catalog Ownership Snowflake Open Catalog or Horizon Catalog
Write Ownership Primarily Snowflake DML
Governance Ownership Snowflake Horizon

Although data files remain on S3, external engines can access Snowflake-managed Iceberg tables through two mechanisms:

  • Via Open Catalog: Snowflake-managed Iceberg tables are synced to Open Catalog and exposed through the REST Catalog API. In this sync scenario, external engines have read-only access. (Note: when Open Catalog itself is used as an internal catalog, read/write access is supported.)
  • Via Horizon Catalog: Tables are exposed directly through the Horizon Iceberg REST Catalog API without syncing to Open Catalog. External engines can both read and write, and existing Snowflake users and roles can be used for access control.

Benefits

  • Governance policies such as column masking and row-level security can be applied to Iceberg tables in the same way as native Snowflake tables. When external engines access tables through Horizon Catalog, the same policies are enforced at read time. Note, however, that writing to tables with masking policies or tags applied is not supported from external engines — this is an important constraint to be aware of.
  • Rich ecosystem support for BI tools such as Power BI makes Snowflake a convenient analytics front end.
  • External engines can access Iceberg tables through Open Catalog or Horizon Catalog while reusing Snowflake users and roles as the unit of access control.

Drawbacks

  • Snowflake warehouse compute costs can be significant for write-heavy workloads. When external engines such as Spark write through Horizon Catalog, Snowflake warehouses are not used — but Horizon Catalog API calls are billed at 0.5 credits per million requests, so cost planning is still required.
  • Coordination is needed when AWS services such as Glue ETL also write to the same datasets. Clearly defining who holds catalog ownership is essential.

Even with Iceberg, many enterprises ultimately converge on a Snowflake-centric operating model because governance, metadata, and write operations all remain concentrated within Snowflake.

In such cases, Iceberg provides openness in theory, but ownership remains firmly within the Snowflake ecosystem.


AWS-Centric Architecture

Characteristics

This architecture uses S3 for storage, Glue Data Catalog for metadata, and AWS-native services for ETL, analytics, and governance.

Its primary advantages are flexibility and service interoperability.

Ownership Model

Layer Owner
Catalog Ownership AWS Glue Data Catalog
Write Ownership Glue ETL / EMR
Governance Ownership Lake Formation

Because Glue Data Catalog supports the Iceberg REST Catalog API, external engines such as Snowflake and Databricks can access the same tables.

This enables AWS to retain ownership while allowing Snowflake to serve as an analytics front end.

Benefits

  • Tight integration across Athena, Glue, EMR, and Redshift with a shared catalog.
  • Fine-grained column- and row-level governance through Lake Formation, applicable to Iceberg tables.
  • Ability to optimize compute engines for different workloads — EMR for large-scale batch, Athena for interactive queries.

Drawbacks

  • Increased architectural and operational complexity due to the number of AWS services involved.
  • Additional design considerations for multi-cloud environments, as the catalog remains AWS-dependent.

Lake Formation is powerful, but troubleshooting permission issues can become challenging. Identifying why a specific user cannot access a specific table or row often takes considerable time, requiring mature operational practices and careful permission design.


Combining AWS and Snowflake

A realistic approach is not choosing one platform over the other, but assigning clear responsibilities to each.

The key is defining ownership boundaries upfront.

AWS Owns the Data, Snowflake Powers Analytics

This is one of the most common patterns.

The goal is to maintain data ownership within AWS while leveraging Snowflake's analytics capabilities and its rich ecosystem of BI connectors.

┌──────────────────────────────────────────────────┐
│                       AWS                        │
│  S3 (Iceberg data files)                         │
│  Glue Data Catalog (Catalog Ownership)           │
│  Lake Formation (Governance Ownership)           │
│  Glue / EMR (Write Ownership)                    │
└──────────────────────┬───────────────────────────┘
                       │ Iceberg REST Catalog API
        ┌──────────────┼───────────────────┐
        ▼              ▼                   ▼
     Athena          Glue              Snowflake
  (Interactive)     (ETL)            (Analytics)
Enter fullscreen mode Exit fullscreen mode

In this model:

Layer Owner
Catalog Ownership AWS
Write Ownership AWS
Governance Ownership AWS

Snowflake acts primarily as an analytical interface.

Two variations exist:

1. Glue Catalog Integration (Read-Only)

Snowflake accesses AWS-managed Iceberg tables through External Iceberg Tables. Write ownership and governance remain entirely with AWS. Lake Formation can be used as the single source of truth for access control.

2. Catalog-Linked Database (Read/Write)

Snowflake can update Iceberg tables through the Iceberg REST Catalog API while the data remains stored on S3. This approach is attractive when analysts and AI workloads primarily operate in Snowflake.

However, governance responsibilities become shared between AWS and Snowflake. Both Lake Formation and Snowflake-side access controls must be configured carefully — a misconfiguration in either can become a security gap. If the read-only pattern (option 1) is sufficient, consolidating governance in Lake Formation is simpler.

For step-by-step implementation details of these patterns — including how to set up External Volumes, Catalog Integrations, and Catalog-Linked Databases — see this companion article:

AWS Snowflake Lakehouse: 2 Practical Apache Iceberg Integration Patterns


Comparison: Three Architectural Patterns

Dimension Snowflake-Centric AWS-Centric Hybrid
Catalog Ownership Snowflake AWS AWS
Write Ownership Snowflake AWS AWS
Governance Ownership Snowflake Horizon Lake Formation AWS primary (①) / AWS+SF (②)
Compute Cost Tends to be higher Optimizable by workload Optimizable by workload
Operational Complexity Low to medium Medium to high High
Multi-Engine Flexibility Medium (via REST API) High High

Choosing the Right Pattern

Based on the patterns above, here is a simplified decision guide:

Snowflake-centric tends to fit when:

  • Analytics is BI-driven or led by non-engineers
  • Development speed and analytics experience take priority over data volume
  • Centralized governance through Snowflake Horizon is preferred

AWS-centric tends to fit when:

  • Data volumes are large and ETL is the dominant workload
  • A dedicated data engineering team is already working within the AWS ecosystem
  • Fine-grained access control through Lake Formation is a requirement

Hybrid tends to fit when:

  • Different teams use different tools (e.g., engineers on AWS, analysts on Snowflake)
  • Future extensibility for AI, ML, or multi-engine workloads is a priority
  • AWS retains data ownership while Snowflake's query performance is still needed

What Happens When Ownership Is Unclear

A common anti-pattern is building a platform that "works" without explicitly defining ownership.

Typical symptoms include:

  • Nobody knows who is responsible for schema changes. When both Glue and Snowflake have schema owners, it becomes unclear which definition is authoritative.
  • Data written from Snowflake is not visible in Athena. When two catalogs attempt to manage the same table, one may lose track of the latest snapshot, causing metadata inconsistencies.
  • Governance rules drift between Lake Formation and Snowflake Horizon. Maintaining access policies in two places creates risk — a gap in either becomes a security vulnerability.
  • Incident response slows down. When multiple engines can write, identifying what happened and where becomes difficult, delaying recovery.

These issues often evolve from technical challenges into organizational problems:

  • Teams blame each other over unclear responsibilities.
  • Audits become difficult because nobody can fully explain who has access to what.
  • Incident recovery is delayed due to unclear decision-making authority.

A running system is not necessarily a well-designed system.

Ownership becomes increasingly difficult to fix after the platform has already grown.


"AWS or Snowflake?" Is a Secondary Question

In practice, organizations often begin by debating whether to standardize on AWS or Snowflake.

In the Iceberg era, I believe that is the wrong starting point.

The first questions should be:

  • Who owns the catalog?
  • Who owns writes?
  • Who owns governance?

Once these three ownership layers are defined, the platform choice naturally follows.

  • Want all three owned by Snowflake? → Snowflake-centric architecture.
  • Want all three owned by AWS? → AWS-centric architecture.
  • Want AWS to own data while Snowflake provides analytics? → Hybrid architecture.

Iceberg has dramatically increased flexibility around where data lives.

As flexibility increases, architects must become more deliberate about defining responsibility.

Starting with product selection often leads to contradictions later. A configuration where Snowflake is used as the query interface, Glue handles writes, and Lake Formation controls governance — without intentional design — is a classic symptom of ownership being distributed and unclear from the start.

The hardest challenge is no longer connectivity.

It is ownership.


Conclusion

Apache Iceberg has significantly reduced storage-level vendor lock-in.

However, catalog ownership, write ownership, and governance ownership still require deliberate architectural decisions.

A useful decision-making sequence is:

  1. Decide who owns the catalog. (Glue / Snowflake Open Catalog / Snowflake Horizon)
  2. Decide who owns writes. (AWS-native services / Snowflake)
  3. Decide who owns governance. (Lake Formation / Snowflake Horizon / both)

Once those three decisions are made, choosing between AWS and Snowflake becomes much easier. From there, you can design the architecture that best fits your requirements.

Ultimately, the hardest part of a modern lakehouse architecture is often not the technology itself. It is agreeing on ownership boundaries — deciding which team manages the catalog, who is responsible for data updates, and where governance policies are enforced.

Technology evolves. The challenge of people and processes remains.

I hope this article helps anyone evaluating lakehouse architectures built on AWS, Snowflake, and Apache Iceberg.

Top comments (0)