Krishna Tangudu

Posted on May 4 • Edited on May 5

From Glue to Horizon: Our Real Journey Building an Iceberg Lakehouse on Snowflake

#architecture #aws #dataengineering #snowflake

We set out to build an open lakehouse: Iceberg tables on AWS S3, Spark/Glue for pipelines, Snowflake for analytics compute power. What could go wrong? Everything—from query performance implosions to uncloneable dynamic tables. Here's the unfiltered journey, including why we pivoted to Snowflake-managed Iceberg via Horizon Catalog and abandoned automated dynamic tables for explicit, observable incremental processing.

TL;DR

❌ Phase 1 failed: Glue-generated Iceberg files (32-64MB) or bigger size caused 5-10x slower Snowflake queries
✅ Phase 2 wins: Snowflake-managed Iceberg auto-compacts to 256-512MB, 2-5x faster, ~35% cost savings
⚠️ Dynamic Tables gotcha: Cannot clone, opaque refresh timing—unusable for production DevOps
✅ Solution: Explicit Streams + Tasks on log tables—boring, debuggable, production-ready
🎯 Key decision: Data team owns Bronze→Silver only; business owns Gold (saved endless remodeling debates)
💰 Reality check: "Cloud-neutral" = readable across engines, not free migration

Our Team Context (Yours Will Differ)

Before diving into architecture decisions, here's who we are:

Skills: Strong SQL/dbt, limited Spark/Scala experience
Priorities: Ship fast, avoid operational black boxes
Constraint: No dedicated DevOps for Glue cluster tuning

This shaped every decision below. A team fluent in Spark would have made different trade-offs.

🎯 The Goal

Unify data from SAP HANA (change data capture), Salesforce Data Cloud, and raw event streams into a single cloud-neutral lakehouse—no proprietary lock-in, full cross-tool interoperability.

Why Iceberg?

We evaluated Delta Lake, Apache Hudi, and Apache Iceberg a couple of years ago. Here's the comprehensive comparison that drove our decision:

Criterion	Iceberg	Delta Lake	Hudi
Partition Evolution	✅ Change without rewrite	❌ Requires full table rewrite	❌ Not supported
Primary Keys	✅ Native support	❌ Not supported	✅ Supported
Automated Compaction	✅ MAINTAIN ICEBERG TABLE	⚠️ Manual (OPTIMIZE tuning)	⚠️ Manual/semi-auto
Schema Evolution	✅ Full (add/drop/rename/reorder)	✅ Add/drop columns	⚠️ Limited
Engine Compatibility	✅ Spark, Trino, Flink, Snowflake, Dremio	⚠️ Spark-first, limited others	⚠️ Spark-first
Platform Support	✅ AWS Glue/Athena, Azure, GCP, Snowflake	✅ AWS, Azure (Fabric native), Databricks	⚠️ AWS, limited Azure
File Format Flexibility	✅ Parquet, ORC, Avro	⚠️ Parquet only	✅ Parquet, ORC, Avro
Community Governance	✅ Apache Foundation (vendor-neutral)	⚠️ Databricks-controlled	✅ Apache Foundation

Why We Chose Iceberg (and Ruled Out Delta/Hudi)

Iceberg's winning factors for our context:

Partition evolution without rewrites: Our SAP data's partitioning strategy evolved over time (daily → monthly as data matured). Iceberg lets us change partition specs without rewriting billions of rows. Delta requires full table rewrite—a multi-day, multi-TB operation we couldn't afford.
Snowflake-managed support: Only Iceberg offers native managed tables in Snowflake Horizon Catalog with automatic compaction. Delta/Hudi would lock us into external table limitations with the Phase 1 performance issues.
Automated compaction: Snowflake's MAINTAIN ICEBERG TABLE handles file optimization automatically. Delta requires manual compaction tuning in Spark—expertise our SQL-first team doesn't have.
Primary key enforcement: Iceberg supports primary keys natively, critical for our SAP source data integrity (customer IDs, order numbers). Delta lacks this—you must enforce it in application logic.
Vendor-neutral governance: Apache Foundation stewardship means no single vendor controls the spec. Delta's governance is tied to Databricks' business interests.

Bottom line: Iceberg was the safe bet for multi-engine flexibility without vendor lock-in. If you're Azure-only with Fabric, Delta is pragmatic. If you're Databricks-native, Delta is the path of least resistance. But for AWS + Snowflake + future optionality, Iceberg was the only choice.

Why Snowflake for Compute?

Three differentiators that closed the decision:

1. Separation of Storage & Compute

Scale workloads independently to meet business demands while enabling detailed chargeback per team or domain—without disrupting other workloads. → Deep dive: Understanding Snowflake Virtual Warehouses

2. Automatic Caching + Smart Pruning (RELY Operators)

Sub-second query performance on petabyte-scale data through intelligent result/metadata caching and constraint-based optimization. → Deep dive: RELY Constraint for Star-Schema Queries

3. Compute Billing Precision

POC result: Our thousands of BI dashboards were only charged for calculation time—not data transfer to BI tools. Estimated 40–60% cost savings vs. **competitors **billing full query duration.

🏛️ Medallion Architecture

Our lakehouse follows the classic three-layer medallion model, tailored for SAP source systems:

┌─────────────────────────────────────────────────────────────┐
│                    DATA SOURCES                             │
│   SAP HANA (CDC)  │  Salesforce Data Cloud  │  Raw Streams │
└────────────┬──────────────────┬─────────────────────┬───────┘
             │                  │                     │
             ▼                  ▼                     ▼
┌─────────────────────────────────────────────────────────────┐
│  🥉 BRONZE LAYER — AWS Glue Spark → Iceberg writes to S3   │
│  Raw ingestion, no transformation, full fidelity            │
└─────────────────────────┬───────────────────────────────────┘
                          │
                          ▼
┌─────────────────────────────────────────────────────────────┐
│  🥈 SILVER LAYER — Iceberg Tables (AWS-hosted, S3)         │
│  Matches source table structures                            │
│  e.g., KNA1 (Customers), MARA (Materials), VBAK (Orders)   │
└─────────────────────────┬───────────────────────────────────┘
                          │  Stream on log_table
                          ▼
┌─────────────────────────────────────────────────────────────┐
│  🥇 GOLD LAYER — Snowflake-Managed Iceberg (Horizon Cat.)  │
│  Proper dimensions & facts with business names             │
│  e.g., DIM_CUSTOMER, FACT_SALES_ORDER, DIM_PRODUCT         │
└─────────────────────────┬───────────────────────────────────┘
                          │
                          ▼
              BI Tools / ML / Fabric Export

Sounds ideal? Early reality: Snowflake choked on our Glue-generated files.

The Architecture Decision Nobody Talks About: Layer Ownership

Here's what we learned the hard way: Don't own Gold if you don't have to.

Why this boundary matters:

Silver = Source truth: Matches SAP table structures (KNA1, MARA, VBAK). Data engineering controls quality, structure, and change tracking.
Gold = Business semantics: DIM_CUSTOMER, FACT_SALES_ORDER. Business teams decide how to model, aggregate, and interpret.
Boundary = Contract: Silver provides clean, change-tracked source data with explicit SLAs; business teams own downstream transformations.

This Decision Saved Us From:

❌ Endless "why did the customer count change?" debates (business definition shifts, not data quality issues)

❌ Remodeling dimensions every quarter when business logic evolves

❌ Being the bottleneck for every dashboard request

❌ Owning interpretations of business rules we don't fully understand

Your team may differ, but define the ownership boundary early or you'll own every downstream interpretation forever. The tools (dbt, Iceberg, Snowflake) don't enforce this—you must.

Phase 1: Glue-Managed Iceberg + Snowflake External Tables ❌

The Performance Trap

We started simple: Spark jobs in Glue created Iceberg tables stored in Glue Catalog; Snowflake linked them as external Iceberg tables.

What Broke: File Size Mismatch

Snowflake's Iceberg scanner is optimized for specific file characteristics per official recommendations:

Parameter	Snowflake Recommendation	What Glue Produced
File size	256 – 512 MB	32 – 64 MB (many small files)
Row group size	16 – 256 MB	< 16 MB
Row groups per file	Multiple (for parallelism)	Often 1

The result: Full table scans instead of pruned reads, query times 5–10x slower than expected.

┌──────────────────────────────────────────────────────┐
│              PHASE 1 ARCHITECTURE                    │
│                                                      │
│  AWS Glue Spark ──writes──▶ Iceberg (Glue Catalog)  │
│                                    │                 │
│                             S3 Parquet files         │
│                          (small, fragmented)         │
│                                    │                 │
│  Snowflake ◀──external table──────┘                 │
│  (slow scans, no auto-compaction)                    │
└──────────────────────────────────────────────────────┘

🔀 Decision Point: Invest in Spark Tuning or Snowflake-Managed?

Option A: Hire Spark expertise, tune file compaction settings

Timeline: 3-6 months
Ongoing cost: Maintain Spark expertise, monitor file sizes
Risk: Our team lacks Spark internals experience

Option B: Snowflake-managed Iceberg via Horizon Catalog

Timeline: 2 weeks
Ongoing cost: Snowflake MAINTAIN ICEBERG TABLE compute
Upside: Handles compaction automatically, team stays in SQL/dbt comfort zone

We chose Option B: Our team's strength is SQL/dbt, not Spark internals. Let Snowflake handle the file lifecycle.

Phase 2: Snowflake-Managed Iceberg + Horizon Catalog ✅

The Pivot

Snowflake-managed Iceberg tables put Snowflake in charge of the table lifecycle on your S3 bucket—Horizon Catalog governs metadata, access, and interoperability.

┌──────────────────────────────────────────────────────────┐
│               PHASE 2 ARCHITECTURE                       │
│                                                          │
│  AWS Glue Spark ──writes──▶ Snowflake Horizon Catalog   │
│                                    │                     │
│                    Horizon manages Iceberg metadata      │
│                    Auto-compaction to 256-512MB files    │
│                                    │                     │
│                             S3 (your bucket)             │
│                          Optimal Parquet layout          │
│                                    │                     │
│  Snowflake ◀──native read─────────┘                     │
│  (2–5x faster, full pruning, cloneable*)                 │
│                                                          │
│  BI Tools ◀── Snowflake compute                         │
│  Glue/Spark ◀── Iceberg open format (bidirectional)     │
└──────────────────────────────────────────────────────────┘

Wins

✅ 2–5x query speedup vs. Phase 1 external tables

✅ Automatic compaction to optimal file sizes via MAINTAIN ICEBERG TABLE

✅ Bidirectional access: Glue/Spark can write, Snowflake reads natively; BI tools use Snowflake compute

✅ Open format preserved: Gold layer exportable to Fabric/Polaris later

Trade-off: Snowflake compute is billed for maintenance runs—but total ops cost is lower than Phase 1's slow queries burning warehouse credits.

⚠️ Why We Abandoned Dynamic Tables for Production

Dynamic Tables on Iceberg sounded perfect—zero-code pipelines, automatic refresh. We hit two walls that forced us back to explicit patterns:

Problem 1 — No Cloning Support:

-- This FAILS silently on dynamic Iceberg tables
CREATE DATABASE prod_clone CLONE prod_db;
-- Dynamic Iceberg tables are simply skipped in the clone

DB/schema clones skip dynamic Iceberg tables entirely. This is a DevOps killer—no dev/test environment parity, no blue-green deploys.

Problem 2 — Opaque Refresh Timing:

Incremental refresh latency was unpredictable and nearly impossible to debug for SLA enforcement. Monitoring refresh lag and debugging failures was guesswork with no visibility into what triggered refreshes or why they were delayed.

🔀 Trade-off: Zero-Code vs. Zero-Surprise

Dynamic Tables promise automation but hide:

When refreshes actually run
What triggered the refresh
How to debug failures in production at 3am

Streams + Tasks = More code, but debuggable in production.

Architecture decision-makers: Optimize for production support, not dev convenience.

Resolution: Abandoned dynamic tables for Silver → Gold. Back to explicit Streams + Tasks.

Our Solution: Log Table Pattern

┌─────────────────────────────────────────────────────┐
│           SILVER → GOLD PIPELINE (Current)          │
│                                                      │
│  Silver Layer                                        │
│     │                                               │
│     ├──▶ silver.log_table (tracks all changes)      │
│                │                                    │
│                ▼                                    │
│     CREATE STREAM log_changes                       │
│     ON TABLE silver.log_table;          ◀── stream  │
│                │                                    │
│                ▼                                    │
│     CREATE TASK gold_refresh            ◀── task    │
│       SCHEDULE = '5 MINUTE'                         │
│       AS MERGE INTO gold.customer_360               │
│          USING log_changes ...;                     │
│                │                                    │
│                ▼                                    │
│     Gold Layer (Snowflake-Managed Iceberg)          │
└─────────────────────────────────────────────────────┘

-- The explicit, cloneable, debuggable pattern
CREATE STREAM log_changes ON TABLE silver.log_table;

CREATE TASK gold_refresh
  WAREHOUSE = compute_xs
  SCHEDULE = '5 MINUTE'
  WHEN SYSTEM$STREAM_HAS_DATA('log_changes')
  AS
  MERGE INTO gold.customer_360 tgt
  USING (
    SELECT * FROM log_changes WHERE metadata$action = 'INSERT'
  ) src
  ON tgt.customer_id = src.customer_id
  WHEN MATCHED THEN UPDATE SET ...
  WHEN NOT MATCHED THEN INSERT ...;

Why This Wins:

✅ Full cloning support (dev/prod parity restored)

✅ Transparent costs — every execution logged in TASK_HISTORY

✅ Debuggable — stream offset visible, failures isolated

✅ Fabric-friendly — Gold can be exported as Iceberg/Parquet later

🌐 Cross-Platform Reality: Why We Can (and Can't) Pivot

Current State: AWS + Snowflake, Iceberg on S3

✅ Open format preserved: Can read Iceberg from Spark, Trino, Athena

✅ Gold exportable: Stream to Parquet → Fabric mirroring works

⚠️ Fabric constraint: OneLake wants Delta + same Azure region for zero-copy

⚠️ Horizon lock-in: Snowflake-managed Iceberg metadata tied to Horizon Catalog

What "Cloud-Neutral" Actually Means

Not: "Deploy anywhere tomorrow with zero effort"

Actually: "Readable by multiple engines, movable with effort"

If We Had to Migrate to Fabric:

Layer	Migration Effort	Estimated Timeline
Bronze → Silver ingestion	Rewrite Glue jobs to Delta	2-3 weeks
Silver → Gold dbt models	Port to Fabric SQL (syntax diffs)	1-2 months
Gold Iceberg tables	Export as Parquet, re-create in Fabric Warehouse	1-2 weeks

┌────────────────────────────────────────────────────────────┐
│                  CATALOG LANDSCAPE                         │
│                                                            │
│  Our Setup: AWS us-east-1                                  │
│  ┌────────────────────────────────────────────────────┐   │
│  │  Snowflake Horizon Catalog (current)               │   │
│  │  + AWS Glue Catalog (bronze/silver ingestion)      │   │
│  └────────────────────────────────────────────────────┘   │
│                                                            │
│  Future Options:                                           │
│  ┌─────────────────┐    ┌──────────────────────────────┐  │
│  │ Polaris Catalog │    │ Microsoft Fabric OneLake      │  │
│  │ (Snowflake SaaS)│    │ (wants Delta; Azure-region   │  │
│  │ Maturing fast   │    │  only for zero-copy)          │  │
│  └─────────────────┘    └──────────────────────────────┘  │
│                                                            │
│  Gold Streams → Parquet export → Fabric compatible ✅      │
└────────────────────────────────────────────────────────────┘

Our Decision:

We prioritized Snowflake ecosystem depth over day-1 multi-cloud portability.

Iceberg gave us an exit path, not a free exit. For teams needing Fabric OneLake zero-copy, starting with Delta on Azure is the pragmatic choice.

💰 Cost Reality Check

Phase 1 (Glue + External Iceberg):

Glue: costs to write and maintain iceberg tables
Snowflake query costs: 3x higher due to full scans

Phase 2 (Snowflake-managed Iceberg):

MAINTAIN ICEBERG TABLE: in warehouse credits
Query costs: 60% reduction (pruning + caching works correctly)
Net savings: 35% monthly vs. Phase 1

Takeaway: Optimize for query performance where your users actually spend time, not just ingestion costs.

🔑 Lessons for Your Lakehouse

1. File sizes first

Target 256-512MB Parquets or let Snowflake MAINTAIN ICEBERG TABLE handle it automatically. Small files kill Snowflake performance.

2. Clone-test early

Dynamic Iceberg tables are powerful but not clone-safe for databases/schemas. Test your DevOps workflow before committing to production.

3. Horizon bidirectional is a game-changer

If your team uses both Spark and Snowflake, Horizon gives you the best of both without choosing sides. Write with Spark, read with Snowflake—all on the same Iceberg tables.

4. Streams > automation black boxes

Explicit Streams + Tasks always win in production ops: observable at 3am, debuggable from logs, cloneable for dev/test.

5. Phase your migration

Don't try to migrate all layers at once. Start with Gold (highest query frequency), validate performance, then move Silver.

6. Define ownership boundaries early

Bronze→Silver→Gold isn't just technical layers—it's organizational boundaries. Decide who owns what before the first production table.

When to Use What: Decision Matrix

Your Situation	Recommendation	Why
Team strong in Spark, need multi-engine	Glue-managed Iceberg + External tables	Keep expertise where it is
Team SQL-first, Snowflake primary engine	Snowflake-managed Iceberg	Let Snowflake handle file lifecycle
Must support Fabric OneLake zero-copy	Delta Lake on Azure	Iceberg works but not zero-copy on Fabric
Need dev/prod clones + SLA guarantees	Streams + Tasks (avoid Dynamic Tables)	Observable, debuggable
Bronze/Silver only (like us)	dbt incremental + Developer Toolkit	Explicit watermark control

Bottom line: Cloud-neutral means readable across tools, not free migration. Choose the platform that matches your team's strengths, use open formats for portability insurance.

What's Next for Us

Investigate Polaris Catalog or wait for Horizon Catalog for true multi-vendor Iceberg metadata management
Evaluate cost/performance of streaming directly from Kafka → Snowflake Iceberg
Explore Iceberg v3 features for enhanced BCDR and CDC capabilities

🆕 Update: Iceberg v3 Features (March 2026) That Could Change Your Decision

Since our original evaluation, Snowflake released Apache Iceberg v3 support in public preview (March 2026) with capabilities that address several of our pain points and unlock new architectural patterns.

What's New in Iceberg v3

1. Cross-Region Replication for Snowflake-Managed Tables

What it enables: BCDR failover and replication groups for Iceberg tables across regions
Why it matters: Previously, our DR strategy required complex Parquet exports. Now, Snowflake-managed Iceberg tables can replicate with full consistency (including row lineage and deletion vectors)
Architecture impact: We can now deploy active-passive DR without custom tooling
📖 Replication Config Docs

2. Catalog-Linked Databases

What it enables: Connect to remote Iceberg catalogs (AWS Glue, Polaris, etc.) with automatic namespace discovery and read/write support
Why it matters: Our Phase 1 external table limitations are eliminated—we can now write back to Glue Catalog-managed tables from Snowflake
Architecture impact: True bidirectional catalog federation; could unblock our "Glue as ingestion, Snowflake as analytics" hybrid
📖 CREATE Catalog-Linked Database Docs

Which Features Would Have Changed Our Decision?

Would have stayed with our choice (Iceberg + Horizon):

✅ Cross-region replication validates our bet on Snowflake-managed Iceberg (Delta still doesn't have this)
✅ Catalog-linked databases eliminate the Phase 1 external table pain without abandoning Glue ingestion

Bottom Line: Iceberg v3 Strengthens the "Safe Bet"

For SQL-first teams on Snowflake, Iceberg v3 eliminates the last major operational friction points we encountered. The combination of Horizon Catalog + v3 features delivers the "cloud-neutral with vendor optimization" balance we were seeking.

What's your biggest Iceberg-on-Snowflake headache? Drop it below 👇

Snowflake #ApacheIceberg #Lakehouse #DataEngineering #HorizonCatalog #DataArchitecture #AWS #OpenLakehouse