Is Snowflake-managed Iceberg the ultimate Trojan horse or your data’s salvation?

#snowflake #iceberg #data #architecture

Snowflake-managed Iceberg tables are the only way to escape the platform's proprietary storage format without sacrificing the performance that makes you pay the premium.

Why I chose this topic: I spent four years watching migration projects die in the "vendor lock-in" valley of death. I’m writing this because I’m tired of seeing engineers choose inferior tech just because they’re terrified of a data platform they actually like using.

It was 3:14 AM on a Tuesday. The PagerDuty alert hit my phone with the urgency of a heart attack: CRITICAL: External Table Latency Exceeded Threshold. Our internal dashboard, which pulls telemetry from an S3 bucket via Snowflake’s external tables, had effectively died. Queries that usually took 300ms were timing out at the 60-second mark.

I rolled out of bed, fired up the VPN, and saw the mess. We were trying to join a massive, partitioned Parquet dataset against a local Snowflake dimension table. The metadata overhead for the external table was choking the query engine. Every time we added a new partition, the MSCK REPAIR TABLE or the periodic metadata refresh would lag, and the query planner would go into a death spiral trying to reconcile the file manifest.

What we saw

The symptoms were classic: QUERY_HISTORY showed massive EXTERNAL_SCAN times. At first, my junior engineer assumed it was an S3 throttling issue. We checked the AWS CloudWatch metrics for the bucket. Nothing. The bandwidth was fine, the API requests were well within limits, and the latency was stable.

We then spent two hours chasing ghosts in the EXPLAIN plan. We thought the file format (Parquet with Snappy compression) was the problem, so we tried re-partitioning the data into smaller chunks. It made it worse. The overhead of Snowflake tracking thousands of small files in a standard external table was creating a metadata bottleneck that simply couldn't scale. We were fighting the platform's inability to reconcile the "source of truth" in S3 with the "state of the world" in Snowflake.

Photo by Willian Justen de Vasconcellos on Unsplash

Root cause

The root cause was the inherent fragility of standard External Tables. Snowflake doesn't "own" the metadata for standard External Tables; it has to infer it. When you rely on AUTO_REFRESH = TRUE, you’re essentially asking Snowflake to play a high-stakes game of catch-up with S3 event notifications.

We were using:

CREATE OR REPLACE EXTERNAL TABLE raw_events (
    data variant
)
PARTITION BY (event_date)
LOCATION = '@s3_stage/events/'
FILE_FORMAT = (TYPE = PARQUET);

The issue is that the metadata is detached from the data. If the S3 event notification fails—or if a backfill process bypasses the staging area—the manifest drifts. Snowflake’s query optimizer was forced to perform a full directory listing of the S3 prefix because it couldn’t trust its own stale manifest. That’s a 10-second metadata latency penalty on every query, regardless of the compute warehouse size.

Photo by Isaac Smith on Unsplash

The fix

We migrated the entire pipeline to Snowflake-managed Iceberg tables. The switch was surprisingly trivial. By using the CATALOG_INTEGRATION feature, we kept the physical files in our own S3 bucket while letting Snowflake manage the Iceberg metadata (the metadata.json files and snapshot pointers).

The SQL change looked like this:

CREATE OR REPLACE ICEBERG TABLE managed_events (
    id STRING,
    event_timestamp TIMESTAMP,
    data VARIANT
)
CATALOG = 'SNOWFLAKE'
EXTERNAL_VOLUME = 'my_s3_volume'
BASE_LOCATION = 's3://my-bucket/iceberg/events/';

Suddenly, the metadata was no longer a guessing game. Because Snowflake manages the Iceberg metadata, it holds the absolute state of the table. The query optimizer doesn’t need to scan S3 buckets; it reads the latest snapshot pointer from the Iceberg metadata file. The 10-second latency vanished instantly, dropping back to sub-second responses.

More importantly, because this is actual Iceberg (v2 spec), the data is perfectly readable by Spark, Trino, or any other engine that supports the Iceberg API. If Snowflake ever decides to hike their storage premiums or if we decide to move a portion of our workload to an EMR cluster, the data is already in an open, portable format. We didn't leave Snowflake; we just stopped letting them hold our data hostage in a proprietary silo.

What we changed so it never happens again

We implemented a strict "Iceberg-first" policy for all new analytical datasets. No more standard External Tables. Period.

We also automated the validation of our metadata. We now run a daily SYSTEM$GET_ICEBERG_TABLE_INFORMATION check against our critical tables to ensure the snapshot age is within acceptable bounds. If the metadata hasn't been updated in 24 hours, the job alerts the team—not because the query is slow, but because it indicates a break in the Iceberg commit log.

We also decoupled our storage from our compute by moving to an EXTERNAL_VOLUME architecture. This allows us to point Snowflake at an S3 bucket while maintaining full control over the lifecycle policies of the data itself. If we want to transition the data to Glacier or delete it after 90 days, we can do it directly via S3 lifecycle rules without Snowflake knowing or caring.

The biggest shift was psychological. We stopped treating Snowflake as a "black box" where data goes to die. By using managed Iceberg, we treat Snowflake as a high-performance compute engine that happens to be sitting on top of an open data lake.

The vendor lock-in narrative is often a crutch for bad architecture. You don't have to leave the platform to own your data. You just have to stop using the proprietary features that anchor you to the platform's specific storage implementation. Snowflake-managed Iceberg gives you the best of both worlds: the speed of a premium warehouse and the sovereignty of an open-source standard. Use it, or keep waking up at 3:00 AM to fix metadata drift. Your choice.

Cover photo by Nathan Anderson on Unsplash.