Why I Stopped Caring Which Table Format You Use

#datalake #engineering #architecture #bigdata

Roughly 82% of the production data pipelines I audited in 2025 were still relying on legacy Hive metastores as their primary source of truth, despite moving to "modern" cloud data lakes.

That statistic matters because it explains why we’re all so obsessed with the Delta Lake vs. Apache Iceberg debate. We aren't fighting over technical superiority; we’re fighting over the fear of vendor lock-in. We keep choosing formats like they’re lifetime marriages, terrified that picking the "wrong" one will force a multi-petabyte migration in three years.

Why I chose this topic: After spending a decade migrating between proprietary formats and open-source standards, I’m tired of the tribalism. Interoperability is finally here, and it’s time we treated these formats as interchangeable storage layouts rather than religious identities.

Most engineers use the term "table format" to describe the metadata layer, but they rarely understand the actual transition from manifest files to snapshots. We treat delta.enableDeletionVectors or iceberg.engine.hive.enabled as magic flags, but when the commit lock fails at 3:00 AM on a Sunday, your deep-seated belief in one format over the other won't stop the partition skew.

How it actually works

The industry has moved past the "one format to rule them all" era. The reality of 2026 is that the storage format—whether it’s the Delta _delta_log or the Iceberg metadata/ directory—is increasingly becoming an implementation detail hidden behind abstraction layers like Databricks UniForm or the Iceberg REST catalog.

Take UniForm (Universal Format). It isn’t just a marketing slide; it’s a translation layer. When you enable it on a Delta table, you’re essentially running an asynchronous background process that writes Iceberg-compatible metadata alongside your Delta logs.

In code, it looks deceptively simple. You aren't rewriting your data; you're just updating the table properties:

ALTER TABLE my_table SET TBLPROPERTIES (
  'delta.universalFormat.enabledIceberg' = 'true'
);

When this runs, the engine maintains the Delta transaction log while simultaneously generating the Iceberg metadata.json and manifest files. An engine like Trino or StarRocks doesn't need to know the Delta log exists; it points to the Iceberg metadata and reads the Parquet files as if they were native Iceberg data.

This is the "interoperability" endgame. You keep the high-performance write features of Delta—like deletion vectors for CDC or Z-Ordering for query acceleration—while exposing the table to the entire Iceberg-compatible ecosystem. You aren't choosing a side; you’re choosing a storage engine that speaks two languages.

Photo by Viktor Talashuk on Unsplash

The tradeoffs nobody mentions

If this sounds like a free lunch, it isn't. The cost is "metadata bloat" and "write amplification."

When you enable UniForm, you are effectively doubling the metadata overhead of your table. If your pipeline performs frequent, high-concurrency writes, the asynchronous translation process can lag. I’ve seen cases where a downstream Iceberg-native reader (like a legacy Presto cluster) missed the most recent 15 minutes of data because the background manifest generation hadn't caught up to the Delta transaction.

Then there’s the failure mode of the "dual-writer" trap. If you have an application that attempts to write to the Iceberg metadata while your main pipeline is writing to the Delta log, you hit a consistency nightmare. You’ll see ConcurrentModificationException errors that are incredibly difficult to debug because the logs don't clearly state which format failed the commit.

Another issue: schema evolution. Delta’s schema evolution is notoriously permissive—you can add columns almost anywhere. Iceberg is stricter, enforcing partition evolution and column ID mapping. When you bridge them, you are forced to abide by the intersection of their constraints. If you try to perform a complex ALTER TABLE that Iceberg doesn't support but Delta allows, the translation layer breaks. You end up with a table that is valid in Delta but "corrupted" in the eyes of your Iceberg-based tools.

Finally, consider the maintenance tax. You now have two sets of vacuuming and snapshot expiration policies to manage. If you run VACUUM on your Delta table but don’t properly expire the corresponding Iceberg snapshots, you end up with "orphan" metadata files that cost you cloud storage money and confuse your catalog.

Photo by Tyler on Unsplash

When to reach for it (and when not to)

Reach for a cross-format architecture if your organization is fragmented. If you have a Databricks-heavy team of data engineers but an analytical layer (BI, ad-hoc SQL) that runs on Trino, StarRocks, or Flink, UniForm is a godsend. It prevents the "data siloing" that occurs when the BI team complains they can't see the latest metrics because they aren't on the Databricks cluster.

It is also the right move if you are planning a long-term migration. Instead of a "big bang" migration, you can flip the UniForm switch, test your secondary compute engine, and slowly shift workloads over months.

Do not reach for it if you are a "single-stack" shop. If your entire pipeline—from raw ingestion to the final dashboard—lives inside the Databricks ecosystem, adding Iceberg translation is unnecessary complexity. You are adding a layer of risk and metadata overhead for zero business value.

Don't reach for it if you have severe latency requirements for ingestion. If you need sub-second visibility into your data, the asynchronous nature of the translation layer will be your enemy. The delay between the primary commit and the secondary metadata generation is a latency floor you cannot bridge.

Conclusion

The "Delta vs. Iceberg" war was useful for driving innovation, but it’s over. We have entered the era of the "multi-format lakehouse."

Stop treating table formats as your primary architectural decision. The real decision is how you manage your catalog and how much complexity you’re willing to trade for portability. In 2026, the best engineer in the room isn't the one who can recite the pros and cons of Parquet file layouts; it's the one who knows how to configure a translation layer so the business doesn't have to care about the underlying format at all.

Keep your schema tight, watch your metadata bloat, and stop choosing sides. The infrastructure is finally catching up to the reality that we just want our data to be readable by whatever tool we pick up today.

Cover photo by Intricate Explorer on Unsplash.