Kushal Nagrani

Posted on Mar 31

Open Tables, Shared Truth: Architecting a Multi-Engine Lakehouse

#lakehouse #dataengineering #iceberg #aws

In the modern data landscape, we often hear the phrase "single source of truth." But as many data engineers know, the reality behind that phrase is often a complex web of data copies, inconsistent metrics, and redundant governance.

The problem isn’t processing data. It’s where truth lives.

For over a decade, we’ve been solving the wrong problem in data engineering.

We’ve optimized compute. We’ve scaled storage. We’ve built faster pipelines.

And yet—
we still don’t trust our data.

This blog is not about introducing another analytics engine or tool. It’s about challenging a design flaw we’ve collectively normalized—and showing how modern lakehouse architectures are finally fixing it.

The Problem We Stopped Questioning

Let’s start with an uncomfortable truth.Most organizations today have:

The same dataset copied multiple times
The same metric producing different results
Governance logic re-implemented across systems

And yet, we confidently say:

“We have a single source of truth.”

But do we really?

In reality, what we have is:

A warehouse copy
A lake copy
A serving copy

Each slightly different. Each “correct” in its own context.

From Data Copies to Data Contracts

Why did we end up here? Because historically:

Compute engines couldn’t agree on formats
Storage systems lacked transactional guarantees
Governance was tied to specific platforms

So we did what engineers do best:
👉 We built pipelines. Lots of them.

Pipelines became the glue holding together fragmented truth. But pipelines don’t scale truth — they multiply it.

Why “Querying the Lake” Wasn’t Enough

The industry tried to fix this with data lakes. We said:

“Let’s store everything in one place”
“Let’s allow multiple engines to query it”

And yes, we achieved:

Faster access
Fewer ETL jobs
Flexible analytics

But we missed something critical:

Ownership didn’t change.

The lake became accessible—but not authoritative.

The Real Shift: From Engines → Tables

Here’s the mindset shift that changes everything:

The unit of ownership is not the engine. It’s the table.

Historically:

Engines owned data
Pipelines moved data between engines

Now:

Tables become shared, governed, authoritative assets

This is where open table formats like Apache Iceberg, Delta Lake, and Apache Hudi come in.

What Makes Open Tables Different?

Open table formats bring database-like guarantees to object storage:

ACID transactions
Schema evolution
Time travel
Snapshot isolation
Concurrent reads and writes

But the real magic is this:

👉 Multiple engines can read and write to the same table reliably

Engines like:

Apache Spark
Trino
Amazon Athena
Snowflake

…can all operate on the same dataset.

No duplication. No translation layers.

Writing Back to the Lake

This is where things truly evolve.

In traditional architectures:

Transformations write to new systems
Each system maintains its own “truth”

In a modern lakehouse:

Transformations write back to shared tables
These tables become data products

👉 The lake is no longer just storage. It’s the system of record.

Where Control Actually Lives?

Another critical shift:

Governance moves from engines → to the table layer

Instead of duplicating policies across systems, you define them once:

Access controls
Column-level security
Schema ownership
Audit trails

Technologies like AWS Lake Formation enable centralized governance across engines.

Now:

Engines come and go
Policies stay consistent

The Reference Architecture: Multi-Engine Lakehouse

A modern architecture looks like this:

Storage Layer: Object storage (e.g., Amazon S3)
Table Layer: Open table formats (Iceberg/Delta/Hudi)
Compute Layer: Multiple engines (Spark, Trino, Athena, etc.)
Governance Layer: Centralized policy enforcement
Consumption Layer: BI, ML, APIs

And here’s what’s missing:
❌ No “gold copy”
❌ No duplicated datasets
❌ No pipeline sprawl

The Pivot: Two Architectures, Same Problem

Traditional Approach

Data is copied
Pipelines everywhere
Multiple versions of truth

Modern Approach

Data is shared
Minimal pipelines
One consistent truth

Same problem.
Two very different philosophies.

Common Anti-Patterns (That Still Exist)

Even with open tables, teams often fall into traps:

Treating tables like CSV files
Having no ownership model
Allowing every engine to write freely
Ignoring cost and compaction strategies

Technology alone doesn’t solve the problem. Design discipline does.

Design Trade-offs You Must Consider

This architecture isn’t “free”.

You need to think about:

Concurrent writers → Conflict resolution strategies
Compaction ownership → Who maintains table performance?
Performance tuning → Partitioning, indexing
Failure domains → What breaks, and where?

These are platform decisions—not just engineering ones.

Rethinking Data Ownership

This shift is bigger than technology.

It’s an organizational change.

You move from:

Pipeline ownership → Data product ownership
System silos → Shared contracts
Tool-centric thinking → Agreement-centric thinking

Key Takeaways

Stop copying data
Start sharing truth
Design tables as products
Let engines be interchangeable

And most importantly:

“The most scalable analytics platforms are built around agreements, not tools.”

Final Thought

We’ve spent years optimizing how fast we process data.

Now it’s time to ask a better question:

Where does truth live? and who owns it?

Because until that’s solved,
no amount of compute will fix your data platform.

DEV Community