DEV Community

Kushal Nagrani
Kushal Nagrani

Posted on

Open Tables, Shared Truth: Architecting a Multi-Engine Lakehouse

In the modern data landscape, we often hear the phrase "single source of truth." But as many data engineers know, the reality behind that phrase is often a complex web of data copies, inconsistent metrics, and redundant governance.

The problem isn’t processing data. It’s where truth lives.

For over a decade, we’ve been solving the wrong problem in data engineering.

We’ve optimized compute. We’ve scaled storage. We’ve built faster pipelines.

And yet—
we still don’t trust our data.

This blog is not about introducing another analytics engine or tool. It’s about challenging a design flaw we’ve collectively normalized—and showing how modern lakehouse architectures are finally fixing it.

The Problem We Stopped Questioning

Let’s start with an uncomfortable truth.Most organizations today have:

  • The same dataset copied multiple times
  • The same metric producing different results
  • Governance logic re-implemented across systems

And yet, we confidently say:

“We have a single source of truth.”

But do we really?

In reality, what we have is:

  • A warehouse copy
  • A lake copy
  • A serving copy

Each slightly different. Each “correct” in its own context.

From Data Copies to Data Contracts

Why did we end up here? Because historically:

  • Compute engines couldn’t agree on formats
  • Storage systems lacked transactional guarantees
  • Governance was tied to specific platforms

So we did what engineers do best:
👉 We built pipelines. Lots of them.

Pipelines became the glue holding together fragmented truth. But pipelines don’t scale truth — they multiply it.

Why “Querying the Lake” Wasn’t Enough

The industry tried to fix this with data lakes. We said:

“Let’s store everything in one place”
“Let’s allow multiple engines to query it”

And yes, we achieved:

  • Faster access
  • Fewer ETL jobs
  • Flexible analytics

But we missed something critical:

Ownership didn’t change.

The lake became accessible—but not authoritative.

The Real Shift: From Engines → Tables

Here’s the mindset shift that changes everything:

The unit of ownership is not the engine. It’s the table.

Historically:

  • Engines owned data
  • Pipelines moved data between engines

Now:

  • Tables become shared, governed, authoritative assets

This is where open table formats like Apache Iceberg, Delta Lake, and Apache Hudi come in.

What Makes Open Tables Different?

Open table formats bring database-like guarantees to object storage:

  • ACID transactions
  • Schema evolution
  • Time travel
  • Snapshot isolation
  • Concurrent reads and writes

But the real magic is this:

👉 Multiple engines can read and write to the same table reliably

Engines like:

  • Apache Spark
  • Trino
  • Amazon Athena
  • Snowflake

…can all operate on the same dataset.

No duplication. No translation layers.

Writing Back to the Lake

This is where things truly evolve.

In traditional architectures:

  • Transformations write to new systems
  • Each system maintains its own “truth”

In a modern lakehouse:

  • Transformations write back to shared tables
  • These tables become data products

👉 The lake is no longer just storage. It’s the system of record.

Where Control Actually Lives?

Another critical shift:

Governance moves from engines → to the table layer

Instead of duplicating policies across systems, you define them once:

  • Access controls
  • Column-level security
  • Schema ownership
  • Audit trails

Technologies like AWS Lake Formation enable centralized governance across engines.

Now:

  • Engines come and go
  • Policies stay consistent

The Reference Architecture: Multi-Engine Lakehouse

A modern architecture looks like this:

  1. Storage Layer: Object storage (e.g., Amazon S3)
  2. Table Layer: Open table formats (Iceberg/Delta/Hudi)
  3. Compute Layer: Multiple engines (Spark, Trino, Athena, etc.)
  4. Governance Layer: Centralized policy enforcement
  5. Consumption Layer: BI, ML, APIs

And here’s what’s missing:
❌ No “gold copy”
❌ No duplicated datasets
❌ No pipeline sprawl

The Pivot: Two Architectures, Same Problem

Traditional Approach

  • Data is copied
  • Pipelines everywhere
  • Multiple versions of truth

Modern Approach

  • Data is shared
  • Minimal pipelines
  • One consistent truth

Same problem.
Two very different philosophies.

Common Anti-Patterns (That Still Exist)

Even with open tables, teams often fall into traps:

  • Treating tables like CSV files
  • Having no ownership model
  • Allowing every engine to write freely
  • Ignoring cost and compaction strategies

Technology alone doesn’t solve the problem. Design discipline does.

Design Trade-offs You Must Consider

This architecture isn’t “free”.

You need to think about:

  • Concurrent writers → Conflict resolution strategies
  • Compaction ownership → Who maintains table performance?
  • Performance tuning → Partitioning, indexing
  • Failure domains → What breaks, and where?

These are platform decisions—not just engineering ones.

Rethinking Data Ownership

This shift is bigger than technology.

It’s an organizational change.

You move from:

  • Pipeline ownership → Data product ownership
  • System silos → Shared contracts
  • Tool-centric thinking → Agreement-centric thinking

Key Takeaways

  • Stop copying data
  • Start sharing truth
  • Design tables as products
  • Let engines be interchangeable

And most importantly:

“The most scalable analytics platforms are built around agreements, not tools.”

Final Thought

We’ve spent years optimizing how fast we process data.

Now it’s time to ask a better question:

Where does truth live? and who owns it?

Because until that’s solved,
no amount of compute will fix your data platform.

Top comments (0)