DEV Community

Cover image for Designing a Cross-Cloud Data Plane with Apache Iceberg
Andrew Kalik
Andrew Kalik

Posted on

Designing a Cross-Cloud Data Plane with Apache Iceberg

Designing a Cross-Cloud Data Plane with Apache Iceberg

Most organizations don’t deliberately choose to build multi-cloud data platforms.

They arrive there gradually — through acquisitions, organizational boundaries, and the reality that different teams and workloads gravitate toward different platforms. Over time, AWS and GCP both become part of the picture, whether that was the original plan or not.

The challenge isn’t the presence of multiple clouds.

The challenge is what happens to data once they are.

Rather than focusing on specific tools or implementations, this post is meant to share a mental model for reasoning about cross-cloud data platforms — one that prioritizes cost discipline, simplicity, and long-term flexibility.


Why Multi-Cloud Is Often Unavoidable

Multi-cloud is rarely ideological.

It usually emerges from:

  • Independent teams choosing platforms that fit their needs
  • Mergers and acquisitions that bring existing cloud footprints
  • Organizational boundaries that resist forced standardization
  • Analytics and AI capabilities evolving at different speeds

In practice:

Most organizations are already multi-cloud long before they design for it.

Trying to undo that reality often leads to brittle mandates and slow delivery. A more durable approach is to design around multi-cloud instead of fighting it.


The Real Cost of Multi-Cloud Is Data Duplication

Where most multi-cloud data architectures struggle is not orchestration or tooling — it’s duplication.

The same dataset is often:

  • Ingested separately into AWS and GCP
  • Transformed independently in each environment
  • Stored in different formats
  • Reprocessed for analytics, applications, and AI

Each duplication multiplies:

  • Storage cost
  • Compute cost
  • Pipeline complexity
  • Operational risk

At scale, this becomes compounding waste.

Multi-cloud becomes expensive only when data is duplicated.

The alternative is to process data once and reuse it everywhere.


A Three-Plane Model for Cross-Cloud Data Platforms

To make this practical, it helps to step back and use a simple mental model that separates responsibilities into three planes, each with a distinct purpose.


The Data Plane: The Source of Truth

The data plane defines what the data is.

It includes:

  • How data is stored
  • How tables are structured
  • How schemas evolve
  • How versions and snapshots are managed

This plane should be:

  • Durable
  • Engine-agnostic
  • Slowly changing
  • Written once and reused many times

Apache Iceberg fits naturally here. It provides a stable, open table contract that works across object storage and compute engines, without binding data to a specific cloud or execution model.

The data plane is not optimized for speed — it is optimized for correctness and reuse.

This is what enables a true single source of truth — not by centralizing platforms, but by standardizing how data is defined and evolved.


The Control Plane: Coordination Without Ownership

The control plane defines when and why work happens.

It includes:

  • Orchestration
  • Eventing
  • Scheduling
  • Governance hooks
  • Policy enforcement

Each cloud can have its own control plane. AWS and GCP do not need to share orchestration logic or operational workflows.

The critical constraint is this:

Control planes coordinate access to data, but they do not own it.

This keeps orchestration stateless, replaceable, and cloud-native, while the data plane remains stable.


The Consumption Plane: Execution and Experience

The consumption plane defines how data is used.

It includes:

  • Analytics and querying
  • Applications
  • Feature extraction
  • Machine learning and AI workloads

This plane is intentionally:

  • Ephemeral
  • Cost-variable
  • Optimized for workload needs
  • Free to evolve independently

Serverless execution fits naturally here. Compute spins up only when needed, processes a slice of data, and shuts down.

Compute should be temporary. Data should not be.


Apache Iceberg as a Shared Cross-Cloud Data Plane

By using Apache Iceberg as the data plane, AWS and GCP can evolve independently while relying on the same underlying data contract.

Iceberg allows:

  • Data to be processed once
  • Schemas to evolve without rewrites
  • Snapshots to support consistent reads
  • Multiple consumers across clouds
  • Object storage to remain the system of record

The clouds don’t need shared pipelines.

They need shared tables.


Single Processing Is the Biggest Cost Reduction Lever

Without a shared data plane:

  • Each cloud processes the same raw data
  • Each environment runs its own transformations
  • Each platform retrains AI models independently
  • Compute cost scales with the number of clouds

With a shared data plane:

  • Data is transformed once
  • Snapshots are reused across consumers
  • Incremental processing minimizes rework
  • Serverless compute stays small and targeted

Processing a dataset once and reusing it across analytics, applications, and AI workloads is one of the most effective ways to reduce cost in cross-cloud data platforms.

Every additional cloud, engine, or workload that reuses that same processed dataset benefits from this decision without incurring proportional cost.

This is architectural efficiency, not after-the-fact optimization.


Where AI Fits

AI makes architectural efficiency non-negotiable, because the cost of duplicated data shows up fastest in training, retraining, and experimentation.

AI does not require a separate plane.

It spans all three:

  • The data plane provides training data and historical snapshots
  • The control plane governs training and retraining
  • The consumption plane handles inference and interaction

Training the same data multiple times across clouds is expensive and unnecessary.

A shared data plane reduces that pressure by design.


Tradeoffs and Reality Checks

This approach does not eliminate complexity entirely.

Teams still need to manage:

  • Catalog consistency
  • Identity and access boundaries
  • Feature differences across execution engines
  • Cross-cloud networking considerations

These are governance and coordination problems — not data duplication problems — and they scale far better than parallel pipelines.


When This Pattern Makes Sense

Strong fit

  • Organizations operating in AWS and GCP
  • Shared analytical and AI datasets
  • Cost-sensitive platforms
  • Serverless-first execution models

Less ideal

  • Ultra-low-latency streaming
  • Workloads tightly coupled to proprietary execution features
  • Single-cloud environments with no external consumers

Looking Ahead: Cloud Interconnect as the Final Enabler

One of the most exciting developments for cross-cloud data architectures is the continued maturation of private cloud interconnect between AWS and GCP.

Interconnect transforms cross-cloud connectivity from a workaround into a first-class architectural feature. It provides a private, predictable network path that avoids the public internet entirely, improving not just performance, but security and control.

As interconnect becomes more accessible:

  • Cross-cloud data access becomes more deliberate and auditable
  • Serverless consumption across clouds becomes more practical
  • Data no longer needs to be duplicated simply to feel “close” or “safe”

This is where the three-plane model fully comes together. A shared data plane backed by Iceberg, independent control planes per cloud, and ephemeral consumption planes can operate across platforms with confidence.

Instead of copying data defensively, teams can design for access intentionally — reducing cost, tightening security boundaries, and simplifying how data moves between clouds.

It’s one of the clearest signals that cross-cloud data architectures are moving from workaround to first-class design.

Interconnect doesn’t change the need for good architecture.

It rewards it.


Closing Thoughts

Multi-cloud does not require identical architectures.

It requires:

  • A shared data plane
  • Independent control planes
  • Ephemeral, serverless consumption

By treating Apache Iceberg as the data contract, teams can avoid duplicating data, minimize compute cost, and support analytics and AI across AWS and GCP without rebuilding their platform for each cloud.

In practice, the most resilient architectures make the fewest assumptions about where compute runs — and the strongest assumptions about how data is defined.

Top comments (0)