Andrew Kalik

Posted on Jan 26

Designing a Cross-Cloud Data Plane with Apache Iceberg

#dataengineering #gcp #aws #bigdata

Designing a Cross-Cloud Data Plane with Apache Iceberg

Most organizations don’t deliberately choose to build multi-cloud data platforms.

They arrive there gradually — through acquisitions, organizational boundaries, and the reality that different teams and workloads gravitate toward different platforms. Over time, AWS and GCP both become part of the picture, whether that was the original plan or not.

The challenge isn’t the presence of multiple clouds.

The challenge is what happens to data once they are.

Rather than focusing on specific tools or implementations, this post is meant to share a mental model for reasoning about cross-cloud data platforms — one that prioritizes cost discipline, simplicity, and long-term flexibility.

Why Multi-Cloud Is Often Unavoidable

Multi-cloud is rarely ideological.

It usually emerges from:

Independent teams choosing platforms that fit their needs
Mergers and acquisitions that bring existing cloud footprints
Organizational boundaries that resist forced standardization
Analytics and AI capabilities evolving at different speeds

In practice:

Most organizations are already multi-cloud long before they design for it.

Trying to undo that reality often leads to brittle mandates and slow delivery. A more durable approach is to design around multi-cloud instead of fighting it.

The Real Cost of Multi-Cloud Is Data Duplication

Where most multi-cloud data architectures struggle is not orchestration or tooling — it’s duplication.

The same dataset is often:

Ingested separately into AWS and GCP
Transformed independently in each environment
Stored in different formats
Reprocessed for analytics, applications, and AI

Each duplication multiplies:

Storage cost
Compute cost
Pipeline complexity
Operational risk

At scale, this becomes compounding waste.

Multi-cloud becomes expensive only when data is duplicated.

The alternative is to process data once and reuse it everywhere.

A Three-Plane Model for Cross-Cloud Data Platforms

To make this practical, it helps to step back and use a simple mental model that separates responsibilities into three planes, each with a distinct purpose.

The Data Plane: The Source of Truth

The data plane defines what the data is.

It includes:

How data is stored
How tables are structured
How schemas evolve
How versions and snapshots are managed

This plane should be:

Durable
Engine-agnostic
Slowly changing
Written once and reused many times

Apache Iceberg fits naturally here. It provides a stable, open table contract that works across object storage and compute engines, without binding data to a specific cloud or execution model.

The data plane is not optimized for speed — it is optimized for correctness and reuse.

This is what enables a true single source of truth — not by centralizing platforms, but by standardizing how data is defined and evolved.

The Control Plane: Coordination Without Ownership

The control plane defines when and why work happens.

It includes:

Orchestration
Eventing
Scheduling
Governance hooks
Policy enforcement

Each cloud can have its own control plane. AWS and GCP do not need to share orchestration logic or operational workflows.

The critical constraint is this:

Control planes coordinate access to data, but they do not own it.

This keeps orchestration stateless, replaceable, and cloud-native, while the data plane remains stable.

The Consumption Plane: Execution and Experience

The consumption plane defines how data is used.

It includes:

Analytics and querying
Applications
Feature extraction
Machine learning and AI workloads

This plane is intentionally:

Ephemeral
Cost-variable
Optimized for workload needs
Free to evolve independently

Serverless execution fits naturally here. Compute spins up only when needed, processes a slice of data, and shuts down.

Compute should be temporary. Data should not be.

Apache Iceberg as a Shared Cross-Cloud Data Plane

By using Apache Iceberg as the data plane, AWS and GCP can evolve independently while relying on the same underlying data contract.

Iceberg allows:

Data to be processed once
Schemas to evolve without rewrites
Snapshots to support consistent reads
Multiple consumers across clouds
Object storage to remain the system of record

The clouds don’t need shared pipelines.

They need shared tables.

Single Processing Is the Biggest Cost Reduction Lever

Without a shared data plane:

Each cloud processes the same raw data
Each environment runs its own transformations
Each platform retrains AI models independently
Compute cost scales with the number of clouds

With a shared data plane:

Data is transformed once
Snapshots are reused across consumers
Incremental processing minimizes rework
Serverless compute stays small and targeted

Processing a dataset once and reusing it across analytics, applications, and AI workloads is one of the most effective ways to reduce cost in cross-cloud data platforms.

Every additional cloud, engine, or workload that reuses that same processed dataset benefits from this decision without incurring proportional cost.

This is architectural efficiency, not after-the-fact optimization.

Where AI Fits

AI makes architectural efficiency non-negotiable, because the cost of duplicated data shows up fastest in training, retraining, and experimentation.

AI does not require a separate plane.

It spans all three:

The data plane provides training data and historical snapshots
The control plane governs training and retraining
The consumption plane handles inference and interaction

Training the same data multiple times across clouds is expensive and unnecessary.

A shared data plane reduces that pressure by design.

Tradeoffs and Reality Checks

This approach does not eliminate complexity entirely.

Teams still need to manage:

Catalog consistency
Identity and access boundaries
Feature differences across execution engines
Cross-cloud networking considerations

These are governance and coordination problems — not data duplication problems — and they scale far better than parallel pipelines.

When This Pattern Makes Sense

Strong fit

Organizations operating in AWS and GCP
Shared analytical and AI datasets
Cost-sensitive platforms
Serverless-first execution models

Less ideal

Ultra-low-latency streaming
Workloads tightly coupled to proprietary execution features
Single-cloud environments with no external consumers

Looking Ahead: Cloud Interconnect as the Final Enabler

One of the most exciting developments for cross-cloud data architectures is the continued maturation of private cloud interconnect between AWS and GCP.

Interconnect transforms cross-cloud connectivity from a workaround into a first-class architectural feature. It provides a private, predictable network path that avoids the public internet entirely, improving not just performance, but security and control.

As interconnect becomes more accessible:

Cross-cloud data access becomes more deliberate and auditable
Serverless consumption across clouds becomes more practical
Data no longer needs to be duplicated simply to feel “close” or “safe”

This is where the three-plane model fully comes together. A shared data plane backed by Iceberg, independent control planes per cloud, and ephemeral consumption planes can operate across platforms with confidence.

Instead of copying data defensively, teams can design for access intentionally — reducing cost, tightening security boundaries, and simplifying how data moves between clouds.

It’s one of the clearest signals that cross-cloud data architectures are moving from workaround to first-class design.

Interconnect doesn’t change the need for good architecture.

It rewards it.

Closing Thoughts

Multi-cloud does not require identical architectures.

It requires:

A shared data plane
Independent control planes
Ephemeral, serverless consumption

By treating Apache Iceberg as the data contract, teams can avoid duplicating data, minimize compute cost, and support analytics and AI across AWS and GCP without rebuilding their platform for each cloud.

In practice, the most resilient architectures make the fewest assumptions about where compute runs — and the strongest assumptions about how data is defined.

DEV Community

Designing a Cross-Cloud Data Plane with Apache Iceberg

Designing a Cross-Cloud Data Plane with Apache Iceberg

Why Multi-Cloud Is Often Unavoidable

The Real Cost of Multi-Cloud Is Data Duplication

A Three-Plane Model for Cross-Cloud Data Platforms

The Data Plane: The Source of Truth

The Control Plane: Coordination Without Ownership

The Consumption Plane: Execution and Experience

Apache Iceberg as a Shared Cross-Cloud Data Plane

Single Processing Is the Biggest Cost Reduction Lever

Where AI Fits

Tradeoffs and Reality Checks

When This Pattern Makes Sense

Looking Ahead: Cloud Interconnect as the Final Enabler

Closing Thoughts

Top comments (0)