DEV Community

Cover image for Your AWS Data Can Now Power Google AI — No Migration Required: Inside Google Cloud's Cross-Cloud Lakehouse
brandon0405
brandon0405

Posted on

Your AWS Data Can Now Power Google AI — No Migration Required: Inside Google Cloud's Cross-Cloud Lakehouse

Google Cloud NEXT '26 Challenge Submission

This is a submission for the Google Cloud NEXT Writing Challenge


The Problem Every Data Engineer Knows Too Well

Picture this: your company has years of carefully curated data sitting in Amazon S3. It powers your dashboards, your pipelines, your ML models. Then someone in leadership asks: "Can we run this through Google's AI?"

And you already know what that means. Migration. Weeks of ETL work. Egress fees that make your stomach drop. Data duplication across clouds. Governance nightmares. The risk of breaking something that's already working.

For years, the cloud data world operated on an unspoken rule: pick your cloud and stay there. Moving between providers wasn't impossible — it was just painful enough that most teams didn't bother.

At Google Cloud NEXT '26, that rule changed. Google announced the Cross-Cloud Lakehouse, built on Apache Iceberg, and it might be the most underrated announcement of the entire event — especially if you care about where AI is actually heading.


What Is the Cross-Cloud Lakehouse?

Before diving into the announcements, some quick context: as of April 20, 2026, Google renamed BigLake to Google Cloud Lakehouse, and BigLake Metastore is now the Lakehouse Runtime Catalog. If you've worked with BigLake before, it's the same APIs and CLI commands — just a new name that better reflects what it actually does.

The Cross-Cloud Lakehouse extends Google Cloud Lakehouse to let you query data in AWS (and Azure, coming later this year) directly from Google Cloud using BigQuery, Dataproc, and Apache Spark — without migrating your data or building complex ETL pipelines.

It works in two layers:

Metadata layer: Your remote Apache Iceberg catalogs (like Databricks Unity Catalog or AWS Glue) are connected to Google's Lakehouse. It discovers your data without copying any files, and authenticates securely through Workload Identity Federation — no long-lived access keys needed.

Transport layer: Google integrates Cross-Cloud Interconnect (CCI) directly into the data plane. By combining CCI's dedicated private networking with Apache Iceberg REST Catalog, cross-cloud queries run with low latency and without the massive egress fees you'd normally pay routing traffic over the public internet.

The result: your agents and analysts can query data in AWS S3 as if it were sitting right there in Google Cloud.


The 4 Announcements That Actually Matter

Google announced its next-generation Cross-Cloud Lakehouse with four concrete breakthroughs. Let me break each one down beyond the keynote surface level.

1. Fully Managed Iceberg Storage with Real Interoperability

This one matters more than it sounds. Previously, if you used Apache Spark for ETL into Iceberg REST Catalog tables, you couldn't write through BigQuery or use its storage management features. You had to choose one or the other.

Now there's true read/write interoperability between BigQuery and Managed Service for Apache Spark, including Iceberg-compatible engines like Spark, Trino, Flink — and third-party engines like Databricks and Snowflake (in Preview). One copy of your data, multiple engines, no compromises.

2. Cross-Cloud Caching: The Feature Nobody's Talking About

This is the one that makes cross-cloud economically viable. Google introduced an intelligent cache that stores cross-cloud data on the first read, slashing egress fees and dramatically accelerating follow-on queries for your AWS and Azure data.

In plain terms: the first time you query your S3 data from BigQuery, it gets cached on Google's side. Every subsequent query is fast and cheap. The penalty for cross-cloud access shrinks to nearly nothing after that first read.

3. Lightning Engine for Apache Spark: Up to 4.5x Faster

Google's Lightning Engine is a real-time, serverless Spark engine that delivers up to 4.5x faster performance than open-source Spark alternatives, and up to 2x better price-performance over the leading proprietary competitor for large datasets.

Flipkart, Lowe's, and Meesho are already accelerating their Apache Spark workloads with it. That's not a beta experiment — that's production scale.

4. An Estimated 117% ROI in Under 6 Months

Google's own analysis puts the estimated ROI of this agentic-first lakehouse approach at 117%, with payback in under six months. Spotify is already unlocking innovation with it.

Take vendor-published ROI numbers with appropriate skepticism — but the underlying logic is sound. If you eliminate data movement costs, reduce ETL complexity, and let multiple engines share a single data copy, the math does work in your favor.


The Angle Most People Are Missing: This Is About AI Agents, Not Just Data

Here's where I think the real story is, and why this announcement deserves more attention than it's getting.

Everyone at Next '26 is talking about Gemini Enterprise Agent Platform, Agent Studio, agentic workflows. But there's a foundational problem with all of that: an AI agent is only as intelligent as the data it can access.

If your agent hits a cross-cloud wall — high latency, expensive egress, proprietary catalog lock-in — its autonomy is broken. It can't reason across your full data estate. It can only see the slice of data you've managed to centralize, which in most enterprises is a fraction of the whole.

The Cross-Cloud Lakehouse isn't a data feature. It's the infrastructure layer that makes truly capable AI agents possible in multi-cloud enterprises — which is to say, virtually every real enterprise.


What It Looks Like in Practice

No account needed to understand this. Here's what a cross-cloud query actually looks like once you've set up federation:

SELECT user_id, action, COUNT(*) as total_actions
FROM `your-project.federated_aws_catalog.your_namespace.your_table`
WHERE event_date >= '2026-04-01'
GROUP BY 1, 2;
Enter fullscreen mode Exit fullscreen mode

That's standard BigQuery SQL. The table in that query physically lives in Amazon S3. No data moved. No migration. No special connector to manage. Google Cloud Lakehouse handles the metadata translation and secure data access transparently.

You can also read Cross-Cloud Lakehouse data directly from Apache Spark clusters without managing separate AWS credentials or S3 connectors — Lakehouse automatically provides temporary, scoped S3 credentials to Spark through the Iceberg REST Catalog interface.


My Take

As a final-year IT Engineering student, I've spent a fair amount of time experimenting with both AWS and Google Cloud under their free tiers — and honestly, the experience has been mostly positive. Both platforms are incredibly powerful for learning, and the free tier generosity means you can build real things without spending a cent. That said, I learned the hard way that "free tier" requires constant attention. I once left several AWS services running after a learning project, forgot about them, and woke up to a bill of over $120. AWS did refund me after I explained the situation, but that moment of panic stuck with me. The anxiety of not knowing exactly what's running, and what it's costing, is real — especially when you're a student with no budget.

That's why the cost angle of this Cross-Cloud Lakehouse announcement stands out to me the most. The cross-cloud caching feature in particular — where data from S3 is cached on Google's side after the first read, dramatically reducing egress fees on subsequent queries — is the kind of thing that changes the math for smaller teams and learners, not just enterprise giants. Egress fees are one of the most frustrating hidden costs in cloud, and the fact that Google is tackling that directly instead of just promising "seamless interoperability" is meaningful.

My honest skepticism? I'd want to see real benchmarks from teams outside of Google's own case studies before trusting this in a production environment. The 117% ROI figure is compelling on paper, but vendor-published numbers always deserve scrutiny. I'd also want to understand the failure modes — what happens when the cross-cloud connection has latency spikes, or when the cache goes stale? For learning and experimentation, this looks genuinely exciting. For production at scale, I'd want at least six months of community battle-testing first.


What's Coming Next (And Why It Matters)

The ecosystem is already bigger than most people realize. The Cross-Cloud Lakehouse already supports bi-directional federation with Databricks, Oracle Autonomous Database, Snowflake, SAP, Salesforce Data360, ServiceNow, Workday, and more. Azure support is coming later this year.

Catalog federation is also launching in Preview for AWS Glue, Databricks, SAP, and Snowflake — with Confluent Tableflow support coming soon.

This isn't Google building a feature. This is Google positioning itself as the analytical brain of a multi-cloud world — a place where your data stays wherever it is, but your AI runs on Google's infrastructure.

Whether that bet pays off depends on adoption and performance at scale, which we'll know more about in the months ahead. But the architectural direction is clear, and the announcement at Next '26 was the starting gun.

"The real test will be whether enterprises trust Google enough to let it be the query layer over their AWS data — what do you think?"


Resources

Top comments (0)