Sujan Koirala

Posted on Apr 29

Running Untrusted AI Code in Production? Google Just Quietly Solved the Hardest Problem in Agents

#devchallenge #googlecloud #cloudnextchallenge

Google Cloud NEXT '26 Challenge Submission

This is a submission for the Google Cloud NEXT Writing Challenge

Let me ask you something uncomfortable.

Your AI agent just decided it needs to run some code. Maybe it's executing a tool call. Maybe it's spinning up a subprocess to parse a file. Maybe it's doing something you didn't fully anticipate when you wrote the system prompt.

What's actually stopping it from doing rm -rf /?

I've been building with AI agents for a while now. And every time I give one tool access (real tool access, not a toy demo) I feel this low-level dread that I can't quite shake. You're handing an LLM the keys to your infrastructure and saying "please be careful." That's not security. That's hope.

At Google Cloud NEXT '26, buried under 260 announcements and a tsunami of Gemini news, Google quietly dropped something I think is the most important infrastructure announcement for anyone actually trying to ship agents to production. Almost nobody is talking about it.

The Problem Nobody Wants to Admit

Here's the uncomfortable truth about the current state of agentic AI:

We're giving agents power we have no safe way to contain.

When an agent executes a tool call (running shell commands, writing files, calling APIs, spawning subprocesses) that code runs somewhere. In most setups today, that somewhere is a container you're also using for other things. Or worse, directly on a VM. The isolation story is basically: "we trust the model not to do something bad."

That's fine for demos. It is catastrophically not fine for production.

The problem compounds fast:

Multi-tenant agentic apps: one user's agent could theoretically interfere with another's execution environment
Autonomous agents: the longer the agent runs without human oversight, the more code it executes, the larger the blast radius of any mistake
Third-party tools: you're not just running your own code. Agents call external tools, MCP servers, APIs that return arbitrary payloads

Every serious team I've talked to has some version of this conversation: "We love the agent prototype. We're terrified to put it in front of real users."

The gap between "it works in a notebook" and "I'd stake my job on this in prod" is almost entirely a trust and isolation problem.

What Google Actually Announced (And Why It Matters)

At NEXT '26, Google announced GKE Agent Sandbox, now Generally Available.

Here's the one-line version: isolated, disposable execution environments purpose-built for AI agents to run untrusted code and tool calls, without touching anything they shouldn't.

But the details are where it gets interesting.

The isolation layer: GKE Agent Sandbox is built on gVisor, kernel-level sandboxing that intercepts system calls before they reach the host. This is the same technology Google uses to secure Gemini itself. Not a new experiment. Battle-tested infrastructure repurposed for agent workloads.

The scale numbers:

300 sandboxes launched per second, per cluster
Sub-second time to first instruction
Each sandbox is stateful and single-replica, purpose-built for agent runtimes, not retrofitted from generic container tooling

The cost angle: Running on Google Axion N4A instances, GKE Agent Sandbox delivers up to 30% better price-performance than the next leading hyperscaler for these workloads. Google also claims it's the only native agent sandbox service among major hyperscale cloud providers right now.

Real-world proof: Lovable, the AI app builder where users generate 200,000+ new projects daily, runs their AI-generated code in GKE Agent Sandboxes precisely because of the fast startup and secure isolation at scale. That's not a benchmark. That's production traffic.

My First Look: What Setup Actually Feels Like

I'll be honest with you: I'm not a GCP power user. I've deployed things to Google Cloud before, but GKE isn't where I spend most of my time. So I went in as someone who needed to be convinced, not someone already bought in.

The GKE Agent Sandbox quickstart gets you to a running sandbox faster than I expected. The core concept to understand is that Agent Sandbox introduces a new node pool type. You're not modifying your existing cluster, you're adding sandbox-capable nodes alongside it.

A minimal setup looks something like this:

# Create a GKE cluster with Agent Sandbox node pool
gcloud container clusters create agent-cluster \
  --location=us-central1 \
  --machine-type=n4a-standard-4 \
  --enable-sandbox

# Verify sandbox nodes are ready
kubectl get nodes -l cloud.google.com/gke-sandbox=true

And a Pod spec that opts into sandbox isolation:

apiVersion: v1
kind: Pod
metadata:
  name: agent-sandbox-pod
spec:
  runtimeClassName: gvisor
  containers:
  - name: agent-runtime
    image: your-agent-image:latest

The runtimeClassName: gvisor is doing the heavy lifting. That's the line that says "run this in kernel-level isolation, not the default container runtime."

What impressed me: The mental model is clean. You don't redesign your agent. You change where it runs. The sandbox is the infrastructure layer, not the application layer. That's the right abstraction.

What I'm still figuring out: Performance characteristics for longer-running agents. The sub-second startup is great for short, sharp tool calls. I want to understand the overhead profile for agents that hold state across a 10-minute autonomous task. The docs are sparse on this specific pattern and it's an area I'd want to benchmark before committing to it for anything complex.

What's genuinely missing: Better local development tooling. Right now the sandbox story is cloud-first. If you want to develop and test against sandbox behavior locally before pushing to GKE, you're doing manual gVisor setup yourself. For a GA product aimed at production agent workloads, I'd expect a docker run --runtime=gvisor equivalent that just works out of the box in a dev environment.

Why This Beats the Hype

Everyone at NEXT '26 is talking about Gemini. The Gemini Enterprise Agent Platform. The Agent Development Kit. The 260 announcements. The keynote moments.

And those things matter! ADK is genuinely useful. The agent governance story is real progress.

But here's the question nobody is asking loudly enough:

Where do your agents actually run?

Not conceptually. Not architecturally. Physically, what process,, on what machine, with what isolation, is executing the tool calls your agent makes 50 times per conversation?

You can have the most sophisticated agent framework in the world. You can have perfect prompt engineering. You can have great evals. And if the execution environment is a shared container with no isolation, you have a security hole you're one bad tool call away from falling through.

GKE Agent Sandbox is unsexy. It won't make it into the keynote highlight reel. It doesn't have a demo that looks good on a conference screen.

It's also the thing that makes the difference between "we have a cool agent demo" and "we have agents in production that we'd stake our reputation on."

The analogy I keep coming back to: in the early days of web hosting, everyone focused on the application. The features, the UI, the database schema. And then a few people started asking hard questions about process isolation, about what happens when one customer's app crashes, about whether shared hosting was actually safe. The answer to those questions became the entire cloud industry.

We're at that moment for agents. The application layer is getting a lot of attention. The isolation layer is getting almost none. GKE Agent Sandbox is Google's answer to that gap, and right now it's the only hyperscaler answer on the market.

My Honest Take

GKE Agent Sandbox is solving the right problem. The gVisor foundation is solid. This isn't a new security model, it's proven technology applied to a new workload pattern. The numbers (300 sandboxes/second, sub-second cold start) are impressive if they hold up in real heterogeneous workloads, not just benchmarks.

My critique: Google announced this quietly, and that's a mistake.

If you're an enterprise team evaluating whether to actually ship an agentic product (not prototype one, actually ship it) execution isolation is probably your #1 blocker after accuracy. GKE Agent Sandbox should have been a top-3 NEXT announcement, not announcement #37 in a blog post.

The other thing I'll say: the "industry's only native sandbox among hyperscalers" claim won't last long. AWS and Azure are both watching this. The fact that Google is first matters, but the window to build ecosystem lock-in is probably 12-18 months.

If you're building agents seriously, go look at this now.

What's the Cloud NEXT '26 announcement you think is most underrated? Drop it in the comments, genuinely curious what else got buried in the 260-item list.