DEV Community

Shir Meir Lador for Google AI

Posted on • Originally published at cloud.google.com

Agent Factory Recap: Supercharging Agents on GKE with Agent Sandbox and Pod Snapshots

In the latest episode of the Agent Factory, Mofi Rahman and I had the pleasure of hosting, Brandon Royal, the PM working on agentic workloads on GKE. We dove deep into the critical questions around the nuances of choosing the right agent runtime, the power of GKE for agents, and the essential security measures needed for intelligent agents to run code.

This post guides you through the key ideas from our conversation. Use it to quickly recap topics or dive deeper into specific segments with links and timestamps.

Why GKE for Agents?

Timestamp: 01:49

We kicked off our discussion by tackling a fundamental question: why choose GKE as your agent runtime when serverless options like Cloud Run or fully managed solutions like Agent Engine exist?

Brandon explained that the decision often boils down to control versus convenience. While serverless options are perfectly adequate for basic agents, the flexibility and governance capabilities of Kubernetes and GKE become indispensable in high-scale scenarios involving hundreds or thousands of agents. GKE truly shines when you need granular control over your agent deployments.

ADK on GKE

Timestamp: 06:58

We've discussed the Agent Development Kit (ADK) in previous episodes, and Mofi highlighted to us how seamlessly it integrates with GKE and even showed a demo with the agent he built. ADK provides the framework for building the agent's logic, traces, and tools, while GKE provides the robust hosting environment. You can containerize your ADK agent, push it to Google Artifact Registry, and deploy it to GKE in minutes, transforming a local prototype into a globally accessible service.

The Sandbox problem

Timestamp: 15:20

As agents become more sophisticated and capable of writing and executing code, a critical security concern emerges: the risk of untrusted, LLM-generated code. Brandon emphasized that while code execution is vital for high-performance agents and deterministic behavior, it also introduces significant risks in multi-tenant systems. This led us to the concept of a "sandbox."

What is a Sandbox?

Timestamp: 19:18

For those less familiar with security engineering, Brandon clarified that a sandbox provides kernel and network isolation. Mofi further elaborated, explaining that agents often need to execute scripts (e.g., Python for data analysis). Without a sandbox, a hallucinating or prompt-injected model could potentially delete databases or steal secrets if allowed to run code directly on the main server. A sandbox creates a safe, isolated environment where such code can run without harming other systems.

Agent Sandbox on GKE Demo

Timestamp: 20:25

So, how do we build this "high fence" on Kubernetes? Brandon introduced the Agent Sandbox on Kubernetes, which leverages technologies like gVisor, an application kernel sandbox. When an agent needs to execute code, GKE dynamically provisions a completely isolated pod. This pod operates with its own kernel, network, and file system, effectively trapping any malicious code within the gVisor bubble.

Mofi walked us through a compelling demo of the Agent Sandbox in action.We observed an ADK agent being given a task requiring code execution. As the agent initiated code execution, GKE dynamically provisioned a new pod, visibly labeled as "sandbox-executor," demonstrating the real-time isolation. Brandon highlighted that this pod is configured with strict network policies, further enhancing security.

The Future: Pod Snapshots

Timestamp: 29:39

While the Agent Sandbox offers incredible security, the latency of spinning up a new pod for every task is a concern. Mofi demoed the game-changing solution: Pod Snapshots. This technology allows us to save their state of running sandboxes and then near-instantly restore them when an agent needs them. Brandon noted that this reduces startup times from minutes to seconds, revolutionizing real-time agentic workflows on GKE.

Conclusion

It's incredible to see how GKE isn't just hosting agents; it's actively protecting them and making them faster.

Your turn to build

Ready to put these concepts into practice? Dive into the full episode to see the demos in action and explore how GKE can supercharge your agentic workloads.

Learn how to deploy an ADK agent to Google Kubernetes Engine and how to get your run agent to run code safely using the GKE agent Sandbox.

Connect with us

Top comments (6)

Collapse
 
ben profile image
Ben Halpern

Nice stuff

Collapse
 
jonmarkgo profile image
Jon Gottfried

I've actually been trying to build this kind of workflow myself - tried a bunch of different tools out there for the sandbox VM management - curious if it'd be viable to use this kind of thing for long-running VMs or of it's more for ephemeral code execution?

Collapse
 
automate-archit profile image
Archit Mittal

Pod Snapshots solving the cold-start latency problem is the piece that makes this production-viable. The security story for sandboxed execution has been solid for a while (gVisor, Firecracker, etc.), but the performance penalty of spinning up isolated environments per-task made it impractical for anything interactive.

Going from minutes to seconds for sandbox restore changes the economics completely. In the automation workflows I build, the agents frequently need to execute generated Python scripts for data transformation — and the difference between a 2-second sandbox spin-up and a 60-second one is the difference between a viable product and an unusable one.

The question I'd push on: how does this handle state persistence across sandbox invocations? If an agent is doing iterative work (run code, check output, modify, run again), does each iteration need a fresh sandbox, or can you checkpoint and resume the same one? That iteration loop is where most real agent work happens, and the latency multiplier across 5-10 iterations would still be significant even with snapshots.

Also curious about the cost model for snapshot storage at scale — if you're running hundreds of agents each with their own snapshot, that's a non-trivial amount of state to persist.

Collapse
 
agentwork profile image
Agent Work

Cool article. Agent Sandbox looks neat for GKE. I'm building a marketplace for AI tasks on Solana called AgentWork. We let users rent compute for specific tasks, which is kinda like what these agents are doing but with blockchain. Not sure how they'd fit in, but interesting stuff.

Collapse
 
Sloan, the sloth mascot
Comment deleted

Some comments may only be visible to logged-in visitors. Sign in to view all comments.