Saurabh Mishra for Google Developer Experts

Posted on Apr 8

Running Agentic AI at Scale on Google Kubernetes Engine

#agents #ai #cloud #kubernetes

The AI industry crossed an inflection point. We stopped asking "can the model answer my question?" and started asking "can the system complete my goal?" That shift from inference to agency changes everything about how we build, deploy, and scale AI in the cloud.

Google Kubernetes Engine (GKE) has quietly become the platform of choice for teams running production AI workloads. Its elastic compute, GPU node pools, and rich ecosystem of observability tools make it uniquely suited not just for model serving but for the orchestration challenges that agentic AI introduces.

This blog walks through the full landscape: what kinds of AI systems exist today, how agentic architectures differ, and what it actually looks like to run them reliably on GKE.

The AI Taxonomy: From Reactive to Autonomous

Before diving into infrastructure, it's worth establishing what we mean by the different modes of AI deployment. Not all AI is "agentic," and the architecture you choose should match the behavior you need

Reactive / Inference

Stateless prompt-response. One request, one LLM call, one answer. The model has no memory between turns. Examples: text classifiers, summarizers, one-shot code generators.

Conversational AI

Multi-turn dialog with session state. The model remembers context within a conversation window. Examples: customer support bots, document Q&A, coding assistants.

Retrieval-Augmented (RAG)

The model can query external knowledge at runtime before generating a response. Introduces a retrieval step vector DBs, semantic search, tool calls to databases.

Agentic AI

The model plans, takes actions, observes results, and loops until a goal is reached. It can call tools, spawn subagents, and make decisions across many steps autonomously.

Multi-Agent Systems

A network of specialized agents collaborating: an orchestrator decomposes a task and delegates to researcher, writer, executor agents that work in parallel or sequence.
Each mode up the stack introduces new infrastructure requirements: more state to manage, longer-lived processes, more concurrent workloads, harder failure modes, and deeper observability needs.

Why GKE for AI Workloads?

Kubernetes is table stakes for any modern distributed system. But GKE specifically brings several features that make it exceptional for AI:

GKE Capabilities for AI

GPU and TPU Node Pools

To handle the heavy lifting of Agentic AI, GKE offers specialized Accelerator Node Pools. This infrastructure allows you to dynamically attach high-end compute resources such as NVIDIA A100, H100, or L4 GPUs and Google TPUs exactly when your agents need them.

Workload Identity & Secret Management

Agentic systems touch many external APIs (databases, external services, third-party tools). Workload Identity Federation lets pods authenticate to Google Cloud services without storing long-lived credentials.

Horizontal Pod Autoscaling with Custom Metrics

Scale agent runner replicas based on queue depth (Pub/Sub backlog, Redis list length) rather than CPU. This allows demand-driven scaling that matches agent workload patterns precisely.

GKE Autopilot & Standard Modes

Autopilot mode handles node management entirely, ideal for teams wanting to focus on agent logic. Standard mode gives full control when you need custom kernel modules or specialized hardware affinity rules.

Cloud Run on GKE for Burst Workloads

Short-lived tool execution steps in an agent pipeline can be offloaded to Cloud Run, which scales to zero between invocations avoiding the overhead of always-on Kubernetes pods for infrequent task

Anatomy of an Agentic AI System

An agentic AI system isn't a single process ,it's a distributed workflow. Understanding its components is essential before mapping it onto Kubernetes primitives.
"An agent is an LLM that can observe the world, decide what to do next, and take actions - in a loop, until a goal is satisfied."

Popular Agentic Frameworks on GKE

Several frameworks have emerged to help teams build agentic systems without reinventing the orchestration wheel. Each has a different philosophy and maps to GKE differently.

Agent Development Kit (ADK)

Google's native framework for building multi-agent systems on Vertex AI. First-class GKE support, tight Gemini integration, built-in evaluation tools. Best choice for teams already on Google Cloud.

LangGraph

Graph-based agent orchestration with explicit state machines. Excellent for complex branching workflows. Containerizes cleanly. LangSmith provides tracing that integrates with GKE logging pipelines

CrewAI

Defines agents as role-playing entities (Researcher, Writer, Editor) with goals and backstories. Simple to model complex human workflows. Ideal for content, analysis, and research pipelines.

Google ADK on GKE >> Native Fit

The Google Agent Development Kit (ADK) is architected to treat Kubernetes as its primary "home," creating a seamless integration where the framework and the platform operate as one. Because ADK is built with a Kubernetes-native philosophy, it transforms GKE from a simple hosting environment into a specialized runtime for autonomous systems.

Observability: The Hard Part

Agentic systems fail in non-obvious ways. An agent might produce a response - but the response could be hallucinated, based on a failed tool call, or the result of an unintended plan branch. Standard HTTP error monitoring doesn't catch this.

The recommended observability stack for GKE-based agentic systems:

Observability Stack

OpenTelemetry Instrumentation

Instrument each agent with OpenTelemetry. Emit spans for every LLM call, tool invocation, and planning step. Export to Google Cloud Trace for full distributed trace visualization.

Structured Logging to Cloud Logging

Log each reasoning step as a structured JSON event: task ID, agent ID, step number, prompt hash, tool name, tool result summary, token counts. Query across traces in BigQuery for post-hoc analysis.
Custom Metrics via Cloud Monitoring

Track agent-specific metrics: tasks completed per minute, average steps per task, tool call success rate, LLM latency P50/P95/P99, and hallucination rate from your eval pipeline.

LLM-specific Tracing (LangSmith / Vertex AI Eval)

Leverage LangSmith or Vertex AI's built-in evaluation capabilities to capture complete prompt–response interactions along with semantic quality metrics. These insights can then be fed back into your continuous improvement cycle.

Security Considerations for Agentic AI on GKE

Agents with tool use are a new attack surface. An agent that can execute code, send emails, or write to a database is a powerful actor - and must be treated like one.

Prompt Injection

Malicious content in retrieved documents can instruct the agent to deviate from its goal. Sanitize all retrieved content before insertion into prompts. Use system-level guardrails in your LLM configuration.

Privilege Escalation

Each agent should operate with the minimum IAM permissions needed for its specific tools. Use Workload Identity with role-specific service accounts never a single all-powerful SA for all agents.

Human-in-the-Loop Gates

For irreversible actions (sending emails, deploying code, database writes), require a human approval step before execution. Implement approval workflows via Pub/Sub pause + Cloud Tasks callback.

Network Policies

Use GKE Network Policies to restrict which agent pods can talk to which services. A researcher agent has no reason to reach the database writer service directly - enforce this in the cluster, not just in code.

What's Next: The Agentic Platform

The direction of travel is clear. GKE is evolving from an application runtime into an agentic platform - a place where autonomous AI systems can be deployed, composed, monitored, and governed with the same rigor we apply to microservices today.
Several emerging capabilities are worth tracking:

Agent-to-Agent Communication (A2A Protocol) - Google's emerging standard for cross-agent RPC, allowing agents built with different frameworks to interoperate. GKE provides the network fabric for this via internal load balancers and service mesh.

Model Context Protocol (MCP) on Kubernetes - MCP is becoming the standard way for agents to discover and call tools. Running MCP servers as sidecar containers or standalone Deployments in GKE makes tool registries cluster-native.

Vertex AI Agent Engine - Google's fully managed orchestration layer for agents that sits above GKE, handling session management, tool routing, and evaluation out of the box. The boundary between GKE and managed agent infrastructure will continue to blur.

"Kubernetes wasn't built for AI. But it turns out the problems of distributed systems - scale, failure, state, observability - are exactly the problems agentic AI inherits."

Core Reference Documentation

https://docs.cloud.google.com/kubernetes-engine/docs/integrations/ai-infra

https://github.com/GoogleCloudPlatform/accelerated-platforms/blob/main/docs/platforms/gke/base/use-cases/inference-ref-arch/README.md

https://docs.cloud.google.com/agent-builder/agent-development-kit/overview

Hands-on Tutorials

https://codelabs.developers.google.com/devsite/codelabs/build-agents-with-adk-foundation

https://cloud.google.com/blog/topics/developers-practitioners/build-a-multi-agent-system-for-expert-content-with-google-adk-mcp-and-cloud-run-part-1

Top comments (5)

Pavel Gajvoronski • Apr 15

Good overview of the infrastructure layer. This is the part most agent builders skip — they focus on prompt engineering and model selection but forget that 31 agents running chains for 1000 concurrent users need real orchestration infrastructure.
We're building Kepion (AI company builder, 31 specialized agents) and hit this exact scaling question early. Our current stack is Docker Compose + FastAPI + Redis pub/sub — works for MVP but won't survive real scale. The chain engine handles loop detection and checkpoint/replay, but it's single-instance.
The GKE approach makes sense for enterprise, but for indie builders and startups the cost is prohibitive. We went with a middle path: 4-tier model routing (Free → Budget → Performance → Premium) with cost circuit breakers. Instead of scaling infrastructure, we scale intelligence — cheap models for routine tasks, expensive models only when quality demands it. Keeps costs under $200/month even with 31 agents.
Curious about one thing: how do you handle agent state in a distributed Kubernetes setup? Our agents are stateless per-call but accumulate knowledge in a shared vault. In a multi-pod setup, vault consistency becomes a real challenge.

Saurabh Mishra Google Developer Experts • Apr 29

@pavelbuild Three practical options:

Partition by session/agent ID : one writer per key prefix, no conflicts. Simplest.
Append-only log (Kafka/Redis Streams) + a projector that materializes state : concurrent writes, eventually consistent reads.
Optimistic locking (CAS) on etcd : strong consistency, retry on conflict

Design Estimation LLC • Apr 10

That's Great Bro

Some comments may only be visible to logged-in visitors. Sign in to view all comments.