DEV Community: Saurabh Mishra

Transforming Kong into an AI Gateway on GCP: Managing LLM Tokens, MCP, and Agentic Traffic

Saurabh Mishra — Tue, 28 Jul 2026 02:59:23 +0000

Most platform teams didn't set out to build "AI infrastructure." They already had Kong running in front of hundreds of REST services on GKE, handling auth, rate limiting, and observability the way an API gateway is supposed to. Then product teams started shipping LLM-backed features, then RAG pipelines, then autonomous agents that call tools on their own and suddenly the gateway layer that worked fine for CRUD traffic started showing cracks. A single chat request can cost 50x another one depending on the model and prompt length. A single "user request" might now fan out into a dozen tool calls made by an agent with no human in the loop. None of that maps cleanly onto request-count rate limits or static routing rules

This is the problem Kong's AI Gateway capabilities are built to solve, and GCP is a natural place to run it: GKE for the gateway itself, Cloud Load Balancing at the edge, Vertex AI and Model Garden as first-class model backends alongside OpenAI, Anthropic, and others, and Memorystore for Redis backing distributed rate-limit state. Below is a practical walkthrough of turning an existing Kong deployment into a real AI gateway on GCP covering token-aware traffic management, Model Context Protocol (MCP) support, and the harder problem of governing agentic traffic that doesn't behave like normal API calls.

Why a Regular API Gateway Isn't Enough for LLM Traffic

Traditional gateways are built around a few assumptions that break down with generative AI workloads:

Cost is not proportional to request count. A request to a small model with a short prompt might cost fractions of a cent; a long-context request to a frontier model can cost dollars. Rate limiting on requests-per-minute doesn't protect your budget.

Backends aren't interchangeable. Routing a request to GPT versus Gemini versus Claude isn't just a load-balancing decision ,it changes cost, latency, and even output quality.

The "client" is increasingly a machine, not a human. Agents chain multiple calls, retry autonomously, and can enter loops that a human would never trigger by clicking a UI.

The protocol itself is evolving. MCP now standardizes how models discover and call tools, and gateways need to speak it natively rather than treating it as an opaque HTTP payload.

Kong's AI Gateway is essentially the same Kong Gateway core, extended with a family of AI-specific plugins — ai-proxy, ai-proxy-advanced, ai-rate-limiting-advanced, ai-semantic-cache, ai-prompt-compressor, ai-mcp-proxy, and others — that make the gateway fluent in tokens, prompts, and tool calls instead of just headers and paths.

Step 1: Get Kong AI Gateway Running on GKE

If you're already running Kong Gateway (OSS or Enterprise) on GKE, you don't need a new platform — you need the AI Gateway plugin set, which ships as part of Kong Gateway 3.6+ and is more fully featured from 3.12 onward (which added first-class MCP support).

A typical GCP-native footprint looks like this:

GKE hosts the Kong data plane nodes (and control plane, if self-managed rather than using Konnect).
Cloud Load Balancing (external HTTPS LB or Gateway API) terminates TLS and fronts the Kong proxy service.
Memorystore for Redis backs distributed counters for the AI rate-limiting plugins, so token budgets are enforced consistently across replicas instead of per-pod.
Secret Manager holds provider API keys (OpenAI, Anthropic, Vertex AI service account credentials) and is mounted into Kong via the CSI driver rather than baked into kong.yaml.
Cloud Monitoring / Managed Prometheus scrapes the metrics Kong's AI plugins emit — including the token-usage and MCP-specific metrics added via the Prometheus plugin extensions.

Deployment itself is standard Kong-on-Kubernetes: the Kong Ingress Controller or KongClusterPlugin/KongPlugin CRDs manage plugin configuration declaratively, which plays well with GitOps if you're already running Config Sync or Argo CD on GKE.

Step 2: Route to Vertex AI and Other Providers with ai-proxy

The ai-proxy plugin turns a Kong Route into an LLM endpoint. You define the provider, the model, and any transformation rules, and Kong normalizes the request/response shape so your applications can call a consistent internal API regardless of which upstream model actually serves it:

plugins:
  - name: ai-proxy
    config:
      route_type: llm/v1/chat
      auth:
        header_name: Authorization
        header_value: Bearer ${{ env "VERTEX_AI_TOKEN" }}
      model:
        provider: gemini
        name: gemini-3.5-flash
        options:
          max_tokens: 1024
          temperature: 0.7

For multi-provider setups,routing cheap, latency-insensitive traffic to a Vertex AI model while sending premium requests to Claude or GPT ai-proxy-advanced adds weighted load balancing and semantic routing across multiple LLM targets behind a single Kong Route. This is also where cost-optimization patterns live: pairing ai-proxy-advanced with ai-semantic-cache (to skip redundant LLM calls entirely for near-duplicate prompts) and ai-prompt-compressor (which uses LLMLingua-style compression to shrink prompts before they hit the paid API) can meaningfully cut spend without touching application code.

Step 3: Make Token Usage a First-Class Rate-Limiting Dimension

This is the core shift from "API gateway" to "AI gateway": rate limiting by request count is close to meaningless for LLM traffic. The ai-rate-limiting-advanced plugin reads the actual token usage returned by the LLM provider in each response and enforces budgets on that basis — prompt tokens, completion tokens, or total tokens, per consumer or consumer group, over a sliding window:

plugins:
  - name: ai-rate-limiting-advanced
    consumer_group: standard-tier
    config:
      policies:
        - limits:
            - limit: 20000
              window_size: 60
              window_type: sliding
          identifier: consumer-group
          tokens_count_strategy: total_tokens
          strategy: redis
          redis:
            cluster_nodes:
              - ip: ${MEMORYSTORE_HOST}
                port: 6379
      llm_format: openai

Backing this with Memorystore for Redis (rather than local in-memory counters) is what makes the limit hold across every GKE pod in the deployment, not just per-replica. Kong also supports cost-based rather than token-based budgets — assigning a dollar cost per input/output token for each model and capping consumer spend per hour, which is the more direct lever for finance and platform teams who think in terms of a budget line rather than a token count.

A pattern worth calling out: tiering by consumer group lets you reject an unauthorized model choice — a standard-tier user requesting a premium model — at the gateway with a clean 400 before any upstream provider call is made, rather than silently downgrading the request or letting it through and eating the cost.

Step 4: Bring MCP Traffic Under Gateway Control

Model Context Protocol has quickly become the default way agents discover and invoke tools, but it introduces its own operational surface: persistent sessions, JSON-RPC method calls instead of simple REST verbs, and a mix of MCP clients, hosts, and servers that don't map onto a traditional request/response gateway model. Kong's ai-mcp-proxy plugin (introduced with Kong Gateway 3.12) is built specifically to sit in that path.

The plugin supports a few distinct modes:

Passthrough, where Kong fronts an existing MCP server as-is, applying auth, rate limiting, and logging without altering the protocol.
Conversion, where Kong turns an existing REST API into an MCP-compatible tool by defining a tool schema in configuration — no code changes to the backend service required:

plugins:
  - name: ai-mcp-proxy
    config:
      mode: conversion-listener
      tools:
        - description: Look up order status by ID
          method: GET
          path: /api/orders/{order_id}/status

On top of the proxy itself, Kong adds the pieces enterprise MCP deployments actually need in production: an ai-mcp-oauth2 plugin implementing OAuth 2.1 so Kong acts as the OAuth Resource Server for MCP sessions (aligning with the MCP spec's June 2025 authorization update), Consumer and Consumer Group ACLs to restrict which tools a given caller can invoke, and MCP-specific logging that captures session IDs, JSON-RPC methods, payloads, and latencies — because "who called what tool, with what arguments, and did it succeed" is the audit trail security teams will ask for the moment an agent does something unexpected.

Running this on GCP, the MCP session-affinity requirements (MCP connections are typically longer-lived than a single HTTP request) are a reasonable fit for GKE's session affinity settings on the Kong Service, and Cloud Trace integration via OpenTelemetry gives you the span-level view of a tool call as it moves from agent → Kong → MCP server → downstream API.

Step 5: Govern Agentic Traffic, Not Just Individual Calls

The hardest part of this transformation isn't any single plugin — it's that agentic traffic breaks the assumption that one inbound request equals one unit of work. An agent might make ten LLM calls and five tool calls to satisfy one user prompt, retry autonomously on failure, and occasionally loop. A few practical patterns help here:

Budget at the session or workflow level, not just the request level. Tag agent-originated traffic with a consumer group and apply the token/cost budgets described above across the group's sliding window, so a runaway agent loop hits a 429 instead of a surprise bill.
Use the AI Prompt Guard / Content Safety plugins as a policy boundary. Kong's AI Gateway includes plugins for prompt injection detection and integration with providers like Azure AI Content Safety to audit messages before they reach the upstream model — worth enabling specifically on agent-facing routes where inputs may originate from untrusted tool output rather than a human.
Scope MCP tool access tightly per agent identity. The Consumer/Consumer Group ACL model for MCP tools means a given agent credential can be restricted to exactly the tools it needs, which matters more for autonomous agents than for human-driven API clients, since there's no human in the loop to notice an overreaching call before it executes.
Treat observability as the actual safety net. Because agent behavior is inherently less predictable than a fixed application flow, the OpenTelemetry span attributes Kong emits for token usage and tool-call metadata are what let you reconstruct, after the fact, exactly what an agent did and why — which is often the more realistic control than trying to prevent every bad outcome upfront.

Putting It Together

None of these pieces are exotic individually rate limiting, OAuth, request logging, and reverse proxying are things Kong has done for years. What's changed is the unit of measurement: tokens instead of requests, tool calls instead of endpoints, and agent sessions instead of single client calls. Running this on GCP means the AI-specific Kong plugins sit on infrastructure that's already tuned for it Memorystore for the distributed counters that make token budgets actually hold, Vertex AI as a native backend alongside third-party model providers, and GKE plus Cloud Monitoring for the session handling and observability that MCP and agentic traffic both demand.

The teams that get the most value out of this tend to start narrow: put one high-traffic LLM route behind ai-proxy with token-based rate limiting first, prove out the Redis-backed budget enforcement, then layer in MCP proxying for a single tool integration before opening the gateway up to broader agentic workflows. The plugin architecture makes that incremental path realistic you're not re-platforming, you're extending a gateway you already trust to understand a new kind of traffic.

Zero-Trust Data Sovereignty: Enforcing Localized MCP Tool Policies in Multi-Region Stacks

Saurabh Mishra — Mon, 27 Jul 2026 05:24:28 +0000

Agentic AI has quietly broken a lot of assumptions that data governance teams spent the last decade building. When an application was a monolith talking to a regional database, "data residency" meant picking the right Cloud SQL instance and locking down an Organization Policy. When that application becomes an agent with a tool-calling loop — pulling a document from Google Drive, hitting a CRM connector, invoking a code execution sandbox, calling out to a partner API every one of those tool calls is a new data egress path, and potentially a new jurisdiction.

The Model Context Protocol (MCP) has become the de facto way agents discover and invoke tools. That's good news for interoperability. It's a harder problem for sovereignty, because MCP servers are, by design, pluggable and location-agnostic — a tool can live anywhere, and the agent doesn't inherently know or care where "anywhere" is. If you're running a multi-region stack on Google Cloud and you have regulatory obligations (GDPR, India's DPDP Act, sector rules like HIPAA or financial services residency mandates), you can't treat MCP tool invocation as a black box. It has to be a policy-enforced boundary, the same way network egress or IAM already is.

This post walks through how to think about and build zero-trust enforcement of MCP tool policies across regions on Google Cloud, so that "which tool can this agent call, with which data, from which region" is answered by infrastructure, not by prompt instructions.

Architectural Overview

When an agent requests a tool call (e.g., querying customer records via a database MCP server), the payload passes through an inline reverse proxy on Cloud Run.

The proxy evaluates:Agent Location & Identity (from Google IAM Workload Identity / OIDC).Target MCP Server Region (where the data reside).

Payload Sensitivity (evaluated via Google Cloud Sensitive Data Protection / Cloud DLP).

Why prompt-level restrictions aren't a control

A common first instinct is to tell the model, in its system prompt, "only use the EU-hosted search tool for EU customer data." That's a UX nicety, not a security control. Prompts are not a trust boundary: they can be overridden by injected content, ambiguous context, or simply model error. If a tool call is the thing that moves regulated data across a border, the decision to allow or block that call needs to be enforced by something that isn't the model itself — a policy engine sitting between the agent and the MCP tool registry.

This is the same lesson zero-trust architecture already taught us for east-west traffic: never trust the caller's stated intent, always verify against an external policy at the point of access.

The shape of the problem in a multi-region MCP stack

A realistic deployment looks like this: an orchestration layer (say, a Vertex AI Agent Builder or a self-hosted agent runtime on GKE) fans out tool calls to a set of MCP servers — some first-party, some third-party, some internal microservices wrapped in an MCP adapter. In a multi-region Google Cloud footprint, those MCP servers might be deployed in europe-west3, asia-south1, and us-central1, fronted by regional endpoints for latency and residency reasons.

Three things need to be true simultaneously:

Data classification travels with the request. A tool call carrying EU personal data needs to carry that fact as metadata, not as something inferred later.

Tool identity and location are resolvable at call time. The policy engine needs to know not just which tool is being called, but which region that MCP server instance is actually running in — and that this hasn't been spoofed.

Policy evaluation happens before egress, not after logging. Audit logs are necessary but not sufficient; the enforcement point has to be able to deny the call.

Architecture: a policy-enforcement sidecar in front of MCP

The pattern that works well on Google Cloud is to put a lightweight policy-decision point (PDP) in front of every MCP tool registry lookup, rather than baking region logic into the agent itself.

Tag data at ingestion, not at the tool boundary. Use Sensitive Data Protection (formerly DLP) to classify inbound context — documents, chat history, retrieved records — and attach classification labels (e.g., residency: eu, pii: high) to the request object the agent carries through its reasoning loop. This is the same principle as Assured Workloads' data boundary controls, just applied to the agent's working context instead of storage.
Register MCP tools with location metadata in a central catalog. Rather than letting agents discover MCP servers via arbitrary URLs, run an internal MCP tool registry (a Cloud Run or GKE service) that records, for every registered server: its deployed region, the data categories it's certified to handle, and its VPC Service Controls perimeter membership. Treat this registry as the single source of truth — agents query it, they don't hardcode endpoints.
Enforce with VPC Service Controls perimeters per region/data class, not per project. Group MCP servers into VPC-SC perimeters that mirror your data classification tiers rather than your org chart. An EU-resident perimeter contains only MCP servers physically deployed in EU regions and approved for EU personal data; cross-perimeter calls are denied by default, and any legitimate cross-border tool call goes through an explicit bridge with its own audit trail. This gives you the "default deny" posture that zero-trust requires, enforced at the network layer where it can't be argued around by application logic.
Use IAM Conditions and Organization Policy as the last-mile gate. Even inside an allowed perimeter, use IAM Conditions on the service accounts your agent runtime assumes to restrict which MCP tool identities can be invoked based on request attributes (time, classification tag, requesting workload identity). Pair this with the resourcelocations Organization Policy constraint so that any new MCP server or supporting resource (a Cloud SQL backing store, a Cloud Storage bucket used by a tool) can't be created outside its intended region in the first place — prevention, not just detection.
Route through a policy-decision point that speaks MCP. Concretely, this can be a small Envoy or custom Cloud Run proxy sitting in the call path between your agent runtime and MCP servers, which:

Reads the classification tags on the incoming request.
Looks up the target tool's region and certification in the registry.
Evaluates a policy (OPA/Rego is a natural fit) that says, effectively: "a request tagged residency: eu may only be routed to a tool tagged region: eu with certified: [eu_pii]."
Allows, denies, or — for genuinely ambiguous cases — routes to a human-in-the-loop approval step before the call proceeds.

This is the piece that turns "the model was told to behave" into "the infrastructure cannot do otherwise."

Handling model-side ambiguity

Even with the PDP in place, there's a softer problem: what happens when the agent's own reasoning conflates data from two residency domains in a single tool call — for example, summarizing an EU customer record alongside a US one before calling an external tool? This is where classification needs to propagate at the field level, not just the request level, so the PDP can catch a call that mixes tags rather than only checking the dominant one. It's also a strong argument for keeping cross-region context merges out of tool-call payloads entirely, and instead having the agent make separate, region-scoped calls that get synthesized after the fact in a control-plane component you trust.

Observability: prove it, don't just do it

Regulators and auditors want evidence, not architecture diagrams. Two things are worth wiring up from day one:

Cloud Audit Logs on every PDP decision, including denials, exported to a region-appropriate BigQuery dataset so the audit trail itself doesn't become a residency violation.
A drift detector that periodically re-verifies the registry's claimed region metadata for each MCP server against reality (e.g., via the actual serving location of the Cloud Run/GKE workload), since a misconfigured deployment silently moving a tool to the wrong region is a more likely failure mode than a malicious one.

The trade-offs worth naming honestly

This pattern adds latency (an extra hop through the PDP) and operational surface area (a registry and a policy engine to maintain). For low-stakes, single-region deployments, it's probably overkill plain VPC-SC perimeters and IAM may be enough. It earns its cost specifically when you have (a) genuine multi-jurisdictional data, (b) third-party or dynamically-discovered MCP tools you don't fully control, and (c) regulatory obligations with real audit teeth. If your MCP tool set is small, static, and entirely first-party, a simpler static allow-list per region may get you 90% of the benefit for a fraction of the complexity.

Closing thought

MCP made tool use composable and dynamic, which is exactly what makes it powerful and exactly why it can't be trusted to self-police on sovereignty. The fix isn't a smarter prompt; it's treating "which tool, which region, which data class" as a first-class policy decision enforced at the network and IAM layer, the same way Google Cloud already asks you to treat storage residency. Get that boundary right once, in infrastructure, and every agent built on top of it inherits the guarantee for free

The Connected Agent: Scaling Antigravity 2.0 with Google Cloud Data Services and Model Context Protocol

Saurabh Mishra — Wed, 01 Jul 2026 01:58:30 +0000

Artificial Intelligence is rapidly evolving from chatbots to autonomous agents capable of reasoning, planning, and taking action. But an AI agent is only as useful as the data and tools it can access.

This is where Google's Antigravity 2.0 changes the game.

Introduced as Google's next-generation agent development platform, Antigravity 2.0 enables developers to build multi-agent systems, orchestrate long-running workflows, and seamlessly integrate enterprise tools. When combined with Model Context Protocol (MCP) ** and **Google Cloud Data Services, it provides a scalable architecture for building production-ready AI applications.

In this article, we'll explore how these technologies work together and why they represent a modern blueprint for enterprise AI.

From Agent Manager to Agent Platform

The original Antigravity, released in November 2025, was a smart coding assistant wrapped around a familiar editor. Version 2.0 is a different category of product entirely. Instead of centering the code editor, it centers the agent itself, shipping simultaneously as a standalone desktop command center, a CLI (agy), an SDK, and a managed agents tier inside the Gemini API.

Underneath all of it sits Gemini 3.5 Flash, tuned specifically for agentic workflows and reportedly running several times faster than the previous generation while holding long context. That speed matters more than it sounds like it should when you're running multiple agents in parallel, each one waiting on a database schema lookup or a query result, latency compounds fast. A model that responds in milliseconds instead of seconds is the difference between a fluid multi-agent workflow and a stalled one.

The architecture reflects this shift toward orchestration. A manager agent breaks an incoming task into subtasks. Specialized sub-agents then work in parallel one writing code, one running terminal commands, another driving a real embedded Chromium browser to click through the UI it just built and catch what's broken. It's less "autocomplete" and more "team of engineers," each with a narrow job and a shared plan.

None of that matters much, though, if the team can't see your data.

Why AI Agents Need More Than an LLM

Consider this user request:

Summarize yesterday's sales, identify delayed shipments, notify affected customers, and generate an executive report.

A traditional chatbot would struggle because the information lives across multiple systems.

The agent needs to:

Query BigQuery for sales analytics.
Retrieve customer orders from Cloud SQL.
Check shipping status through an external API.
Search policy documents stored in Cloud Storage.
Send notifications.
Remember previous conversations.

Writing custom integrations for every application quickly becomes difficult to maintain.

Instead, modern AI systems separate reasoning from tool execution.

Meet Antigravity 2.0

Antigravity 2.0 is Google's platform for building intelligent agents that can reason, collaborate, and execute complex workflows.

Instead of relying on a single AI assistant, Antigravity 2.0 enables teams to orchestrate multiple specialized agents that work together.

Some of its key capabilities include:

🤖 Multi-agent orchestration
🧠 Long-running reasoning
🔄 Dynamic task decomposition
🛠 Native MCP tool integration
💻 Antigravity CLI and SDK
☁️ Deep integration with Google Cloud
📊 Enterprise-ready deployment patterns

Rather than directly accessing databases or APIs, Antigravity agents invoke MCP tools to retrieve data or perform actions securely.

What is Model Context Protocol (MCP)?

Instead of building custom integrations for every database or API, each capability is exposed as an MCP server.

The agent discovers available tools and invokes them dynamically.

User
│
▼
Antigravity 2.0
│
Discovers MCP Tools
│
───────────────
BigQuery Tool
Cloud SQL Tool
AlloyDB Tool
Storage Tool
GitHub Tool
Slack Tool
───────────────

The result is a modular architecture where agents remain lightweight while integrations evolve independently.

Bringing Google Cloud Data Services into the Picture

The real strength of Antigravity 2.0 comes from combining intelligent orchestration with trusted enterprise data.

📊 BigQuery

BigQuery gives agents access to analytical data at scale.

Example prompt:

"Which region had the highest revenue growth this month?"

The workflow is simple:

Antigravity selects the BigQuery MCP tool.
SQL is executed.
Results are summarized using Gemini.
The user receives insights instead of raw tables.

⚡ AlloyDB

AlloyDB is ideal for AI applications that require both operational data and semantic search.

Use cases include:

Vector search
RAG applications
Customer support
Product recommendations

Agents can combine structured queries with semantic retrieval to generate highly contextual responses.

🗄 Cloud SQL

Most enterprise applications already rely on relational databases.

Instead of migrating data, organizations can expose Cloud SQL securely through MCP.

Existing business applications immediately become AI-ready.

📁 Cloud Storage

Knowledge doesn't always live in databases.

Contracts, reports, PDFs, manuals, and images often reside in Cloud Storage.

An MCP server can retrieve relevant documents and provide them as context to the agent.

🔥 Firestore

Firestore stores:

User preferences
Conversation history
Application state
Session data

This allows Antigravity agents to personalize every interaction.

⚡ Memorystore (Redis)

Redis helps improve both performance and cost.

Typical use cases include:

Semantic cache
Conversation memory
Shared agent memory
Rate limiting
Session storage

Caching reduces latency and minimizes unnecessary LLM requests.

Multi-Agent Workflow in Action

Imagine a customer support assistant built with Antigravity 2.0.

A customer asks:

"My package hasn't arrived. What's happening, and am I eligible for compensation?"

Rather than relying on one agent, Antigravity orchestrates several specialized agents.

Multi-Agent Workflow in Action

Imagine a customer support assistant built with Antigravity 2.0.

A customer asks:

"My package hasn't arrived. What's happening, and am I eligible for compensation?"

Rather than relying on one agent, Antigravity orchestrates several specialized agents.

📦 Data Agent

Queries Cloud SQL to retrieve the order.

🚚 Logistics Agent

Calls the shipping provider's API.

📚 Knowledge Agent

Searches Cloud Storage for compensation policies.

📈 Analytics Agent

Queries BigQuery for historical delivery performance.

🧠 Memory Agent

Retrieves previous conversations from Firestore and Redis.

The orchestrator combines these outputs into a single response that is accurate, contextual, and personalized.

📦 Data Agent

Queries Cloud SQL to retrieve the order.

🚚 Logistics Agent

Calls the shipping provider's API.

📚 Knowledge Agent

Searches Cloud Storage for compensation policies.

📈 Analytics Agent

Queries BigQuery for historical delivery performance.

🧠 Memory Agent

Retrieves previous conversations from Firestore and Redis.

The orchestrator combines these outputs into a single response that is accurate, contextual, and personalized.

Security by Design

Enterprise AI requires strong governance.

Google Cloud provides the building blocks:

IAM
Service Accounts
Secret Manager
Cloud Audit Logs
VPC Service Controls
Private Service Connect
Customer-managed encryption keys (CMEK)

Since MCP servers expose only approved tools, organizations can apply least-privilege access and maintain strict security boundaries.

Why This Architecture Matters

Combining Antigravity 2.0 with MCP creates several advantages:

✅ Standardized integrations

✅ Reusable enterprise tools

✅ Modular architecture

✅ Better observability

✅ Easier governance

✅ Lower maintenance costs

✅ Faster AI development

As new business systems are introduced, developers simply deploy additional MCP servers instead of modifying the agents themselves.

Best Practices

If you're building production AI agents, consider these recommendations:

Keep agents focused on reasoning rather than direct data access.
Build small, reusable MCP tools with clear responsibilities.
Secure every MCP server with IAM and least-privilege permissions.
Cache expensive queries with Memorystore.
Monitor agents using Cloud Logging and OpenTelemetry.
Store credentials in Secret Manager.
Version MCP tools to maintain compatibility.
Add approval workflows before executing sensitive business operations.

Final Thoughts

Antigravity 2.0 marks an important step toward enterprise-ready agentic AI. Instead of building isolated chatbots, developers can create collaborative AI systems that reason, retrieve trusted business data, and automate complex workflows.

When paired with Model Context Protocol (MCP) and Google Cloud Data Services, Antigravity 2.0 enables secure, modular, and scalable AI architectures that are easier to build, govern, and extend.

The future of AI isn't just smarter models ,it's intelligent agents working together with the right tools, the right data, and the right architecture.

Closing the Trust Gap: Automating GKE Incident Response with Antigravity 2.0, GKE MCP, and Artifacts

Saurabh Mishra — Mon, 29 Jun 2026 16:12:25 +0000

Anatomy of the Trust Gap

Before we can talk about the solution, we need to talk honestly about how the trust gap forms. It isn't a technology failure it's an epistemological one. When an automated system takes an action or makes a recommendation, on-call engineers need to answer three questions almost simultaneously:

Is the diagnosis correct? Does the system understand what's actually wrong, or is it pattern-matching superficially?

Is the proposed action safe? Will following this recommendation make things better, worse, or sidestep the real issue entirely?

Can I explain this decision later? If I follow the automation and it goes wrong, will I be able to reconstruct why — for a postmortem, for my team, for myself?

Legacy runbook automation systems answer none of these questions well. They tell you what to do, not why. They surface alerts, not reasoning. And when they're wrong — which they are, reliably, in the tail cases that matter most — engineers stop trusting them for everything, including the cases where they'd be right.

Engineers are rightfully terrified of "runaway automation"—brittle bash scripts or over-eager webhooks that misinterpret a symptom, delete the wrong stateful pod, or trigger an accidental cascading failure across a cluster. Because of this, we default to waking up exhausted humans at 3:00 AM to manually sift through kubectl logs.

With the emergence of agentic AI ecosystems, we finally have a way to close this gap. By pairing Google Antigravity 2.0—Google's standalone agent orchestration platform—with GKE's native infrastructure and Artifacts, teams can build an automated, transparent, and strictly governed incident response pipeline

The Tech Stack: GKE, MCP, and Antigravity

To automate incident resolution safely, an AI agent cannot treat a cluster like a black box. It needs deep, contextual access to the environment without compromising security.

This workflow relies on three core components:
Google Kubernetes Engine (GKE): The underlying managed environment running containerized workloads.

GKE Model Context Protocol (MCP) Server: Introduced to standardize how AI agents interact with Kubernetes, the MCP server exposes standardized capabilities for monitoring, analyzing, and modifying cluster resources.

Google Antigravity 2.0: Operating via the Gemini Enterprise Agent Platform, Antigravity functions as the central orchestrator. It connects to the GKE MCP server using enterprise-grade IAM credentials and Workload Identity, executing automated reasoning loops to triage and fix issues

Bridging the Gap with Artifact-Driven SRE

The secret to trust is transparency. Google Antigravity does not blindly run destructive scripts in the background. Instead, its core design centers on Artifacts—structured, immutable deliverables created by the agent to communicate its thinking, progress, and verification milestones to human users. When applied to GKE Site Reliability Engineering (SRE), Antigravity uses an Artifact-Driven Remediation framework:

Implementation Plans: Before modifying any cluster state, the agent generates a rich Markdown specification detailing the exact API changes it intends to make (e.g., cordoning a node, scaling down a corrupted deployment).

Task Lists: A structured checklist showing the step-by-step diagnostic operations the agent is executing in real time.

Walkthroughs: Once a fix is applied, the agent generates an interactive post-mortem artifact summarizing the changes and verifying cluster health with real data logs

Step-by-Step: The Automated Incident Loop

Let's look at how Antigravity handles a common, painful production issue: a microservice experiencing a memory leak that triggers an Out-Of-Memory (OOM) killer loop, choking out co-located pods on a GKE node.

GKE-Specific Diagnostic Patterns: What We've Learned

Twelve months of running Antigravity in production across seventeen GKE clusters has generated a substantial library of incident patterns. The following are the most common root cause categories our diagnostic engine has learned to identify with high confidence, along with the signal signatures that distinguish them:

Node pool autoscaler contention
Symptoms: pods stuck in Pending despite headroom in existing nodes; cluster autoscaler logs showing scale-up events followed by immediate scale-down; kube_node_status_condition flipping. Common in environments where both HPA and VPA are enabled without coordination, creating competing scaling pressure. Antigravity's diagnostic rule for this pattern has 0.89 average confidence based on the last 6 months of production data.

Workload Identity credential expiry
Symptoms: application pods returning 403s to GCP APIs; token-refresh-timeout errors in container logs; incident opened by latency or error rate alert rather than infrastructure alert. Tricky to diagnose because the failure is in application layer but the root cause is in the identity infrastructure. Signal correlation across Kubernetes events and Cloud Logging together is what makes this diagnosable.

Resource quota saturation at namespace level
Symptoms: new pod creation failing with exceeded quota despite ample node resources; affects all deployments in a namespace simultaneously. Engineers frequently misdiagnose this as a node shortage because node-level metrics look healthy. Antigravity's namespace quota check is the first hypothesis evaluated for any pod-creation failure — it rules in or out in under a second.

Affinity/anti-affinity scheduling deadlocks
Symptoms: 0/N nodes are available: N node(s) didn't match pod anti-affinity in scheduler events; happens after cluster topology changes (node pool resize, zonal failures). Difficult to reason about in the moment because the conflict is between pod specs that were each valid when written. The Artifact for these incidents includes a specific note explaining which pods are in conflict and why.

The Antigravity Pipeline: From Signal to Artifact

What You'll Need to Build This

Antigravity is an internal platform, but the architectural pattern is reproducible. If you're building toward something similar, here's an honest assessment of what's required:

Observability foundations that are actually good
Antigravity is only as smart as its inputs. If your Prometheus metrics are inconsistently labeled, your GKE event retention is too short, or your structured logging is incomplete, the diagnostic engine will produce low-confidence outputs that engineers learn not to trust — and you're back to square one. Invest in observability before investing in automation.

A runbook of failure modes, not just runbooks
The diagnostic patterns that power Antigravity's hypothesis engine came from three months of retrofitting existing incident postmortems into structured, parameterized failure signatures. This work is not glamorous. It also cannot be skipped. LLMs like Claude are remarkably good at synthesizing structured context into legible narrative — they are not (yet) good at doing root cause analysis from raw, unstructured signal streams.

A hard commitment to the human gate
The temptation to auto-approve "low risk" actions will be constant and will come from leadership as well as engineers who get tired of approving the same PDB patches. Resist it. The trust in Antigravity was built precisely because nothing executes without human approval engineers know that if they make a mistake, they made it, and they can learn from it. Eroding the gate erodes the trust model.

Genuine uncertainty representation
Build the uncertainty_notes requirement into your Artifact schema as a non-nullable, non-empty field. Prompt your LLM to fill it honestly. Review generated Artifacts in postmortems not just for cases where the system was wrong, but for cases where it was right but overconfident. Calibration matters as much as accuracy.

Restoring Peace of Mind to On-Call Teams

When an engineer opens their laptop after a resolved incident, they aren't looking at a black box or a string of cryptic logs. They are greeted by structured, historical evidence.

Through the Antigravity 2.0 Desktop Sidebar or CLI, the engineering team has an asynchronous paper trail of the entire event. The trust gap disappears because the system behaves predictably, logs its intentions transparently before acting, and provides concrete receipts of success.

By pairing the declarative, rock-solid infrastructure of GKE with the precise, artifact-backed reasoning of Google Antigravity, organizations can safely transition from reactive fire-fighting to autonomous, self-healing infrastructure.

Building a Multi-Agent Security Framework for Kubernetes: Autonomous Detection, Investigation, and Remediation

Saurabh Mishra — Thu, 04 Jun 2026 06:49:28 +0000

Kubernetes is the industry standard for scaling cloud-native workloads While it offers tremendous scalability and flexibility, securing Kubernetes environments remains a significant challenge. Organizations often rely on a collection of disconnected security tools to handle vulnerability scanning, runtime monitoring, compliance validation, and incident response.

As clusters grow in complexity, security teams face increasing alert fatigue, delayed response times, and difficulties correlating security events across multiple layers of the platform.

Recent advancements in Agentic AI present an opportunity to rethink Kubernetes security. Instead of relying solely on static rules and isolated security products, organizations can deploy a collaborative network of AI-powered security agents that continuously monitor, investigate, and remediate threats.

This blog explores how a Multi-Agent Security Framework can transform Kubernetes security operations through autonomous detection, investigation, and remediation.

The Problem with Traditional Kubernetes Security

Modern Kubernetes environments generate security signals from multiple sources:

Runtime security tools
Container vulnerability scanners
Admission controllers
Network monitoring systems
Compliance platforms
Cloud security posture management tools

Each system produces valuable information, but most operate independently.

Consider a common scenario:

A container begins executing suspicious commands.

A runtime security platform detects the behavior and raises an alert. However, determining whether the threat is critical requires additional context:

Is the pod exposed externally?
Does the workload have excessive privileges?
Can it access sensitive namespaces?
Is lateral movement possible?
Does it violate organizational policies?

Answering these questions often requires multiple tools and human intervention.

This is where multi-agent systems become valuable.

What is a Multi-Agent Security Framework?

A Multi-Agent Security Framework consists of specialized AI agents, each responsible for a specific security domain. These agents collaborate to investigate incidents, exchange findings, and coordinate remediation actions.

Instead of a single "security copilot," organizations deploy a team of specialized autonomous agents.

Core Design Principles
Domain specialization
Collaborative investigation
Continuous monitoring
Autonomous reasoning
Human-in-the-loop governance

Pillars

Autonomous Detection

Continuous, multi-signal threat sensing across network, runtime, supply chain, and access layers — without polling delays.

Autonomous Investigation
Agents correlate signals, query cluster context, and build an evidence graph so responders arrive with answers, not questions.

Autonomous Remediation
Graduated, confidence-gated responses — from policy updates to pod quarantine — executed in seconds, not minutes.

Architecture : The agent topology

The framework is structured in three tiers. Specialist agents handle domain-specific sensing. An Orchestrator Agent handles correlation and response coordination. A shared Intelligence Plane built on NATS and a graph-based context store is the connective tissue between them

Every agent is a Kubernetes Deployment with its own ServiceAccount, scoped strictly to the permissions it needs. The Intelligence Plane is the only shared resource and access to it is controlled via mTLS with workload identities, preventing any agent from spoofing events.

Autonomous detection

Detection agents run continuously, producing structured ThreatEvent objects the moment they observe anomalous behavior. Unlike scheduled scans, they operate as event-driven loops reacting to signals within milliseconds of occurrence.

Detection Layer

What each specialist agent watches

Network Sentinel: eBPF-based flow telemetry, cross-namespace connection attempts, DNS query anomalies, unexpected egress to external IPs, port scanning signatures, and flows that violate declared NetworkPolicies.

Runtime Guardian: Syscall sequence deviations from per-workload baselines, unexpected binary executions, writes to /proc or /sys, capability changes, and privileged container escalation patterns detected via Falco or Tetragon rules.

Supply Chain Verifier: Image signature verification at admission time, SBOM cross-referencing against CVE databases, detection of images from unregistered registries, and OPA policy violations before any pod schedules.

RBAC Auditor: New ClusterRoleBindings with wildcard verbs, service accounts gaining elevated privileges, new tokens issued to sensitive namespaces, and drift from the last known-good RBAC snapshot.

Autonomous investigation
Detection tells you something happened. Investigation tells you what, to what extent, and how. This phase is where most human security hours are spent and where autonomous agents can deliver the biggest leverage.

Investigation Layer

What the Forensic Investigator Agent does

Evidence graph construction: Builds a directed graph of all entities involved — pods, service accounts, nodes, secrets, external IPs — and the relationships between them at the time of the incident.

Blast radius mapping: Determines which other namespaces, secrets, and workloads could have been reached from the compromised entity, given the RBAC and network topology at the time.

Timeline reconstruction: Assembles a chronological sequence of events from audit logs, ThreatEvents, and deployment history to identify patient zero and the attack progression.

Cross-agent signal correlation: Queries all specialist agents for their observations about the involved entities within a configurable lookback window (default: 30 minutes before first signal).

Autonomous remediation

Remediation is where autonomy earns its keep and where it demands the most discipline. The Remediation Executor Agent applies a graduated response model: response severity scales with confidence score, and actions affecting the control plane always require human approval.

Remediation Layer The graduated response tiers

Tier 1 — Observe (confidence < 0.6): Log the event, enrich with context, send an informational alert. No cluster state changes. Human reviews asynchronously.

Tier 2 — Restrict (confidence 0.6–0.8): Apply targeted NetworkPolicy to block the suspicious traffic flow. Annotate the pod with quarantine metadata. Page the on-call engineer with full context.

Tier 3 — Isolate (confidence 0.8–0.95): Evict the affected pod, revoke associated ServiceAccount tokens, and update NetworkPolicy to block the pod's IP range. Incident ticket auto-created with InvestigationReport attached.

Tier 4 — Escalate (confidence ≥ 0.95 or control-plane impact): Page security lead immediately. Stage proposed remediation actions for one-click human approval. Do not auto-execute.

Agent Roster

All six agents at a glance

Network Sentinel
eBPF-powered traffic analysis across all namespaces. Detects lateral movement, DNS tunneling, and NetworkPolicy violations in real time. Auto-updates deny rules on confirmed threats.

eBPF
NetworkPolicy
DNS Analysis

Runtime Guardian
Builds behavioral baselines per workload via Falco/Tetragon. Flags syscall deviations, shell spawns, and privilege escalations indicative of container escape attempts.

Falco
Tetragon
Syscall Audit

Supply Chain Verifier
Hooks the admission webhook to validate image signatures (Cosign), SBOMs, and OPA policies before any workload schedules. Blocks untrusted images silently and instantly.

Cosign
SBOM
OPA Gatekeeper

RBAC Auditor
Continuously diffs live RBAC state against a least-privilege baseline. Catches permission creep, wildcard bindings, and unexpected new ClusterRoleBindings before they're exploited.

RBAC
Policy-as-Code
Drift Detection

Forensic Investigator
Automatically triggered on incident promotion. Queries all agents for corroborating telemetry, builds an evidence graph, maps blast radius, and reconstructs the attack timeline.

Evidence Graph
Blast Radius
Timeline

Orchestrator + Remediation Executor
Correlates signals from all detection agents, scores incidents, and dispatches the Executor. The Executor applies graduated responses observe, restrict, isolate, or escalate with full rollback support.

Correlation
Threat Scoring
Graduated Response

What makes this safe to run in production

Autonomous remediation in production is only safe if the framework is built for it from the start. These principles are non-negotiable.

How Google Cloud powers each pillar

If you're running on Google Kubernetes Engine, you don't have to build every piece of this framework from scratch. Google Cloud provides a suite of managed services that map directly onto the detection, investigation, and remediation layers each deeply integrated with GKE's control plane.

How Google Cloud services map to each agent

Security at cluster scale requires coordination

No single tool and no single human team can watch every plane of a production Kubernetes cluster simultaneously. Multi-agent frameworks aren't a future concept they're the practical answer to a present problem.

Untrusted Code, Trusted Cluster Scaling Secure AI Agent Workspaces with GKE Agent Sandbox

Saurabh Mishra — Sun, 31 May 2026 04:03:42 +0000

How gVisor-powered sandbox isolates AI-generated code at the kernel level and why it changes everything for multi-tenant agentic systems.

In this article we are going discuss on below points

The problem with AI agents writing code
What is GKE Agent Sandbox?
How gVisor intercepts the kernel
Architecture deep dive
Setting it up: step by step
Production patterns
Conclusion

There's a moment every engineer running AI agents eventually faces: an LLM generates a perfectly plausible subprocess.run() call, pipes it to bash -c, and realise that one prompt injection away from a full container escape. The code looks reasonable. The agent trusts itself. And cluster's blast radius just became everyone's problem.

This is the defining security problem of the agentic era. Language models don't just generate text anymore they write, execute, and iterate on code in tight feedback loops. The capabilities that make them useful (unrestricted Python, shell access, file I/O) are exactly the capabilities that make them dangerous in a shared cluster.

Google's answer — GKE Agent Sandbox

GKE Agent Sandbox is built for agentic workloads that require high-level scale, extensibility, and security. Key benefits include:

Kernel-level isolation: Provides strong, kernel-level isolation for untrusted, LLM-generated code by using built-in GKE features like GKE Sandbox. Agent Sandbox also supports the open source Kata Containers software.

Sub-second provisioning: Offers an out-of-the-box mechanism to provide sandboxes significantly faster than standard Kubernetes Pod scheduling allows (typically <1s).

Cloud-native extensibility: Leverages the power of the Kubernetes paradigm and the managed infrastructure of GKE.

By providing a declarative, standardized API, GKE Agent Sandbox offers a single-container experience that provides isolation and persistence characteristics similar to a virtual machine (VM), built entirely on Kubernetes primitives

The problem with AI agents writing code

Agentic AI systems whether you're building with LangGraph, AutoGen, Claude's tool-use API, or rolling your own share a common architectural pattern: the model generates code, a runtime executes it, results flow back to the model, and the loop continues. At each iteration, the model has broader context about what worked and what didn't. This is enormously powerful for automating complex tasks.

It also creates an attack surface that traditional Kubernetes security was never designed to handle.

Container escape

LLM-generated code exploits known kernel vulnerabilities or misconfigured capabilities to break out of the container boundary.

Prompt injection via code output

Malicious content in retrieved data embeds instructions that manipulate the agent into executing attacker-controlled payloads.

Lateral network movement

An agent with network access can enumerate internal services, extract credentials, and pivot across your cluster — all through legitimate-looking Python requests.

Filesystem exfiltration

Without mount restrictions, agents can read service account tokens, Kubernetes secrets mounted as volumes, and host path data.

Standard container security — securityContext, network policies, Pod Security Admission provides defence in depth but doesn't address the fundamental issue: containers share the host kernel. If the kernel has a vulnerability, a sufficiently motivated attacker (or sufficiently capable LLM) can exploit it regardless of namespace isolation.

What is GKE Agent Sandbox?
GKE Agent Sandbox is a Google-managed node pool configuration that applies gVisor-based container sandboxing specifically tuned for agentic AI workloads.

At its core, it combines three things:

gVisor runtime (runsc) as the default OCI runtime
Every container in the sandbox node pool runs under runsc instead of the standard runc. This intercepts all syscalls through a user-space kernel implementation called Sentry.

Agent-specific resource isolation profiles
Pre-configured seccomp and AppArmor profiles optimised for Python/Node.js/container-in-container workloads that AI agents commonly generate. No manual tuning of syscall allowlists required for standard use cases.

Integrated observability via Cloud Monitoring
Syscall audit logs, sandbox violation events, and resource consumption metrics flow automatically into Cloud Monitoring — giving you behavioural baselines for agent workloads without custom instrumentation.

How gVisor intercepts the kernel

Understanding what gVisor actually does is essential for reasoning about its security guarantees. The mental model most engineers have of containers — "a process with namespaces and cgroups" — breaks down when thinking about gVisor.

In a standard container, your application's open(), read(), execve(), and socket() calls go directly to the host Linux kernel via the system call interface. The kernel has to handle them, which means a kernel vulnerability is reachable from inside the container.

With gVisor, those same syscalls are intercepted by Sentry a Go implementation of the Linux kernel that runs entirely in user space. Sentry implements the Linux ABI from scratch. When your agent code calls execve(), it's Sentry that handles it, not the host kernel. Sentry then makes a much smaller set of calls to the actual host kernel (through a restricted interface called the "platform") to handle things like memory mapping and scheduling.

End-to-End Architectural Blueprint

To isolate untrusted code execution while maintaining a highly responsive management plane, the architecture splits the cluster into two distinct, specialized node pools.

Standard Node Pool (The Brain)- This pool runs your trusted, long-lived orchestration services. Because this code is written and audited by your team, it runs on the standard Linux host kernel for maximum performance and native access to internal cluster resources.Agent Controller: The core engine managing the life cycle of AI agent tasks, spin-up times, and state tracking.Tool Router: Mediates external API calls and manages what capabilities (e.g., web search, database querying) are exposed to the agent.Result Collector: Aggregates outputs, logs, and state changes from the runtime pods.State & Storage (Postgres/Redis): Highly available data layers tracking session memory and agent state.

Agent Sandbox Node Pool (The Muscle) - This pool is dedicated entirely to executing untrusted code generated by AI models. It uses the runtimeClassName: gvisor configuration to enforce strict kernel-level isolation.Code Executor Pods ($N$ Pods): Ephemeral, rapid-churn pods designed to spin up, run a specific snippet of generated code, and terminate.The Sentry (User-Space Kernel): gVisor’s core component. Instead of letting a Python agent talk directly to the host Linux kernel via standard system calls (syscall()), the Sentry intercepts them. It implements a core suite of Linux kernel primitives in user-space, shielding the host bare-metal or VM infrastructure from container escape vulnerabilities.

Workload Identity & RBAC Separation

By separating Kubernetes Service Accounts (KSAs) and mapping them to distinct Google Cloud IAM Service Accounts, we eliminate the risk of privilege escalation if an agent is compromised.

Observability and Behavioral Analysis

Because sandbox runtimes are naturally adversarial, observability shifts from standard application performance monitoring (APM) to real-time behavioral and security auditing

Syscall Audit Logs: gVisor provides structural logs of intercepted system calls via its internal logging mechanisms. Unusual system calls (e.g., attempts to call forbidden network protocols or direct raw socket manipulations) are immediately streamed to Cloud Logging.

Violation Events: Any attempt by a sandboxed container to bypass the Sentry or execute an invalid operation triggers an immediate containment event, surfaced directly in Google Cloud Security Command Center.

Cloud Monitoring: Aggregates container-level metrics (CPU, Memory, Churn rate). Crucial for detecting malicious infinite loops or resource-exhaustion (DDoS) attempts disguised as AI agent tasks.

Cloud Trace: End-to-end distributed tracing maps exactly how long a request spends routing through the Tool Router versus how long it spends executing inside the gVisor sandbox, allowing you to fine-tune the performance overhead introduced by user-space context switching.

Setting it up: step by step

Here's a complete walkthrough from a fresh GKE cluster to a running sandboxed agent workload. This assumes you have gcloud, kubectl, and Terraform configured for project.

Production patterns

Pattern 1: Warm pool with pre-forked executors

Cold-starting a new pod for every code execution adds latency. The standard pattern is to maintain a pool of warm executor pods that listen for work over a task queue (Pub/Sub or Redis Streams). The controller dispatches code snippets to idle executors; completed executors reset their environment and return to the pool. A garbage collection sidecar restarts pods that have been warm too long to prevent state accumulation.

Pattern 2: Execution budget enforcement

AI agents can get into infinite loops. Beyond Kubernetes resource limits, apply an application-level timeout using Python's signal.alarm or Go's context cancellation. A 30-second wall-clock timeout with a 10-second CPU-time budget covers almost all legitimate agent code execution patterns while preventing runaway loops from consuming pool capacity.

Pattern 3: Network egress allow-listing per agent type

Different agent personas have different legitimate network needs. A data analysis agent needs access to BigQuery and GCS. A web research agent needs HTTP egress to public internet. A code review agent needs neither. Model this with separate NetworkPolicies per agent label, and use PodSpec labels to bind agents to the right policy at scheduling time.

Conclusion

The agentic era is here, and it runs on code execution. Whether you're building autonomous research assistants, DevOps automation agents, or data pipeline orchestrators,eventually going to need a principled answer to the question: what happens when the model writes something it shouldn't?

GKE Agent Sandbox doesn't make the threat go away. Prompt injection is still a model-level problem. Lateral movement still requires complementary network controls. Secrets management still requires RBAC discipline. But the sandbox answers a specific, hard question — what if agent-generated code exploits a kernel vulnerability or escalates privileges? — with a credible, production-tested answer: it runs against Sentry, not your host kernel.

For most teams running agentic workloads on GKE, the operational cost is low (a single node pool configuration), the performance cost is acceptable (single-digit percentages for typical agent workload patterns), and the security benefit is significant (kernel-level isolation with full Kubernetes observability).

That's the architectural question GKE Agent Sandbox is designed to answer. Build agentic systems with the assumption that the code will sometimes be wrong, sometimes be manipulated, and occasionally be malicious and design your execution environment accordingly.

References and Documentation

https://docs.cloud.google.com/kubernetes-engine/docs/how-to/agent-sandbox

https://docs.cloud.google.com/kubernetes-engine/docs/concepts/machine-learning/agent-sandbox

Running Agentic AI at Scale on Google Kubernetes Engine

Saurabh Mishra — Wed, 08 Apr 2026 04:15:15 +0000

The AI industry crossed an inflection point. We stopped asking "can the model answer my question?" and started asking "can the system complete my goal?" That shift from inference to agency changes everything about how we build, deploy, and scale AI in the cloud.

Google Kubernetes Engine (GKE) has quietly become the platform of choice for teams running production AI workloads. Its elastic compute, GPU node pools, and rich ecosystem of observability tools make it uniquely suited not just for model serving but for the orchestration challenges that agentic AI introduces.

This blog walks through the full landscape: what kinds of AI systems exist today, how agentic architectures differ, and what it actually looks like to run them reliably on GKE.

The AI Taxonomy: From Reactive to Autonomous

Before diving into infrastructure, it's worth establishing what we mean by the different modes of AI deployment. Not all AI is "agentic," and the architecture you choose should match the behavior you need

Reactive / Inference

Stateless prompt-response. One request, one LLM call, one answer. The model has no memory between turns. Examples: text classifiers, summarizers, one-shot code generators.

Conversational AI

Multi-turn dialog with session state. The model remembers context within a conversation window. Examples: customer support bots, document Q&A, coding assistants.

Retrieval-Augmented (RAG)

The model can query external knowledge at runtime before generating a response. Introduces a retrieval step vector DBs, semantic search, tool calls to databases.

Agentic AI

The model plans, takes actions, observes results, and loops until a goal is reached. It can call tools, spawn subagents, and make decisions across many steps autonomously.

Multi-Agent Systems

A network of specialized agents collaborating: an orchestrator decomposes a task and delegates to researcher, writer, executor agents that work in parallel or sequence.
Each mode up the stack introduces new infrastructure requirements: more state to manage, longer-lived processes, more concurrent workloads, harder failure modes, and deeper observability needs.

Why GKE for AI Workloads?

Kubernetes is table stakes for any modern distributed system. But GKE specifically brings several features that make it exceptional for AI:

GKE Capabilities for AI

GPU and TPU Node Pools

To handle the heavy lifting of Agentic AI, GKE offers specialized Accelerator Node Pools. This infrastructure allows you to dynamically attach high-end compute resources such as NVIDIA A100, H100, or L4 GPUs and Google TPUs exactly when your agents need them.

Workload Identity & Secret Management

Agentic systems touch many external APIs (databases, external services, third-party tools). Workload Identity Federation lets pods authenticate to Google Cloud services without storing long-lived credentials.

Horizontal Pod Autoscaling with Custom Metrics

Scale agent runner replicas based on queue depth (Pub/Sub backlog, Redis list length) rather than CPU. This allows demand-driven scaling that matches agent workload patterns precisely.

GKE Autopilot & Standard Modes

Autopilot mode handles node management entirely, ideal for teams wanting to focus on agent logic. Standard mode gives full control when you need custom kernel modules or specialized hardware affinity rules.

Cloud Run on GKE for Burst Workloads

Short-lived tool execution steps in an agent pipeline can be offloaded to Cloud Run, which scales to zero between invocations avoiding the overhead of always-on Kubernetes pods for infrequent task

Anatomy of an Agentic AI System

An agentic AI system isn't a single process ,it's a distributed workflow. Understanding its components is essential before mapping it onto Kubernetes primitives.
"An agent is an LLM that can observe the world, decide what to do next, and take actions - in a loop, until a goal is satisfied."

Popular Agentic Frameworks on GKE

Several frameworks have emerged to help teams build agentic systems without reinventing the orchestration wheel. Each has a different philosophy and maps to GKE differently.

Agent Development Kit (ADK)

Google's native framework for building multi-agent systems on Vertex AI. First-class GKE support, tight Gemini integration, built-in evaluation tools. Best choice for teams already on Google Cloud.

LangGraph

Graph-based agent orchestration with explicit state machines. Excellent for complex branching workflows. Containerizes cleanly. LangSmith provides tracing that integrates with GKE logging pipelines

CrewAI

Defines agents as role-playing entities (Researcher, Writer, Editor) with goals and backstories. Simple to model complex human workflows. Ideal for content, analysis, and research pipelines.

Google ADK on GKE >> Native Fit

The Google Agent Development Kit (ADK) is architected to treat Kubernetes as its primary "home," creating a seamless integration where the framework and the platform operate as one. Because ADK is built with a Kubernetes-native philosophy, it transforms GKE from a simple hosting environment into a specialized runtime for autonomous systems.

Observability: The Hard Part

Agentic systems fail in non-obvious ways. An agent might produce a response - but the response could be hallucinated, based on a failed tool call, or the result of an unintended plan branch. Standard HTTP error monitoring doesn't catch this.

The recommended observability stack for GKE-based agentic systems:

Observability Stack

OpenTelemetry Instrumentation

Instrument each agent with OpenTelemetry. Emit spans for every LLM call, tool invocation, and planning step. Export to Google Cloud Trace for full distributed trace visualization.

Structured Logging to Cloud Logging

Log each reasoning step as a structured JSON event: task ID, agent ID, step number, prompt hash, tool name, tool result summary, token counts. Query across traces in BigQuery for post-hoc analysis.
Custom Metrics via Cloud Monitoring

Track agent-specific metrics: tasks completed per minute, average steps per task, tool call success rate, LLM latency P50/P95/P99, and hallucination rate from your eval pipeline.

LLM-specific Tracing (LangSmith / Vertex AI Eval)

Leverage LangSmith or Vertex AI's built-in evaluation capabilities to capture complete prompt–response interactions along with semantic quality metrics. These insights can then be fed back into your continuous improvement cycle.

Security Considerations for Agentic AI on GKE

Agents with tool use are a new attack surface. An agent that can execute code, send emails, or write to a database is a powerful actor - and must be treated like one.

Prompt Injection

Malicious content in retrieved documents can instruct the agent to deviate from its goal. Sanitize all retrieved content before insertion into prompts. Use system-level guardrails in your LLM configuration.

Privilege Escalation

Each agent should operate with the minimum IAM permissions needed for its specific tools. Use Workload Identity with role-specific service accounts never a single all-powerful SA for all agents.

Human-in-the-Loop Gates

For irreversible actions (sending emails, deploying code, database writes), require a human approval step before execution. Implement approval workflows via Pub/Sub pause + Cloud Tasks callback.

Network Policies

Use GKE Network Policies to restrict which agent pods can talk to which services. A researcher agent has no reason to reach the database writer service directly - enforce this in the cluster, not just in code.

What's Next: The Agentic Platform

The direction of travel is clear. GKE is evolving from an application runtime into an agentic platform - a place where autonomous AI systems can be deployed, composed, monitored, and governed with the same rigor we apply to microservices today.
Several emerging capabilities are worth tracking:

Agent-to-Agent Communication (A2A Protocol) - Google's emerging standard for cross-agent RPC, allowing agents built with different frameworks to interoperate. GKE provides the network fabric for this via internal load balancers and service mesh.

Model Context Protocol (MCP) on Kubernetes - MCP is becoming the standard way for agents to discover and call tools. Running MCP servers as sidecar containers or standalone Deployments in GKE makes tool registries cluster-native.

Vertex AI Agent Engine - Google's fully managed orchestration layer for agents that sits above GKE, handling session management, tool routing, and evaluation out of the box. The boundary between GKE and managed agent infrastructure will continue to blur.

"Kubernetes wasn't built for AI. But it turns out the problems of distributed systems - scale, failure, state, observability - are exactly the problems agentic AI inherits."

Core Reference Documentation

https://docs.cloud.google.com/kubernetes-engine/docs/integrations/ai-infra

https://github.com/GoogleCloudPlatform/accelerated-platforms/blob/main/docs/platforms/gke/base/use-cases/inference-ref-arch/README.md

https://docs.cloud.google.com/agent-builder/agent-development-kit/overview

Hands-on Tutorials

https://codelabs.developers.google.com/devsite/codelabs/build-agents-with-adk-foundation

https://cloud.google.com/blog/topics/developers-practitioners/build-a-multi-agent-system-for-expert-content-with-google-adk-mcp-and-cloud-run-part-1

Hooking up CrewAI with Google Gemini for Multi-Agent Automation Systems

Saurabh Mishra — Mon, 16 Feb 2026 15:55:24 +0000

Google’s AI ecosystem is vast and powerful, featuring Google Gemini models (accessible via API) and Google AI Studio (a brilliant web IDE for experimenting with and deploying generative AI apps). But what happens when you combine that raw reasoning capability with an autonomous orchestration framework?

CrewAI

CrewAI is an open-source Python framework that lets you build and orchestrate multiple AI agents that collaborate to accomplish complex tasks like a virtual team of specialists. It organizes agents, assigns them roles and lets them delegate and share tasks.

Why Gemini + CrewAI?

CrewAI allows you to define agents with highly specific roles, goals and backstories. Under the hood, it uses LiteLLM (or LangChain wrappers) to route calls to the language model of your choice.

By hooking CrewAI into Google’s Gemini models (like gemini-2.5-flash or other models), we get:

Lightning-fast reasoning required for agentic loops.
Massive context windows for analyzing huge codebases, logs, or documentation.
Natively integrated Google Search grounding, perfect for agents that need to research complex code, real-time data, or modern architecture patterns.

Step 1: **Setup and Authentication**
To get started, we need to configure CrewAI to use Gemini models.

Get Gemini API Key:

Go to Google AI Studio or the Google Cloud console.
Create an API key for Gemini.
Save this API key , we’ll need it to authenticate your LLM in CrewAI.
Install Dependencies: Install the required packages

pip install crewai
python3.11 -m pip install langchain-google-genai

NOTE: langchain-google-genai requires Python 3.9+

Step 2: **The Scenario & Initializing the Brain**
Let’s build a highly relevant, real-world scenario: An Automated Cloud Infrastructure Design Team. We will create a two-agent crew:

A Principal Cloud Architect to design the system.
A Lead DevSecOps Engineer to tear it apart and review it for vulnerabilities.
First, let’s set up our script and initialize the Gemini “brain” using LangChain’s wrapper.

import os
from crewai import Agent, Task, Crew, Process
from langchain_google_genai import ChatGoogleGenerativeAI

# ==========================================
# 1. Configuration & Setup
# ==========================================
# Replace 'YOUR_API_KEY' with your actual Gemini API key, 
# or set it in your environment variables before running the script.
os.environ["GOOGLE_API_KEY"] = os.getenv("GOOGLE_API_KEY", "YOUR_API_KEY")

# Initialize the Gemini model
# Using gemini-2.5-flash for complex reasoning and architecture design
gemini_llm = ChatGoogleGenerativeAI(
    model="gemini-2.5-flash",
    temperature=0.4 # Slightly creative, but grounded in technical reality
)

Step 3: **Defining the Agents**
Agents need a clear identity to function properly. In CrewAI, we define their role, goal, and backstory to give the LLM strict boundaries and deep, specialized context.

# ==========================================
# 2. Define the Agents
# ==========================================
cloud_architect = Agent(
    role='Principal Cloud Architect',
    goal='Design highly scalable, resilient, and cost-effective cloud infrastructures based on user requirements.',
    backstory=(
        "You are a seasoned cloud architect with 15+ years of experience across AWS, GCP, and Azure. "
        "You excel at designing modern microservices, serverless architectures, and event-driven systems. "
        "Your primary focus is ensuring the system can handle massive scale while keeping latency low."
    ),
    verbose=True,
    allow_delegation=False,
    llm=gemini_llm
)

devsecops_engineer = Agent(
    role='Lead DevSecOps Engineer',
    goal='Rigorously review cloud architectures to identify vulnerabilities, ensure compliance, and enforce zero-trust security.',
    backstory=(
        "You are a paranoid but brilliant cybersecurity veteran. You specialize in cloud security posture management, "
        "IAM least-privilege policies, network isolation, and data encryption. You view every architecture through "
        "the lens of a potential attacker and fix flaws before deployment."
    ),
    verbose=True,
    allow_delegation=False,
    llm=gemini_llm
)

Step 4: **Defining the Tasks**
Agents are useless without clear instructions. Tasks in CrewAI define what needs to be done, the expected output, and who is responsible for executing it.

# ==========================================
# 3. Define the Tasks
# ==========================================
project_scenario = (
    "A global e-commerce platform transitioning from a monolith to microservices. "
    "It requires secure user authentication, a high-throughput inventory management system, "
    "and seamless integration with third-party payment gateways. It anticipates massive traffic spikes during holiday sales."
)

design_task = Task(
    description=(
        f"Analyze the following project scenario: '{project_scenario}'.\n"
        "Create a comprehensive cloud architecture design. You must specify the cloud provider (or multi-cloud), "
        "compute resources, databases, caching layers, message queues, and content delivery networks. "
        "Justify why you chose these specific services."
    ),
    expected_output="A detailed Architectural Design Document outlining services, data flow, and scaling strategies.",
    agent=cloud_architect
)

security_review_task = Task(
    description=(
        "Critically review the Architectural Design Document produced by the Principal Cloud Architect. "
        "Identify at least 3 potential security vulnerabilities or single points of failure. "
        "Provide concrete, actionable remediations for each vulnerability (e.g., adding WAF, adjusting VPC peering, enforcing KMS encryption)."
    ),
    expected_output="A Security Audit Report listing vulnerabilities found, risk severity, and mandatory architecture modifications.",
    agent=devsecops_engineer
)

Step 5: **Form the Crew and Execute!
**

# ==========================================
# 4. Form the Crew and Execute
# ==========================================
cloud_engineering_crew = Crew(
    agents=[cloud_architect, devsecops_engineer],
    tasks=[design_task, security_review_task],
    process=Process.sequential, # The DevSecOps engineer waits for the Architect
    verbose=True
)

if __name__ == "__main__":
    print("Booting up the Automated Cloud Infrastructure Design Team...")
    print("Initiating CrewAI sequence. Please wait while the agents collaborate...\n")

    # Kickoff the process
    result = cloud_engineering_crew.kickoff()

    print("\n" + "="*50)
    print("FINAL DEVSECOPS REVIEW & SECURED ARCHITECTURE")
    print("="*50 + "\n")
    print(result)

Complete code:-

import os
from crewai import Agent, Task, Crew, Process
from langchain_google_genai import ChatGoogleGenerativeAI

# ==========================================
# 1. Configuration & Setup
# ==========================================
# Replace 'YOUR_API_KEY' with your actual Gemini API key, 
# or set it in your environment variables before running the script.
os.environ["GOOGLE_API_KEY"] = os.getenv("GOOGLE_API_KEY", "YOUR_API_KEY")

# Initialize the Gemini model
# Using gemini-2.5-flash for complex reasoning and architecture design
gemini_llm = ChatGoogleGenerativeAI(
    model="gemini-2.5-flash",

    temperature=0.4 # Slightly creative, but grounded in technical reality
)

# ==========================================
# 2. Define the Agents
# ==========================================
cloud_architect = Agent(
    role='Principal Cloud Architect',
    goal='Design highly scalable, resilient, and cost-effective cloud infrastructures based on user requirements.',
    backstory=(
        "You are a seasoned cloud architect with 15+ years of experience across AWS, GCP, and Azure. "
        "You excel at designing modern microservices, serverless architectures, and event-driven systems. "
        "Your primary focus is ensuring the system can handle massive scale while keeping latency low."
    ),
    verbose=True,
    allow_delegation=False,
    llm=gemini_llm
)

devsecops_engineer = Agent(
    role='Lead DevSecOps Engineer',
    goal='Rigorously review cloud architectures to identify vulnerabilities, ensure compliance, and enforce zero-trust security.',
    backstory=(
        "You are a paranoid but brilliant cybersecurity veteran. You specialize in cloud security posture management, "
        "IAM least-privilege policies, network isolation, and data encryption. You view every architecture through "
        "the lens of a potential attacker and fix flaws before deployment."
    ),
    verbose=True,
    allow_delegation=False,
    llm=gemini_llm
)

# ==========================================
# 3. Define the Tasks
# ==========================================
# The scenario we want them to work on
project_scenario = (
    "A global e-commerce platform transitioning from a monolith to microservices. "
    "It requires secure user authentication, a high-throughput inventory management system, "
    "and seamless integration with third-party payment gateways. It anticipates massive traffic spikes during holiday sales."
)

design_task = Task(
    description=(
        f"Analyze the following project scenario: '{project_scenario}'.\n"
        "Create a comprehensive cloud architecture design. You must specify the cloud provider (or multi-cloud), "
        "compute resources, databases, caching layers, message queues, and content delivery networks. "
        "Justify why you chose these specific services."
    ),
    expected_output="A detailed Architectural Design Document outlining services, data flow, and scaling strategies.",
    agent=cloud_architect
)

security_review_task = Task(
    description=(
        "Critically review the Architectural Design Document produced by the Principal Cloud Architect. "
        "Identify at least 3 potential security vulnerabilities or single points of failure. "
        "Provide concrete, actionable remediations for each vulnerability (e.g., adding WAF, adjusting VPC peering, enforcing KMS encryption)."
    ),
    expected_output="A Security Audit Report listing vulnerabilities found, risk severity, and mandatory architecture modifications.",
    agent=devsecops_engineer
)

# ==========================================
# 4. Form the Crew and Execute
# ==========================================
cloud_engineering_crew = Crew(
    agents=[cloud_architect, devsecops_engineer],
    tasks=[design_task, security_review_task],
    process=Process.sequential, # The DevSecOps engineer waits for the Architect to finish
    verbose=True
)

if __name__ == "__main__":
    print("🚀 Booting up the Automated Cloud Infrastructure Design Team...")
    print("Initiating CrewAI sequence. Please wait while the agents collaborate...\n")

    # Kickoff the process
    result = cloud_engineering_crew.kickoff()

    print("\n" + "="*50)
    print("FINAL DEVSECOPS REVIEW & SECURED ARCHITECTURE")
    print("="*50 + "\n")
    print(result)

Results:-
Run this script in terminal and watch Gemini stream its thought process

Integrate Other Google Tools (Optional)
Want to take this to the enterprise level? CrewAI supports robust integrations with Google’s Workspace apps via its enterprise platform/tools ecosystem

Google Drive

You can allow agents to upload/download files to Drive — useful for storing outputs.

Google Docs
Create, read, and edit Google Docs documents.

Google Sheets
Create, read, and update Google Sheets spreadsheets and manage worksheet data.

To enable these, you connect your Google account via OAuth in CrewAI’s integrations dashboard then grant permissions.

**
Documentation References**

https://docs.crewai.com/en/introduction

https://ai.google.dev/gemini-api/docs/crewai-example

https://developers.googleblog.com/building-agents-google-gemini-open-source-frameworks/