The AI ecosystem is rapidly shifting from ephemeral, single-turn chatbots to autonomous, distributed software agents that execute complex operations over hours, days, or weeks. For site reliability engineers (SREs) and platform architects, this shift introduces massive challenges: state drift, network dropouts, untrusted code execution, and unmanageable infrastructure costs.
To bridge this production readiness gap, Google has open-sourced Agent eXecutor (AX) under the Apache 2.0 license. Written in Go, AX is a Kubernetes-native, distributed runtime standard built specifically to schedule, isolate, persist, and scale long-running agentic workloads across enterprise data planes.
Here is a deep dive into the architecture of AX and why it represents the infrastructure blueprint for production-grade AI.
1. The Core Architecture: Durable Execution and Resumption
Existing orchestration frameworks excel at prototyping agent logic but often fail under real-world infrastructure failures. If a container restarts or a network timeout occurs mid-task, the agent state is lost.
AX treats agents as stateful, resilient microservices. It provides out-of-the-box durability through two architectural pillars:
┌──────────────────────────────┐
│ AX Router │
└──────────────┬───────────────┘
│ (Resumable Streams)
▼
┌──────────────────────────────┐
│ AX Controller │
│ (Single-Writer, Event Log) │
└──────────────┬───────────────┘
┌───────────────────────┼───────────────────────┐
▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Isolated Worker│ │ Isolated Worker│ │ Native MCP │
│ (Agent) │ │ (Skill) │ │ Server │
└──────────────┘ └──────────────┘ └──────────────┘
The Event Log & Snapshotting
AX intercepts all context modifications, tool calls, and LLM completions, committing them to a high-throughput durable event log managed by a Single-Writer architecture. If an agent crashes or is descheduled by Kubernetes, a new worker spins up, replays the event log, and resumes execution seamlessly without repeating expensive LLM calls or duplicating external API mutations.
Connection Recovery & Resumable Streams
When building long-running workflows, client-to-agent disconnects are guaranteed to happen. AX routes client communications via resumable streams. If a network boundary drops, the client simply reconnects to the AX Controller, which automatically backfills all events missed during the outage window.
2. Native Model Context Protocol (MCP) Support
Instead of forcing developers into a proprietary ecosystem, Google has built AX with native support for the Model Context Protocol (MCP).
AX treats MCP servers as dynamically discoverable, sandboxed actors. The central AX Controller abstracts the operational complexities of managing multi-tenant tool lifecycles. When an agent requests a tool call, the AX Controller checks the tool registry, executes the protocol-compliant schema over secure channels, and records the interaction within the central audit log.
This decoupling ensures absolute portability: any standard enterprise database, file system, or internal API exposed via an MCP server can instantly serve as an operational tool inside an AX runtime environment.
3. Kubernetes Native Scaling via Agent Substrate
Standard Kubernetes deployments are highly optimized for thousands of static, long-running REST APIs or gRPC services. However, an enterprise agent workflow can generate millions of short-lived, bursty, sub-second tool calls that can quickly overwhelm a standard k8s control plane.
To handle this architectural strain, Google paired AX with Agent Substrate, a complementary open-source control plane layer for Kubernetes designed for ultra-scale agent infrastructure density.
| Feature | Standard Kubernetes (K8s) | Kubernetes with AX & Agent Substrate |
|---|---|---|
| Control Plane Target | Thousands of long-running services | Millions of highly active agent sessions |
| Idle Capacity Management | Pods remain warm, drawing continuous compute resources | Pod Snapshots suspend idle workloads to cold state |
| Scaling Architecture | Standard HPA (Minutes/Seconds) | Fast allocation (300 sandboxes/sec at <200ms latency) |
| Workload Isolation | Shared node kernel boundaries | Strict sandboxing via gVisor / Kata Containers |
By leveraging Pod Snapshots, Agent Substrate allows AX to completely freeze an agent's memory state and CPU context when it pauses for human feedback or goes idle. The resource footprints drop to near-zero, freeing up cluster compute. The second a callback or event triggers the agent, it instantly un-freezes from standby capacity with sub-second initialization times.
4. Advanced Debugging: Trajectory Branching
Debugging a failed state deep within a non-deterministic agentic loop is notoriously difficult. To address this, AX exposes a debugging primitive called Trajectory Branching.
Because AX explicitly tracks and registers every execution step in its event log, developers can branch an agentic execution path from any historical checkpoint. If an agent hits a logic exception at step 45 of an operation, you can spin up an alternative trajectory branch from step 44, hot-patch the agent's prompts or underlying code, and re-run the transaction from that exact snapshot without re-executing steps 1 through 43.
Getting Started
Because AX is runtime-agnostic, you can build your agents using your preferred framework (LangGraph, AutoGen, or custom Go/Python codebases) and hand execution management off to the AX runtime.
The AX CLI is written in Go and can be installed directly from the public GitHub repository:
go install github.com/google/ax/cmd/ax@latest
ax --help
Top comments (0)