Delafosse Olivier

Posted on May 27 • Originally published at coreprose.com

Inside Google’s Agent Executor: Open Runtime for Production AI Agents

#ai #machinelearning #llm #programming

Originally published on CoreProse KB-incidents

Most agent frameworks excel at demos, not at running stateful, tool-calling agents 24/7 under enterprise SLOs. Production failures usually come from hallucinations, PII leaks, and behavioral drift that never appeared in the prototype. [1]

Google’s Gemini Enterprise Agent Platform, Agent Runtime, and Agent Governance Stack directly address these issues: long-running state, fleet governance, and security that fits a microservice estate rather than a notebook. [10]

An open-source “Agent Executor” aligned with this stack would give teams a shared runtime for tools, state, governance hooks, and observability—so Agent Ops is not rebuilt from scratch in every project. [3][5]

1. Why Production AI Agents Need a Dedicated Executor Runtime

Most open frameworks optimize for:

Rapid prototyping
Simple tool chains
Quick UI wiring

In production, agents fail unless you add strong testing and runtime guardrails beyond basic orchestration. [1][5]

Once an agent is customer-facing, teams must handle:

SLOs, incidents, and on-call
Scaling, caching, rate limits, token budgets
IAM, secrets, network boundaries
Rollbacks, experiments, and change control [4]

This operational discipline—Agent Ops—surrounds a stateful, LLM-powered service calling APIs, retrieval, and multi-step workflows with many failure modes. [4]

Google’s Gemini Enterprise Agent Platform reflects this with:

Long-running Agent Runtime (up to seven days of state)
Agent Governance Stack for identity, registry, and policies
Code-first orchestration, tools, and data access (e.g., Sales Intelligence Agent) [10][11]

An open Agent Executor would encode these patterns into a composable runtime, matching Google’s “prototype to enterprise” guidance. [3][10]

2. Core Architecture of a Google-Style Agent Executor Runtime

A reliable agent stack must align models, orchestration, memory, tools, and observability. [5] Misdesign in any layer causes latency spikes, broken workflows, or opaque errors.

A Google-style Agent Executor would coordinate:

Model layer: Gemini APIs, routing/fallback, cost-aware selection
Orchestration: planning loops, branching, retries (LangGraph- or ADK-like) [5][11]
Memory & retrieval: history, RAG, durable state
Tools/actions: typed APIs with IAM and rate limits [4][5]
Observability: traces, metrics, logs, evaluation hooks [2][8]

Stable contracts between layers let teams swap backends without rewriting agent logic.

Long-running agents and checkpointing

Agent Runtime supports workflows with state retained for days, using checkpoint-and-resume so failures or human approvals do not trigger full recomputation. [10]

def run_step(session_id, input_event):
    state = load_state(session_id)
    plan = planner.step(state, input_event)
    result = executor.execute(plan)
    new_state = reducer(state, result)
    save_state(session_id, new_state)  # durable checkpoint
    return result

Patterns such as delegated approvals—agents pausing for human sign-off while consuming zero compute—should be first-class APIs, not ad-hoc glue. [10]

Self-improving memory

Advanced stacks move beyond flat context windows using: [2]

Vector search for semantic recall
Graph databases for relationships
Background jobs to extract insights and resolve conflicts

An Executor should provide:

Pluggable vector + graph backends
Built-in conflict resolution strategies
Automatic insight extraction from interaction logs [2]

Orchestration across frameworks and protocols

Modern systems mix:

LangGraph graphs
A2A multi-agent protocols
MCP-based tools [2]

The runtime must unify these, coordinating planning loops and tool calls. Google’s code-first multi-agent patterns in Go and ADK can be generalized into reusable lifecycle hooks, tool schemas, and routing. [11]

Here, the Executor is the contract that makes heterogeneous frameworks behave as one operable system. [2][5]

3. Security, Governance, and Observability as First-Class Concerns

Most serious incidents involve:

Prompt injection
Data exfiltration
PII exposure [1]

Static policy documents are useless once a malicious input or tool is live; the runtime itself must enforce defenses.

Isolation and sandboxing

Google’s GKE Agent Sandbox uses gVisor to run each agent in a hardened, per-request sandbox with sub-second cold starts. [7] A robust Executor should integrate:

Per-session sandboxes (Kubernetes/gVisor-like) [7]
Fine-grained IAM for tools and data [10]
Secrets management and scoped credentials [4]

Guardrails and adversarial testing

Production agents need active defenses wired into the request pipeline, for example: [2][9]

LlamaFirewall for input/output/tool guardrails
Arcade for OAuth2-protected tools with approvals
Apex for adversarial prompt-injection testing in CI and live traffic

Every request should pass through a standard guardrail chain owned by the Executor. [2]

Observability beyond logs

Agent monitoring needs reasoning-level visibility: [8]

Decision traces and rationales
Tool calls and parameters
Behavioral metrics over time

Platforms like LangSmith and IntellAgent already capture traces and behavior to detect drift. [2][8] One team, for instance, saw support agents offering excessive discounts; traces revealed a retrieval config change that over-weighted old sales playbooks. Monitoring surfaced the issue within hours. [2][8]

Google’s Agent Governance Stack adds: [10][9]

Fleet policies and agent identities
Unified security dashboards
Audits, anomaly detection, and Responsible AI guardrails

In a serious Executor, security and observability form the spine of the runtime, not optional extras. [1][2][10]

4. Performance, Cost Management, and Infrastructure Integration

Agent Ops directly intersects infra and FinOps: [4]

Scaling across clusters
Rate-limit handling
Token and compute spend control

These should be standardized in the runtime instead of reinvented per agent.

Infra-aware runtime

Typical production environments already use: [4]

ECS or Kubernetes/GKE for containers
Redis for caches and embeddings
OpenSearch or Postgres for search/vector
DynamoDB (or similar) for session memory

An Executor should expose storage interfaces so existing Redis/Postgres/OpenSearch/Dynamo stacks plug in without custom glue. [4][5]

GKE Agent Sandbox shows gVisor isolation co-existing with sub-second cold starts, enabling per-request sandboxes for latency-sensitive workloads. [7]

Deployment patterns

Realistic deployments include: [2]

Docker + FastAPI services
GPU scaling on Runpod
On-prem inference via Ollama
Managed execution with AWS Bedrock AgentCore (infra + tracking)

A Google-aligned Executor can standardize: [10]

Request tracking and correlation IDs
Latency histograms and SLOs
Cost attribution per user, agent, or tool

Cost and reliability trade-offs

Misconfigurations—like recursive tools or huge contexts—can: [5][9]

Explode token costs
Cause timeouts and brittle workflows

A full-stack Executor can enforce: [4][5]

Global token and API budgets
Per-tool concurrency/backoff
SLO-aware degradation (cheaper models, skipping non-critical tools)

Performance and cost become part of the runtime contract with infra. [4][7][10]

5. Implementation Roadmap and Ecosystem Positioning

Most frameworks still provide shallow security, weak compliance mapping, and minimal observability, pushing enterprises to bolt on their own guardrails. [1] An open-source Agent Executor can be the production backbone these frameworks plug into.

From reference stack to runtime

A comprehensive production stack—self-improving memory, adversarial testing, multi-environment deploys—already exists as a reference tutorial. [2] An Executor could unify this into:

A standard lifecycle (plan → act → observe → evaluate)
Built-in evaluation and behavioral tests
First-class hooks for security and governance services [2][3]

Google’s prototype-to-production guide calls out evaluation, governance, and Gemini integration as core; these map directly to Executor features. [3][10]

Codifying expert practices

Specialist AI agent firms repeatedly implement: [6]

Reasoning loops and multi-agent patterns
Memory hierarchies and validation layers
Permission models and evaluation hooks

Encoding these as primitives lets smaller teams benefit without reinventing them.

Production-focused literature emphasizes: [5][9][11]

Multi-agent orchestration
Scalable memory architectures
Framework trade-offs (LangChain vs LangGraph)
Cost optimization and guardrails in real deployments

Google’s four-step framework for startups recommends starting with single-agent workflows, then introducing multi-agent patterns as maturity grows. [3][10] An open Agent Executor, aligned with this path, can turn today’s prototype-heavy ecosystem into one where robust, governed, and observable agents are the default.

About CoreProse: Research-first AI content generation with verified citations. Zero hallucinations.

🔗 Try CoreProse | 📚 More KB Incidents

DEV Community