Svetlana Perekrestova

Posted on Mar 21 • Originally published at sperva.hashnode.dev

Why 'Brownfield' Deployments Break Agent Architectures — Lessons from Google

#ai #agents #googlecloud

Last week I attended Google AI Agents Reloaded Live Labs Benelux and ended up winning in the "Agents for Good" track. 🎉

Between lab sessions and back-to-back talks, I filled several pages of notes — and a lot of it challenged or refined assumptions I had about building agents. This isn't a recap of "what is an AI agent." It's the non-obvious stuff: where the real traps are, what the data actually says, and which design decisions have outsized consequences in production.

First: the honest framing for 2026

The production track opened with this:

"If 2025 was the year of the agent, 2026 is the year of making it work."

Followed immediately by a slide that just read: "Reality check: it's all brownfield."

No one is deploying agents into pristine infrastructure built from scratch. They go into existing systems, legacy APIs, organizational processes, and teams that were never designed for autonomous AI. That constraint changes almost every architectural decision — and it's the lens through which everything else in this article should be read.

The A2A Protocol — and why the streaming part matters

Google's Agent-to-Agent (A2A) protocol is a standardized, framework-agnostic way for agents to discover each other, communicate, and delegate tasks. The spec is clean, but the interesting design decisions are in the details.

Discovery via Agent Cards

Remote agents advertise their capabilities at a well-known endpoint:

/.well-known/agent-card.json

This is the agent's capability manifest — it declares name, description, URL, version, skills, and supported content types. A client agent fetches this before sending any task. It's roughly the AI equivalent of an OpenAPI spec, but for autonomous agents rather than REST endpoints.

An Agent Card definition in code looks like:

agent_card = AgentCard(
    name='Currency Agent',
    description='Helps with exchange rates for currencies',
    url=f'http://{host}:{port}/',
    version='1.0.0',
    default_input_modes=CurrencyAgent.SUPPORTED_CONTENT_TYPES,
    default_output_modes=CurrencyAgent.SUPPORTED_CONTENT_TYPES,
    skills=[skill],
)

The communication model: Messages, Tasks, Artifacts

All communication happens over JSON-RPC over HTTP(S). The core data structures:

Message — has a role (user or agent) and one or more Parts (text, file, or JSON)
Task — returned by the server agent with an id and status for async tracking
Artifact — the final output payload, also structured as Parts

The Agent Executor sits inside the server-side agent and handles incoming messages, runs the internal reasoning loop, and emits responses back to the client.

Polling is explicitly an anti-pattern

This was called out directly in the session: polling task.status over HTTP for long-running tasks is inefficient. You're hammering an endpoint waiting for a status change when the server could simply tell you when it's done.

The right mechanism is SSE (Server-Sent Events) — the server agent pushes updates (initial task acknowledgment, intermediate messages, final artifacts) over a persistent HTTPS connection. You declare support in the agent card with streaming: true.

For multi-step agentic workflows where the client (or an orchestrating parent agent) needs to react to intermediate outputs, this isn't optional. Polling at scale compounds into latency and unnecessary infrastructure load.

A2A vs MCP — complementary, not competing

This distinction came up several times across talks and labs:

MCP (Model Context Protocol): connects an agent to tools — APIs, functions, data sources. It defines the agent↔tool interface. Primitives are Tools, Resources, and Prompts.
A2A: connects agents to agents. Full task delegation between autonomous agents that each have their own reasoning loops, tools, and memory.

A well-designed multi-agent system uses both: MCP for tool access within an agent, A2A for coordination across agent boundaries. The live demo showed a reference implementation worth looking at if you're planning a multi-agent architecture.

Evaluating agents properly — this section deserves its own article

The evaluation talk (by Naz Bayrak from Google) was the most practically useful session of the day. The field underinvests here, and it shows.

You need to evaluate two dimensions, not one

Every agent run produces:

Final response — did the agent achieve the goal? Is the output correct and useful?
Trajectory — what path did it take? Which tools were called, in what order?

Evaluating only the output misses an entire category of bugs: agents that arrive at correct-looking answers via wrong reasoning, agents that take five steps when two suffice, agents that call the wrong tool but recover via hallucination. These bugs surface reliably under slightly different inputs — and you won't know they exist until they do.

The three evaluation methods and their actual tradeoffs

Method	Strengths	Weaknesses
Human evaluation	Captures nuance, human factors, trust signals	Subjective, slow, expensive, doesn't scale
LLM-as-a-Judge	Scalable, consistent, automated	Bounded by the judge model's capability ceiling; misses complex intermediate steps
Automated metrics	Objective, deterministic, fits in CI	Can't measure creativity or complex reasoning; susceptible to gaming

None of these is sufficient alone. The pattern that works in production is layered: automated metrics catch regressions at scale, LLM-as-Judge handles qualitative assessment, human evaluation validates the trust signals that automation can't quantify.

Six trajectory metrics — and which to use when

ADK provides six distinct strategies for comparing an agent's actual trajectory against a "golden run":

Exact match — perfect replication required (strictest)
In-order match — correct steps in the correct order, extra steps allowed
Any-order match — correct steps in any order, extra steps allowed
Precision — how relevant/correct are the predicted actions?
Recall — how many required actions were actually captured?
Single-tool use — did the agent use a specific tool at least once?

Choosing wrong matters. Any-order match makes sense when steps are genuinely parallelizable. In-order match is right when sequence is semantically meaningful — a lookup must precede a write, a validation must precede a commit. Precision vs. recall is the classic tradeoff: optimize for precision when false positives are costly, recall when missing required steps is the bigger risk.

LLM-as-a-Judge: how to structure the rubric

The pattern that works:

"You are an impartial AI quality analyst. Rate the following response
on a scale of 1-5 for [criterion].
Does the response [specific check]?
Explain your reasoning."

Expected output — structured JSON, not prose:

{
  "groundedness_score": 5,
  "reasoning": "The response accurately reflects the source document and makes no unsupported claims."
}

Ground truth (a reference answer) is recommended but not required. Without it, scores are less reliable — include it whenever you have it.

Two case studies worth unpacking

Case Study 1: Multi-agent retail pricing system

Challenge: validate a system where three agents (Data, Action, Forecasting) had to collaborate correctly. The final output alone wasn't enough to trust — if the orchestrator called agents in the wrong order, outputs could look plausible while the underlying process was broken.

Three-layer evaluation:

Trajectory testing (pytest-automated): confirmed the orchestrator called the right agents in the right order. Tests verified that data queries started with transfer_to_agent('DataAgent') — catching internal process errors even when final outputs looked fine.
LLM-as-Judge with custom rubric: scored final responses on "Business Clarity" and "Conciseness" — qualitative, business-relevant criteria that no automated metric captures.
Rubric decomposition for stress testing: for complex multi-constraint prompts, one LLM generated a checklist of all constraints the response should satisfy; a second LLM gave binary Yes/No answers per item. Explainable, detailed failure analysis rather than a single opaque score.

Key learning: For multi-agent back-end systems, the process is the product. Trajectory correctness must be validated before output quality — a correct-looking answer via a wrong path is a bug, not a success.

Case Study 2: Customer-facing software assistant

Challenge: technical accuracy wasn't enough — the agent needed to be genuinely helpful and feel trustworthy to real users.

Strategy:

AI-simulated conversations: a "Simulated IT Pro" LLM dynamically generated realistic multi-turn dialogues, creating a large test dataset without expensive human annotation.
Expert Evaluator LLM: scored transcripts on helpfulness and task adherence, providing a quantitative baseline.
"Vibes-based" human testing: domain experts interacted directly with the agent and provided qualitative feedback on whether guidance felt right in real-world context — the thing automated systems can't capture.
Structured human evaluation: 1-5 scoring forms with free-form notes, identifying nuance and domain-specific errors the LLM judge missed.

Key learning: Automation provides scale; it doesn't provide trust. For user-facing agents, qualitative domain-expert signal isn't optional — it's the thing you're ultimately optimizing for.

Evaluation is a loop, not a gate

The four-stage continuous evaluation cycle:

Code & Build — validate logic before anything runs in production
Quality & Behavior Eval — pre-deployment testing of intelligence and behavior
Release & Live — real users generate real data
Observe, Analyze, Capture — production insights update golden datasets for the next iteration

Production interactions are your richest source of new test cases. Teams that close this loop improve continuously. Teams that treat evaluation as a pre-ship gate plateau.

Agentic memory — full context is not the answer

This was the most empirically grounded section of the whole conference, and it directly challenged a common default.

The accuracy data

Approach	Accuracy
No memory	10.8%
Full context (all history)	55.4%
Competing memory offering	63.8%
Reflective Memory Bank (RMB)	74.6%

Full context underperforms selective memory. There are two reasons:

Cost: full context processes all history every request; selective memory retrieves only the relevant subset.
Context rot: as irrelevant history accumulates in the context window, LLM output quality degrades systematically — the "lost in the middle" effect. The model's attention is diluted across noise.

The implication runs counter to the intuition that "more context = better": giving the model less but more relevant information outperforms giving it everything.

Two distinct memory layers

Sessions (short-term memory)

Agent Engine is inherently stateless — it doesn't store anything between calls. Sessions is the layer that adds statefulness. It stores conversation history, agent actions, and state within a single session, and eliminates the need to manage your own conversation history database.

Memory Bank (long-term memory)

Memory Bank persists facts across multiple sessions, linked to a specific userId. After each session ends, an LLM automatically extracts key facts from the conversation and stores them. Retrieval is similarity-based (semantic search) or by userId lookup. The extraction happens server-side, invisible to the user.

Sessions and Memory Bank are independent — you can use either or both.

How Reflective Memory Management actually works

RMM (from Google Research, arXiv:2503.08026) is the underlying mechanism:

Prospective Reflection: after a session ends, the system decomposes the dialogue by topic, summarizes key facts, and stores or merges them into the bank
Retrospective Reflection: before responding to a new query, retrieves potentially relevant topic summaries
Adaptive Reranking: a learnable module (trained via RL on LLM citation behavior) refines the Top-K retrieved memories to the Top-M most relevant — this is the key differentiator from naive vector similarity search

Wiring it up

With ADK, PreloadMemoryTool handles the full lifecycle — finding memories at session start, injecting them into context, and telling Memory Bank to learn from the session when it ends:

agent = adk.Agent(
    model=MODEL_NAME,
    name="helpful_assistant",
    instruction="""You are a helpful assistant with perfect memory.
        - Use the context to personalize responses
        - Naturally reference past conversations when relevant
        - Build upon previous knowledge about the user""",
    tools=[adk.tools.preload_memory_tool.PreloadMemoryTool()],
)

runner = adk.Runner(
    agent=agent,
    app_name=app_name,
    session_service=VertexAiSessionService(
        project=PROJECT_ID, location=LOCATION, agent_engine_id=agent_engine_id
    ),
    memory_service=VertexAiMemoryBankService(
        project=PROJECT_ID, location=LOCATION, agent_engine_id=agent_engine_id
    ),
)

For non-ADK frameworks (LangGraph, CrewAI), Memory Bank exposes a direct API with generate_memories and retrieve_memories — same capabilities, more manual wiring.

Agents in production: what changes at scale

The "agentic drift" problem

As agent autonomy increases, SRE and ops teams face a new operational discipline: agentic drift — when agent behavior gradually diverges from intended behavior in production without clear error signals. Agents may continue producing plausible-looking outputs while quietly drifting from their intended goals. This is harder to detect than traditional software failures, where errors are usually explicit.

The 5% of complex, novel outages that AI cannot self-resolve become the critical escalation path for human operators.

AI improves individuals but increases delivery instability

Survey data from the conference: 85% of respondents report AI has increased their individual productivity (13% extremely, 31% moderately, 41% slightly). But the measured organizational-level effects showed software delivery instability as the highest-impact negative outcome of AI adoption.

Individual speed gains don't automatically propagate to team or organizational outcomes. The gap between "I'm faster" and "we ship better" is where governance and process come in — and it's not a gap most teams are deliberately closing.

Two strategic frames worth keeping

The most useful mental model from the whole event:

Agent as Actor (done by agents): agents build and operate things autonomously — dev automation, code generation, orchestration.
Agent as Artifact (done to agents): agents are platform components that must be deployed, scaled, secured, governed, observed, and optimized.

Most teams invest heavily in the first and underinvest in the second. Production failures tend to originate in the second.

The design principles I'm keeping

Evaluate trajectories, not just outputs. A correct answer reached by an incorrect path is a bug waiting to surface on the next slightly different input. Trajectory evaluation is not optional — it's the signal that output evaluation misses.

Memory architecture is a first-class design decision. The choice between in-context history, Sessions, and Memory Bank has measurable accuracy and cost implications. The 20-point accuracy gap between full context and RMB is large enough to matter in production. It's not an implementation detail you optimize later.

Streaming over polling. SSE-based push vs. HTTP polling is the difference between a responsive and a broken UX for any long-running agent task. Design for it upfront.

Evaluation is a loop. Pre-deployment evals catch known issues. Production observation discovers the unknown unknowns. The feedback loop between production traces and golden datasets is where agents actually improve over time — and it needs to be built deliberately, not added retroactively.

Design for brownfield. The clean-room version of an agentic system is a prototype. The real one integrates with what already exists. Every architectural decision should be stress-tested against that constraint.

Based on notes from Google AI Agents Reloaded Live Labs Benelux. A2A demo code: github.com/MKand/agenticprotocols. Memory research: arXiv:2503.08026.

DEV Community