DEV Community

Cover image for Beyond the Notebook: 4 Architectural Patterns for Production-Ready AI Agents
Fabricio Quagliariello
Fabricio Quagliariello

Posted on

Beyond the Notebook: 4 Architectural Patterns for Production-Ready AI Agents

This is a submission for the Google AI Agents Writing Challenge: Learning Reflections


Introduction

The gap between a "Hello World" agent running in a Jupyter Notebook and a reliable, production-grade system is not a step—it's a chasm (and it is not an easy one to cross).

I recently had the privilege to participate in the 5-Day AI Agents Intensive Course with Google and Kaggle. After completing the coursework and finalizing the capstone project, I realized that beyond the many valuable things we enjoyed in this course (very valuable white papers, carefully designed notebooks, and exceptional expert panels in the live sessions), the real treasure wasn't just learning the ADK syntax—it was the architectural patterns subtly embedded within the lessons.

As an architect building production systems for over 20 years, including multi-agent workflows and enterprise integrations, I've seen firsthand where theoretical agents break under real-world constraints.

We are moving from an era of "prompt engineering" to "agent architecture" where "context engineering" is key. As with any other emerging architectural paradigm, this shift demands blueprints that ensure reliability, efficiency, and ethical safety. Without them, we risk agents that silently degrade, violate user privacy, or execute irreversible actions without oversight.

Drawing from the course and my own experience as an AI Architect, I have distilled the curriculum into four essential patterns that transform fragile prototypes into robust production systems:

The 4 Core Patterns

  1. Outside-In Evaluation Hierarchy: Shifting focus from the final answer to the decision-making trajectory.
  2. Dual-Layer Memory Architecture: Balancing ephemeral session context with persistent, consolidated knowledge.
  3. Protocol-First Interoperability: Decoupling agents from tools using standardized protocols like MCP and A2A.
  4. Long-Running Operations & Resumability: Managing state for asynchronous tasks and human-in-the-loop workflows.

Throughout this analysis, I'll apply a 6-point framework grounded in the principles of Principia Agentica—ensuring these patterns respect human sovereignty, fiduciary responsibility, and meaningful human control.

The Analysis Framework

  1. The Production Problem: Why naive approaches fail at scale.
  2. The Architectural Solution: The specific design pattern taught in the course.
  3. Key Implementation Details: Concrete code-level insights from the ADK notebooks.
  4. Production Considerations: Real-world deployment implications (latency, cost, scale).
  5. Connection to Ethical Design: How the pattern supports human sovereignty, fiduciary responsibility, or ethical agent architecture. I will include a "failure scenario" where I'll try to illustrate what could happen without the ethical safeguard.
  6. Key Takeaways: A distilled summary of each pattern's production principle, implementation guidance, and ethical anchor—designed to serve as a quick reference for architects moving from prototype to production.

Let's do this!


Pattern 1: Outside-In Evaluation Hierarchy (Trajectory as Truth)

In traditional software, if the output is correct, the test passes. In agentic AI, a correct answer derived from a hallucination or a dangerous logic path is a ticking time bomb.

1. The Production Problem

Naive evaluation strategies often fail in production due to the non-deterministic nature of LLMs. We face two specific traps:

  • The "Lucky Guess" Trap: Imagine an agent asked to "Get the weather in Tokyo." A bad agent might hallucinate "It is sunny in Tokyo" without calling the weather tool. If it happens to be sunny, a traditional assert result == expected test passes. This hides a critical failure in logic that will break as soon as the weather changes.
  • The "Silent Failure" of Efficiency: An agent might solve a user request but take 25 steps to do what should have taken 3. This bloats token costs and latency—a failure mode that boolean output checks completely miss.

2. The Architectural Solution

Day 4 of the course introduced the concept of Glass Box Evaluation. We move away from simple output verification to a hierarchical approach:

  1. Level 1: Black Box (End-to-End): Did the user get the right result?
  2. Level 2: Glass Box (Trajectory): Did the agent use the correct tools in the correct order?
  3. Level 3: Component (Unit): Did the individual tools perform as expected?

This shift treats the trajectory (Thought → Action → Observation) as the unit of truth. By evaluating the trajectory, we ensure the agent isn't just "getting lucky," but is actually reasoning correctly.

pattern1_1

3. Implementation Details: Field Notes from the ADK

The ADK provides specific primitives to capture and score these trajectories without writing custom parsers for every test.

From adk web to evalset.json
Instead of manually writing test cases, the ADK encourages a "Capture and Replay" workflow. During development (using adk web), when you spot a successful interaction, you can persist that session state. This generates an evalset.json that captures not just the input/output, but the expected tool calls.

// Conceptual structure of an ADK evalset entry
// Traditional test: just input/output
// ADK evalset contains evalcases with invocations: input (queries) + expected_tool_use + reference (output)
{
  "name": "ask_GOOGLE_price", // a given name of the evaluation set
  "data": [ // evaluation cases are included here
      "query": "What is the stock price of GOOG?", // user input
      "reference": "The price is $175...", // expected semantic output
      "expected_tool_use": [ // expected trajectory
        { 
            "tool_name": "get_stock_price", 
            "tool_input": { // arguments passed to the tool
                "symbol": "GOOG" 
            } 
        }
      ],
      // other evaluation cases ...
  ],
  "initial_session": {
    "state": {},
    "app_name": "hello_world",
    "user_id": "user_..." // the specific id of the user
  }
}

Enter fullscreen mode Exit fullscreen mode

This JSON represents an EvalSet containing one EvalCase. Each EvalCase has a namedata (which is a list of invocations), and an optional initial_session. Each invocation within the data list includes a query, expected_tool_use, expected_intermediate_agent_responses, and a reference response.

The EvalSet object itself also includes eval_set_id, name, description, eval_cases, and creation_timestamp fields.

Configuring the Judge

In the test_config.json, we can move beyond simple string matching. The course demonstrated configuring LLM-as-a-Judge evaluators.

  • Naive Approach: Uses an exact match evaluator (brittle, fails on phrasing differences).
  • Architectural Approach: Uses TrajectoryEvaluator alongside SemanticSimilarity. The ADK allows us to define "Golden Sets" where the reasoning path is the standard, allowing the LLM judge to penalize agents that skip steps or hallucinate data, even if the final text looks plausible.

Core Configuration Components

To configure an LLM-as-a-Judge effectively, you must construct a specific input payload with four components:

  1. The Agent's Output: The actual response generated by the agent you are testing.
  2. The Original Prompt: The specific instruction or query the user provided.
  3. The "Golden" Answer: A reference answer or ground truth to serve as a benchmark.
  4. A Detailed Evaluation Rubric: Specific criteria (e.g., "Rate helpfulness on a scale of 1-5") and requirements for the judge to explain its reasoning.

ADK Default Evaluators

The ADK Evaluation Framework includes several default evaluators, accessible via the MetricEvaluatorRegistry:

  • RougeEvaluator: Uses the ROUGE-1 metric to score similarity between an agent's final response and a golden response.
  • FinalResponseMatchV2Evaluator: Uses an LLM-as-a-judge approach to determine if an agent's response is valid.
  • TrajectoryEvaluator: Assesses the accuracy of an agent's tool use trajectories by comparing the sequence of tool calls against expected calls. It supports various match types (EXACT, IN_ORDER, ANY_ORDER).
  • SafetyEvaluatorV1: Assesses the safety (harmlessness) of an agent's response, delegating to Vertex Gen AI Eval SDK.
  • HallucinationsV1Evaluator: Checks if a model response contains any false, contradictory, or unsupported claims by segmenting the response into sentences and validating each against the provided context.
  • RubricBasedFinalResponseQualityV1Evaluator: Assesses the quality of an agent's final response against user-defined rubrics, using an LLM as a judge.
  • RubricBasedToolUseV1Evaluator: Assesses the quality of an agent's tool usage against user-defined rubrics, employing an LLM as a judge.

These evaluators can be configured using EvalConfig objects, which specify the criteria and thresholds for assessment.

Bias Mitigation Strategies

A major challenge is handling bias, such as the tendency for models to give average scores or prefer the first option presented:

  • Pairwise Comparison (A/B Testing): Instead of asking for an absolute score, configure the judge to compare two different responses (Answer A vs. Answer B) and force a choice. This yields a "win rate," which is often a more reliable signal.
  • Swapping Operation: To counter position bias, invoke the judge twice, swapping the order of the candidates. If the results are inconsistent, the result can be labeled as a "tie".
  • Rule Augmentation: Embed specific evaluation principles, references, and rubrics directly into the judge's system prompt.

Advanced Configuration: Agent-as-a-Judge

There's a distinction between standard LLM-as-a-Judge (which evaluates final text outputs) and Agent-as-a-Judge:

  • Standard LLM-as-a-Judge: Best for evaluating the final response (e.g., "Is this summary accurate?").
  • Agent-as-a-Judge: Necessary when you need to evaluate the process, not just the result. You configure the judge to ingest the agent's full execution trace (including internal thoughts, tool calls, and tool arguments). This allows the judge to assess intermediate steps, such as whether the correct tool was chosen or if the plan was logically structured.

Evaluation Architectures

You can use several architectural approaches when configuring your judge:

  • Point-wise: The judge evaluates a single candidate in isolation.
  • Pair-wise / List-wise: The judge compares two or more candidates simultaneously to produce a ranking.
  • Multi-Agent Collaboration: For high-stakes evaluation, you can configure multiple LLM judges to debate or vote (e.g., "Peer Rank" algorithms) to produce a final consensus, rather than relying on a single model.

Example Configuration

For a pairwise comparison judge, structure the prompt to output in a structured JSON format:

{
  "winner": "A", // or "B" or "tie"
  "rationale": "Answer A provided more specific delivery details..."
}
Enter fullscreen mode Exit fullscreen mode

This structured output allows you to programmatically parse the judge's decision and calculate metrics like win/loss rates at scale.

Analogy

You can think of configuring an LLM-as-a-Judge like setting up a blind taste test. If you just hand a judge a cake and ask "Is this good?", they might be polite and say "Yes." But if you provide them with a Golden Answer (a cake baked by a master chef) and use Pairwise Comparison (ask "Which of these two is better?"), you force them to make a critical distinction, resulting in far more accurate and actionable feedback.

4. Production Considerations

Moving this pattern from a notebook to a live system requires handling scale and cost.

  • Dynamic Sampling: You cannot trace and judge every single production interaction with an LLM—it’s too expensive. A robust pattern is 100/10 sampling: capture 100% of traces that result in user errors or negative feedback, but only sample 10% of successful sessions to monitor for latency drift (P99) and token bloat.
  • The Evaluation Flywheel: Evaluation isn't a one-time gate before launch. Production traces (captured via OpenTelemetry) must be fed back into the development cycle. Every time an agent fails in production, that specific trajectory should be anonymized and added to the evalset.json as a regression test.
  • Latency Impact: Trajectory logging must be asynchronous. The user should receive their response immediately, while the trace data is pushed to the observability store (like LangSmith or a custom SQL db) in a background thread to avoid degrading the user experience.

5. Ethical Connection

"The Trajectory is the Truth" is the technical implementation of Fiduciary Responsibility. We cannot claim an agent is acting in the user's best interest if we only validate the result (the "what") while ignoring the process (the "how"). We must ensure the agent isn't achieving the right ends through manipulative, inefficient, or unethical means.

Concrete Failure Scenario:

Consider a hiring agent that filters job candidates. Without trajectory validation, it could discriminate based on protected characteristics (age, gender, ethnicity) during the filtering process, yet pass all output tests by producing a "diverse" final shortlist through cherry-picking. The bias hides in the how—which resumes were read, which criteria were weighted, which candidates were never considered. Output validation alone cannot detect this algorithmic discrimination. Only trajectory evaluation exposes the unethical reasoning path.

Key Takeaways

  • Production Principle: Trust the reasoning process, not just the output. Trajectory validation is the difference between lucky guesses and reliable intelligence.
  • Implementation: Use ADK's TrajectoryEvaluator with EvalSet objects to capture expected tool calls alongside expected outputs. Configure LLM-as-a-Judge with Golden Sets and pairwise comparison to avoid evaluation bias.
  • Ethical Anchor: This pattern operationalizes Fiduciary Responsibility—we validate that the agent serves the user's interests through sound reasoning, not through shortcuts, hallucinations, or hidden bias.

Validating the how is critical, but what happens when the reasoning path spans not just one conversation turn, but weeks or months? An agent that reasons correctly in the moment can still fail catastrophically if it forgets what it learned yesterday. This brings us to our second pattern: managing the agent's memory architecture.

Pattern 2: Dual-Layer Memory Architecture (Session vs. Memory)

1. The Production Problem

Although models like Gemini 1.5 have introduced massive context windows, treating context as infinite is an architectural anti-pattern.

Consider a Travel Agent Bot: In Session 1, the user mentions a "shellfish allergy." By Session 10, months later, that critical fact is buried under thousands of tokens of hotel searches and flight comparisons

This might lead to two very concrete failures:

  • Context Rot: As the context window fills with noise, the model's ability to attend to specific, older instructions (like the allergy) degrades.
  • Cost Spiral: Re-sending the entire history of every past interaction for every new query creates a linear cost increase that makes the system economically unviable at scale.

2. The Architectural Solution

We must distinguish between the Workbench and the Filing Cabinet.

  • The Session (Workbench): An ephemeral, mutable space for the current task. It holds the immediate "Hot Path" context. To keep it performant, we apply Context Compaction—automatically summarizing or truncating older turns while keeping the most recent ones raw.
  • The Memory (Filing Cabinet): A persistent layer for consolidated facts. This requires an ETL (Extract, Transform, Load) pipeline where the agent Extracts facts from the session, Consolidates them (deduplicating against existing knowledge), and Stores them for semantic retrieval later.

3. Implementation Details: Code Insights

The ADK moves memory management from manual implementation to configuration.

Session Hygiene via Compaction
In the ADK, we don't manually trim strings. We configure the agent to handle its own hygiene using EventsCompactionConfig.

from google.adk.agents.base_agent import BaseAgent
from google.adk.apps.app import App, EventsCompactionConfig
from google.adk.apps.llm_event_summarizer import LlmEventSummarizer # Assuming this is your summarizer

# Define a simple BaseAgent for the example
class MyAgent(BaseAgent):
    name: str = "my_agent"
    description: str = "A simple agent."

    def call(self, context, content):
        pass

# Create an instance of LlmEventSummarizer or your custom summarizer
my_summarizer = LlmEventSummarizer()

# Create an EventsCompactionConfig
events_compaction_config_instance = EventsCompactionConfig(
    summarizer=my_summarizer,
    compaction_interval=5,
    overlap_size=2
)

# Create an App instance with the EventsCompactionConfig
my_app = App(
    name="my_application",
    root_agent=MyAgent(),
    events_compaction_config=events_compaction_config_instance
)

print(my_app.model_dump_json(indent=2))

Enter fullscreen mode Exit fullscreen mode

Persistence: From RAM to DB
In notebooks, we often use InMemorySessionService. This is dangerous for production because a container restart wipes the conversation. The architectural shift is moving to DatabaseSessionService (backed by SQL or Firestore) which persists the Session object state, allowing users to resume conversations across devices.

The Memory Consolidation Pipeline
Day 3b introduced the framework for moving from raw storage to intelligent consolidation. This is where the "Filing Cabinet" becomes smart. The workflow is an LLM-driven ETL pipeline with four stages:

  1. Ingestion: The system receives raw session history.
  2. Extraction & Filtering: An LLM analyzes the conversation to extract meaningful facts, guided by developer-defined Memory Topics:
    The LLM extracts only facts matching these topics.

    # Conceptual configuration (Vertex AI Memory Bank, Day 5)
    memory_topics = [
      "user_preferences",      # "Prefers window seats"
      "dietary_restrictions",  # "Allergic to shellfish"
      "project_context"        # "Leading Q4 marketing campaign"
    ]
    
    
  3. Consolidation (The "Transform" Phase): The LLM retrieves existing memories and decides:

    • CREATE: Novel information → new memory entry.
    • UPDATE: New info refines existing memory → merge (e.g., "Likes marketing" becomes "Leading Q4 marketing project").
    • DELETE: New info contradicts old → invalidate (e.g., Dietary restrictions change).
  4. Storage: Consolidated memories persist to a vector database for semantic retrieval.

Note: While Day 3b uses InMemoryMemoryService to teach the API, it stores raw events without consolidation. For production-grade consolidation, we look to the Vertex AI Memory Bank integration introduced in Day 5.

Retrieval Strategies: Proactive vs. Reactive
The course highlighted two distinct patterns for getting data out of the Filing Cabinet:

  1. Proactive (preload_memory): Injects relevant user facts into the system prompt before the model generates a response. Best for high-frequency preferences (e.g., "User always prefers aisle seats").
  2. Reactive (load_memory): Gives the agent a tool to search the database. The agent decides if it needs to look something up. Best for obscure facts to save tokens.

4. Production Considerations

  • Asynchronous Consolidation: Moving data from the Workbench to the Filing Cabinet is expensive. In production, this ETL process should happen asynchronously. Do not make the user wait for the agent to "file its paperwork." Trigger the memory extraction logic in a background job after the session concludes.
  • Semantic Search: Keyword search is insufficient for the Filing Cabinet. Production memory requires vector embeddings. If a user asks for "romantic dining," the system must be able to retrieve a past note about "candlelight dinners," even if the word "romantic" wasn't used.
  • The "Context Stuffing" Trade-off: While preload_memory reduces latency (no extra tool roundtrip), it increases input token costs on every turn. load_memory is cheaper on average but adds latency when retrieval is needed.

5. Ethical Design Note

This architecture embodies Privacy by Design. By distinguishing between the transient session and persistent memory, we can implement rigorous "forgetting" protocols.

Pattern 2

We scrub Personally Identifiable Information (PII) from the session log before it undergoes consolidation into long-term memory, ensuring we act as fiduciaries of user data rather than creating an unmanageable surveillance log.

Concrete Failure Scenario:

Imagine a healthcare agent that remembers a patient mentioned their HIV status in Session 1. Without a dual-layer architecture, this fact sits in plain text in the session log forever, accessible to any system with database read permissions. If the system is breached, or if a support engineer needs to debug a session, the patient's private health information is exposed. Worse, without consolidation logic, the system doesn't know to delete this information if the patient later says "I was misdiagnosed—I don't have HIV." The agent treats every utterance as equally permanent, creating a privacy nightmare where sensitive data proliferates uncontrollably across logs and backups.

Key Takeaways

  • Production Principle: Context is expensive, but privacy is priceless. Design memory systems that distinguish between what an agent needs now (hot session) and what it needs forever (consolidated memory).
  • Implementation: Use EventsCompactionConfig for session hygiene and implement a PII scrubber in your ETL pipeline before consolidation. Leverage Vertex AI Memory Bank for production-grade semantic memory with built-in privacy controls.
  • Ethical Anchor: This pattern operationalizes Privacy by Design—we build forgetfulness and data minimization into the architecture, treating user data as a liability to protect, not an asset to hoard.

With robust evaluation validating our agent's reasoning and a dual-layer memory preserving context over time, we might assume our system is production-ready. But there's a hidden fragility: these capabilities are only as good as the tools and data sources the agent can access. When every integration is a bespoke API wrapper, scaling becomes a maintenance nightmare. This brings us to the third pattern: decoupling agents from their dependencies through standardized protocols.

Pattern 3: Protocol-First Interoperability (MCP & A2A)

1. The Production Problem

We are facing an "N×M Integration Trap."

Imagine building a Customer Support Agent. It needs to check GitHub for bugs, message Slack for alerts, and update Jira tickets. Without a standard protocol, you write three custom API wrappers. When GitHub changes an endpoint, your agent breaks.

Now, multiply this across an enterprise. You have 10 different agents needing access to 20 different data sources. You are suddenly maintaining 200 brittle integration points. Furthermore, these agents become isolated silos—the Sales Agent has no way to dynamically discover or ask the Engineering Agent for help because they speak different "languages."

2. The Architectural Solution

The solution is to invert the dependency. Instead of the agent knowing about the specific tool implementation, we adopt a Protocol-First Architecture.

  • Model Context Protocol (MCP): For Tools and Data. It decouples the agent (client) from the tool (server). The agent doesn't need to know how to query a Postgres DB; it just needs to know the MCP interface to ask for data.
  • Agent2Agent (A2A): For Peers and Delegation. It allows for high-level goal delegation. An agent doesn't execute a task; it hands off a goal to another agent via a standardized handshake.
  • Runtime Discovery: Instead of hardcoding tools, agents query an MCP Server or an Agent Card at runtime to discover capabilities dynamically.

Pattern 3

3. Implementation Details: Code Examples from the ADK

The ADK abstracts the heavy lifting of these protocols.

Connecting Data via MCP
We don't write API wrappers. We instantiate an McpToolset. The ADK handles the handshake, lists the available tools, and injects their schemas into the context window automatically.

The Model Context Protocol (MCP) is used to connect an agent to external tools and data sources without writing custom API clients. In ADK, we use McpToolset to wrap an MCP server configuration.

Example: Connecting an agent to the "Everything" MCP server:

from google.adk.agents import LlmAgent
from google.adk.tools import McpToolset
from google.adk.tools.mcp_tool.mcp_session_manager import StdioConnectionParams
from google.adk.tools.mcp_tool.mcp_session_manager import StdioServerParameters
from google.adk.runners import Runner # Assuming Runner is defined elsewhere

# 1. Define the connection to the MCP Server
# Here we use 'npx' to run a Node-based MCP server directly
mcp_toolset = McpToolset(
    connection_params=StdioConnectionParams(
        server_params=StdioServerParameters(
            command="npx",
            args=["-y", "@modelcontextprotocol/server-everything"]
        ),
        timeout=10.0 # Optional: specify a timeout for connection establishment
    ),
    # Optionally filter to specific tools provided by the server
    tool_filter=["getTinyImage"]
)

# 2. Add the MCP tools to your Agent
agent = LlmAgent(
    name="image_agent",
    model="gemini-2.0-flash",
    instruction="You can generate tiny images using the tools provided.",
    # The toolset exposes the MCP capabilities as standard ADK tools
    tools=[mcp_toolset] # tools expects a list of ToolUnion
)

# 3. Run the agent
# The agent can now call 'getTinyImage' as if it were a local Python function
runner = Runner(agent=agent, ...) # Fill in Runner details to run

Enter fullscreen mode Exit fullscreen mode

Delegating via A2A (Agent-to-Agent)

The Agent2Agent (A2A) protocol is used to enable collaboration between different autonomous agents, potentially running on different servers or frameworks.

A. Exposing an Agent (to_a2a)
This converts a local ADK agent into an A2A-compliant server that publishes an Agent Card.

To make an agent discoverable, we wrap it using the to_a2a() utility. This generates an Agent Card—a standardized manifest hosted at .well-known/agent-card.json.

from google.adk.agents import LlmAgent
from google.adk.a2a.utils.agent_to_a2a import to_a2a
from google.adk.tools.tool_context import ToolContext
from google.genai import types
import random

# Define the tools
def roll_die(sides: int, tool_context: ToolContext) -> int:
  """Roll a die and return the rolled result.

  Args:
    sides: The integer number of sides the die has.
    tool_context: the tool context
  Returns:
    An integer of the result of rolling the die.
  """
  result = random.randint(1, sides)
  if not 'rolls' in tool_context.state:
    tool_context.state['rolls'] = []

  tool_context.state['rolls'] = tool_context.state['rolls'] + [result]
  return result

async def check_prime(nums: list[int]) -> str:
  """Check if a given list of numbers are prime.

  Args:
    nums: The list of numbers to check.

  Returns:
    A str indicating which number is prime.
  """
  primes = set()
  for number in nums:
    number = int(number)
    if number <= 1:
      continue
    is_prime = True
    for i in range(2, int(number**0.5) + 1):
      if number % i == 0:
        is_prime = False
        break
    if is_prime:
      primes.add(number)
  return (
      'No prime numbers found.'
      if not primes
      else f"{', '.join(str(num) for num in primes)} are prime numbers."
  )

# 1. Define your local agent with relevant tools and instructions
# This example uses the 'hello_world' agent's logic for rolling dice and checking primes.
root_agent = LlmAgent(
    model='gemini-2.0-flash',
    name='hello_world_agent',
    description=(
        'hello world agent that can roll a die of 8 sides and check prime'
        ' numbers.'
    ),
    instruction="""
      You roll dice and answer questions about the outcome of the dice rolls.
      When you are asked to roll a die, you must call the roll_die tool with the number of sides.
      When checking prime numbers, call the check_prime tool with a list of integers.
    """,
    tools=[
        roll_die,
        check_prime,
    ],
    generate_content_config=types.GenerateContentConfig(
        safety_settings=[
            types.SafetySetting(
                category=types.HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT,
                threshold=types.HarmBlockThreshold.OFF,
            ),
        ]
    ),
)

# 2. Convert to A2A application
# This automatically generates the Agent Card and sets up the HTTP endpoints
a2a_app = to_a2a(root_agent, host="localhost", port=8001)

# To run this application, save it as a Python file (e.g., `my_a2a_agent.py`)
# and execute it using uvicorn:
# uvicorn my_a2a_agent:a2a_app --host localhost --port 8001
Enter fullscreen mode Exit fullscreen mode

The Agent Card (Discovery):

The Agent Card is a standardized JSON file that acts as a "business card" for an agent, allowing other agents to discover its capabilities, security requirements, and endpoints.

{
  "name": "hello_world_agent",
  "description": "hello world agent that can roll a die of 8 sides and check prime numbers. You roll dice and answer questions about the outcome of the dice rolls. When you are asked to roll a die, you must call the roll_die tool with the number of sides. When checking prime numbers, call the check_prime tool with a list of integers.",
  "doc_url": null,
  "url": "http://localhost:8001/",
  "version": "0.0.1",
  "capabilities": {},
  "skills": [
    {
      "id": "hello_world_agent",
      "name": "model",
      "description": "hello world agent that can roll a die of 8 sides and check prime numbers. I roll dice and answer questions about the outcome of the dice rolls. When I am asked to roll a die, I must call the roll_die tool with the number of sides. When checking prime numbers, call the check_prime tool with a list of integers.",
      "examples": null,
      "input_modes": null,
      "output_modes": null,
      "tags": [
        "llm"
      ]
    },
    {
      "id": "hello_world_agent-roll_die",
      "name": "roll_die",
      "description": "Roll a die and return the rolled result.",
      "examples": null,
      "input_modes": null,
      "output_modes": null,
      "tags": [
        "llm",
        "tools"
      ]
    },
    {
      "id": "hello_world_agent-check_prime",
      "name": "check_prime",
      "description": "Check if a given list of numbers are prime.",
      "examples": null,
      "input_modes": null,
      "output_modes": null,
      "tags": [
        "llm",
        "tools"
      ]
    }
  ],
  "default_input_modes": [
    "text/plain"
  ],
  "default_output_modes": [
    "text/plain"
  ],
  "supports_authenticated_extended_card": false,
  "provider": null,
  "security_schemes": null
}

Enter fullscreen mode Exit fullscreen mode

B. Consuming a Remote Agent (RemoteA2aAgent)

To consume this, the parent agent simply points to the URL. The ADK treats the remote agent exactly like a local tool.

This allows a local agent to delegate tasks to a remote agent by reading its Agent Card.

from google.adk.agents import LlmAgent
from google.adk.agents.remote_a2a_agent import AGENT_CARD_WELL_KNOWN_PATH
from google.adk.agents.remote_a2a_agent import RemoteA2aAgent

# 1. Define the remote agent interface
# Points to the .well-known/agent.json of the running A2A server
prime_agent = RemoteA2aAgent(
    name="remote_prime_agent",
    description="Agent that handles checking if numbers are prime.",
    agent_card=f"http://localhost:8001/{AGENT_CARD_WELL_KNOWN_PATH}"
)

# 2. Use the remote agent as a sub-agent
root_agent = LlmAgent(
    name="coordinator",
    model="gemini-2.0-flash", # Explicitly define the model
    instruction="""
      You are a coordinator agent.
      Your primary task is to delegate any requests related to prime number checking to the 'remote_prime_agent'.
      Do not attempt to check prime numbers yourself.
      Ensure to pass the numbers to be checked to the 'remote_prime_agent' correctly.
      Clarify the results from the 'remote_prime_agent' to the user.
      """,
    sub_agents=[prime_agent]
)

# You can then use this root_agent with a Runner, for example:
# from google.adk.runners import Runner
# runner = Runner(agent=root_agent)
# async for event in runner.run_async(user_id="test_user", session_id="test_session", new_message="Is 13 a prime number?"):
#     print(event)

Enter fullscreen mode Exit fullscreen mode

While both protocols connect AI systems, they operate at different levels of abstraction.

When to use which?

  • Use MCP when you need deterministic execution of specific functions (stateless).
  • Use A2A when you need to offload a fuzzy goal that requires reasoning and state management (stateful).
Feature MCP (Model Context Protocol) A2A (Agent2Agent Protocol)
Primary Domain Tools & Resources. Autonomous Agents.
Interaction "Do this specific thing". Stateless execution of functions (e.g., "query database," "fetch file"). "Achieve this complex goal". Stateful, multi-turn collaboration where the remote agent plans and reasons.
Abstraction Low-level plumbing. Connects LLMs to data sources and APIs (like a USB-C port for AI). High-level collaboration. Connects intelligent agents to other intelligent agents to delegate responsibility.
Standard Standardizes tool definitions, prompts, and resource reading. Standardizes agent discovery (Agent Card), task lifecycles, and asynchronous communication.
Analogy Using a specific wrench or diagnostic scanner. Asking a specialized mechanic to fix a car engine.

How they work together:
An application might use A2A to orchestrate high-level collaboration between a "Manager Agent" and a "Coder Agent."

The "Coder Agent," in turn, uses MCP internally to connect to GitHub tools and a local file system to execute the work.

4. Production Considerations

Moving protocols from stdio (local process) to HTTP (production network) introduces critical security challenges.

  • The "Confused Deputy" Problem: Protocols decouple execution, but they also expose risks. A malicious user might trick a privileged agent (the deputy) into using an MCP file-system tool to read sensitive configs. Production architectures must enforce Least Privilege by placing MCP servers behind API Gateways that enforce policy checks before the tool is executed.
  • Discovery vs. Latency: Dynamic discovery adds a round-trip latency cost at startup (handshaking). In production, we often cache tool definitions (static binding) for performance, while keeping the execution dynamic.
  • Governance: To prevent "Tool Sprawl" where agents connect to unverified servers, enterprises need a Centralized Registry—an allowlist of approved MCP servers and Agent Cards that act as the single source of truth for capabilities.

5. Ethical Design Note

Protocol-first architectures are the technical foundation for Human Sovereignty and Data Portability.

Standardizing the interface (MCP) helps us prevent vendor lock-in, among many other advantages. A user can swap out a "Google Drive" data source for a "Local Hard Drive" source without breaking the agent, ensuring the user—not the platform—controls where the data lives and how it is accessed.

This abstraction acts as a bulwark against algorithmic lock-in, ensuring that an agent's reasoning capabilities are decoupled from proprietary tool implementations, preserving the user's freedom to migrate their digital ecosystem without losing their intelligent assistants.

Concrete Failure Scenario:

Imagine a small business builds a customer service agent tightly coupled to Salesforce's proprietary API. Over three years, the agent accumulates thousands of lines of custom integration code. When Salesforce raises prices 300%, the business wants to migrate to HubSpot—but their agent is fundamentally Salesforce-shaped. Every tool, every data query, every workflow assumption is hardcoded. Migration means rebuilding the agent from scratch, which the business can't afford. They're trapped. This is algorithmic lock-in—not just vendor lock-in of data, but vendor lock-in of intelligence. Without protocol-first design, the agent becomes a hostage to the platform, and the user loses sovereignty over their own automation.

Key Takeaways

  • Production Principle: Agents should depend on interfaces, not implementations. Protocol-first design (MCP for tools, A2A for peers) inverts the dependency and prevents the N×M integration trap.
  • Implementation: Use McpToolset to connect agents to data sources via the Model Context Protocol. Use RemoteA2aAgent and to_a2a() for agent-to-agent delegation. Cache tool definitions at startup for performance, but keep execution dynamic.
  • Ethical Anchor: This pattern operationalizes Human Sovereignty and Data Portability—users control where their data lives and which tools their agents use, free from vendor lock-in or algorithmic hostage-taking.

We now have agents that reason correctly, remember what matters, and connect to any tool or peer through standard protocols. But there's one final constraint that threatens to unravel everything: the assumption that every interaction completes in a single request-response cycle. Real business workflows don't work that way. Approvals take hours. External APIs time out. Humans need time to think. This is where our fourth pattern becomes essential: teaching agents to pause, persist, and resume across the boundaries of time itself.

Pattern 4: Long-Running Operations & Resumability

This is perhaps the most critical pattern for integrating agents into real-world business logic where human approval is non-negotiable.

1. The Production Problem

Naive agents fall into the "Stateless Trap."

Imagine a Procurement Agent tasked with ordering 1,000 servers.

The workflow is:

  1. Analyze quotes
  2. Propose the best option
  3. Wait for CFO approval
  4. Place the order

Here's a mermaid sequence diagram illustrating the procurement workflow:

Pattern 4_1

This diagram shows the sequential flow from analyzing quotes through to placing the order, with the critical approval step from the CFO in the middle.

If the CFO takes 2 hours to review the proposal, a standard HTTP request will time out in seconds. When the CFO finally clicks "Approve," the agent has lost its memory. It doesn't know which vendor it selected, the quote ID, or why it made that recommendation. It essentially has to start over.

2. The Architectural Solution

The solution is a Pause, Persist, Resume architecture.

  • Event-Driven Interruption: The agent doesn't just "wait." It emits a specific system event (adk_request_confirmation) and halts execution immediately, releasing compute resources.
  • State Persistence: The agent's full state (conversation history, tool parameters, reasoning scratchpad) is serialized and stored in a database, keyed by an invocation_id.
  • The Anchor (invocation_id): This ID becomes the critical "bookmark." When the human acts, the system rehydrates the agent using this ID, allowing it to resume exactly where it left off—inside the tool call—rather than restarting the conversation.

Pattern 4_2

3. Implementation Details: Code Insights

The ADK provides the ToolContext and App primitives to handle this complexity without writing custom state machines.

The Three-State Tool Pattern
Inside your tool definition, you must handle three scenarios:

  1. Automatic approval (low stakes)
  2. Initial request (pause)
  3. Resumption (action)
def place_order(num_units: int, tool_context: ToolContext) -> dict:
    # Scenario 1: Small orders auto-approve
    if num_units <= 5:
        return {"status": "approved", "order_id": f"ORD-{num_units}"}

    # Scenario 2: First call - request approval (PAUSE)
    # The tool checks if confirmation exists. If not, it requests it and halts.
    if not tool_context.tool_confirmation:
        tool_context.request_confirmation(
            hint=f"Large order: {num_units} units. Approve?",
            payload={"num_units": num_units}
        )
        return {"status": "pending"}

    # Scenario 3: Resume - check decision (ACTION)
    # The tool runs again, but this time confirmation exists.
    if tool_context.tool_confirmation.confirmed:
        return {"status": "approved", "order_id": f"ORD-{num_units}"}
    else:
        return {"status": "rejected"}

Enter fullscreen mode Exit fullscreen mode
  1. Automatic Approval (Scenario 1): The initial if num_units <= 5: block handles immediate, non-long-running scenarios, which is a common pattern for tools that can quickly resolve simple requests.
  2. Initial Request (Pause - Scenario 2): The if not tool_context.tool_confirmation: block leverages tool_context.request_confirmation() to signal that the tool requires external input to proceed. The return of {"status": "pending"} indicates that the operation is not yet complete.
  3. Resumption (Action - Scenario 3): The final if tool_context.tool_confirmation.confirmed: block demonstrates how the tool re-executes, this time finding tool_context.tool_confirmation present, indicating that the external input has been provided. The tool then acts based on the confirmed status. The Human-in-the-Loop Workflow Samples also highlights how the application constructs a types.FunctionResponse with the updated status and sends it back to the agent to resume its task.

The Application Wrapper
To enable persistence, we wrap the agent in an App with ResumabilityConfig. This tells the ADK to automatically handle state serialization.

from google.adk.apps import App, ResumabilityConfig

app = App(
    root_agent=procurement_agent,
    resumability_config=ResumabilityConfig(is_resumable=True)
)

Enter fullscreen mode Exit fullscreen mode

The Workflow Loop
The runner loop must detect the pause and, crucially, use the same invocation_id to resume.

# 1. Initial Execution
async for event in runner.run_async(...):
    events.append(event)

# 2. Detect Pause & Get ID
approval_info = check_for_approval(events)

if approval_info:
    # ... Wait for user input (hours/days) ...
    user_decision = get_user_decision() # True/False

    # 3. Resume with INTENT
    # We pass the original invocation_id to rehydrate state
    async for event in runner.run_async(
        invocation_id=approval_info["invocation_id"],
        new_message=create_approval_response(user_decision)
    ):
        # Agent continues execution from inside place_order()
        pass

Enter fullscreen mode Exit fullscreen mode

This workflow shows the mechanism for resuming an agent's execution:

  • Initial Execution: The first runner.run_async() call initiates the agent's interaction, which eventually leads to the place_order tool returning a "pending" status.
  • Detecting Pause & Getting ID: Detect the "pending" state and extract the invocation_id. Check the Invocation Context and State Management code wiki section to check how InvocationContext tracks an agent's state and supports resumable operations.
  • Resuming with Intent: The crucial part is calling runner.run_async() again with the same invocation_id. This tells the ADK to rehydrate the session state and resume the execution from where it left off, providing the new message (the approval decision) as input. This behavior is used in the Human-in-the-Loop Workflow Samples, where the runner orchestrates agent execution and handles multi-agent coordination.

4. Production Considerations

  • Persistence Strategy: InMemorySessionService is insufficient for production resumability because a server restart kills pending approvals. You must use a persistent store like Redis or PostgreSQL to save the serialized agent state.
  • UI Signaling: The adk_request_confirmation event should trigger a real-time notification (via WebSockets) to the user's frontend, rendering an "Approve/Reject" card.
  • Time-To-Live (TTL): Pending approvals shouldn't live forever. Implement a TTL policy (e.g., 24 hours) after which the state is garbage collected and the order is auto-rejected to prevent stale context rehydration.

5. Ethical Design Note

This pattern is the technical implementation of Meaningful Human Control.

It ensures high-stakes actions (Agency) remain subservient to human authorization (Sovereignty), preventing "rogue actions" where an agent executes irreversible decisions (like spending budget) without explicit oversight.

Concrete Failure Scenario:

Imagine a financial trading agent receives a signal to liquidate a portfolio position. Without resumability, the agent operates in a stateless, atomic transaction: detect signal → execute trade. There's no pause for human review. If the signal is based on a data glitch (a "flash crash"), or if market conditions have changed in the seconds between signal and execution, the agent completes an irreversible $10M trade that wipes out a quarter's earnings. The human operator sees the confirmation after the damage is done. Worse, if the system crashes mid-execution, the agent loses context and might try to execute the same trade twice, compounding the disaster. Without Meaningful Human Control embedded in the architecture, the agent becomes a runaway train.

Key Takeaways

  • Production Principle: High-stakes actions require human-in-the-loop workflows. Design agents that can pause, wait for approval, and resume execution without losing context—spanning hours or days, not just seconds.
  • Implementation: Use ToolContext.request_confirmation() for tools that need approval. Configure ResumabilityConfig in your App to enable state persistence. Use the invocation_id to resume execution from the exact point of interruption. Store state in Redis or PostgreSQL, never in-memory.
  • Ethical Anchor: This pattern operationalizes Meaningful Human Control—we architecturally prevent agents from executing irreversible, high-stakes actions without explicit human authorization, preserving human sovereignty over consequential decisions.

Conclusion

The Google & Kaggle Intensive was a masterclass not just in coding, but in thinking.

Building agents is not just about chaining prompts; it is about designing resilient systems that can handle the messiness of the real world.

  • Evaluation ensures we trust the process, not just the result.
  • Dual-Layer Memory solves the economic and context limits of LLMs.
  • Protocol-First (MCP) prevents integration spaghetti and silos.
  • Resumability allows agents to participate in human-speed workflows safely.

Where to Start: A Prioritization Guide

If you're moving your first agent from prototype to production, consider implementing these patterns in order:

  1. Start with Pattern 1 (Evaluation). Without trajectory validation, you're flying blind. Capture a handful of golden trajectories from your adk web sessions, configure a TrajectoryEvaluator, and establish your evaluation baseline before writing another line of agent code.
  2. Add Pattern 4 (Resumability) early if your agent performs any action that requires human approval or waits on external systems (payment processing, legal review, third-party APIs). The cost of refactoring a stateless agent into a resumable one later is enormous. Build with invocation_id and ToolContext.request_confirmation() from day one.
  3. Implement Pattern 2 (Dual-Layer Memory) when your agent starts handling multi-turn conversations or personalization. If you see users repeating themselves across sessions ("I'm allergic to shellfish" → 3 months later → "I'm allergic to shellfish"), or if your context costs are climbing, it's time for the Workbench/Filing Cabinet split.
  4. Adopt Pattern 3 (Protocol-First Interoperability) when you need to integrate your second data source or agent. The first integration is always bespoke; the second is where you refactor to MCP/A2A or accept technical debt forever. Don't wait until you have ten brittle integrations to wish you'd used protocols.

The Architect's Responsibility

As we move forward, our job as architects is to ensure these systems are not just smart, but reliable, efficient, and ethical.

We are not just building tools—we are defining the interface between human intention and machine action. Every architectural decision we make either preserves or erodes human sovereignty, privacy, and meaningful control.

When you choose to validate trajectories, you're not just improving test coverage—you're building fiduciary responsibility into the system.

When you separate session from memory, you're not just optimizing token costs—you're designing for privacy by default.

When you adopt MCP and A2A, you're not just reducing integration complexity—you're preserving user freedom from algorithmic lock-in.

When you implement resumability, you're not just handling timeouts—you're enforcing meaningful human control over consequential actions.

These patterns are not neutral technical choices. They are ethical choices encoded in architecture.

Let's build responsibly.

Top comments (0)