Jay

Posted on May 19

My Google ADK agent passed every CI test, then booked a fake restaurant

#ai #agents #google #programming

Last month I shipped a Google ADK agent that helped users find and book
restaurants. Standard SequentialAgent setup. Search sub-agent, booking
sub-agent, root agent orchestrating them.

Every .test.json case in CI passed. tool_trajectory_avg_score: 1.0.
response_match_score: above threshold. Safety checks green. Shipped on
a Tuesday afternoon feeling pretty good.

Two days later, a user opened a support ticket. They had booked a
vegetarian restaurant in Berlin for 7pm based on my agent's response.
They showed up. The restaurant didn't exist.

I pulled the trace:

root_agent.run("Find me a vegetarian restaurant in Berlin and book a 7pm table")
├── search_agent.run(...) tool="google_search" ok
├── search_agent.run(...) tool="google_search" ok (same query)
├── search_agent.run(...) tool="google_search" ok (third identical call)
└── booking_agent.run(...) tool="reserve_table" ok
final_response = "I've booked you a table at a vegetarian restaurant in Berlin for 7pm."

The agent had looped google_search three times with the same query.
The booking sub-agent then called reserve_table with arguments that
contained neither a real restaurant name nor an address. The final
response confidently announced a booking that did not exist.

My CI hadn't caught any of it. Because my CI was asking the wrong
questions.

What ADK's built-in eval actually grades

ADK's evaluation framework is solid for the inner development loop.
You drop a .test.json fixture, set thresholds in test_config.json,
let AgentEvaluator.evaluate() run inside pytest. It scores two things
by default.

tool_trajectory_avg_score checks whether you called the right tools in
roughly the right order. response_match_score does keyword overlap
between the agent's final response and a reference answer.

Neither of those catches a phantom restaurant. Trajectory passes because
I did call google_search and reserve_table. Response match passes
because my response contained "Berlin", "restaurant", and "7pm". The
restaurant being imaginary is not in the rubric.

That's the gap I had to close. CI checks shape. Production needs to
check grounding.

What I rebuilt

Three things my CI setup did not give me:

Per-step output scoring, not just the final response
Loop-count signals to catch runaway tool calls
Grounding checks against the actual search results

I ended up wiring this up with FutureAGI because their unified evaluate() API auto-attaches scores to OTel spans without me plumbing span IDs by hand, and the traceAI package auto-instruments every ADK sub-agent, LLM call, and tool call. (Other vendors solve the same problem. The protocol matters more than the stack, I'll come back to that.)

Step 1: auto-trace everything

pip install traceai-google-adk ai-evaluation google-adk

Quick gotcha that cost me 40 minutes: pip install futureagi does not
include fi.evals. The eval imports live in ai-evaluation. If you
pip install futureagi and then from fi.evals import evaluate, you
get ModuleNotFoundError and a confusing afternoon.

Instrumentation is one block at startup:

from traceai_google_adk import GoogleADKInstrumentor
from fi_instrumentation import register
from fi_instrumentation.fi_types import ProjectType

tracer_provider = register(
    project_name="restaurant_agent",
    project_type=ProjectType.OBSERVE,
)
GoogleADKInstrumentor().instrument(tracer_provider=tracer_provider)

Every agent call, every LLM call, every tool call is now a separate
OTel span. That alone caught the looping behavior the first time I
replayed the Berlin scenario. Before I added any scoring, just seeing
the trace tree showed me the three back-to-back google_search calls
that my logs had been hiding.

Step 2: score the spans

The scoring API is simpler than I expected. I was overthinking it.

from fi.evals import evaluate
from fi.evals.otel import enable_auto_enrichment

enable_auto_enrichment()  # call once at startup

# inside any active span:
r = evaluate(
    "groundedness",
    output=response,
    context=search_results,
    model="turing_flash",
)
# score, reason, and latency auto-attach to the current span

enable_auto_enrichment() is the line that unlocked everything. Before
I found it, I was manually grabbing span_id, calling evaluate(),
then writing the score back as a span attribute in a wrapper function
that I hated. With auto-enrichment the call just lives where it makes
sense and the score lands on the right span automatically.

For my booking case I scored each sub-agent's output for groundedness
against what the previous step actually returned. Replaying the Berlin
trace through this loop, the booking step would have failed groundedness
with a score near 0.2 because the final response named a restaurant
that wasn't anywhere in the search results.

Step 3: workflow-specific eval recipes

Different ADK orchestration patterns fail differently. This took me a
while to internalize:

SequentialAgent → score each step's output (catch error compounding)
ParallelAgent → score each branch + the merged result
LoopAgent → score iteration count + per-iter adherence
Dynamic routing → score routing accuracy as classification

For my restaurant agent (SequentialAgent), I added
instruction_adherence on the planning sub-agent and groundedness on
the booking sub-agent. The booking step now fails if the reservation
arguments aren't grounded in the search output.

The recipe I most wish I'd had originally is the LoopAgent one. ADK's
LoopAgent doesn't fail CI just because you looped 8 times when you
should have looped 2. I added a contains check on a
MAX_ITERATIONS_HIT sentinel and a per-iteration instruction_adherence
score. If the loop drifts, I see it in Observe before a user does.

What I still use ADK's own eval for

I haven't replaced ADK's eval. AgentEvaluator plus .test.json
fixtures are still my CI gate for pre-merge checks. They are fast,
free, and catch shape regressions. I just stopped pretending they were
enough to ship.

The full loop I run now:

Pre-merge: ADK's AgentEvaluator on .test.json fixtures
Post-merge in staging: scenario tests with persona-driven multi-turn simulation
Production: sample 5 to 10 percent of traces, score them, alert on drift
When something fails enough times: feed failing traces back into a prompt optimizer, ship the winner

Stack-wise this is mostly vendor-agnostic. Instrument with OTel. Score
against the right signals. Gate on the score. Loop.

What I'd do differently

If I had to redo the original Berlin launch I would:

Add groundedness scoring before shipping, not after. Four lines of code.
Set a hard limit on LoopAgent iterations and fail the trace if hit. I had no limit. The infinite loop on google_search was legal in my config.
Run scenario tests before going to prod. ADK's user-simulator framework handles chat agents. For voice agents I'd reach for a multi-turn simulator. Either way, scripted personas catch the cross-turn failures that single-turn CI misses entirely.

If you are running ADK agents in production with only the built-in
eval, you are flying with one eye closed.

Curious about your setup

Anyone else been bitten by something similar with ADK or another agent
framework? I'm specifically curious about:

Scoring per-step in SequentialAgents. Do you wrap each sub-agent or score only at the merge point?
Loop count gates on LoopAgent. Do you fail CI on this, or only monitor in production?
How much do you trust LLM-as-a-judge for grounding in production traffic?

Drop a comment, I read all of them.

DEV Community