Last month I shipped a Google ADK agent that helped users find and book
restaurants. Standard SequentialAgent setup. Search sub-agent, booking
sub-agent, root agent orchestrating them.
Every .test.json case in CI passed. tool_trajectory_avg_score: 1.0.
response_match_score: above threshold. Safety checks green. Shipped on
a Tuesday afternoon feeling pretty good.
Two days later, a user opened a support ticket. They had booked a
vegetarian restaurant in Berlin for 7pm based on my agent's response.
They showed up. The restaurant didn't exist.
I pulled the trace:
root_agent.run("Find me a vegetarian restaurant in Berlin and book a 7pm table")
├── search_agent.run(...) tool="google_search" ok
├── search_agent.run(...) tool="google_search" ok (same query)
├── search_agent.run(...) tool="google_search" ok (third identical call)
└── booking_agent.run(...) tool="reserve_table" ok
final_response = "I've booked you a table at a vegetarian restaurant in Berlin for 7pm."
The agent had looped google_search three times with the same query.
The booking sub-agent then called reserve_table with arguments that
contained neither a real restaurant name nor an address. The final
response confidently announced a booking that did not exist.
My CI hadn't caught any of it. Because my CI was asking the wrong
questions.
What ADK's built-in eval actually grades
ADK's evaluation framework is solid for the inner development loop.
You drop a .test.json fixture, set thresholds in test_config.json,
let AgentEvaluator.evaluate() run inside pytest. It scores two things
by default.
tool_trajectory_avg_score checks whether you called the right tools in
roughly the right order. response_match_score does keyword overlap
between the agent's final response and a reference answer.
Neither of those catches a phantom restaurant. Trajectory passes because
I did call google_search and reserve_table. Response match passes
because my response contained "Berlin", "restaurant", and "7pm". The
restaurant being imaginary is not in the rubric.
That's the gap I had to close. CI checks shape. Production needs to
check grounding.
What I rebuilt
Three things my CI setup did not give me:
- Per-step output scoring, not just the final response
- Loop-count signals to catch runaway tool calls
- Grounding checks against the actual search results
I ended up wiring this up with FutureAGI because their unified evaluate() API auto-attaches scores to OTel spans without me plumbing span IDs by hand, and the traceAI package auto-instruments every ADK sub-agent, LLM call, and tool call. (Other vendors solve the same problem. The protocol matters more than the stack, I'll come back to that.)
Step 1: auto-trace everything
pip install traceai-google-adk ai-evaluation google-adk
Quick gotcha that cost me 40 minutes: pip install futureagi does not
include fi.evals. The eval imports live in ai-evaluation. If you
pip install futureagi and then from fi.evals import evaluate, you
get ModuleNotFoundError and a confusing afternoon.
Instrumentation is one block at startup:
from traceai_google_adk import GoogleADKInstrumentor
from fi_instrumentation import register
from fi_instrumentation.fi_types import ProjectType
tracer_provider = register(
project_name="restaurant_agent",
project_type=ProjectType.OBSERVE,
)
GoogleADKInstrumentor().instrument(tracer_provider=tracer_provider)
Every agent call, every LLM call, every tool call is now a separate
OTel span. That alone caught the looping behavior the first time I
replayed the Berlin scenario. Before I added any scoring, just seeing
the trace tree showed me the three back-to-back google_search calls
that my logs had been hiding.
Step 2: score the spans
The scoring API is simpler than I expected. I was overthinking it.
from fi.evals import evaluate
from fi.evals.otel import enable_auto_enrichment
enable_auto_enrichment() # call once at startup
# inside any active span:
r = evaluate(
"groundedness",
output=response,
context=search_results,
model="turing_flash",
)
# score, reason, and latency auto-attach to the current span
enable_auto_enrichment() is the line that unlocked everything. Before
I found it, I was manually grabbing span_id, calling evaluate(),
then writing the score back as a span attribute in a wrapper function
that I hated. With auto-enrichment the call just lives where it makes
sense and the score lands on the right span automatically.
For my booking case I scored each sub-agent's output for groundedness
against what the previous step actually returned. Replaying the Berlin
trace through this loop, the booking step would have failed groundedness
with a score near 0.2 because the final response named a restaurant
that wasn't anywhere in the search results.
Step 3: workflow-specific eval recipes
Different ADK orchestration patterns fail differently. This took me a
while to internalize:
SequentialAgent → score each step's output (catch error compounding)
ParallelAgent → score each branch + the merged result
LoopAgent → score iteration count + per-iter adherence
Dynamic routing → score routing accuracy as classification
For my restaurant agent (SequentialAgent), I added
instruction_adherence on the planning sub-agent and groundedness on
the booking sub-agent. The booking step now fails if the reservation
arguments aren't grounded in the search output.
The recipe I most wish I'd had originally is the LoopAgent one. ADK's
LoopAgent doesn't fail CI just because you looped 8 times when you
should have looped 2. I added a contains check on a
MAX_ITERATIONS_HIT sentinel and a per-iteration instruction_adherence
score. If the loop drifts, I see it in Observe before a user does.
What I still use ADK's own eval for
I haven't replaced ADK's eval. AgentEvaluator plus .test.json
fixtures are still my CI gate for pre-merge checks. They are fast,
free, and catch shape regressions. I just stopped pretending they were
enough to ship.
The full loop I run now:
-
Pre-merge: ADK's
AgentEvaluatoron.test.jsonfixtures - Post-merge in staging: scenario tests with persona-driven multi-turn simulation
- Production: sample 5 to 10 percent of traces, score them, alert on drift
- When something fails enough times: feed failing traces back into a prompt optimizer, ship the winner
Stack-wise this is mostly vendor-agnostic. Instrument with OTel. Score
against the right signals. Gate on the score. Loop.
What I'd do differently
If I had to redo the original Berlin launch I would:
- Add
groundednessscoring before shipping, not after. Four lines of code. - Set a hard limit on
LoopAgentiterations and fail the trace if hit. I had no limit. The infinite loop ongoogle_searchwas legal in my config. - Run scenario tests before going to prod. ADK's user-simulator framework handles chat agents. For voice agents I'd reach for a multi-turn simulator. Either way, scripted personas catch the cross-turn failures that single-turn CI misses entirely.
If you are running ADK agents in production with only the built-in
eval, you are flying with one eye closed.
Curious about your setup
Anyone else been bitten by something similar with ADK or another agent
framework? I'm specifically curious about:
- Scoring per-step in
SequentialAgents. Do you wrap each sub-agent or score only at the merge point? - Loop count gates on
LoopAgent. Do you fail CI on this, or only monitor in production? - How much do you trust LLM-as-a-judge for grounding in production traffic?
Drop a comment, I read all of them.
Top comments (0)