DEV Community

Cover image for Observability at Scale: Mastering ADK Callbacks for Cost, Latency, and Auditability [GDE]

Observability at Scale: Mastering ADK Callbacks for Cost, Latency, and Auditability [GDE]

Connie Leung on April 06, 2026

AI orchestrators receive significant attention; however, when deployments become latent and costly, developers often overlook a critical capability...
Collapse
 
automate-archit profile image
Archit Mittal

The beforeModelCallback pattern for conditional skipping is incredibly powerful and underused. What I find most interesting is how this maps to a broader principle in agent design: treating LLM calls as expensive I/O operations rather than default logic paths. Your circuit-breaker pattern in afterToolCallback (escalate + FATAL_ERROR after max retries) is essentially the same pattern we use in distributed systems for failing fast. One thing I'd add — if you're running multiple sequential agents like this in production, consider aggregating the performance metrics from agentStartCallback/agentEndCallback into a structured trace (OpenTelemetry spans, for instance) rather than just console.log. That way you get a full flame graph of your agent pipeline and can spot which subagent is the bottleneck without parsing logs manually. Really solid patterns here.

Collapse
 
railsstudent profile image
Connie Leung Google Developer Experts

Excellent feedback. I found out adk.dev/integrations/?topic=observ... today. I will give it a try and then write a blog post.

Collapse
 
automate-archit profile image
Archit Mittal

That's awesome Connie! The ADK integrations page has some really solid patterns for connecting callbacks to OpenTelemetry exporters. One thing I'd suggest when writing your blog post — show the before/after of debugging a multi-step agent call with vs without the observability layer. The cost visibility alone (seeing exactly which sub-agent burned through tokens) is usually what convinces teams to adopt it. Looking forward to reading your post!

Collapse
 
apex_stack profile image
Apex Stack

The beforeModelCallback pattern for conditional skipping is the real gem here. I run a multi-agent setup with about a dozen scheduled agents that each handle different tasks — site auditing, content publishing, metric monitoring — and the biggest cost drain early on was agents making unnecessary LLM calls when the data they needed wasn't ready yet or hadn't changed since the last run.

Your approach of validating session state before hitting the model is exactly the right pattern. In my case I ended up building something similar where each agent checks whether its upstream dependencies have produced new data before doing any inference work. The savings are dramatic — probably 40-50% fewer LLM calls once you add those guards.

The afterToolCallback for retry counting with FATAL_ERROR escalation is also smart. Without a hard cap like that, validation loops can silently burn through your token budget. I've seen agents get stuck retrying malformed outputs indefinitely when the model just can't produce valid JSON for a particular edge case. Having that circuit breaker built into the callback layer rather than in the agent logic itself keeps things much cleaner.

Collapse
 
railsstudent profile image
Connie Leung Google Developer Experts

Thank you for attesting that the patterns work. I only learned them a month ago when preparing for a technical talk. I hope to give the same talk at Build with AI in Hong Kong at the end of the month.

Collapse
 
agentwork profile image
Agent Work

Observability is a nightmare when you're dealing with distributed systems. ADK callbacks can be a lifesaver but also a pain if not handled right. AgentWork uses similar principles to manage task execution and observability across a decentralized network. It’s a wild ride, but worth it.

Collapse
 
syedahmershah profile image
Syed Ahmer Shah

a great deep dive into a part of the agent workflow that usually gets ignored.

Collapse
 
railsstudent profile image
Connie Leung Google Developer Experts

Thank you. You can learn more by watching YouTube videos that the googlers published.

Collapse
 
itskondrat profile image
Mykola Kondratiuk

ran into the same thing - app-level logging was useless for latency. hooking callbacks per model call was the only way to see where tokens were burning. cost tracking finally made sense after that.

Collapse
 
agentwork profile image
Agent Work

ADK callbacks are a pain point for observability at scale. AgentWork uses Solana's speed and low costs to handle thousands of tasks without the overhead of traditional observability tools. We're not using ADK, but the problem is real.

Collapse
 
megallmio profile image
Socials Megallm

this is super relevant ive been wrestling with adk callback latency in a recent project the cost implications alone are eye-watering when things go sideways.