Tool-Use Latency Budget: Where the 8 Seconds Actually Go

#ai #agents #observability #opentelemetry

Book: LLM Observability Pocket Guide: Picking the Right Tracing & Evals Tools for Your Team
Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

A user clicks the button. The spinner appears, they wait, and eight seconds later the answer arrives. It works. Nobody is happy.

Eight seconds is a number that comes up often when teams ship their first agent. It is not a benchmark; it is the rough shape of a turn that runs one tool call, gets a result back, and writes a sentence about it. People feel it the way they feel a slow page. It registers as a vibe of "this is sluggish" before it registers as a number. The first instinct is to blame the model and reach for a smaller one. That instinct is often wrong, because the model is rarely the only thing in the eight seconds.

The fix starts with seeing the eight seconds as four phases instead of one. Once you can name the phases, you can budget them. With a budget, you know which one to attack.

The four phases of a tool-use turn

Treat a single agent turn as a pipeline with four stages. Numbers below are illustrative. The share each phase takes will be different in your system. The four phases are not.

Decide. Send the conversation history plus the tool list. The model reads it and emits a tool_use block. This is a full non-streaming request: you wait for the complete response before you can act on the tool_use block.
Execute. Your code receives the tool call, runs it, and gets a result. A database query, an HTTP call, a vector search. Pure backend latency.
Context. Append the tool result to the running message list, re-serialise, ship the whole thing back to the provider. This is the phase nobody instruments because it feels like nothing. But the prompt is bigger now, the network round trip costs the same, and the provider re-reads what it just emitted.
Respond. The model reads the tool result and writes the user-facing answer. Streaming or not, this is where the prose comes out.

Of those four, only Execute is fully under your control. Decide and Respond are model time, and Context is a serialisation step that grows with the conversation. Breaking the turn apart tells you which phase to attack first.

Instrumenting each phase with OpenTelemetry

The OpenTelemetry GenAI semantic conventions give you a vocabulary for this. The span names below follow the spirit of those conventions, with one parent span per turn and child spans for each phase. Names stay stable across SDKs, so a Grafana panel keeps working as you swap SDKs.

from opentelemetry import trace
from opentelemetry.trace import Status, StatusCode

tracer = trace.get_tracer("agent")

MODEL = "claude-sonnet-4-5"


def run_turn(client, messages, tools, tool_runner):
    with tracer.start_as_current_span("agent.turn") as turn:
        with tracer.start_as_current_span("model.input") as s:
            s.set_attribute("gen_ai.request.model", MODEL)
            s.set_attribute(
                "gen_ai.message.count", len(messages)
            )
            s.set_attribute(
                "gen_ai.message.bytes",
                len(str(messages)),
            )
            decision = client.messages.create(
                model=MODEL,
                messages=messages,
                tools=tools,
                max_tokens=1024,
            )
            s.set_attribute(
                "gen_ai.usage.input_tokens",
                decision.usage.input_tokens,
            )
            s.set_attribute(
                "gen_ai.usage.output_tokens",
                decision.usage.output_tokens,
            )

        tool_blocks = [
            b for b in decision.content
            if b.type == "tool_use"
        ]
        if not tool_blocks:
            return decision

        tool_results = []
        for block in tool_blocks:
            with tracer.start_as_current_span(
                "tool.execute"
            ) as ts:
                ts.set_attribute("tool.name", block.name)
                result = tool_runner(block.name, block.input)
                ts.set_attribute(
                    "tool.result.bytes", len(str(result))
                )
                tool_results.append(
                    {
                        "type": "tool_result",
                        "tool_use_id": block.id,
                        "content": result,
                    }
                )

        messages = messages + [
            {"role": "assistant", "content": decision.content},
            {"role": "user", "content": tool_results},
        ]

        with tracer.start_as_current_span("model.output") as s:
            s.set_attribute(
                "gen_ai.message.bytes",
                len(str(messages)),
            )
            final = client.messages.create(
                model=MODEL,
                messages=messages,
                tools=tools,
                max_tokens=1024,
            )
            s.set_attribute(
                "gen_ai.usage.input_tokens",
                final.usage.input_tokens,
            )
            s.set_attribute(
                "gen_ai.usage.output_tokens",
                final.usage.output_tokens,
            )
            return final

That is most of the wrapper. Add a few lines around it to record turn-level totals (start time, end time, span count, total tokens), and you have a per-turn record you can pivot on. The four child spans are the four phases. Their durations are your budget.

The phase you cannot see in code is Context. It hides inside model.output because the round trip carries the bigger message list. You can isolate it by recording the prompt size on each model span: if gen_ai.message.bytes doubles between turn 1 and turn 5, the Context cost is showing up as input-token growth on the second model call. Once you have that attribute, the Context phase becomes a query you can run.

Attacking each phase

The order below is the budget many teams converge on after a few rounds of trace-reading. Walk it top to bottom.

Decide and Respond — prompt caching first

Both model phases are dominated by how many tokens the model has to re-read on every call. For agents, that is the system prompt plus the tool list plus the conversation history, and most of it is identical from turn to turn. Anthropic's prompt caching docs describe how to mark a cache breakpoint after the static parts of the request. A cache hit costs a fraction of a fresh read on both tokens and time-to-first-token.

The structural rule: put the volatile parts of the request after the cache breakpoint, not before. Tool definitions and the system prompt are static for a release. The conversation history grows, but its prefix is stable. If your prompt looks like "system, tools, conversation, fresh user message", the breakpoint goes between conversation and the fresh message. Everything before it is reusable. If it looks like "system, fresh user message, tools, conversation", the cache cannot do anything for you.

Verify the win in the trace, not on a benchmark. The cache_read_input_tokens field on the response tells you how many tokens were served warm. If that number is zero, your breakpoint is in the wrong place.

Execute: parallel tool calls when the model emits multiple

Modern agent SDKs let the model emit more than one tool_use block in a single decision. If your loop runs them serially, you pay the longest tool's latency, then the second-longest, then the third. Run them concurrently and you pay the longest one only.

import asyncio

async def run_tools_parallel(tool_blocks, tool_runner):
    async def one(block):
        with tracer.start_as_current_span(
            "tool.execute"
        ) as ts:
            ts.set_attribute("tool.name", block.name)
            result = await tool_runner(
                block.name, block.input
            )
            ts.set_attribute(
                "tool.result.bytes", len(str(result))
            )
            return {
                "type": "tool_result",
                "tool_use_id": block.id,
                "content": result,
            }

    return await asyncio.gather(*(one(b) for b in tool_blocks))

Two guardrails worth naming. Parallelism only helps when the tools are independent: if tool B reads a row that tool A writes, keep them serial. And anything that writes to an external system needs idempotency before you fan it out concurrently with retries.

Context: shrink the tool results

A tool result is part of every subsequent turn's input. A 40-kilobyte JSON dump from a tool call gets re-sent every time the conversation continues. The bigger your tool results, the more Context you are paying for, on every turn after the first.

The lever here is: do not return raw payloads to the model. Return the slice the model needs to make the next decision. A search_documents tool that returns ten results with their full bodies should return ten titles plus a server-side handle, and the model should call a fetch_document(handle) tool when it actually wants the body. The total token cost is lower and the Context phase shrinks. The model also decides more accurately on a clean list of titles than on ten fully expanded bodies.

Respond: stream the final answer

Streaming does not change total Respond time. It changes when the user sees the first character, which is what they grade the product on, not the wall-clock total. Say a Respond phase takes 3.5 seconds — that is 3.5 seconds of blank screen for the user. Streamed, the first sentence appears in roughly the window the model needs to generate it, and the user starts reading.

Most agent loops disable streaming for the model.input phase because you need the full tool_use block before you can do anything with it. That is correct. They then forget to re-enable streaming on the model.output phase, where it costs you nothing and makes the product feel different. Audit your loop: if stream=True is missing on the final call, you are blocking the user from seeing the first sentence for the entire generation window.

What the dashboard looks like afterwards

Five panels per agent service:

p50 / p95 of agent.turn duration.
Stacked bar of phase durations (model.input, tool.execute, model.output).
Cache hit rate from cache_read_input_tokens over total input.
p95 tool.result.bytes per tool name.
Time-to-first-token on streaming responses.

The first three are what you watch every day. The last two are what you reach for when panel one starts climbing. After the four fixes, you want to see the turn get faster and the per-call model cost get cheaper at the same time. If the turn got faster but model cost went up, you bought speed by spending tokens. Take that to finance before you ship it.

Next move

Open a trace from yesterday's worst turn. Mark the four phase boundaries on it. Whatever phase is widest is the one to fix first.

If this was useful

The LLM Observability Pocket Guide walks through the GenAI semantic conventions the wrapper above leans on, the cache-token attributes you need on every model span, and the dashboards that turn the four-phase budget into something you can alert on. The chapter on agent traces pairs directly with the wrapper in this post.