- Book: AI Agents Pocket Guide: Patterns for Building Autonomous Systems with LLMs
- Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
- My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
- Me: xgabriel.com | GitHub
A support agent answers "no, your order shipped on time" with full confidence. The customer pulls up the same order in the admin panel and sees the delay flag plain as day. The flag was in the tool result. The agent never saw it. Somewhere between your database and the model's context, the JSON got cut off mid-array, and the agent answered from the half it could see as if the other half did not exist.
This is tool-result truncation. It is one of the most expensive failure modes in production agents because the agent never tells you it happened. There is no error, just a clean tool call followed by a clean assistant reply. The user gets a wrong answer with the same tone as a right one.
You did not write the truncation. The framework did. Or the MCP server. Or you, two months ago, with a one-liner you forgot.
Where the truncation actually lives
The Claude API does not silently truncate inputs to fit the context window. Starting with Claude Sonnet 3.7 it returns a validation error when prompt and output tokens exceed the context window instead of dropping bytes on the floor. So if your tool_result is making it into the request at all, the model is seeing whatever the request contains. A silent cut has to be happening upstream of the API call.
It lives in the layers between your tool and the model. A non-exhaustive list of places that will quietly cut your tool_result short:
-
The MCP host or CLI wrapping your tool. Claude Code lets MCP servers mark individual outputs as durable up to 500,000 characters via the
anthropic/maxResultSizeCharsannotation. Gemini CLI's default tool-output truncation threshold is 4,000,000 characters. OpenAI Codex truncates tool outputs at hardcoded limits of 10 KiB or 256 lines, head-and-tail. Different host, different cliff. -
Your agent framework. LangChain, LlamaIndex, or any custom loop sometimes wrap
str(result)[:N]in a tool adapter for "safety" without surfacing N. -
Your own tool. You wrote
return rows[:50]because pagination was a tomorrow problem and now it is a tomorrow problem. - The transport. A reverse proxy with a 1MB body limit. A serverless function with a single-digit-MB response cap. The truncation happens before the agent loop ever sees the bytes.
Each layer cuts a different way: by characters, tokens, lines, or JSON nesting depth. Stack two of them and you get bytes that vanish without a single log line.
Why the agent sounds confident anyway
The model sees what it sees. If your tool returns the first 700 characters of a 12KB JSON document and the closing brace is missing, the model treats the partial document as the complete document. It does not mention the cutoff because nothing in the input said anything was cut off. The conversation history shows a tool call followed by a tool result. From the model's seat, that is the whole picture.
This is the part that makes the bug nasty. The model isn't lying. It's answering from incomplete data because nothing in the input flagged the cut. A human looking at the same situation would say "wait, this JSON ends in "customer_name": with no value, something is broken." The model has to be told to think that way, and most production agents are not told.
Detection: log the bytes
You cannot fix what you cannot see. The first move is making tool-result size visible per call. One observability span attribute, one alert, one daily review.
import hashlib
import json
import logging
from contextlib import contextmanager
from time import perf_counter
log = logging.getLogger("agent.tools")
LARGE_RESULT_BYTES = 16_000
def _hash(args: dict) -> str:
blob = json.dumps(args, sort_keys=True, default=str)
return hashlib.sha1(blob.encode()).hexdigest()[:12]
@contextmanager
def trace_tool_call(name: str, args: dict):
start = perf_counter()
state = {"size_bytes": None, "truncated": False, "error": None}
def finish(result_str: str, truncated: bool = False):
state["size_bytes"] = len(result_str.encode("utf-8"))
state["truncated"] = truncated
try:
yield finish
except Exception as exc:
state["error"] = type(exc).__name__
raise
finally:
latency_ms = (perf_counter() - start) * 1000
log.info(
"tool.call",
extra={
"tool.name": name,
"tool.args_hash": _hash(args),
"tool.result.bytes": state["size_bytes"],
"tool.result.truncated": state["truncated"],
"tool.error": state["error"],
"tool.latency_ms": latency_ms,
},
)
size = state["size_bytes"] or 0
if size > LARGE_RESULT_BYTES:
log.warning(
"tool.large_result",
extra={"tool.name": name, "bytes": size},
)
Call sites have to remember to invoke finish(...) so the wrapper records the size:
with trace_tool_call(name, args) as finish:
result = TOOL_HANDLERS[name](**args)
serialized = json.dumps(result, default=str)
finish(serialized)
Four attributes earn their keep:
-
tool.result.bytes— the actual byte count of what your handler returned. StaysNoneon the exception path so a failure does not log a misleading zero. -
tool.result.truncated— set by the wrapper when it had to cut. -
tool.name— for grouping. -
tool.args_hash— for spotting the same query repeating with the same oversized payload.
Wire this into whatever tracing backend you use. The metric you want a dashboard on is the p95 of tool.result.bytes per tool name. When the p95 of search_orders jumps from 3KB to 41KB after a deploy, you know a join started returning extra columns and you have a week or so before something downstream silently chops it.
The alert is one rule: page when tool.result.bytes > <ceiling> for any tool, where the ceiling is whatever your weakest layer cuts at. If you do not know your weakest layer, pick 16KB and tighten from there.
Fix one: paginate the tool
If your tool can return 50KB, it can return 5KB at a time. Pagination is the default fix because it does the size enforcement at the source, before any host or framework gets a chance to chop.
import base64
from dataclasses import dataclass
def _encode(offset: int) -> str:
return base64.urlsafe_b64encode(
str(offset).encode()
).decode()
def _decode(cursor: str) -> int:
return int(base64.urlsafe_b64decode(cursor).decode())
def _count_estimate(query: str) -> int:
row = db.execute(
"SELECT reltuples::bigint FROM pg_class WHERE relname='orders'"
).fetchone()
return int(row[0]) if row else 0
@dataclass
class Page:
rows: list
next_cursor: str | None
total_estimate: int
def search_orders(
query: str,
cursor: str | None = None,
page_size: int = 20,
) -> Page:
offset = _decode(cursor) if cursor else 0
rows = db.execute(
"""
SELECT id, customer_id, status, total_cents
FROM orders
WHERE search_text @@ plainto_tsquery(%s)
ORDER BY created_at DESC
LIMIT %s OFFSET %s
""",
(query, page_size + 1, offset),
).fetchall()
has_next = len(rows) > page_size
page_rows = rows[:page_size]
next_cursor = (
_encode(offset + page_size) if has_next else None
)
return Page(
rows=[dict(r) for r in page_rows],
next_cursor=next_cursor,
total_estimate=_count_estimate(query),
)
The snippet assumes a db connection with a DB-API-style execute().fetchall(); swap in whatever your stack provides. The schema you expose to the model has to advertise the cursor in the description. Models pick tools by reading descriptions, and "this returns a page, ask for the next one with the cursor" is information the model needs to plan with.
{
"name": "search_orders",
"description": "Search orders by free text. Returns a page of up to 20 orders, plus next_cursor (string) when more pages exist. To get the next page, call again with the same query and the next_cursor value. total_estimate is approximate.",
"input_schema": {
"type": "object",
"properties": {
"query": {"type": "string", "minLength": 2},
"cursor": {"type": "string"},
"page_size": {
"type": "integer", "minimum": 1, "maximum": 50
}
},
"required": ["query"]
}
}
A 20-row page of order summaries lands at well under 4KB even with verbose fields. The agent decides whether to fetch page two, and the truncation question shrinks to almost nothing.
Fix two: summarize at the tool layer
Some tool outputs do not paginate cleanly. A monitoring snapshot. A test report. A diff. For those, return a summary plus a handle to the full payload.
def run_diagnostics(target: str) -> dict:
full = _collect_diagnostics(target)
payload_id = artifact_store.put(full)
return {
"summary": {
"target": target,
"checks_total": len(full["checks"]),
"checks_failed": sum(
1 for c in full["checks"] if c["status"] != "ok"
),
"first_failures": [
{"name": c["name"], "error": c["error"][:200]}
for c in full["checks"]
if c["status"] != "ok"
][:5],
},
"full_payload_id": payload_id,
"fetch_with": (
"Call get_diagnostic_artifact(payload_id) "
"if you need the full report."
),
}
The model gets the shape it needs to reason about: how many checks ran, how many failed, the first five failures. If it needs to dig further into a specific failure it calls a second tool with the artifact ID. The full report never lands in the prompt unless the model explicitly asks for it.
This is the shape Anthropic's tool-use docs describe as a best practice: "design tool responses to return only high-signal information." The handle-plus-summary pattern is one implementation of that guidance.
Fix three: enforce the ceiling at the wrapper
Pagination and summarization are upstream fixes. You also want a downstream guard, because tools change, queries return more rows over time, and one of the layers above will quietly raise its limit and stop chopping. A wrapper that fails loud when results exceed the ceiling stops a half-truncated response from ever reaching the model.
class ToolResultTooLarge(Exception):
pass
MAX_RESULT_BYTES = 16_000
def call_tool(name: str, args: dict) -> dict:
raw = TOOL_HANDLERS[name](**args)
serialized = json.dumps(raw, default=str)
size = len(serialized.encode("utf-8"))
if size > MAX_RESULT_BYTES:
return {
"error": "result_too_large",
"tool": name,
"size_bytes": size,
"limit_bytes": MAX_RESULT_BYTES,
"hint": (
"Result exceeded the per-call size limit. "
"Use pagination (cursor) or request a "
"narrower filter."
),
}
return raw
The model now sees a structured error instead of a clipped JSON document and adapts on the next turn, usually by adding a filter or asking for the first page. The large-result alert finally has a paired counterpart the agent can act on, not just a dashboard line.
The hint is doing real work. Models steer hard on error strings. "Use pagination (cursor) or request a narrower filter" produces a useful next move; a bare 413 Payload Too Large produces a confused retry.
Streaming results as separate messages
For tools that genuinely have a lot to say — long reports, multi-document searches, traces — you can flip the conversational shape. Instead of one massive tool result, the tool returns an immediate ack plus a small summary, and the longer content lands as a sequence of follow-up messages the model can iterate over.
In Claude's API, the simplest version is a tool that returns a list of artifact IDs and a next_action field. The agent calls the next tool, which returns one artifact at a time. The agent decides when it has enough. This is the shape MCP servers tend to use when they expose document collections, and the same shape works for any tool whose output naturally splits into chunks.
The win is the same as pagination: the model reads what it needs and stops. The shape just feels more natural for streaming workloads, where there is no obvious "page two" but there is an obvious "next document."
What healthy tool-result hygiene looks like
A turn you would not write a runbook about: the user asks a question. The agent calls search_orders(query="acme", page_size=20). The tool returns 18 rows and next_cursor=null in 2.4KB. The agent answers. tool.result.bytes=2400 shows up in the trace as a green dot. No alert, no confused retry, no customer in a Slack thread two days later asking why your bot insists their order shipped on time when it did not.
Get there in three moves: log the size of every tool result, paginate or summarize-and-handle outputs that grow, and surface a structured "result_too_large" error from the wrapper before a clipped payload reaches the model. Your traces go quiet because the agent finally has honest inputs to reason from.
If this was useful
The AI Agents Pocket Guide covers the tool-design patterns the model actually needs to reason well: pagination shapes, summary-plus-handle responses, structured error contracts, and the eval rig that catches regressions before the bytes vanish. It pairs nicely with the tracing approach above — the book has a chapter on instrumenting agent loops with the same span shape this post uses.

Top comments (0)