Here's a thing I watched my coding agent do last month and couldn't unsee:
Me: Which commit added the user-agent header?
Agent: runsgit log -p... ingests 56 KB of diff ... "It was commit1f8808c."
Correct answer. Catastrophic method. To find one commit hash, it poured the entire patch history of the file into its context window — thousands of lines it read once, used one line of, and then carried around for the rest of the session like a backpack full of bricks.
This isn't a memory problem. The model didn't forget anything. The opposite — it remembered too much of the wrong thing. And every follow-up question after that got a little slower, a little dumber, because the useful signal was now buried under a diff nobody needed.
The real problem: tool output is firehose, context is a teacup
People talk about "context rot" like the model degrades on its own. In my experience it almost never does that unprompted. What actually happens is you let a tool dump raw output straight into the window. A few usual suspects:
-
git log/git diffon an active file — tens of KB, easily. -
npm testorpytest -v— hundreds of lines, 3 of which are the failure. -
find/treeonnode_modules— you know how this ends. - Fetching an API doc page — 40 KB of HTML nav and footer to get one code sample.
The agent reads all of it to extract one fact. Then it keeps all of it. The context window is the most expensive, most limited resource in the whole loop, and the default behavior treats it like /dev/stdout.
What I tried first (and why it wasn't enough)
My first instinct was discipline: tell the agent in CLAUDE.md to "be careful with large outputs, use head, grep before you cat." That helps for about ten minutes. The model is eager and helpful; the moment a task needs the full diff it reaches for the full diff, and the instruction loses to the immediate goal. Prompt-level rules don't survive contact with a real task. I needed the firehose pointed somewhere else structurally, not a polite request to drink less.
The fix: run the command in a sandbox, only let the summary in
The pattern that actually worked: the command runs in a subprocess outside the context window. The raw output gets indexed there. Only what I explicitly print — a summary, a count, the three lines that matter — crosses back into the model's context.
Concretely, instead of the agent running git log -p directly, it runs the query in a sandbox and prints a digest:
# runs OUTSIDE the context window; only the print() crosses back in
import subprocess
log = subprocess.run(
["git", "log", "-p", "--", "publish_devto.py"],
capture_output=True, text=True,
).stdout
hits = [ln for ln in log.splitlines() if "User-Agent" in ln]
print(f"{len(hits)} lines touch User-Agent; introduced in:")
print(subprocess.run(
["git", "log", "-S", "User-Agent", "--format=%h %s", "--", "publish_devto.py"],
capture_output=True, text=True,
).stdout)
The 56 KB of diff never enters my agent's context. Two lines of answer do. Same correct result, ~1% of the token cost, and — this is the part that compounds — the window stays clean for the next twenty questions.
I formalized this in the project's instructions as a routing rule. Roughly:
## Tool routing
- Output likely > ~50 lines? → run it in the sandbox, print a summary.
- Need the raw bytes to EDIT a file? → read it directly (the edit needs it).
- Just analyzing/exploring? → sandbox it, summary only.
- Web docs? → fetch + index out-of-context, then search the index.
The mental model is a hierarchy: gather in the sandbox, search the index, only surface conclusions. The window holds reasoning, not raw data.
The one rule that makes or breaks it
Here's the distinction that took me embarrassingly long to get right: reading to edit and reading to understand are different operations.
If the agent is going to edit a file, it genuinely needs the bytes in context — the edit tool diffs against them. Sandboxing that just forces a second read later. But if it's reading a 2,000-line file to summarize what a module does, or scanning test output to find the failure, the raw content is pure ballast. Only the conclusion has value.
So the rule isn't "never read big things." It's "don't carry data you've already finished using." Edit-reads stay in context. Analyze-reads get summarized and dropped.
What changed
After wiring this up, the difference wasn't subtle. Sessions that used to get vague and forgetful around the one-hour mark now stay sharp, because the window isn't 70% stale diffs and test logs. When I check the stats, the sandbox is routinely keeping 80–95% of raw tool output out of context on exploration-heavy tasks.
The lesson I keep relearning: your agent isn't getting dumber. You're feeding it garbage and then asking it to think clearly with a mouth full of garbage. Point the firehose at a bucket, hand the model a cup, and a lot of "the AI got confused" problems quietly disappear.
If you've watched your agent read a 50 KB file to answer a yes/no question, you already know exactly what I'm talking about. The fix is structural, not motivational.
Top comments (0)