Our AI code validator agent took 608 seconds to report results from a test suite that runs in 96 seconds. The agent wasn't stupid. The tool output was.
Every developer tool we use (test runners, linters, compilers, build systems) was designed for humans reading a terminal. When an AI agent reads that same output through a context window, things break in ways you don't expect. This is one example of that problem, and a pattern for fixing it.
The Symptom
We run a TypeScript monorepo with ~12,000 tests across four packages. After each feature, a code-validator agent runs tests and reports pass/fail with coverage. Simple job.
| Agent Task | Actual Test Time | Agent Time | Overhead |
|---|---|---|---|
| Backend (3,683 tests) | 24s | 224s | 9.3x |
| Frontend (7,450 tests) | 96s | 608s | 6.3x |
The agent was spending 6-9x longer understanding the results than the tests took to run.
What The Agent Actually Did
We parsed the agent transcripts (every tool call, every reasoning step). Here's the backend agent's actual sequence:
1. npm run test:coverage → 419KB output, truncated at 235KB
2. grep "Tests" /tmp/output.log → matched console.log JSON, not summary
3. npm run test:coverage → re-ran entire suite. Truncated again.
4. tail -20 /tmp/output.log → got coverage table row, not summary
5. grep -E "passed|failed" → matched 47 lines of noise
6. npm run test:coverage → third complete re-run
... repeated 6 times total ...
12 tool calls. 6 complete test re-runs. 224 seconds. To answer a yes/no question.
The frontend agent was worse: 28 tool calls, 5 test re-runs, 13 different grep/tail/head combinations trying to parse a coverage text table. It even reported a false failure — incorrectly flagging coverage as below threshold because it parsed the wrong line.
Why? Because vitest produces this:
✓ src/services/__tests__/userService.test.ts (12 tests) 45ms
✓ src/services/__tests__/authService.test.ts (8 tests) 23ms
... 1,386 more files ...
Test Files 1389 passed (1389)
Tests 3683 passed (3683)
Duration 24.1s
----------|---------|----------|---------|---------|
File | % Stmts | % Branch | % Funcs | % Lines |
----------|---------|----------|---------|---------|
... 141 rows ...
419KB of human-readable output. The answer five numbers is at the bottom. The context window truncates from the bottom. The agent never sees it.
You wouldn't send 419KB of raw HTML to a mobile app and tell it to regex out the data. But that's exactly what we were doing with our agents.
The Fix
We stopped asking "how do we make the agent parse this better" and asked "can we give the agent a command that just outputs the answer?"
RESULT_FILE=$(mktemp)
trap 'rm -f "$RESULT_FILE"' EXIT
# JSON reporter writes structured data to file. Everything else → /dev/null.
(cd "$PKG_DIR" && npx vitest run \
--reporter=json \
--outputFile="$RESULT_FILE" \
) > /dev/null 2>&1
# Extract exactly what the agent needs
PASSED_TESTS=$(jq '.numPassedTests' "$RESULT_FILE")
FAILED_TESTS=$(jq '.numFailedTests' "$RESULT_FILE")
SUCCESS=$(jq '.success' "$RESULT_FILE")
echo "RESULT=$( [ "$SUCCESS" = "true" ] && echo "PASS" || echo "FAIL" )"
echo "TESTS=$PASSED_TESTS passed, $FAILED_TESTS failed"
echo "WALL_TIME=${WALL_TIME}s"
# On failure only: extract what failed
if [ "$SUCCESS" != "true" ]; then
jq -r '.testResults[] | select(.status == "failed") |
"FILE: \(.name)\n\([.assertionResults[] |
select(.status == "failed") | " - " + .fullName] | join("\n"))"
' "$RESULT_FILE" | head -30
fi
Three decisions:
-
--reporter=json— vitest writes structured JSON to a file -
> /dev/null 2>&1— 419KB of terminal noise disappears -
jq— extracts five numbers from structured data
The agent now sees this:
=== VALIDATION: test:backend ===
RESULT=PASS
SUITES=1389 passed, 0 failed (1389 total)
TESTS=3683 passed, 0 failed (3683 total)
WALL_TIME=40s
Five lines. One tool call. No parsing, no ambiguity, no re-runs.
The Pattern Is Everywhere
This isn't a vitest problem. It's a tool output problem. Every developer tool your agent touches has the same issue:
-
Linters — ESLint's default output is human-friendly.
eslint --format jsongives your agent structured violations with file paths, line numbers, and severity — no parsing needed. -
Type checkers —
tsc --noEmitdumps errors to stderr as human-readable text. A 5-line wrapper that counts errors and captures file paths turns it into a structured report. -
Build tools —
docker buildstreams layers of progress output. The agent only needs: did it succeed, what's the image size, how long did it take. -
Infrastructure —
terraform planproduces pages of human-readable diff.terraform plan -jsongives your agent a structured changeset it can reason about.
The pattern is always the same: the tool already has structured output (JSON, machine-readable flags), but the default is designed for a terminal. Switch the format, discard the noise, extract the answer.
The Results
| Metric | Before | After |
|---|---|---|
| Backend: tool calls | 12 | 1 |
| Backend: agent time | 224s | 42s |
| Frontend: tool calls | 28 | 1 |
| Frontend: agent time | 608s | 66s |
| False failures | 2 | 0 |
| Test re-runs per agent | 5-6 | 0 |
Same tests. Same agent. Same model. Same prompts.
The Takeaway
The industry is pouring effort into prompt engineering, model selection, and agent frameworks. Meanwhile, half the agent's context window is filled with ANSI color codes, progress bars, and output that was never meant for machine consumption. The context window is a scarce resource => treat it like memory, not a terminal screen.
When your agent is slow, don't start with the prompt. Start with what the tools are sending back. Audit every command your agent runs. If the output is more than a screenful, the agent is probably struggling with it. Most tools already support structured output: JSON flags, machine-readable formats, quiet modes. Use them. And where they don't exist, a simple wrapper script that filters noise and extracts the answer will do more for your agent's performance than any prompt rewrite.
The fastest agent isn't the one with the best reasoning. It's the one that doesn't have to reason about the data format at all.
Based on a real optimization on a production TypeScript monorepo with ~12,000 vitest tests. The pattern — structured output, noise suppression, answer extraction — applies to any tool your agents touch.
Top comments (0)