DEV Community

Penloom Studio
Penloom Studio

Posted on

Stop Scrolling Your Agent's Logs. Debug It Like a Program.

Your coding agent just finished a 40-step run and the result is wrong. You do what everyone does: open the transcript and start scrolling. Twenty minutes later you have a vibe ("it went off the rails somewhere around the middle?") and no fix.

Scrolling is not debugging. An agent run is a program execution — a weird one, but still an execution — and the same discipline that works on programs works on agents: reproduce it, trace it, then write a check that can actually fail. Here's the workflow I use on real agent pipelines, with the three failure patterns that eat most of the time.

Step 1: Pin the run before you touch anything

You can't debug what you can't reproduce, and agent runs are built to not reproduce: the repo moved, the prompt was edited inline, the model sampled differently. So before investigating, freeze the inputs into a tiny harness:

#!/usr/bin/env bash
# repro.sh — pin everything the run depends on
set -euo pipefail

git stash --include-untracked          # exact repo state
git checkout "$FAILING_COMMIT"

claude -p "$(cat task-that-failed.md)" \
  --model claude-sonnet-4-6 \
  --max-turns 25 \
  --output-format json > run-$(date +%s).json
Enter fullscreen mode Exit fullscreen mode

Task text in a file, not your shell history. Model pinned to an exact ID, not "whatever the default is this week." Output captured as JSON, not eyeballed in a terminal that's about to be closed.

If the failure doesn't reproduce twice in a row, that's not a dead end — that's your first finding. Nondeterministic failures are usually environment leaks (a dirty worktree, a cache, a network call inside the task), and the harness just told you the bug is there, not in the agent's reasoning.

Step 2: Trace it — turn the wall of text into a tree

A 40-step transcript is unreadable as prose but trivial as a tree: which step, what input, what output, how long, how many tokens. That's what tracing gives you, and you don't have to build it. Langfuse (~30k stars, open source, self-hostable) is the reference tool here; OpenLLMetry does the same as pure OpenTelemetry instrumentation if you'd rather ship spans to a backend you already own.

Instrumenting your own agent code is a decorator, not a rewrite:

from langfuse import observe

@observe()
def plan_step(task: str) -> str:
    ...

@observe()
def apply_edit(file: str, patch: str) -> bool:
    ...
Enter fullscreen mode Exit fullscreen mode

Every call becomes a span; the run becomes a collapsible tree with inputs and outputs attached. Now "somewhere around the middle" becomes: step 14, the file-read tool returned an empty string, and every step after it reasoned confidently about a file that was never loaded.

That pattern — one bad tool result early, confident nonsense after — is the single most common agent failure I see. The model isn't hallucinating out of nowhere at step 30; it's faithfully extending a lie it was told at step 14. Find the first bad span and stop reading there. Fixes almost always belong at the first divergence, not the last symptom.

Step 3: Write a check that can fail

Here's the trap: you fix step 14, re-run, skim the output, and it "looks right." That's a vibe, not a verification — the same kind of skim that missed the bug the first time.

Turn "looks right" into an assertion. promptfoo makes this declarative:

# promptfooconfig.yaml
prompts:
  - file://task-that-failed.md
providers:
  - anthropic:messages:claude-sonnet-4-6
tests:
  - assert:
      - type: contains        # the fix actually landed
        value: "logger.error"
      - type: not-contains    # the regression stays dead
        value: "console.log"
      - type: javascript      # output is valid patch syntax, not prose about a patch
        value: "output.includes('--- a/') && output.includes('+++ b/')"
Enter fullscreen mode Exit fullscreen mode

Or keep it in pytest if that's where your CI lives — the point isn't the tool, it's that the check is mechanical and falsifiable. A check that can't fail is decoration.

Two hard-won rules for these checks:

Grade fresh evidence only. I once burned an hour "diagnosing" a failure using QA artifacts from the previous run — old screenshots graded against new output produce confident, detailed, completely false diagnoses. Now every verification script starts by deleting its own evidence directory:

rm -rf qa/ && mkdir qa/   # stale evidence is worse than no evidence
Enter fullscreen mode Exit fullscreen mode

Check the real surface. If the agent's job was "the page renders," asserting that the HTML file exists checks the wrong surface. Tools like proofshot exist precisely for this — record the browser, capture the screenshot, bundle the errors — because "the file is on disk" and "the thing works" are different claims, and agents are excellent at satisfying the first while failing the second.

The 10-minute version

When an agent run goes wrong:

  1. Pin it — task in a file, model pinned, repo state frozen, output captured. If it won't reproduce, the bug is in your environment, and that's a finding.
  2. Trace it — decorator-level instrumentation, read the tree, find the first bad span. Ignore everything downstream of it; that's echo, not cause.
  3. Assert it — encode "fixed" as a check that can fail, grade only fresh evidence, and check the surface the user actually touches.

None of this is exotic. It's the debugging you already know, applied to a runtime that happens to talk back. The teams getting consistent value out of coding agents aren't the ones with magic prompts — they're the ones who stopped treating agent output as something you read and started treating it as something you test.

The agent isn't lying to you. You just couldn't see what it saw. Give yourself eyes first; the fix is usually one bad span away.

Top comments (0)