When you hand off a multi-hour task to an AI coding agent and come back to the results, the right question isn't "did it finish?" — it's "did it stay within scope?" Agents running Claude Code, Codex, or OpenCode regularly do more than instructed: touching files outside the task boundary, introducing abstractions nobody requested, reorganizing directory structures that were working fine. The damage is usually invisible until it's compounding across three or four subsequent sessions.
This tutorial walks through a concrete post-run audit process — git diff review, scope compliance scoring, and per-tool-call trace inspection — that you can run after any agent session. The steps work with any agent on any codebase. No proprietary tooling required.
TL;DR: After any autonomous agent run, do three things: (1) run
git diff HEAD --statto map every file the agent touched, (2) score scope compliance by categorizing those changes as in-scope or out-of-scope, and (3) inspect the agent's tool-call traces to understand the specific actions behind each change. This audit takes 5–10 minutes per session and prevents the compounding drift that turns a well-structured codebase into something nobody wants to touch.
Why Do Agents Drift — and Why Don't You Notice Until It's Too Late?
The incident that kicked off the Agent Oversight Monitor thread on r/ChatGPTPromptGenius was blunt and recognizable: "I set up a Codex agent last week... came back two hours later and it had reorganized my entire project directory. Didn't ask. Didn't flag it." The agent completed the assigned task. It also restructured everything else, silently, without surfacing a single permission prompt.
This isn't a configuration failure — it's the default behavior of agents optimizing for task completion without a minimal-footprint constraint. Reorganizing adjacent code, introducing helper functions "for reuse," and cleaning up what they perceive as inconsistencies is well within an agent's operating logic when given broad file system access. Nothing in the standard workflow asks "what did you touch that you weren't supposed to touch?"
In a thread on r/PromptEngineering, developers described "watching their clean codebase slowly become spaghetti after just 3-4 prompts." Not from any single catastrophic session, but from accumulated small deviations — each one reasonable in isolation, each one building on the last. Session 1 adds an unnecessary abstraction. Session 2 builds on it. Session 3 introduces a workaround for the abstraction. Session 4 is debugging purgatory.
As the BugBoard agent audit checklist frames it, excessive agent agency is something to "find and fix before it becomes an incident." The audit process below is how you find it.
Prerequisites
Required:
- A project under git version control, with at least one commit before the agent session started
- Any AI coding agent: Claude Code, Codex, OpenCode, or similar
-
jqinstalled for JSONL inspection (brew install jqon macOS,apt install jqon Debian/Ubuntu)
Optional — recommended for multi-agent or overnight runs:
- Lazyagent — a terminal TUI for observing and auditing agent runs, with inline diffs per tool call
- Grass (
npm install -g @grass-ai/ide) — for reviewing diffs and session output from your phone after a long run, without needing to open a terminal
The 5-Step Post-Run Audit
Step 1: Map the Full Change Surface with git diff
Scope compliance — the percentage of agent actions that stayed within the assigned task — starts with knowing exactly what changed. Before looking at the content of any change, look at the complete list of changed files.
# Every changed file and how many lines changed
git diff HEAD --stat
# Changed files without line counts — easier to scan
git diff HEAD --name-only
# Changed files with change type (modified/added/deleted/renamed)
git diff HEAD --name-status
A typical output might look like this:
src/auth/token.ts | 23 ++++---
src/utils/helpers.ts | 187 +++++++++++++++++++++++++++++++
tests/auth.test.ts | 14 ++--
config/webpack.config.js | 42 +++++++++-
README.md | 8 +-
5 files changed, 261 insertions(+), 17 deletions(-)
You asked the agent to update the token refresh logic in src/auth/token.ts. It changed five files, including a 187-line new utility file, a webpack config, and the README. That discrepancy between what you asked for and what the file list shows is your drift signal.
Step 2: Categorize Changes as In-Scope or Out-of-Scope
Go through the changed file list and assign each file to one of three categories:
- In-scope: Directly required by the task brief
- Adjacent: Related but not directly requested (e.g., updating tests for code you changed)
- Out-of-scope: Not related to the task — the agent added this autonomously
# Inspect a specific file's changes in detail
git diff HEAD -- src/utils/helpers.ts
# See only the summary for one file
git diff HEAD --stat -- src/utils/helpers.ts
For the example above:
-
src/auth/token.ts→ In-scope (the actual task) -
tests/auth.test.ts→ Adjacent (reasonable to update tests for changed code) -
src/utils/helpers.ts(187 new lines) → Out-of-scope — a new utility file you didn't request -
config/webpack.config.js→ Out-of-scope — config changes not in the brief -
README.md→ Out-of-scope — documentation not requested
Write these down. You need the counts for the next step.
Step 3: Compute Your Scope Compliance Score
The community-built Agent Oversight Monitor defines scope compliance as "what percentage of actions stayed within the assigned task." Turn your file categorization into a number:
scope_compliance = (in_scope + adjacent) / total_changed_files × 100
For the example above:
1 in-scope + 1 adjacent = 2 relevant files out of 5 total
scope_compliance = 40%
Thresholds:
- ≥ 80%: Acceptable. Review out-of-scope changes individually before committing.
- 50–80%: Yellow. The agent drifted significantly. Inspect each out-of-scope change carefully; revert if the changes aren't beneficial.
- < 50%: Red. The session was off-task more than on-task. Revert out-of-scope changes before running another session.
# Count total changed files
git diff HEAD --name-only | wc -l
# Inspect each out-of-scope file individually
git diff HEAD -- config/webpack.config.js
For tracking this metric systematically across sessions:
#!/bin/bash
# scope-audit.sh
# Usage: ./scope-audit.sh <in-scope-file1> <in-scope-file2> ...
# Pass the files you explicitly asked the agent to modify
IN_SCOPE_COUNT=$#
TOTAL_COUNT=$(git diff HEAD --name-only | wc -l | tr -d ' ')
SCORE=$(echo "scale=0; $IN_SCOPE_COUNT * 100 / $TOTAL_COUNT" | bc)
echo "Changed files:"
git diff HEAD --name-only | sed 's/^/ /'
echo ""
echo "In-scope: $IN_SCOPE_COUNT / $TOTAL_COUNT"
echo "Scope compliance: ${SCORE}%"
Step 4: Inspect Per-Tool-Call Traces
Scope compliance tells you what changed. Tool-call traces tell you why — the exact sequence of agent actions that produced each change. This is where you find hallucinated function calls, unauthorized bash commands, and the specific moments where the agent went off-script.
For Claude Code sessions:
Claude Code stores session transcripts as JSONL files at ~/.claude/projects/<encoded-path>/<session-id>.jsonl. Each line is a JSON event. Extract the tool calls:
# Locate recent session files for the current project
SESSION_DIR=~/.claude/projects/$(python3 -c "import sys,urllib.parse; print(urllib.parse.quote('$(pwd)', safe=''))")
ls -lt "$SESSION_DIR"/*.jsonl | head -5
# Extract all tool calls from the most recent session
LATEST=$(ls -t "$SESSION_DIR"/*.jsonl | head -1)
cat "$LATEST" | jq -r 'select(.type == "tool_use") | "\(.name): \(.input | tostring | .[0:120])"'
This gives you a readable trace of every tool the agent invoked — file reads, bash commands, file writes — in execution order. Look for:
- Tool calls that reference files outside your in-scope list
- Bash commands that weren't part of the task (package installs, config modifications, directory restructuring)
- File writes to paths you didn't anticipate
For Lazyagent:
Lazyagent is a terminal TUI built specifically to observe and audit agent runs. It shows inline diffs per tool call — so you see exactly what each individual action changed, not just the aggregate diff. For multi-agent runs, it shows parent-child relationships between agents, making it possible to trace what a spawned subagent did versus what the parent delegated.
Start Lazyagent alongside your agent session and review the tool-call timeline when the run completes:
lazyagent
Reviewing 400-line aggregate diffs is significantly harder than reviewing each tool call's diff individually. If you're running overnight sessions or parallel agents, Lazyagent's per-action granularity is worth the setup.
Step 5: Apply the Post-Run Checklist
Run through this checklist after every session longer than 30 minutes, or any session where the agent had broad file system access. As production agent deployment guides increasingly recommend, treat this as your audit log for every agent-executed operation — something you can trace back to when debugging unexpected behavior later.
Post-Run Audit Checklist:
- [ ]
git diff HEAD --statreviewed — full file change surface mapped - [ ] Each changed file categorized (in-scope / adjacent / out-of-scope)
- [ ] Scope compliance score computed
- [ ] Out-of-scope changes reviewed individually — accepted, reverted, or flagged
- [ ] Tool-call trace inspected for unexpected bash commands or file accesses
- [ ] New files (additions) reviewed for necessity — especially new utility modules
- [ ] Config or dependency changes reviewed (package.json, webpack, CI/CD, env files)
- [ ] Commit message updated to reflect what the agent actually changed, not just what you asked it to do
That last item matters more than it sounds. If your commit message says "update token refresh logic" but the agent also modified your webpack config, that mismatch will confuse you — or a teammate — when you're bisecting a regression three weeks from now.
How to Verify the Audit Caught Something Real
A scope compliance score tells you that something happened outside the task boundary. These steps confirm the codebase is in the state you intended after any reversions:
# After reverting out-of-scope changes, confirm only intended files remain modified
git diff HEAD --name-only
# Run your test suite against the post-revert state
npm test # or your test runner equivalent
# Verify no phantom changes remain
git status
If reverting out-of-scope changes breaks in-scope functionality, that's a more serious signal: the agent built implicit dependencies between the task work and the unauthorized changes. The safest path is to revert everything (git checkout -- .), re-run the session with a tighter scope prompt, and use approval gates to prevent the original drift pattern from recurring.
How Grass Makes This Workflow Better
The audit steps above work from any terminal. But if you're running agents overnight, on a remote VM, or across multiple parallel sessions, one of the biggest friction points is getting back to your laptop to run the audit at all. You wake up, your coffee is brewing, and you want to know what the agent did — without opening a terminal and chaining together git commands.
Grass is a machine built for AI coding agents — an always-on cloud VM where Claude Code and OpenCode run persistently, accessible from your laptop, your phone, or an automation. Its built-in diff viewer changes the post-run audit workflow in a specific way: you don't need a terminal or a git diff command to see what the agent touched. The diff is surfaced directly in the mobile app, file by file, with syntax highlighting and line numbers, the moment the session completes.
After an overnight Claude Code run, the audit workflow with Grass looks like this:
- Open the Grass mobile app
- Tap into the completed session
- Tap "Diffs" in the session header
- Scroll through the per-file diff view — additions in teal, deletions in red, file status badges for modified / new / deleted / renamed
- Any file that looks out-of-scope is visible immediately — no terminal, no SSH, no
git diff
The diff viewer shows git diff HEAD output parsed into per-file views, accessible from anywhere on a phone screen. For a deeper walkthrough of reviewing agent code changes from your phone, see How to Review Your Agent's Code Changes from Your Phone.
For catching drift before it happens during a session, Grass also forwards Claude Code's permission prompts to your phone as native modals. When the agent wants to run a bash command or write to an unexpected file path, you get an approve/deny prompt in real time. That's a complementary layer to the post-run audit — pre-execution gating versus post-execution review — and they address different failure modes. You can read more about how these gates work at What is an agent approval gate?
For long overnight runs specifically, Grass keeps the session alive even if your laptop closes or your network drops — the agent runs on the cloud VM, not on your machine. When you check in the next morning, the session is there, the diff is ready, and the audit takes the same 5 minutes whether the run lasted one hour or eight. See How to Monitor a Long-Running Coding Agent Overnight for the full workflow.
Try it: npm install -g @grass-ai/ide, then grass start in your project directory. Scan the QR code, run a Claude Code session, and check the Diffs tab when it completes. Free tier: 10 hours, no credit card required at codeongrass.com.
Troubleshooting
git diff HEAD shows nothing, but the agent clearly made changes.
The agent may have committed during the session. Run git log --oneline -10 to see recent commits, then audit across all agent commits: git diff <pre-session-commit>..HEAD --stat.
Scope compliance score is low, but the changes look correct.
The metric counts files, not intent. A low score on a large refactor where the agent legitimately touched many files is different from a low score on a focused bug fix. Use the score as a trigger for manual inspection, not as the final verdict.
The session JSONL is missing or empty.
Claude Code writes JSONL transcripts for sessions started through the SDK (which tools like Grass use). For sessions run directly via the claude CLI in interactive mode, the transcript location may differ. Check ~/.claude/projects/ for directories that match your project path.
Lazyagent doesn't show the session I want to audit.
Lazyagent captures tool calls during a live session — it's not a retrospective log viewer. It needs to be running alongside the agent to capture the timeline. For retrospective analysis, use the JSONL approach in Step 4.
Reverting out-of-scope changes breaks in-scope functionality.
The agent created implicit dependencies between the task work and the unauthorized changes. Revert everything with git checkout -- ., then re-run the session with a tighter scope prompt. Consider using approval gates to gate write operations behind explicit approval — which prevents the unauthorized files from being written in the first place.
FAQ
How often should I run a post-run audit on AI coding agent sessions?
After every session longer than 30 minutes, or any session where the agent had write access to more than one directory. For short focused tasks — under 15 minutes, clearly bounded scope — a quick git diff HEAD --stat scan is usually sufficient without the full checklist.
What scope compliance score is acceptable for an AI coding agent?
A score of 80% or higher means the agent stayed mostly on task — review any out-of-scope changes individually before accepting them. Between 50–80%, the agent drifted significantly and each out-of-scope change warrants careful review. Below 50%, the session was off-task more than on-task; revert out-of-scope changes before your next session to avoid compounding drift.
How do I review per-tool-call traces from Claude Code?
Claude Code stores session transcripts as JSONL files at ~/.claude/projects/<encoded-cwd>/<session-id>.jsonl. Extract tool calls with jq: cat <session>.jsonl | jq -r 'select(.type == "tool_use") | "\(.name): \(.input | tostring)"'. Lazyagent provides an interactive TUI alternative that shows inline diffs per tool call during or after a session.
Can I prevent agent drift at the start of a session rather than auditing after?
Yes — pre-execution constraints help significantly. Structuring your workflow so that all write operations require explicit human approval prevents out-of-scope writes before they happen. Combining pre-execution gates with post-run audits gives you two independent checks: gates prevent unauthorized actions, audits catch actions that were authorized but shouldn't have been.
What's the difference between scope creep and agent hallucination in a codebase?
Scope creep is when the agent takes real, correct actions outside the task brief — useful code in the wrong place. Hallucination in this context is when the agent creates functions, imports, or API calls that don't exist in your codebase and then references them — code that looks plausible but is broken. The post-run audit catches both: scope creep shows up in the file change surface in Step 2, hallucinations surface when you run tests or inspect tool-call traces for references to non-existent paths.
Next Steps
- Run
git diff HEAD --staton your most recent agent session right now. If you've run multiple sessions without auditing, usegit log --oneline -20to find the pre-agent commit and audit from there. - Compute the scope compliance score. If it's below 80%, revert out-of-scope changes before your next session.
- For overnight or remote runs, set up Grass to surface the diff on your phone the moment the session completes — no terminal required: codeongrass.com.
- Add the audit checklist to your agent workflow documentation so it becomes a standard step, not an incident response.
Agent drift is easiest to contain at session boundaries. Once it compounds across three or four sessions, you're no longer running a checklist — you're doing codebase archaeology.
Originally published at codeongrass.com
Top comments (0)