John Rojas

Posted on May 24

Checkbox theater: how I stopped trusting my AI agent to run the checks

#ai #technicalwriting #documentation #devrel

For context: in the previous piece, I worked through a five-dimension review framework for documentation, covering clarity, readability, style, completeness, and technical accuracy. Those dimensions are now part of how our team's AI agent reviews PRs. It runs them on every review pass, quietly, in the background. Most people don't think about them. They just see the review output.

Then I started catching things on my own review pass that the agent had marked clean. The style scan reported zero hits. I'd find three present-tense violations on the next read. A completeness check came back marked complete. A ticket requirement was unaddressed in the diff. The dimensions were running. They were also missing things, and I was the one finding what they missed.

This piece is about what I learned from that gap, what I built to close it, and the bigger principle I'm still working through. Sharing as I go, in case any of it is useful.

The setup

I'd built up a set of gates around the dimension checks. A PRE-FLIGHT gate that forced the agent to write a todo list with concrete execution methods for each dimension before any review work began. No "I'll check style" wishful thinking; you had to say "I'll run gh pr diff and scan for forbidden terms, will/would violations, and passive voice constructions." A COMPLETION gate that required documented evidence for all five dimensions before the review file could be written: findings, or "no issues, here's what I checked."

It felt thorough. Looked thorough. Read thorough on paper. Was thorough in approximately the same way a paper checklist is fire-safe.

The failure mode

What I started noticing across reviews:

A style scan reported clean. On my own pass through the diff, I caught three present-tense violations.

A completeness check came back marked complete. A ticket requirement was unaddressed.

Sub-agents reported back with results no artifact on disk corroborated. "Ran the will/would scan, zero hits." Where? Show me. There was no where. The scan had never produced output. It had produced a sentence.

I want to be careful about how I name this. It was satisfying the instruction as stated. The PRE-FLIGHT todo said "run the will/would scan." Marking the todo complete satisfied the instruction. Whether the scan had actually executed and produced findings was, structurally, outside the loop. The gate was social, not mechanical. It depended on the agent choosing to do the work, and on me choosing to believe that the work had been done because the agent said so.

I was reading the agent's confidence as evidence. Confidence is not evidence. It's a sentence about evidence. There is a difference, and the difference was costing me on every review.

The phrase that stuck was checkbox theater. The gates existed. They had names, structure, even formal blocking semantics. What they didn't have was teeth. They lived in instructions, and instructions are wishes.

The shift

The question I put to the agent: can we make these gates mechanical? Not "the agent should run the check" but "the agent cannot move forward until the check has produced a written artifact, on disk, tied to the current state of the PR."

That reframe is the whole article in one sentence. Evidence over status. Substrate over self-report. The shift from a gate that asks the agent to verify itself to a gate that verifies whether the agent has verified itself, where "has verified" is measured by file existence, not by claim.

Once that shift was on the table, the implementation became obvious. Most of it, anyway. I'm still finding the edges.

The implementation: three moves

Move 1: Scripts that write artifacts, not status flags

Each mechanical scan now writes a JSON file. Not a status. Not a return code. A file, with hit counts, file paths, sample matches, a timestamp, and the SHA of the PR HEAD it was run against.

{
  "pr_number": "NNNN",
  "pr_head_oid": "<headRefOid>",
  "run_at": "2026-05-12T08:59:14Z",
  "dimensions": {
    "style": {
      "status": "ran",
      "hits": 0,
      "source": "style-gate.sh (5 scans: will/would, passive, placeholders, superlatives, boolean)"
    },
    "readability": {
      "hits": 2,
      "scan": "sentence length > 25 words on added prose lines",
      "samples": ["docs/api/auth.md:42", "docs/api/auth.md:67"]
    }
  },
  "total_hits": 2,
  "status": "ran"
}

The SHA pin is doing real work. If the PR gets a new commit, the artifact's pr_head_oid no longer matches the current HEAD. The artifact is now stale, which means the scan results are stale, which means whatever was clean five minutes ago is no longer demonstrably clean. The agent has to re-run.

Move 2: A hook that intercepts the destination

This is the move that turned out to matter most. Cursor supports a beforeShellExecution hook: a shell script that runs before any shell command the agent issues. The hook reads the command, decides whether it's a PR-write command (gh api .../pulls/<N>/comments, gh pr edit --body, gh pr comment), and if so, validates the gate artifacts before deciding whether to allow or deny.

The mechanism here is Cursor-specific. The principle isn't. Other agentic tools have equivalent shell-level hooks; if yours doesn't, the enforcement point shifts to a pre-commit hook or a CI gate, but the move is the same: put the verification somewhere the agent can't talk its way past.

The validation is dumb on purpose. Does PR-<N>-tickets.json exist? Does it have status: "loaded" or "partial_blocked"? Does its pr_head_oid match the current HEAD from gh pr view? Same questions for PR-<N>-gate.json. If any check fails, the hook returns deny with a clear message:

pr-review-gate-hook BLOCKED gh api .../pulls/NNNN/comments on PR #NNNN.

Missing or stale gate artifacts:
- Stale: PR-NNNN-gate.json pr_head_oid=<old-sha> but current PR HEAD is <new-sha>
  Re-run: ~/Documents/docs-agent/scripts/review-gate.sh NNNN

Resolve and retry. Bypass available for one command via environment variable.

What changed when this hook went live: the agent stopped being able to lie about whether it had run the scans. There was nothing for it to lie about. Either the artifact existed and the SHA matched, or the call to GitHub got blocked. The agent could still produce a sentence saying "I ran the scan." That sentence no longer affected anything. The hook didn't read sentences. It read files.

Here's the before-and-after, visually:

One subtle but important detail: if the hook itself errors, if jq isn't installed, if gh can't reach the network, the command is allowed through with a warning. The gate fails open. This is deliberate. The cost of false negatives, a stale artifact slipping through, is low because the next review will catch it. The cost of false positives, every command bricked because the hook crashed, is high. A bypass environment variable exists for the same reason: when you genuinely need to override, you can, but you have to do it on purpose.

Move 3: The dimension rules that match

Hard gating only works if the gates point at the right things, so the rules inside the dimensions got tightened too. The style dimension can no longer be satisfied by citing the style guide; you have to run the mechanical scans and either resolve every hit via inline suggestion or document zero matches with the command that produced them. The completeness dimension requires a per-requirement mapping table built from tickets.json, not from the PR diff alone, because feature-mapping from the artifact being reviewed is circular. The rule structure stopped being aspirational and started being operational. "Run the check" turned into "produce the artifact that proves you ran the check, and here's the schema."

The feedback loop: lessons as infrastructure

Hard gating handles the failure modes I already knew about. It does not handle the ones I haven't run into yet. For those, there's a separate piece: the gap log.

The gap log is an append-only file. After every review, the POST-REVIEW IMPROVEMENT gate runs: take every reviewer comment, ask whether the workflow could have caught it before submission, and if yes, draft a check that would catch it next time. The check gets logged.

The format is one line per gap:

2026-05-10 | PR-1234 | style    | passive voice not caught on added definition lines | open
2026-05-12 | PR-1242 | complete | nav entry missing for new partial                  | mechanized
2026-04-22 | PR-1289 | clarity  | "this powerful feature" not caught                 | resolved

Three statuses do the work:

Status	Meaning
`open`	Logged but not yet caught by any script or hook. Next PRE-FLIGHT reads this and injects the gap as an additional dimension check.
`mechanized`	A scan or hook now catches this pattern automatically. The gap can sit dormant; the infrastructure handles it.
`resolved`	The underlying recurring pattern is gone (often because upstream changed). No further check needed.

What this does, structurally, is convert one-time learnings into infrastructure. A gap surfaced in PR-1234 doesn't sit in a Slack message I'll forget about. It sits in the log. The next PRE-FLIGHT reads the log and reminds the agent. When I get around to writing a script that catches the pattern mechanically, the status flips. The lesson doesn't depend on me remembering it.

Honest disclosure: this part still has friction. At the end of each session, I have to prompt the agent to walk through reviewer comments and log gaps. The pattern isn't fully self-sustaining yet. The log gets written. The injection works. The "remember to look at the comments" step is still mine. There's a future article in closing that gap, and I'm still figuring out the right shape for it.

The principle

Trust-but-verify is the wrong frame for AI workflows, because verification is what you're asking the LLM to do. The agent that runs the check is the same agent reporting on whether the check ran. There is no second agent watching. The integrity of the whole thing depends on a self-report from the entity being audited. That's not a verification model. That's a wish.

The fix isn't a better prompt. The fix is moving verification outside the agent's control loop entirely. The agent can write any sentence about whether it ran a scan. The hook cannot write any sentence about whether a file exists. The file is on disk or it isn't. The SHA matches or it doesn't. There is no clever phrasing the agent can produce that changes that.

Where I am right now: hard gates for the failure modes I know about, gap log for the ones I don't, manual prompting for the meta-check that I don't have a way to automate yet. It's a workable system I'm still actively improving.

If you're building AI-assisted workflows, especially ones where the LLM both does the work and audits the work, I'd push you to ask the same question I had to ask: what does my agent actually produce, on disk, that I could check without taking its word for anything? If the answer is nothing, that's checkbox theater. The first step to closing the gap is making it produce something.

More on the specific mechanical scanners in the next piece. More on the gap log and the durability problem in the one after that. If you've solved any of this, or seen a sharper take on it, I'd want to hear from you.

I'm publishing this while the system is still actively improving, because the principle landed for me before the implementation did, and the implementation isn't finished. If you're building AI workflows where the agent both does the work and audits it, I'd want to hear how you've thought about that gap, or whether you've found a way to close it.

I write about AI-assisted documentation workflows, developer experience, and the evolving role of technical writers. If any of this resonates, let's connect on LinkedIn.

DEV Community