Takayuki Kawazoe

Posted on May 2 • Edited on May 3

I shipped a regex to catch AI agents committing only half the work. Then I ripped it back out.

#ai #claude #python #testing

The agent committed the test file, pushed it, and reported success. The implementation file the test was supposed to exercise was not in the diff. Exit code 0. PR open. CI green. From every downstream system, the run looked clean. The only reason I caught it was that someone glanced at the PR diff before merging and said "wait, where's the actual change."

That moment, six months ago, is when I decided I needed something stronger than "the agent said it was done." This post is the story of what I built, why I kept patching it for half a year, and why I finally deleted the whole thing in a single commit: 911005fd, 742 lines deleted, 39 added, net -703, plus around 280 unit tests removed. The thing I deleted was a regex-based "diff shape guard" that tried to detect partial implementations by comparing the agent's diff against the file paths mentioned in the ticket. The thing I kept was an existing behavior-based check — verify_commands — that turns out to subsume everything the regex was trying to do, and a lot more.

I'm building an AI dev harness called Codens — happy to talk about it but it's not the point of this post. The point is the lesson: shape validation of AI output is a trap that compounds every time you try to fix it.

Background: what "partial implementation" looks like

When you hand a ticket to a coding agent, the failure modes that aren't "the agent crashed" are the ones that bite you. Exit 0, commits made, work incomplete in a specific way:

Test file written, implementation file missing.
Implementation written, migration missing.
Function signature updated, call sites in three other modules still pass the old shape.
Route handler added, OpenAPI spec the SDK regenerates from unchanged.

The agent is locally consistent in all of these — from inside its context window it did the natural next step. From outside, the PR is half a feature.

Our workflow launches Claude Code on a remote worker, hands it a Notion ticket, lets it create a branch, implement, commit, push. Then a verify step runs the ticket's verify_commands — a per-ticket shell snippet that says "task is done iff these commands pass." If verify fails, a fix_verify claude_code step gets the verify output piped into its prompt and tries again. The preset matters here because the whole article hinges on the relationship between develop and verify:

// backend/src/infrastructure/workflow/presets/notion_compatible.json (excerpt)
{
  "id": "develop",
  "type": "claude_code",
  "config": { "model": "opus", "max_turns": 30, "timeout_seconds": 1800 },
  "on_success": "verify",
  "on_failure": "notify_failure"
},
{
  "id": "verify",
  "type": "run_tests",
  "config": {
    "test_command": "{verify_commands}",
    "timeout_seconds": 300
  },
  "on_success": "notify_success",
  "on_failure": "fix_verify"
},
{
  "id": "fix_verify",
  "type": "claude_code",
  "config": {
    "model": "opus",
    "prompt_override": "Verification failed. Fix the issues.\nOutput:\n{previous_step_output}",
    "max_retries": 2
  },
  "on_success": "verify",
  "on_failure": "notify_failure"
}

verify_commands is the behavior-based check: did the thing the ticket described actually happen, judged by running real tests against real code. Slow, expensive (seconds to minutes), semantically strong.

What I added on top of it was a shape-based check: did the diff at least touch the files the ticket mentioned. Cheap, fast, runs immediately after develop. The intuition was "if the ticket says modify src/notion_sync.py and the agent's diff doesn't touch it, fail fast — don't even bother running verify."

The intuition was wrong in a way that took me half a year to fully accept.

What the shape guard actually was

The mechanism had three pieces, all in backend/src/utils/expected_files.py. The first was a regex that pulled file-shaped substrings out of the ticket body:

# backend/src/utils/expected_files.py (DELETED)
# Matches relative file paths with known source-code extensions.
# Accepts optional leading directory segments (lowercase, hyphens, underscores,
# digits) followed by a filename that may begin with uppercase or digits.
_FILE_PATH_RE = re.compile(
    r"(?:[a-z][a-z0-9_-]*/)*[a-zA-Z0-9][a-zA-Z0-9_.-]*\.(?:py|tsx?|md|sh|ya?ml)"
)

The extension allowlist is restrictive on purpose — .docx, .png, .csv would have generated noise without ever being something the agent should be modifying.

The second piece was the cover-check, which used a deliberately loose suffix match so that a ticket saying "modify notion_sync.py" would still match a diff that included src/infrastructure/notion_sync.py:

# (DELETED) _diff_covers_expected
def _diff_covers_expected(
    diff_files: List[str],
    expected_files: List[str],
) -> Tuple[bool, List[str]]:
    """Return (all_covered, missing_files)."""
    if not expected_files:
        return True, []
    missing = [
        ef
        for ef in expected_files
        if not any(
            df == ef
            or df.endswith("/" + ef)
            or ef.endswith("/" + df)
            for df in diff_files
        )
    ]
    return len(missing) == 0, missing

The third piece was the call site inside claude_code.py, sitting right after develop finished and right before the no-op detection:

# backend/src/infrastructure/workflow/steps/claude_code.py (DELETED block)
if exit_code == 0 and repos_committed_set and step_type != "resolve_conflicts":
    _guard_body = (
        context.get_var("notion_body", "")
        or context.get_var("task_goal", "")
        or ""
    )
    _guard_expected = _extract_expected_files(_guard_body)
    if _guard_expected:
        _guard_diff_raw = result.get("diff_files")
        if _guard_diff_raw is not None:
            _guard_diff = [
                f.strip() for f in _guard_diff_raw.split(",") if f.strip()
            ]
        else:
            # worker did not supply diff_files — fall back to the agent's
            # marker block, then to scanning stdout for path-shaped substrings.
            _guard_diff = (
                _parse_modified_files_marker(output)
                or _extract_expected_files(output)
            )
        _guard_ok, _guard_missing = _diff_covers_expected(
            _guard_diff, _guard_expected
        )
        if not _guard_ok:
            return StepResult.error(
                "Expected files missing from diff: "
                + ", ".join(_guard_missing)
            )

The first time it caught a real bug — a ticket that mentioned src/notion_sync.py::_normalize_uv_venv where the agent had only created the test file — it felt like the right investment. A cheap regex was catching a class of error that would otherwise have wasted minutes of verify_commands time and produced a misleadingly-green run. That single early win paid for the rest of the half-year.

It also paid for what I now think were four bad bets, each of which I patched my way through before finally giving up.

Reason 1 we ripped it out: half the tickets had no file paths to extract

The first crack was statistical. After a month in production, roughly half of all tickets did not mention a single concrete file path. They were tickets like "The reaction bar feels sluggish on mobile, tighten it" or "When the agent fails three times in a row, send a Slack alert." Normal tickets — they describe the outcome, not the implementation surface area. The guard's response, by deliberate design, was to skip:

# (DELETED docstring)
"""Returns an empty list when no paths are found, which causes the shape guard
to skip (false-positive avoidance for tickets that don't mention specific files)."""

That skip was the only safe response — firing on expected_files=[] would have failed every outcome-style ticket. But the consequence is that the guard only protected runs where someone had bothered to write file paths into the description. The other half ran with no protection at all. I had built a safety net with holes the size of normal human ticket-writing, and the math on that was bad before I even hit the next three problems.

Reason 2: the regex caught file paths in "References" and "Notes" sections too

The second crack appeared when teams started writing more structured tickets — the kind with sections for "Implementation," "References," "Test Plan," "Notes." The regex pulled file paths out of all of them. So a ticket that said "implement X in foo.py, and as background see how bar.py does it" would set expected_files = ["foo.py", "bar.py"]. The agent would correctly modify foo.py, leave the reference file alone, commit, and the guard would fail the run because bar.py wasn't in the diff.

The patch was a section-aware preprocessor. Strip out the headings I didn't want the regex to scan, then run the regex on what's left:

# (DELETED) _EXCLUDED_SECTION_TITLES
_EXCLUDED_SECTION_TITLES = (
    "verification checklist",
    "verify",
    "verification",
    "test plan",
    "tests",
    "テスト",
    "検証",
    "checklist",
    "references",
    "参考",
    "notes",
    "備考",
)

Plus a markdown heading walker that propagated exclusion to subsections:

# (DELETED) _strip_excluded_sections
_HEADING_RE = re.compile(r"^(#{1,6})\s+(.+)$", re.MULTILINE)

def _strip_excluded_sections(text: str) -> str:
    if not text:
        return text
    headings = list(_HEADING_RE.finditer(text))
    if not headings:
        return text

    excluded: List[bool] = [False] * len(headings)
    excluded_depth: int | None = None
    for i, match in enumerate(headings):
        depth = len(match.group(1))
        title = match.group(2).strip().lower()
        if excluded_depth is not None and depth <= excluded_depth:
            excluded_depth = None
        if excluded_depth is not None:
            excluded[i] = True
        elif title in _EXCLUDED_SECTION_TITLES:
            excluded[i] = True
            excluded_depth = depth
    # ... rebuild kept text from non-excluded chunks ...

This worked in the narrow sense. The next week, a ticket came in with a "## Background" section. Then "## 関連 PR" (Related PR). Then "## 過去の議論" (Prior Discussion). Then English variants the original list missed: "## Prior Art," "## Context," "## Linked Issues."

Every new convention any engineer adopted for organizing tickets now required me to add a string to _EXCLUDED_SECTION_TITLES, in two languages — or to go educate the engineer about which section names my regex understood. The ticket-writing humans did not exist to make my regex work. They were communicating intent to an agent that had no trouble understanding what "Background" meant. Only the regex did. Every new exclusion added complexity while reducing scope. Each patch was net-negative on actual protection.

Reason 3: the agent's stdout fallback broke when the stdout got truncated

The shape guard had two ways to know what the agent had touched. The first was result["diff_files"], a comma-separated list of paths from the worker, derived from git diff --name-only after the agent finished. The second was a fallback that scanned the agent's stdout for path-shaped substrings, used when the worker hadn't yet been upgraded to return diff_files:

# (DELETED fallback)
_guard_diff = (
    _parse_modified_files_marker(output)
    or _extract_expected_files(output)
)

The fallback worked until the stdout got long enough to hit our truncator. Claude Code can produce thousands of lines of tool-use output. We truncate captured output to a fixed byte budget. When that truncator landed mid-path — src/utils/expe... instead of src/utils/expected_files.py — the regex stopped matching, the fallback returned an incomplete list, and the cover-check failed against a perfectly good diff.

The patch was to ask the agent to emit a deterministic marker block at the end of its output:

[PURPLE-MODIFIED-FILES]
src/utils/expected_files.py
src/infrastructure/workflow/steps/claude_code.py
[/PURPLE-MODIFIED-FILES]

Parsed by _parse_modified_files_marker, which survived the cleanup because it's still useful for diagnostics:

# backend/src/utils/expected_files.py (kept)
_MODIFIED_FILES_MARKER_RE = re.compile(
    r"\[PURPLE-MODIFIED-FILES\](.*?)\[/PURPLE-MODIFIED-FILES\]",
    re.DOTALL | re.IGNORECASE,
)

The marker survives truncation because it's emitted near the end of the run. It also handles paths the regex couldn't — Next.js dynamic routes like app/[locale]/page.tsx were tripping _FILE_PATH_RE because of the bracket segments.

But the agent doesn't emit the marker reliably. When max_turns is exhausted mid-tool-use, when the run hits a timeout, when stop_reason ends as tool_use rather than a final assistant message — the summary block at the end never gets written. So now I had a fallback for the fallback: the original regex scan applied to the truncated stdout I'd just established was unreliable. Three tiers of input source, each with its own failure modes, and the guard's behavior was a function of which tier happened to be active for any given run. I was spending more time reasoning about the guard's failure modes than the guard was saving me from.

Reason 4: the lockfile + no-op marker collision broke the design assumption

This is the one that finally killed it.

Some of our verify_commands include npm install to make sure dependencies are present before running tests. There's also a marker convention for "this task is already done":

# backend/src/infrastructure/workflow/steps/claude_code.py (kept)
NOOP_VERIFIED_MARKER = "[PURPLE-NO-OP-VERIFIED-OK]"

If the agent emits that marker, the workflow treats the run as a verified no-op — no commits expected, no diff expected, success.

The collision: when the agent runs npm install as part of checking whether the task is already done, npm regenerates package-lock.json. The agent's session honestly concludes "nothing to do here" and emits [PURPLE-NO-OP-VERIFIED-OK]. But the worker's commit hook commits the regenerated lockfile because it's the one mechanical change that happened during the session.

Post-run state: repos_committed_set = {"frontend"} because of the lockfile, [PURPLE-NO-OP-VERIFIED-OK] is in the output, ticket says "fix the ReactionBar timing." The shape guard fires, looks for ReactionBar.tsx in the diff, finds only package-lock.json, stops the run with Expected files missing from diff: ReactionBar.tsx.

Nothing has gone wrong. The agent correctly determined the work was already done. The lockfile is an unrelated side effect of the verification step. The guard is producing a false positive because its assumption — "if there are commits, the diff should match the ticket" — doesn't survive the case where commits exist for reasons orthogonal to the ticket.

The patch was to short-circuit the guard when the no-op marker is present:

# backend/src/infrastructure/workflow/steps/claude_code.py (current)
if exit_code == 0 and repos_committed_set and step_type != "resolve_conflicts":
    # When [PURPLE-NO-OP-VERIFIED-OK] is present but diff_files is unavailable
    # (Option B above cannot clear repos_committed_set), re-route to the no-op
    # path so the verified-no-op handling takes effect correctly.
    if self._has_already_done_marker(output):
        logger.info(
            "ClaudeCodeStep: [PURPLE-NO-OP-VERIFIED-OK] marker present for "
            "run=%s step=%s — re-routing to no-op path",
            context.run_id,
            context.step_id,
        )
        repos_committed_set = set()
        result["repos_committed"] = ""

Adding that branch is the moment I lost faith in the design. The shape guard, whose job was to be an external observer of the agent's output, was now branching on the agent's self-report. The architectural premise had been "we don't trust the agent to tell us if the work is done — we look at the diff." Trusting the marker for the no-op case was the exact kind of trust I'd set out to avoid by writing the guard in the first place. I was paying a maintenance tax on a thing that no longer had its original justification.

The behavior-based alternative was already in the workflow

Once I stepped back and asked what verify_commands actually catches, the answer was: everything the shape guard was trying to catch, plus more.

Wrote the test, didn't write the implementation: pytest fails on import error or assertion.
Wrote the implementation, didn't write the migration: the app fails to start with a schema mismatch.
Updated the function signature, didn't update the callers: mypy catches it. Or the test does.
Wrote the route, didn't update the OpenAPI spec: the SDK regen step in verify_commands fails.

The run_tests.py step is small in the current code. The interesting bits are the variable interpolation, the merging of project-level defaults with task-specific verify, and the subshell wrapping that prevents cd leakage between chunks:

# backend/src/infrastructure/workflow/steps/run_tests.py (current)
test_command = config.get("test_command", "npm run test")
try:
    test_command = test_command.format(**context.variables)
except (KeyError, ValueError):
    pass
# Build final command: prepend project-level defaults before task-specific commands.
# Each chunk runs inside its own subshell `(...)` so a `cd dir && ...` step
# in the project default does not leak its pwd into the task-level verify and
# cause the next `cd dir` to fail. The agent service runs the final string
# under /bin/sh (dash on Debian/Ubuntu), where a failed `cd` returns exit 2 —
# which RunTestsStep would otherwise surface as the misleading
# "Tests failed (exit code: 2)".
default_verify = context.get_var("default_verify_commands")
task_verify = context.get_var("verify_commands")
if default_verify or task_verify:
    parts = [f"({p})" for p in [default_verify, task_verify] if p]
    test_command = " && ".join(parts)

Two details worth flagging: the subshell wrapping and the && chaining. The subshell exists because dash (the default /bin/sh on Debian/Ubuntu) propagates cd failures as exit 2, which used to surface as a confusing "Tests failed (exit code: 2)" error that had nothing to do with the actual test outcome. Wrapping each chunk in ( ... ) keeps cd scoped to its own chunk. The && chaining means project defaults run first and task verify only runs if defaults pass — the right shape for "format → lint → typecheck → test."

When verify fails, the surfaced error includes the tail of the captured output, not just the bare exit code:

output = result.get("output", "") or ""
_ERROR_TAIL_BYTES = 1500
tail = output[-_ERROR_TAIL_BYTES:] if len(output) > _ERROR_TAIL_BYTES else output
exit_code = result.get("exit_code", "unknown")
if tail.strip():
    msg = f"Tests failed (exit code: {exit_code}). Last output:\n{tail}"
else:
    msg = f"Tests failed (exit code: {exit_code}); agent reported no output"
return StepResult.error(
    msg,
    {
        "test_output": output,
        "previous_step_output": output,
    },
)

previous_step_output is the variable fix_verify's prompt_override interpolates. When verify fails, the next claude_code step receives the actual stderr tail in its prompt: "Verification failed. Fix the issues. Output: ." Agent reads what failed, edits, workflow loops back to verify, runs again. Two retries before giving up.

The shape guard was checking "did the diff include this filename." Verify is checking "does the codebase, after the agent's commits, satisfy the executable contract." The latter is strictly stronger. Anything shape caught, verify catches and hands the agent something concrete to fix in the next loop. Anything shape missed — outcome-style tickets, agent edits to unrelated files, lockfile drift — verify handles correctly because it doesn't care what the diff looks like, only whether the code passes.

The single case where verify is weaker is "the ticket has no verify_commands written," and the right response to that is to make verify_commands a required ticket field, not to maintain a 130-line regex module as a fallback safety net.

What was removed, what was kept

The deletion commit is 911005fd, footprint across five files: 39 lines added, 742 deleted. Net 703 lines gone. Around 280 unit tests deleted — regex variants for _extract_expected_files, heading-scope tests for _strip_excluded_sections, suffix-match patterns for _diff_covers_expected, two integration tests in test_claude_code_step.py that exercised the guard's fail and pass paths.

What survived in expected_files.py is just the marker parser, with a docstring rewrite to reflect that it's no longer a fallback for shape detection:

# backend/src/utils/expected_files.py (current, post-cleanup)
"""Modified-files marker parsing utilities (used for no-op diagnostics)."""

NOOP_VERIFIED_MARKER and _has_already_done_marker also stayed. They let the agent self-report a no-op in a way that's safely scoped: the worst case of trusting the marker incorrectly is that we treat a real-work run as a no-op, catchable by other guards downstream and by verify itself if there's something to verify. Self-reports are fine when the trust scope is fail-safe — when the failure mode of trusting wrongly is benign. They are not fine when the trust scope is "this guard runs partial-implementation detection," because then the failure mode of trusting wrongly is letting partial implementations through.

The residue I'd put in front of anyone considering an agent self-report marker: what's the worst thing that happens if the agent emits the marker dishonestly or by mistake? If the answer is "the workflow takes a safer path than it would have," ship it. If the answer is "the workflow skips a check that was supposed to catch real bugs," don't.

The general pattern: shape vs behavior validation of AI output

Every shape check is a proxy for some behavior check you actually care about. "The diff contains the filename" is a proxy for "the agent did the work the ticket described." "The output ends with [DONE]" is a proxy for "the agent reached a clean stopping condition." "The PR title matches feat:|fix:|chore:" is a proxy for "the change is conventionally classified."

Proxies are cheap and look like good engineering when you write them — fast, deterministic, no LLM in the loop, easy to test. But the moment you start patching them against false positives, you are in a race against the unbounded variety of human and agent expression, and you will lose. Each patch shrinks the proxy's scope while expanding its surface area. Eventually you're maintaining hundreds of lines that catch a smaller set of bugs than the underlying behavior check would catch on its own.

The lesson: if a behavior check exists and runs in your pipeline already, do not put a shape proxy in front of it. The shape proxy gives you false confidence ("the cheap check passed, so the run is probably fine") and false alarms ("the cheap check failed, so the run is probably bad") in roughly equal measure, while the behavior check has already done a strictly more accurate job. The proxy's only honest selling point is latency — fail fast before the expensive thing runs. For AI agent output that selling point almost never pays off, because the proxies are too leaky to safely fail-fast on.

If you find yourself naming a function _strip_excluded_sections on day three of building a check, take that as the signal that the check itself is in trouble.

A side note worth keeping

The [PURPLE-MODIFIED-FILES] marker is still useful for diagnostics even though we no longer use it as input to a guard. When a run does something surprising, having the agent's own list of files-it-thinks-it-touched in the captured output saves time during post-mortems. The pattern I now look for: code that's bad as a control input can be fine as an observability input, and the right move is to keep the data and delete the control logic.

Closing

If you've shipped a similar diff-shape guard for AI agent output and later stripped it back out, I'd love to hear what tipped you over. My suspicion is that the false-positive curve looks roughly the same shape for everyone — early easy wins, then a plateau where every patch shrinks the cohort of tickets the guard protects, until the cohort is small enough that the maintenance cost dominates. The version of this question I keep asking is: is there a shape check that doesn't have this maintenance arc, or is shape validation of unstructured agent output just structurally bad and I should stop reaching for it?

The marker that did survive — [PURPLE-NO-OP-VERIFIED-OK] — survived because its trust scope is narrow and fail-safe. If you've shipped agent self-report markers with a wider scope that didn't decay into the same tar pit, I'd want to hear how.

DEV Community