Takayuki Kawazoe

Posted on Jun 8

"Autonomous coding agents don't break in the middle, they break at the seams"

#ai #agents #ci #git

After running AI coding agents in production for a while, one thing became clear: the failures aren't in the code the model writes. They're at the seams — git, CI, auth, the network. The boundaries with the outside world.

The model itself is genuinely capable. It writes functions, writes tests, refactors. What breaks is everything around the work: pushing the result, waiting on CI, merging the PR, refreshing a token, calling another service. And the failures are often the kind a human would avoid without thinking.

Here are five incidents we hit and fixed in Codens' Purple (the orchestration core) over the last few weeks. All real, with production task IDs and dates. Every fix is merged. There's a shared design lesson at the end that ties them together.

Incident 1: a half-resolved merge nearly flooded a PR with 12,000 lines

This was the scary one.

A Purple task on opsguide-back opened a PR. I looked inside: +12,162 lines / 149 files changed, with literal <<<<<<< markers in 2 of them. The commit graph:

e567ce67 (merge commit, "chore: Fix HYBRID_SEARCH...")
 ├ parent[0] = 0b069e5d  (develop tip, +1468 commits over main)
 └ parent[1] = 2940de35  (the actual feature commit)

What happened: in the fix step, the AI decided to git merge develop to backport some test fixes. The merge conflicted. The AI resolved it partially and drove git commit through anyway with markers still in the tree. What got pushed: develop's entire divergence plus unresolved conflict markers. If anyone had clicked merge, main would have been polluted by 1468 commits of develop drift in one shot.

A human wouldn't do this. They wouldn't merge develop into a main-targeted PR in the first place, and if it conflicted they wouldn't commit until it was fully resolved. But the AI, optimizing locally to get one test passing, does it without hesitation.

Fix: stop it at push time, in two layers

A single git pre-push hook. This is where the AI's git push actually goes, so this is where the guard belongs.

#!/bin/bash
set -u

# Layer 1: conflict-marker scan (always on, no config)
scan_conflict_markers() {
    local hits
    # Match markers at column 0 followed by a space, so we don't
    # false-positive on "=======" markdown rules or ASCII art.
    hits=$(git grep -lE '^(<<<<<<< |======= |>>>>>>> )' HEAD 2>/dev/null || true)
    if [ -n "$hits" ]; then
        echo "purple-pre-push: ABORT — committed files contain conflict markers:" >&2
        echo "$hits" | sed 's/^/  /' >&2
        return 1
    fi
    return 0
}

scan_conflict_markers || exit 1

The key is scanning the committed tree (HEAD). The working directory may have been cleaned up, but markers that made it into a commit stay. HEAD is what's about to be pushed, so that's what you git grep.

The regex ^(<<<<<<< |======= |>>>>>>> ) matters for precision: ======= shows up in markdown headings and tables all the time, so we match only the exact shape of a git conflict marker — start of line, then a space.

Layer 2 is a merge-source allowlist, configurable per workflow. It only runs when a policy file is present:

# Layer 2: merge-source allowlist (only when a policy file exists)
# {
#   "feature": "feature/<task-id>",
#   "base":    "main",
#   "allowed": ["develop", "release/x"]
# }

For each new merge commit on the pushed ref, we check that every parent is reachable from feature / base / one of allowed, using git merge-base --is-ancestor. A merge from a disallowed source is rejected. Blank policy means no check — it's opt-in.

for p in $parents; do
    ok=0
    for rname in "${refs[@]}"; do
        if git merge-base --is-ancestor "$p" "$rname" 2>/dev/null; then
            ok=1; break
        fi
    done
    if [ "$ok" = "0" ]; then
        echo "ABORT — merge commit has parent not from allowed sources" >&2
        bad=1
    fi
done

And the unglamorous-but-important part: fail-safe. If the hook itself has a bug and errors, the push still proceeds. A guard bug stopping every workflow is worse than the occasional incident slipping through. Layer 1 is just git grep and git log (tiny surface area); layer 2 falls back to permissive if jq isn't available.

Incident 2: a transient network blip misclassified as a permanent failure

A task routed through a self-hosted model gateway (vLLM behind Cloudflare) died with exit 1 after ~27 minutes of work:

API Error: The socket connection was closed unexpectedly.

The gateway was healthy at session start and recovered immediately after — GET /health returned 200 in 0.56s by the time I looked. So it was a momentary mid-session disconnect: the same Cloudflare-fronted overload pattern that already drives the 524 retry path, just surfacing as a closed socket from node's fetch.

The problem: the existing retry regex covered 524 / origin_response_timeout / connection reset / Too Many Requests but had no entry for the closed-socket case. So the task was classified "non-transient error (exit=1), not retrying," and the whole step got escalated to Slack to wait for a human re-dispatch.

Fix: add the patterns, and trust that false positives are cheap

# Patterns we treat as transient (safe to cleanly retry)
_GW_TRANSIENT_RE = re.compile(
    r"524|origin_response_timeout|Too Many Requests|"
    r"connection reset|"
    # added:
    r"socket connection was closed|socket hang up|ECONNRESET|fetch failed",
    re.IGNORECASE,
)

Added to both classifier sites: the shell-side retry loop in the per-job container, and the Python-side clean-retry detector in the workflow engine.

Here's the core idea of this whole post:

Adding to the transient list is always safe. A false positive (treating a real permanent failure as transient) only wastes a 30–90s backoff. The AI is idempotent over the same prompt, so state isn't corrupted. A false negative (treating a real transient as permanent) escalates to Slack and stops a human.

False positives are cheap, false negatives are expensive. So bias the classifier toward transient. This asymmetry holds for job systems in general, but it's sharper for agents: each run is tens of minutes plus inference cost, so the unit cost of a human escalation is unusually high.

Incident 3: merging before a late-registering CI check even appears

wait_ci built its "required checks" list from the checks observable when the PR opened.

But opsguide-back's test job builds a Docker image first, so it registers ~3 minutes after the PR opens. It wasn't in the PR-open snapshot. So wait_ci passed early without waiting for it, and the downstream merge_pr hit:

Repository rules blocked merge: 405
Required status check "test" is failing.

Actual timeline (2026-05-20, opsguide-back #11284):

04:25:25  wait_ci starts  required_checks=[check-develop-only-files,
          export-and-check, format-check, lint, check-single-head]  ← no test
04:28     test job starts
04:33     test FAILED  ← but wait_ci had already moved on to merge

Fix: treat branch protection, not the observed snapshot, as the source of truth

GitHub branch protection has required_status_checks — the canonical list GitHub actually gates the merge on. Read that instead of a snapshot.

def get_required_status_check_contexts(repo, branch):
    # Read branch protection's required_status_checks.
    # Return [] on 404/403 so unprotected branches / missing perms
    # fall back to current behaviour.
    ...

Union those contexts into wait_ci's required checks with strict=True. Strict mode already waits for a required check that hasn't appeared yet (returns waiting() when the run is None/incomplete), so the late test now gets waited for, evaluated, and a failure routes to fix instead of slipping through to a 405 at merge.

The lesson: don't let "what I can see right now" be the system's truth. In a world where CI checks register asynchronously, an observation snapshot is always going to be stale. Read the gate definition itself.

Incident 4: the singular form of a fast check re-queued at merge time

This one is a single regex, but the kind that eats an afternoon.

Two tasks failed at merge-pr with:

Repository rules blocked merge: 405
Required status check "check-branch-name" is expected.

_is_ci_pending_error only matched the plural wording "N of M required status checks are expected" — i.e. "are expected". When exactly one required check is incomplete, GitHub uses the singular Required status check "X" is expected. That fell straight through the pending detector into a hard failure.

Why does a check wait_ci saw green re-queue at merge time? check-branch-name is a fast check, and merge_pr recomputes the merge base right before merging. GitHub re-evaluates branch protection against the new head and briefly reports the fast check as "expected" again until it re-reports success. The bounded retry loop was built for exactly this window — it just wasn't being entered for the single-check case.

Fix

# Route both singular "is expected" and plural "are expected" into retry
if "expected" in error_message.lower():
    return True  # treat as CI-pending; back off and retry

Match the bare token "expected". The only GitHub merge-block messages containing "expected" are these pending-check wordings, so widening the match can't misclassify a genuine policy rejection (required signed commits, etc.) as transient — covered by the existing regression test.

Unglamorous, but singular-vs-plural and 3-minute-delays are the actual things that stop autonomous agents.

Incident 5: a borrowed token already expired the moment it arrived

Codens' per-task workers run on borrowed shared OAuth credentials. The refreshToken is intentionally stripped: if it weren't, each worker's CLI would refresh independently and rotate the shared OAuth identity, cascading 401s across siblings.

src.token_refresh: token is expired — attempting refresh before job start
src.token_refresh: Token refresh failed: no refreshToken in credentials file
POST /jobs HTTP/1.1 500 Internal Server Error

So the borrower can't refresh on its own. If the accessToken it receives is already past expiresAt at the moment of receipt, the worker's pre-job check dies on "no refreshToken" and POST /jobs returns 500.

Fix: the source that holds the refreshToken refreshes before returning

The root cause: the source credential service's GET /claude-auth returned the stored credentials as-is, expiry included. The only place that holds the canonical refreshToken is the source, so refresh there before returning.

async def get_claude_auth():
    # Refresh before returning. No-op when >5 min of life remains,
    # so the common case adds zero round-trips.
    await run_in_thread(ensure_valid_token)
    return load_credentials()

ensure_valid_token does nothing if the token has more than 5 minutes left, so it's free in the common case. Only when it's under threshold does the one place with the refreshToken (the source) refresh, write the new token, and then return it.

The naive design — "the borrower refreshes" — didn't match the architectural constraint of a shared identity. Only one party can refresh. So that party does it before returning.

Three principles that recur

Line up all five and the fixes share a shape.

1. Autonomous agents break at the seams. Not in output quality, but at the boundaries: git, CI, auth, network. So the fixes point at hardening boundaries, not at smarter models. A pre-push hook, a source of truth for CI gates, the right party to refresh a token — all classic systems design, all unrelated to the model.

2. False positives and false negatives have asymmetric cost. A misclassified transient costs a tens-of-seconds backoff; a missed one costs a human escalation. Agent runs are long and expensive, so the cost of stopping a human is unusually high. Bias classifiers toward retry. Idempotency makes that safe.

3. Guards must be fail-safe. A safety mechanism's own bug must not stop the main flow. The pre-push hook lets the push proceed if the hook itself errors unexpectedly. We weigh "every workflow stops" heavier than "an incident occasionally slips through."

The more you let the AI do, the more these not-in-the-middle details pay off. Smarter models don't make git conflict markers disappear, don't make CI checks register synchronously, don't stop tokens from expiring. Keeping an agent running in production turned out to be the work of closing these seams, one at a time.

Codens builds all of this into the product. Take a look if you're interested.

https://www.codens.ai/en/