codex fixing codex: a consensus loop that argues, judges, and merges its own PRs

#ai #llm #opensource #programming

Last Friday I wrote here about consensus-loop, the agent loop we built and open-sourced that doesn't just suggest code but actually writes it, has agents review it, and merges its own PRs (that post is here). A few people asked what we actually point it at day to day. So here's the experiment I keep coming back to: we aimed the same loop at a fork of the codex CLI and let it fix codex. codex fixing codex.

This is the version with the repo links, so you can decide for yourself whether it's real instead of taking my word for it.

The setup: take a public fork of the open-source codex CLI, and point our own consensus loop at it. The loop's job is to close small upstream bugs in that fork end to end, with no one typing the patch. The whole thing is dogfood. The fork has zero stars, zero forks, no outside users. I'm saying that up front so the rest reads as "here's a mechanism," not "here's a product."

The repo is public: github.com/ChronoAIProject/codex. It's a fork of openai/codex. Nothing below requires you to trust me; every claim is a clickable issue or PR.

And if you'd rather watch than read, we've been livestreaming the loop running this end to end: the stream is here.

How a bug moves through the loop

1. Intake. A real upstream codex bug gets mirrored into the fork as an issue. The title carries the pointer, e.g. "Upstream openai/codex#29131: Unrecognized slash command prevents message from being sent." The issue body states a selection rubric: small-to-medium mechanical bugs, bounded to identifiable files, owned by this repo. It explicitly avoids auth, app-server, desktop, iOS, broad sandbox policy. So the loop is not trying to be a heroic maintainer; it's picking fights it can finish.

2. Solvers argue. Several solver agents take a pass and post their proposals as issue comments. They have different priors:

a minimal solver that wants the smallest change that satisfies the repro,
a structural solver that wants a clean boundary,
a delete-solver that argues for removing code rather than adding it. They genuinely disagree. On issue #34 the minimal solver proposed a "pre-dispatch validation" tweak, the structural solver proposed a "batch validation boundary," and the delete-solver abstained from deletion. You can read all three.

3. A judge arbitrates rounds. A meta-judge reads the solver outputs. If they're split, it doesn't pick a winner — it posts something like "Design consensus needs one narrower round" and sends it back. Issue #34 went three rounds. The final comment is titled "Round-3 meta-judge arbitration" and spells out the decision:

"the minimal and structural solvers now agree on the same concrete implementation boundary, and the delete solver abstains from deletion while accepting that same boundary."

It even records what got rejected: a new ToolCallBatch module ("a new single-caller codex-core abstraction is not required for correctness"). That's the part I find genuinely useful — the judge writes down the road not taken.

**4. Implement, test, merge. **Once consensus is reached the loop opens a branch (refactor/iter34-issue-34), writes the patch, runs the guarded build/test, and opens a PR against the consensus-rnd/issues branch. For #34 that's PR #37, which touched codex-rs/core/src/session/turn.rs and codex-rs/core/src/stream_events_utils.rs and added a regression test under codex-rs/core/tests/. Then it merges itself and posts back on the issue: ✅ Auto-merged via PR #37.

The state lives in GitHub. Issues are the work queue, solver comments are the debate transcript, the judge comment is the decision record, the PR is the artifact, and labels track lifecycle: crnd:lifecycle:managed, crnd:phase:design-solving → crnd:phase:consensus-reached → crnd:phase:merged, plus crnd:human:auto meaning the controller may proceed without a maintainer. Every loop-authored PR body ends with ⟦AI:AUTO-LOOP⟧. That marker, not a human, is the thing telling you who wrote it.

A real fix, end to end

Issue #34 mirrors a real codex concurrency bug: when one model response contained several parallel tool calls, a valid apply_patch sibling could start side effects before a malformed sibling in the same response was rejected. The judge framed it as "fail-fast validation for side-effecting batches" — accept the whole batch as well-formed before launching anything that writes.

The merged fix (PR #37) stages tool calls and only flushes them to the run queue at ResponseEvent::Completed, after the whole response batch is known good. It shipped with a regression: a valid sibling followed by a malformed one in the same response, asserting the valid one does not execute. The PR ran just test -p codex-core on the targeted test and reported it green. That's a real bug with a real, reviewable patch, produced by a debate I didn't participate in.

Where it's honest about doing nothing

PR #16 is the one I'd point a skeptic at. The loop took issue #15 (an apply_patch bug), tried to reproduce it against the current checkout, and couldn't. Instead of inventing a fix to look productive, the PR body says:

"No production fallback was added; the regression passed, so the native tool-call path did not prove an executable lookup bug in this checkout."

So it added a PATH-isolated regression test to lock the behavior and stopped. That's the correct engineering call, and it's also the kind of result that looks like a no-op until you read the reasoning. A loop that knows when not to patch is more interesting to me than one that always produces a diff.

The honest boundaries

It's a fork, dogfood, no users. Nothing here has been proposed upstream, and this is not an OpenAI thing — it's us pointing our loop at our own fork of their open-source CLI.

The bugs are small by design. Status-dot contrast, UTF-8 BOM handling in apply-patch, dedup tool calls by call_id, the slash-command fix. Bounded mechanical stuff. "AI maintains a codebase" would be a lie; "a loop closes small bounded bugs end to end" is what actually happened, ~16 merged PRs so far.

Humans are still in it. Someone mirrors the upstream issues and sets the rubric, and we open every PR to read it. To quote our own status: we still open them half expecting garbage. The code is auto; the attention isn't.

The judge is sometimes ceremony. On easy bugs the three solvers basically agree and the judge rubber-stamps. The 3-round arbitration on #34 is the one case where the disagreement was load-bearing. I don't yet have clean evidence the judge beats a single good agent on the easy 80%.

Repo's public if you want to dig: github.com/ChronoAIProject/codex. Start with issue #34 and PR #37.