<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: nwnwnw413</title>
    <description>The latest articles on DEV Community by nwnwnw413 (@nwnwnw413).</description>
    <link>https://dev.to/nwnwnw413</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3896877%2Ff54f1836-b31d-400d-b3a9-6abf04929627.png</url>
      <title>DEV Community: nwnwnw413</title>
      <link>https://dev.to/nwnwnw413</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/nwnwnw413"/>
    <language>en</language>
    <item>
      <title>codex fixing codex: a consensus loop that argues, judges, and merges its own PRs</title>
      <dc:creator>nwnwnw413</dc:creator>
      <pubDate>Mon, 22 Jun 2026 06:44:24 +0000</pubDate>
      <link>https://dev.to/nwnwnw413/codex-fixing-codex-a-consensus-loop-that-argues-judges-and-merges-its-own-prs-11bh</link>
      <guid>https://dev.to/nwnwnw413/codex-fixing-codex-a-consensus-loop-that-argues-judges-and-merges-its-own-prs-11bh</guid>
      <description>&lt;p&gt;Last Friday I wrote here about consensus-loop, the agent loop we built and open-sourced that doesn't just suggest code but actually writes it, has agents review it, and merges its own PRs (&lt;a href="https://dev.to/nwnwnw413/our-agent-loops-have-been-shipping-production-features-for-weeks-heres-the-tool-3ekn"&gt;that post is here&lt;/a&gt;). A few people asked what we actually point it at day to day. So here's the experiment I keep coming back to: we aimed the same loop at a fork of the codex CLI and let it fix codex. &lt;strong&gt;codex fixing codex&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This is the version with the repo links, so you can decide for yourself whether it's real instead of taking my word for it.&lt;/p&gt;

&lt;p&gt;The setup: take a public fork of the open-source codex CLI, and point our own consensus loop at it. The loop's job is to close small upstream bugs in that fork end to end, with no one typing the patch. The whole thing is dogfood. The fork has zero stars, zero forks, no outside users. I'm saying that up front so the rest reads as "here's a mechanism," not "here's a product."&lt;/p&gt;

&lt;p&gt;The repo is public: github.com/ChronoAIProject/codex. It's a fork of openai/codex. Nothing below requires you to trust me; every claim is a clickable issue or PR.&lt;/p&gt;

&lt;p&gt;And if you'd rather watch than read, we've been livestreaming the loop running this end to end: &lt;a href="https://www.youtube.com/watch?v=EMH3fwYd5Lw" rel="noopener noreferrer"&gt;the stream is here&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  How a bug moves through the loop
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. Intake.&lt;/strong&gt; A real upstream codex bug gets mirrored into the fork as an issue. The title carries the pointer, e.g. "Upstream openai/codex#29131: Unrecognized slash command prevents message from being sent." The issue body states a selection rubric: small-to-medium mechanical bugs, bounded to identifiable files, owned by this repo. It explicitly avoids auth, app-server, desktop, iOS, broad sandbox policy. So the loop is not trying to be a heroic maintainer; it's picking fights it can finish.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Solvers argue.&lt;/strong&gt; Several solver agents take a pass and post their proposals as issue comments. They have different priors:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a minimal solver that wants the smallest change that satisfies the repro,&lt;/li&gt;
&lt;li&gt;a structural solver that wants a clean boundary,&lt;/li&gt;
&lt;li&gt;a delete-solver that argues for removing code rather than adding it.
They genuinely disagree. On issue #34 the minimal solver proposed a "pre-dispatch validation" tweak, the structural solver proposed a "batch validation boundary," and the delete-solver abstained from deletion. You can read all three.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;3. A judge arbitrates rounds.&lt;/strong&gt; A meta-judge reads the solver outputs. If they're split, it doesn't pick a winner — it posts something like "Design consensus needs one narrower round" and sends it back. Issue #34 went three rounds. The final comment is titled "Round-3 meta-judge arbitration" and spells out the decision:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"the minimal and structural solvers now agree on the same concrete implementation boundary, and the delete solver abstains from deletion while accepting that same boundary."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;It even records what got rejected: a new ToolCallBatch module ("a new single-caller codex-core abstraction is not required for correctness"). That's the part I find genuinely useful — the judge writes down the road not taken.&lt;/p&gt;

&lt;p&gt;**4. Implement, test, merge. **Once consensus is reached the loop opens a branch (refactor/iter34-issue-34), writes the patch, runs the guarded build/test, and opens a PR against the consensus-rnd/issues branch. For #34 that's PR #37, which touched codex-rs/core/src/session/turn.rs and codex-rs/core/src/stream_events_utils.rs and added a regression test under codex-rs/core/tests/. Then it merges itself and posts back on the issue: ✅ Auto-merged via PR #37.&lt;/p&gt;

&lt;p&gt;The state lives in GitHub. Issues are the work queue, solver comments are the debate transcript, the judge comment is the decision record, the PR is the artifact, and labels track lifecycle: crnd:lifecycle:managed, crnd:phase:design-solving → crnd:phase:consensus-reached → crnd:phase:merged, plus crnd:human:auto meaning the controller may proceed without a maintainer. Every loop-authored PR body ends with ⟦AI:AUTO-LOOP⟧. That marker, not a human, is the thing telling you who wrote it.&lt;/p&gt;

&lt;h2&gt;
  
  
  A real fix, end to end
&lt;/h2&gt;

&lt;p&gt;Issue #34 mirrors a real codex concurrency bug: when one model response contained several parallel tool calls, a valid apply_patch sibling could start side effects before a malformed sibling in the same response was rejected. The judge framed it as "fail-fast validation for side-effecting batches" — accept the whole batch as well-formed before launching anything that writes.&lt;/p&gt;

&lt;p&gt;The merged fix (PR #37) stages tool calls and only flushes them to the run queue at ResponseEvent::Completed, after the whole response batch is known good. It shipped with a regression: a valid sibling followed by a malformed one in the same response, asserting the valid one does not execute. The PR ran just test -p codex-core on the targeted test and reported it green. That's a real bug with a real, reviewable patch, produced by a debate I didn't participate in.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where it's honest about doing nothing
&lt;/h2&gt;

&lt;p&gt;PR #16 is the one I'd point a skeptic at. The loop took issue #15 (an apply_patch bug), tried to reproduce it against the current checkout, and couldn't. Instead of inventing a fix to look productive, the PR body says:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"No production fallback was added; the regression passed, so the native tool-call path did not prove an executable lookup bug in this checkout."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;So it added a PATH-isolated regression test to lock the behavior and stopped. That's the correct engineering call, and it's also the kind of result that looks like a no-op until you read the reasoning. A loop that knows when not to patch is more interesting to me than one that always produces a diff.&lt;/p&gt;

&lt;h2&gt;
  
  
  The honest boundaries
&lt;/h2&gt;

&lt;p&gt;It's a fork, dogfood, no users. Nothing here has been proposed upstream, and this is not an OpenAI thing — it's us pointing our loop at our own fork of their open-source CLI.&lt;/p&gt;

&lt;p&gt;The bugs are small by design. Status-dot contrast, UTF-8 BOM handling in apply-patch, dedup tool calls by call_id, the slash-command fix. Bounded mechanical stuff. "AI maintains a codebase" would be a lie; "a loop closes small bounded bugs end to end" is what actually happened, ~16 merged PRs so far.&lt;/p&gt;

&lt;p&gt;Humans are still in it. Someone mirrors the upstream issues and sets the rubric, and we open every PR to read it. To quote our own status: we still open them half expecting garbage. The code is auto; the attention isn't.&lt;/p&gt;

&lt;p&gt;The judge is sometimes ceremony. On easy bugs the three solvers basically agree and the judge rubber-stamps. The 3-round arbitration on #34 is the one case where the disagreement was load-bearing. I don't yet have clean evidence the judge beats a single good agent on the easy 80%.&lt;/p&gt;

&lt;p&gt;Repo's public if you want to dig: &lt;a href="//github.com/ChronoAIProject/codex"&gt;github.com/ChronoAIProject/codex&lt;/a&gt;. Start with issue #34 and PR #37.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>opensource</category>
      <category>llm</category>
      <category>programming</category>
    </item>
    <item>
      <title>Our agent loops have been shipping production features for weeks. Here's the tool.</title>
      <dc:creator>nwnwnw413</dc:creator>
      <pubDate>Fri, 19 Jun 2026 14:10:24 +0000</pubDate>
      <link>https://dev.to/nwnwnw413/our-agent-loops-have-been-shipping-production-features-for-weeks-heres-the-tool-3ekn</link>
      <guid>https://dev.to/nwnwnw413/our-agent-loops-have-been-shipping-production-features-for-weeks-heres-the-tool-3ekn</guid>
      <description>&lt;p&gt;Everyone's saying the same thing right now: stop prompting your coding agent, start designing the loop that prompts it for you, and let it do the work. We agree. We've just been doing it long enough that it isn't a prediction anymore — autonomous loops have been running our R&amp;amp;D on four production repos for weeks.&lt;/p&gt;

&lt;p&gt;Here's a concrete one. On &lt;a href="https://github.com/ChronoAIProject/NyxID" rel="noopener noreferrer"&gt;NyxID&lt;/a&gt;, our open-source gateway, a loop took a load-balancing feature from a GitHub issue to a merged PR last week: about 1,400 lines of Rust, and the merge metadata records &lt;code&gt;human_touch_count = 0&lt;/code&gt;, meaning no human edited the diff. A person still scoped the issue and clicked merge — but the code came out of the loop and survived review without anyone rewriting it. (&lt;a href="https://github.com/ChronoAIProject/NyxID/pull/975" rel="noopener noreferrer"&gt;PR #975&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;That's the part everyone's excited about, and it's real. It's also not the hard part, and not the reason we trust the thing enough to leave it running.&lt;/p&gt;

&lt;h2&gt;
  
  
  The hard part is trust, not autonomy
&lt;/h2&gt;

&lt;p&gt;The failure mode of an autonomous loop isn't that it does nothing. It's that it does something confidently wrong: writes plausible code that doesn't hold up, papers over a failing test, claims a result it can't support, and runs until your budget is gone. A single model is sure of itself even when it shouldn't be, and a naive loop inherits all of that confidence with none of the brakes. That's the real reason most "agent runs for 10 hours" demos stay demos.&lt;/p&gt;

&lt;p&gt;So the thing we actually built &lt;code&gt;consensus-loop&lt;/code&gt; around isn't "make the agent run." It's "make the agent trustworthy enough that you can walk away." The way you get there is to stop letting one confident model decide alone.&lt;/p&gt;

&lt;h2&gt;
  
  
  How it works
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;consensus-loop&lt;/code&gt; is a skill you inject into a host you already use — Claude Code, Codex, Cursor, or Gemini. You point it at a repo, hand it one &lt;code&gt;host.env&lt;/code&gt; file with that repo's facts, and it takes over the development loop from there.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# the entire host-side contract is a handful of facts&lt;/span&gt;
&lt;span class="nv"&gt;REPO_ROOT&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;/path/to/your/repo
&lt;span class="nv"&gt;GH_REPO_SLUG&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;your-org/your-repo
&lt;span class="nv"&gt;BUILD_CMD&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"cargo build"&lt;/span&gt;
&lt;span class="nv"&gt;TEST_CMD&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"cargo test"&lt;/span&gt;
&lt;span class="nv"&gt;INTEGRATION_BRANCH&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;consensus-rnd-integration
&lt;span class="nv"&gt;REVIEW_BASE_BRANCH&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;main
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One detail worth pulling out, because it's most of why the consensus means anything: the loop runs across two different systems. The host you install into — Claude Code, in our setup — is the controller. It routes, posts to GitHub, commits, and merges, but it does none of the thinking. The thinking runs on separate Codex workers it spawns in isolated git worktrees. Claude Code drives; Codex reasons. The agent steering the loop isn't the one doing the work, and the work itself is split across independent Codex workers that can't see each other.&lt;/p&gt;

&lt;p&gt;Here's how it works:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Three Codex solvers argue in isolation.&lt;/strong&gt; One is biased toward the smallest possible change, one toward structural correctness, one toward deleting code. They each draft a plan without seeing the others' work, so they don't quietly converge on the same wrong answer.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A judge converges them.&lt;/strong&gt; A fourth role reads all three plans and runs a truth table. If all three propose the same shape of fix, that's consensus and it proceeds. If they disagree, the judge writes a sharper question and sends it back for another round.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;It implements, then an independent reviewer tries to reject it.&lt;/strong&gt; Separate review passes check architecture, quality, and tests, and they're told to err toward "rework" when in doubt, not toward "ship."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;It gives up on purpose.&lt;/strong&gt; If three or more rounds pass with no progress and no new framing, the default is to drop the task rather than burn tokens grinding on something unsolvable.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;There's no algorithmic novelty here, and we won't pretend otherwise. Underneath, this is multi-agent debate, an LLM judge, and self-consistency — patterns you already know. What's hard, and what took us weeks of debugging on real repos, is the reliability engineering around the loop: the daemons that keep it alive, the leases that stop two instances from fighting, the release gates, and the stop rules. The idea is cheap. Making it trustworthy is not.&lt;/p&gt;

&lt;p&gt;If you just want to try the consensus idea on a single hard decision without any of the daemon machinery, there's a lightweight skill called &lt;code&gt;sshx&lt;/code&gt; that spins up a few isolated workers to give you multiple angles and nothing else.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why we trust it: it knows what it doesn't know
&lt;/h2&gt;

&lt;p&gt;We'll be straight about the conflict of interest: all the repos below are ours, and &lt;code&gt;consensus-loop&lt;/code&gt; has zero external adoption so far. This is our own tape, not third-party validation. Everything is a public issue or PR you can open.&lt;/p&gt;

&lt;p&gt;The NyxID feature up top is the loop doing the work. These are the loop deciding &lt;em&gt;not&lt;/em&gt; to — which is the behavior that makes the first kind safe to rely on.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;It stopped instead of fabricating.&lt;/strong&gt; On &lt;a href="https://github.com/aevatarAI/aevatar" rel="noopener noreferrer"&gt;aevatar&lt;/a&gt;, the solvers reached consensus, but at implementation time the worker didn't have the real external evidence to make the change safely. Rather than invent the missing piece to produce &lt;em&gt;something&lt;/em&gt;, it stopped, changed nothing, and surfaced what it didn't know. A clean stop, not a confident wrong diff. (&lt;a href="https://github.com/aevatarAI/aevatar/issues/2181" rel="noopener noreferrer"&gt;#2181&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;It called a human when it should have.&lt;/strong&gt; On &lt;a href="https://github.com/ChronoAIProject/Ornn" rel="noopener noreferrer"&gt;Ornn&lt;/a&gt;, a large feature wouldn't converge after several rounds. The loop didn't force the merge. It opened an escalation, left the half-finished work for review, and flagged it as needing a person. (&lt;a href="https://github.com/ChronoAIProject/Ornn/issues/1061" rel="noopener noreferrer"&gt;#1061&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;It refused to take credit it couldn't back.&lt;/strong&gt; On &lt;a href="https://github.com/the-omega-institute/newmath" rel="noopener noreferrer"&gt;newmath&lt;/a&gt; — also ours, written by the same maintainer — the loop ran an experiment and measured a real result, 0.998 on a gap-detection benchmark against a 0.463 baseline. Then it went to claim a &lt;em&gt;separate&lt;/em&gt; result, that the model also predicted better, and the statistical gate didn't pass: identical error on both arms. So it marked that claim false and logged why. On a repo where we'd have loved the win, it didn't take a result it couldn't support. (&lt;a href="https://github.com/the-omega-institute/newmath/issues/1687" rel="noopener noreferrer"&gt;#1687&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;Three of those four are the loop choosing not to act. That's the point. An autonomous loop you can actually leave running isn't one that always produces — it's one that produces when it's sure and stops when it isn't.&lt;/p&gt;

&lt;h2&gt;
  
  
  What we've put into it
&lt;/h2&gt;

&lt;p&gt;We've spent the past couple of months building this loop, tuning it, and running it for real on our own repos — using it in production and improving it at the same time. That's 155 billion tokens and 1.6 million model calls of actually living inside the thing, not a weekend prototype. We trade tokens for time, on purpose.&lt;/p&gt;

&lt;p&gt;Loop engineering is having its moment right now, and we don't think the versions that actually work should stay locked inside a handful of companies' private repos. So we're putting ours in the open. Come run loops with us — point it at your repo, break it, tell us where it falls over, and let's find out what these things can really do.&lt;/p&gt;

&lt;h2&gt;
  
  
  It catches and fixes its own mistakes
&lt;/h2&gt;

&lt;p&gt;It's still early-stage, and a lot of what the loop does is repair itself before anything reaches you. When a test fails or a reviewer rejects the work, it doesn't ship the break: it feeds the error back in, fixes it, and re-checks. The NyxID feature up top went through that four times before it passed. And when it genuinely can't recover on its own, it stops and says so rather than guessing past it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Take it
&lt;/h2&gt;

&lt;p&gt;It's open source, MIT-licensed. We don't sell it and we're not trying to. We built the loop because we needed it, we run it on our own products every day, and we're giving it to you to run on yours. Inject it into your host, write one &lt;code&gt;host.env&lt;/code&gt;, and point it at a repo.&lt;/p&gt;

&lt;p&gt;Go break it: &lt;strong&gt;&lt;a href="https://github.com/ChronoAIProject/consensus-rnd" rel="noopener noreferrer"&gt;https://github.com/ChronoAIProject/consensus-rnd&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>showdev</category>
      <category>opensource</category>
      <category>llm</category>
    </item>
  </channel>
</rss>
