Dmitry Bondarchuk

Posted on Mar 19

I Let Agents Write My Code. They Got Stuck in a Loop and Argued With Each Other

#agents #ai #automation #codereview

A follow-up to building a local AI pipeline that reviews its own code

I built vexdo — a CLI pipeline that automates the full dev cycle: spec → Codex implementation → reviewer → arbiter → PR. The dream: close my laptop, come back to a reviewed PR. No manual copy-pasting between tools, no being the glue.

Then I migrated from local Codex to Codex Cloud. Then I swapped the reviewer from Claude to GitHub Copilot CLI. Then I went to make a coffee and came back to find my pipeline had sent Codex the same feedback four times in a row.

This post is about that, and the other ways things broke. Not the happy path — the other one.

Quick recap: what vexdo does

spec.yaml → Codex (implements) → Reviewer (finds issues) → Arbiter (fix / submit / escalate) → PR
                 ↑___________________________|
                         fix loop

v1 ran locally and synchronously. v2 runs Codex Cloud so I can kick off a task and close my laptop. The reviewer is now GitHub Copilot CLI. The arbiter is still Claude.

Simple enough in theory. Here's what went wrong.

The infinite loop that took me an embarrassing amount of time to notice

This one genuinely hurt.

My spec had a contradiction I didn't catch when writing it. One section described the expected behavior. Another section described the system architecture. They disagreed on where a certain piece of logic should live.

Codex, being a dutiful implementer, read the behavior requirement and made change A. The reviewer flagged it: "this violates the architecture described in section 3." Fair enough. The arbiter sent it back for a fix.

Next iteration: Codex, now armed with the reviewer's feedback, made change B instead. The reviewer flagged it: "this doesn't implement the behavior described in section 1."

Codex made change A again.

I watched this unfold across four iterations before I admitted to myself what was happening. The agents weren't broken. They were doing exactly what they were told. The spec was broken, and nobody in the loop had the job of noticing that — because I hadn't given anyone that job.

The fix: a stuck detector

I added a fourth agent call — a loop detector that runs after each review. It gets the full iteration history: every reviewer output, every piece of feedback, every resulting diff. Its only job is to answer one question: are we making progress, or are we going in circles?

const prompt =
  `You are reviewing the history of a code review loop.\n` +
  `Below are the last ${history.length} iterations: reviewer findings and the resulting diffs.\n\n` +
  `${formatHistory(history)}\n\n` +
  `Is the loop making progress toward resolution, or is it cycling?\n` +
  `If cycling: briefly describe the contradiction causing it.\n` +
  `Respond with JSON: { "status": "progress" | "stuck", "reason": string }`;

When it returns stuck, the pipeline escalates to me with the reason. In the spec-contradiction case the output was something like: "Reviewer alternates between flagging architecture violation and spec violation. Likely spec inconsistency between sections 1 and 3."

That's exactly the signal I needed. I fixed the spec in two minutes. The task ran clean on the next attempt.

One more API call per iteration. Absolutely worth it.

The arbiter that treated every nit as a showstopper

My arbiter's job is to decide: fix, submit, or escalate. In v1, it was prompt-tuned to be thorough — if there are any issues in the review, send it back for fixes.

Sounds responsible. Was not.

The Copilot reviewer, being an agent with opinions, would find real issues — and also flag a variable name it preferred, a missing blank line, inconsistent comment style. Nits. These came back as review comments. The arbiter, seeing review comments, dutifully returned fix.

So tasks that were functionally correct would bounce through 2-3 extra cycles chasing aesthetics. Each cycle is 8-10 minutes of Codex Cloud execution. A task that should have been one pass took four. The diff after iteration 4 was identical to the diff after iteration 1 except for a renamed variable and a blank line.

The fix: severity-aware arbitration

The reviewer was already tagging severity — I just wasn't using it in the arbiter decision. One prompt update:

Severity guide:
- critical / high: always fix before submitting
- medium: fix if it's a behavior issue; use judgment for style
- low / nit: do NOT send back for fix; note in PR description instead

Only return "fix" if there are unresolved critical or high severity issues.
If the only remaining issues are low/nit, return "submit".

Task cycle count dropped immediately.

The thing I kept reminding myself: the arbiter is a policy, not just a judge. Left to its own devices, it defaults to "fix everything," which is technically correct and practically a treadmill. You have to encode the actual policy — what counts as blocking, what doesn't — or you'll spend a lot of Codex Cloud credits on blank lines.

The cloud stuff that also broke (quickly)

Since you'll hit these too:

Exit codes lie. codex cloud status returns non-zero when a task is still pending. Not an error — just "not done yet." My polling loop treated every poll as a failure and gave up immediately. Fix: parse stdout first, only throw if the output is unrecognizable.

The status values aren't what the docs imply. I was matching completed. The actual output contains [READY]. Also in rotation: [PENDING], [RUNNING], [IN_PROGRESS]. Add them all, map READY → completed.

There's no CLI resume command. The web UI lets you continue a Codex session with follow-up instructions. The CLI doesn't expose this. I simulate it by submitting a new task with the original spec plus feedback appended, with a header so it's recognizable in the UI:

[REVIEW FEEDBACK — FIX REQUESTED]
Task: Implement key pairs validation
Iteration: 2

<original spec>

Issues to fix:
<arbiter feedback>

The less funny thing I've been sitting with

All of the above are patchable. Annoying to find, quick to fix.

The bigger issue isn't a bug: my codebase wasn't ready for agents to work in.

I realized this gradually, then all at once. I put together a scoring framework — an "agent-ready codebase" checklist — and ran my codebase through it. The result was humbling.

The framework

1. Repository structure & modularity. Can you clearly identify domain logic, application services, adapters, infrastructure, and tests? Are module boundaries clean, or is there a "shared dump" folder where things go to be forgotten? Hidden coupling is invisible to you and actively confusing to agents.

2. Locality of changes. For a typical feature, how many files does a change touch? Which modules get pulled in? "God files" and scattered logic mean agents produce large, sprawling diffs — which makes the reviewer's job harder and increases the surface area for things to slip through.

3. Naming & intent clarity. Are functions and modules named by use-case, or generically? Can you infer side effects from names? An agent reading processData() has to guess. An agent reading validateAndPersistUserPayment() doesn't.

4. Contracts & boundaries. Are API boundaries validated — schemas, types, runtime validation? Are there contract tests? Is the public API clearly separated from internals? Without this, agents make changes that technically compile but violate implicit assumptions at integration points.

5. Test quality & reliability. Are tests deterministic? Behavior-focused? Do they cover edge cases? Can you easily add a regression test when something breaks? Flaky tests are worse than no tests in an automated pipeline — they inject false negatives into the review loop and you can't tell whether the failure is real.

6. Verification pipeline. Is there a single command that verifies correctness — lint, types, tests? Can you run partial checks scoped to changed files? If the answer is "kind of, it's complicated," agents will struggle to self-verify. And if they can't self-verify, you end up doing it.

7. Review comment verifiability. Can typical review comments be validated automatically — via lint, type checker, tests? Or are most comments subjective judgment calls? The higher the ratio of automatable-to-subjective feedback, the more effective an automated reviewer becomes. A codebase full of subjective review surface generates noise that the arbiter has to wade through on every cycle.

8. Risk segmentation. Can you identify high-risk areas — auth, billing, migrations, infrastructure? Is this encoded somewhere: path conventions, annotations, docs? Without it, agents treat all code as equally safe to modify. That's fine until they're modifying the billing module with the same confidence they'd modify a utility function.

9. Documentation for agents. Is there an ARCHITECTURE.md? A CONTRIBUTING.md? An AGENTS.md (or equivalent) that explains how to run the service, how to test changes, how to add a feature? Agents can infer a lot from code — but they shouldn't have to infer things that could just be written down. Every missing doc is a guess the agent makes on your behalf.

10. Dev environment & reproducibility. Can you bootstrap the service reliably from a clean clone? Are there hidden dependencies — secrets, external services that need to be running, manual steps nobody wrote down? Every hidden dependency is a potential point of silent failure when the agent tries to verify its own work.

My score: 52/100

That number explains a lot of friction. When a Codex change touches six files across three modules, the reviewer has more surface area to miss things. When tests are flaky, the verification step is unreliable. When architectural rules live only in my head, no agent can enforce them — which made the stuck loop I described earlier almost predictable in hindsight.

A brief word about the "code quality matters less now" take

I keep seeing this framing: in the era of agentic systems, code quality matters less because the AI will figure it out. Sloppy structure, vague names, tangled modules — the model is smart enough to work around it.

I think this is exactly backwards, and I'm saying this as someone who just spent several evenings watching agents thrash inside a mediocre codebase.

The agent can't ask you what you meant. It can't read the git history and infer the original design intent. It reads what's there. Ambiguous structure → ambiguous behavior. Hidden coupling → unexpected side effects. Vague names → hallucinated assumptions. No AGENTS.md → the agent guesses how your service is supposed to work and proceeds with confidence.

Code quality doesn't matter less when agents are writing and reviewing your code. It matters more, because the human who could previously fill in the gaps isn't filling them in anymore. The code is the only source of truth the agent has. It better be a good one.

A score of 52/100 means I'm running agents on a codebase that's half-ready for them. Getting that number up is now higher on my list than any pipeline feature.

What the pipeline looks like now

spec.yaml
  → codex cloud exec --branch <feature-branch>
  → [poll until READY]
  → codex cloud apply → git commit → git push
  → copilot review (with full iteration history)
  → stuck detector (iteration > 1)
  → arbiter (severity-aware)
  → if fix: loop with feedback header
  → if submit: open PR
  → if escalate: surface to human with reason

Fix iterations stack as commits on the branch. Each commit message is generated by Copilot — a prompt built around conventional commit rules (type(scope): description), first line of output taken as the message, with a fallback to chore: apply codex changes if Copilot times out or returns nothing. Squash-merge when done. The history is readable: you get actual meaningful commit messages at each iteration, not just iteration 2.

Where I'm going with this

My actual goal is a system where agents write and review code autonomously, and I step in rarely — for escalations, ambiguous specs, and the cases that genuinely need human judgment.

Right now I'm in the loop more than I want to be. Some of that is tooling immaturity. Some of it is the 52/100. Some of it is that spec-writing is still entirely manual — and as I learned, a bad spec defeats even a well-tuned pipeline.

Here's what's on the roadmap, roughly grouped by problem area:

Review and verification

Verification ladder. Right now the arbiter makes a judgment call about whether something is "done." I want to replace that with structured must_haves in the task YAML — a list of requirements that get verified against the diff at four tiers: static (file/export presence), command (tests pass), behavioral (observable output), or human (escalate). Submit is only allowed when every must-have passes. No more "looks good to me" from the arbiter.

Better stuck detection. The current loop detector catches cycles at the review level. I want to add diff-level detection: if Codex produces the same diff twice, fire a diagnostic retry with a targeted prompt. On a second identical diff, escalate with a structured breakdown showing exactly which review comments went unaddressed. Less "something seems wrong," more "here is precisely what didn't change and why."

Context and memory

This is the area I'm most excited about. Right now each Codex submission is stateless — it knows the spec and the feedback, nothing else. Over a multi-step task, that's a problem.

Fresh context injection. Before each Codex submission, prepend summaries of completed steps and a decisions register to the prompt. Prevents Codex from re-implementing utilities already built by earlier steps. Capped at 2000 tokens so it doesn't eat the context window.

Decisions register. A .vexdo/decisions.md file — an append-only table of architectural decisions made during execution: which validation library was chosen, what the storage strategy is, naming conventions adopted. The arbiter populates it automatically. Every subsequent step prompt gets it injected. The goal: agents that build on prior decisions instead of relitigating them.

Scout agent. A focused Claude call before each Codex submission that scans the target service's codebase and returns relevant existing files, reuse hints, and conventions to follow. Non-fatal: if Scout fails, execution continues without it. But when it works, Codex stops reinventing things that already exist.

Adaptive replanning. After each step completes, a lightweight Claude call checks whether remaining step specs are still accurate given what was actually built. Proposes updates for me to confirm before the next step runs. Multi-step plans rarely survive contact with reality unchanged — this is the mechanism for adjusting without rewriting everything manually.

Resilience

Continue-here protocol. Right now if the process crashes mid-task, you start over. I'm adding a .vexdo/continue.md checkpoint written at every major phase transition — codex submitted, codex done, review iteration, arbiter done. vexdo start --resume reads the checkpoint and picks up from exactly where it left off. This matters more than it sounds once tasks are running for 30+ minutes across multiple iterations.

Observability and interaction

Cost and token tracking. Every Claude API call will capture token usage and estimated cost. Per-step and total costs shown in vexdo status. Optional budget ceiling in .vexdo.yml that pauses execution before overspending. Right now I have no idea what a task costs until I check my API bill.

UAT script generation. After all steps complete, Vexdo writes .vexdo/uat.md — a human test script derived from step must-haves and arbiter summaries. vexdo submit warns if UAT items are unchecked (override with --skip-uat). The dream of fully autonomous code is great; the reality is that some things still need a human to click through the UI once.

Discuss command. vexdo discuss <task-id> opens an interactive Claude session with full task context pre-loaded — what was built, what decisions were made, what's still pending. Ask questions, queue spec updates for pending steps, steer execution from a second terminal while start is running. The CLI as a conversation partner, not just an executor.

Getting the codebase score above 80 will get me closer to the goal. So will all of the above. The common thread: the more context agents have, the less they guess. The less they guess, the fewer loops. The fewer loops, the closer I get to actually closing my laptop and coming back to a PR that's ready.

One problem at a time.

vexdo is open source — github.com/vexdo/vexdo-cli. If you're building something similar or have hit these problems differently, I'd like to hear about it.

DEV Community