Dmitry Bondarchuk

Posted on Mar 11

I Built a Local AI Dev Pipeline That Reviews Its Own Code Before Opening a PR

#ai #chatgpt #agents #vibecoding

How I got tired of being the glue between AI tools and automated the whole thing with vexdo

I've been using AI coding assistants for about a year now. Claude Code for planning and spec writing, Codex for the actual implementation, Copilot for inline suggestions. The results were genuinely good — but the process was exhausting.

Every task looked like this:

Write a spec with Claude Code
Copy-paste it into Codex
Wait for Codex to finish
Open a PR
Manually request a Copilot review
Read the review comments
Decide which ones matter
Copy the important ones back into Codex
Wait again
Repeat until it looks good

I was the glue. Every step required me to context-switch, copy text between tools, and make judgment calls. The AI was doing the creative work, but I was doing all the plumbing.

So I built vexdo — a local CLI pipeline that automates the entire cycle: spec → implementation → review → fixes → PR. Human intervention only when something goes wrong.

Disclaimer for my boss: this is exclusively a personal project. I have never used any of this for work tasks. Not once. 😄

Fair warning before we go further: this is an experiment, not a production-ready tool. It works, I use it, but it's very much a proof of concept. The goal was to validate the architecture and see where it breaks — not to ship a polished product. If you're looking for something battle-tested, this isn't it yet. If you're curious about the pattern and want to hack on it, read on.

Here's what I learned building it.

The core idea: review before the PR

Most AI coding tools open a PR and then review it. This makes sense in a human workflow — you open a PR, someone reviews it, you fix things.

But in an automated pipeline, this creates a mess. You end up with PRs full of back-and-forth commits like:

feat: add /events endpoint
fix: add input validation (per review)
fix: fix validation again (per review)
fix: actually fix it this time

vexdo flips this. Review happens on the local diff, before a PR is ever opened. The pipeline iterates until the code is clean, then opens exactly one PR — already reviewed, already fixed.

The git history stays clean. The PR is meaningful. You only get notified when there's something that actually needs a human decision.

Spec-driven development as the foundation

The whole thing only works if the agent knows what "done" looks like. That's where spec-driven development comes in.

Every task in vexdo is a YAML file with a structured spec:

id: task-001
title: "Add POST /events endpoint"
steps:
  - service: backend
    spec: |
      Implement a REST endpoint POST /events.

      Acceptance criteria:
        - Validates incoming payload against EventSchema
        - Returns 201 with created event on success
        - Returns 400 with validation errors on failure
        - Unit tests cover happy path and validation errors

      Architectural constraints:
        - Use existing auth middleware, do not reimplement
        - Do not modify existing endpoint interfaces

      Critical if:
        - No input validation
        - Breaking change to existing API
        - No tests

The acceptance criteria and critical if fields aren't just documentation — they're the ground truth that the reviewer and arbiter use to evaluate the code. No spec, no review. No review, no PR.

I write these specs collaboratively with Claude Code before handing anything to Codex. This 10-minute investment saves hours of back-and-forth later.

Why Codex for implementation (and not Claude Code)

The obvious question: why not use Claude Code for the coding step too? It's clearly the better model for complex coding tasks.

Cost.

Claude Code is great but expensive for automated, unattended runs. When you're running a pipeline that might do 3 iterations of "write code → review → fix" per task, the token cost adds up fast — especially if you're running multiple tasks per day.

Codex hits a much more comfortable price point for the implementation step. It's not as capable as Claude Code on hard problems, but for well-scoped tasks with a clear spec, it does the job at a fraction of the cost.

The split I landed on: Codex does the implementation (cheap, runs autonomously, good enough for scoped tasks), Claude Haiku does the review and arbitration (also cheap, but here accuracy matters more than raw coding ability). Claude Code stays in my workflow for the part it's genuinely irreplaceable at — writing the spec interactively with me before the pipeline starts.

One implementation detail worth noting: Codex runs with --full-auto flag and doesn't commit anything. All its changes sit as unstaged modifications. The review loop captures them via git diff HEAD — staged and unstaged together. This means the entire set of changes Codex made is visible to the reviewer in one clean diff, not scattered across intermediate commits.

If cost isn't a constraint for you, swapping Codex for Claude Code in the pipeline would probably improve results. The architecture supports it — it's just a config change.

The review loop: two Claude calls, not one

Here's where it gets interesting. I don't use a single AI call to review the code. I use two, with deliberately isolated contexts.

Call 1 — The Reviewer

The reviewer sees: the spec + the git diff. Nothing else.

It returns a structured list of findings, using four severity levels:

[
  {
    "severity": "critical",
    "file": "src/routes/events.ts",
    "line": 23,
    "comment": "No validation on req.body before passing to createEvent()",
    "suggestion": "Add schema validation using existing validateBody middleware"
  },
  {
    "severity": "important",
    "file": "src/routes/events.ts",
    "line": 31,
    "comment": "Error from createEvent() not caught — unhandled rejection will crash the process",
    "suggestion": "Wrap in try/catch and return 500 with a generic message"
  },
  {
    "severity": "minor",
    "file": "src/routes/events.ts",
    "line": 45,
    "comment": "Inconsistent error message format compared to other endpoints",
    "suggestion": "Use errorResponse() helper for consistency"
  }
]

The four severity levels are strictly defined:

critical — breaks an acceptance criterion or architectural constraint
important — likely to cause bugs directly related to what the spec requires
minor — code quality issue, but doesn't block the spec
noise — style or preference, spec-neutral

The reviewer's job is purely technical: does this code satisfy the spec?

Call 2 — The Arbiter

The arbiter sees: the spec + the diff + the reviewer's findings. It does not see the history of how the spec was written. This isolation is intentional — it prevents the arbiter from being too lenient because it "knows" the original intent.

The arbiter returns a decision:

{
  "decision": "fix",
  "reasoning": "Critical validation issue and unhandled error path must be resolved before merge",
  "feedback_for_codex": "Add input validation to POST /events handler. Use the existing validateBody(EventSchema) middleware pattern from POST /users. The validation should happen before any database calls. Also wrap the createEvent() call in try/catch and return 500 for unexpected errors.",
  "summary": "2 issues require fixing: missing validation, unhandled error path"
}

Three possible decisions:

fix — send feedback_for_codex to Codex and iterate
submit — no critical or important spec violations, open the PR
escalate — something needs a human

The submit threshold matters: the arbiter submits only when there are no critical or important findings that reflect real spec violations. Minor and noise issues don't block submission. The spec is the bar, not stylistic perfection.

When escalation triggers

The arbiter escalates in three distinct cases:

Explicit conflict — a reviewer comment contradicts the spec, or there's genuine architectural ambiguity the arbiter shouldn't resolve autonomously
Max iterations reached — the arbiter kept requesting fixes but ran out of iterations (configurable, default 3)
Missing fix instructions — the arbiter decided "fix" but didn't produce feedback_for_codex (a guardrail against bad model outputs)

In all three cases, you get the full context: the spec, the diff, every review comment with severity and location, the arbiter's reasoning, and a summary. Enough to make a decision without re-reading the code from scratch.

Why Claude as arbiter works surprisingly well

When I first designed this, I was skeptical that an LLM could reliably classify review comments. In practice, it works much better than expected — for a specific reason.

The arbiter isn't making subjective judgments. It's doing a structured comparison: does this review comment point to a violation of the acceptance criteria or architectural constraints in the spec?

If yes → critical or important, needs fixing.
If no → minor or noise, can be ignored.
If the review comment contradicts the spec → escalate.

The spec acts as an objective grounding document. The arbiter doesn't need to have opinions about code quality in the abstract — it just needs to read and compare two documents. LLMs are very good at this.

The key prompt constraint I found most important: the arbiter must not try to resolve conflicts between the reviewer and the spec. When there's a conflict, it escalates. This keeps humans in the loop for the decisions that actually matter.

Multi-repo support without a monorepo

Most of my projects have multiple services in separate repositories. I didn't want to force a monorepo structure just to use an automation tool.

vexdo uses a simple project layout on your local machine:

projectRoot/
  .vexdo.yml          ← project config
  tasks/
    backlog/
    in_progress/
    review/
    done/
    blocked/
  service1/           ← git repo
  service2/           ← git repo
  service3/           ← git repo

The .vexdo.yml config maps service names to paths:

version: 1
services:
  - name: backend
    path: ./backend
  - name: frontend
    path: ./frontend
review:
  model: claude-haiku-4-5-20251001
  max_iterations: 3
  auto_submit: false
codex:
  model: gpt-4o

Each service is its own git repo. vexdo treats projectRoot as a workspace, not a repo.

Multi-step tasks with dependencies work like this:

steps:
  - service: contracts
    spec: "Add EventType to shared schema"

  - service: backend
    depends_on: [contracts]
    spec: "Implement handler for new EventType"

  - service: frontend
    depends_on: [backend]
    spec: "Display new EventType in event list"

Steps without depends_on can run in parallel (not implemented yet, but the architecture supports it). Steps with depends_on run sequentially — the backend step doesn't start until contracts is reviewed, fixed, and submitted.

Each step gets its own branch (vexdo/task-001/backend), its own review loop, and its own PR. If the frontend step fails review 3 times and escalates, you get the full context of what happened in all previous steps, so you can make an informed decision.

What the workflow looks like in practice

# Initialize a new project
cd ~/projects/my-project
vexdo init

# Write a task spec (I do this with Claude Code interactively)
vim tasks/backlog/task-001.yml

# Hand it off
vexdo start tasks/backlog/task-001.yml

Then vexdo takes over. The output is a flat stream of progress markers — something like:

Step 1/2: backend: Add POST /events endpoint
→ Creating branch vexdo/task-001/backend
→ Running codex implementation for service backend
→ Starting review loop for service backend
Iteration 1/3
→ Collecting git diff for service backend
→ Requesting reviewer analysis (model: claude-haiku-4-5-20251001)
Review: 1 critical 1 important 1 minor
- critical (src/routes/events.ts:23): No validation on req.body
- important (src/routes/events.ts:31): Unhandled rejection from createEvent()
- minor (src/routes/events.ts:45): Inconsistent error message format
→ Requesting arbiter decision (model: claude-haiku-4-5-20251001)
→ Arbiter decision: fix (2 issues require fixing)
→ Applying arbiter feedback with codex
Iteration 2/3
→ Collecting git diff for service backend
→ Requesting reviewer analysis (model: claude-haiku-4-5-20251001)
Review: 0 critical 0 important 1 minor
→ Requesting arbiter decision (model: claude-haiku-4-5-20251001)
→ Arbiter decision: submit (no critical issues)

Step 2/2: frontend: Add POST /events endpoint
→ Creating branch vexdo/task-001/frontend
→ Running codex implementation for service frontend
→ Starting review loop for service frontend
Iteration 1/3
→ Collecting git diff for service frontend
→ Requesting reviewer analysis (model: claude-haiku-4-5-20251001)
Review: 0 critical 0 important 0 minor
→ Requesting arbiter decision (model: claude-haiku-4-5-20251001)
→ Arbiter decision: submit (no issues found)

✓ Task ready for PR. Run 'vexdo submit' to create PR.

I come back to two clean PRs, each with a review summary attached. I read the summary, look at the diff, hit merge. The whole thing took 8 minutes and I didn't touch it.

The iteration logs are preserved in .vexdo/logs/{taskId}/ — one diff, one review JSON, and one arbiter JSON per iteration per service. vexdo logs task-001 shows a summary; vexdo logs task-001 --full dumps everything including diffs.

What happens on escalation

When the arbiter escalates, vexdo prints the full context — spec, all review comments with locations, arbiter reasoning — and exits with a non-zero code.

The task file moves to tasks/blocked/. Importantly, the state is preserved — .vexdo/state.json stays on disk with status: escalated. The branches are preserved too. This means you can inspect exactly what happened, fix the spec or the code manually, and decide how to proceed.

The recovery path is still manual: run vexdo abort to clear the state, then restart with an updated spec. Automated recovery from escalation is on the roadmap.

What doesn't work (yet)

I want to be honest about the limitations.

Codex has a complexity ceiling. For well-scoped tasks — add an endpoint, update a client, add a utility function — it's great. For tasks that require deep understanding of implicit system invariants, it struggles. The spec helps a lot, but it's not magic.

The arbiter can be too lenient. If your spec is vague, the arbiter will be too. "Add proper error handling" is not a spec. "Return 400 with { error: string } for validation failures, 500 with a generic message for unexpected errors" is a spec.

No automatic rollback. If step 3 of a 4-step task escalates, the previous steps are already complete (branches and, if auto_submit: true, PRs are already created). You need to handle rollback manually. This is on the roadmap.

State recovery is basic. If the process crashes mid-task, vexdo start --resume picks up from the last completed step. But if it crashes mid-Codex-run, you need to clean up the unstaged changes manually before resuming.

Only GitHub. The PR creation is wired to the gh CLI. GitLab, Gitea, and others aren't supported.

Getting started

npm install -g @vexdo/cli

# In your project
vexdo init

# Set your API key
export ANTHROPIC_API_KEY=your-key-here

# Make sure you have the codex and gh CLIs installed
# Then write a task and run it
vexdo start tasks/backlog/your-task.yml

The repo is at https://github.com/vexdo/vexdo-cli. Contributions welcome — especially around the state recovery story and parallel step execution.

The bigger idea

What I built is less about vexdo specifically and more about a pattern: AI agents work best when they have structured evaluation criteria and a clear escalation path to humans.

The spec is the evaluation criteria. The arbiter is the evaluator. Escalation is the safety valve.

Without the spec, you get an agent that does something but you're not sure if it's right. Without the arbiter, you get a flood of review comments with no prioritization. Without escalation, you get an agent that either loops forever or merges bad code.

All three together create something that actually feels autonomous — not because it never needs you, but because it knows when it needs you.

vexdo is open source under MIT. If you build something with it or find a bug, open an issue — I read them all.

DEV Community