Shinsuke KAGAWA

Posted on May 11 • Edited on May 15

A second pair of eyes for Claude Code: building Galley, a local runner that checks the work before the PR opens

#productivity #ai #opensource #programming

Update: claude -p pricing changed after this was published, which made running the cross-model loop continuously more expensive than I wanted to defend. Galley's executor is now selectable between Claude Code and Codex (it was Claude-only before), so a same-model Codex → Codex setup is a viable cost-optimized option alongside the original Claude → Codex pairing. See "Where it is" for the current pairing recommendations.

TL;DR

I run three client projects plus an OSS repo. Agentic coding got good enough that for a lot of tasks I just hand over a goal and a list of acceptance criteria and let it run. The catch: Opus 4.7's reliability made me nervous enough that I started ending almost every task with a manual round of "now have Codex look at this."
That ritual happened often enough that I automated it. Galley is the result: a local runtime where Claude Code executes a task inside a git worktree, then a supervisor (Claude, or Codex when I want a different reviewer) checks the run evidence against the acceptance criteria and either bounces it back for another attempt or approves it and opens a PR.
The parts I'm happiest with aren't the loop itself. They're the boring scaffolding around it: the tool is installed and configured by a Skill, acceptance-criteria test skeletons get written into the worktree before the first attempt, there's a per-repo quality.yaml I keep growing, and a second model on review duty catches things the model that wrote the code does not.
Galley now does a noticeable chunk of its own development. It's MIT-licensed and available on GitHub.

The setup that made me build this

For the last while my day job has been three concurrent product codebases plus maintaining an open-source workflow framework. None of them are huge, but the context-switching tax is real, and I'd been leaning harder and harder on Claude Code to carry whole tasks rather than babysitting them line by line.

At some point a threshold got crossed. Not "AI writes all my code now," more like: for a task of a certain size and shape, I no longer needed to plan the implementation. I needed to write down what done looks like (the goal, the acceptance criteria, the paths it's allowed to touch), and that was genuinely enough. The model would go figure out the how. That's a nice feeling the first few times it works.

Then Opus 4.7 happened, and the feeling got complicated. I won't relitigate it; plenty of people have. The short version for me was: the ambition was still there, the output still looked plausible, but I stopped trusting "looks plausible." So I started doing something I'd done occasionally before, but now every single time: after Claude Code finished, I'd open Codex, hand it the diff and the requirements, and ask it to find what was wrong. It usually found something. Different model, different blind spots: Codex would flag an edge case Claude glossed over, and Claude would have written cleaner structure than Codex would have. The combination of the two was reliably better than either alone.

So now every task had a manual final step that I did by hand, with copy-paste, in a separate terminal, every time.

"Third time, automate it"

I have a rule I mostly stick to: the first time you do a thing manually, fine. The second time, grumble. The third time, you build the tool. The cross-model review had blown well past three.

But the more I sketched it, the more it stopped being "a script that pipes a diff into Codex" and turned into something with a shape: if a supervisor model is going to approve work, it needs the work in a reviewable form, not just a diff but the command plan, the executor's own report, git status, the structured result. If it's going to reject work, the rejection needs to come back to the executor as a new attempt with the feedback attached, not as a Slack message to me. And if it can do all that, it can open the PR itself, and I can do final tweaks from PR comments instead of from my editor.

What Galley actually does

It's a local Go binary plus a daemon. You point it at a repo, hand it a task, and it runs:

your repo  ──task YAML──▶  galley daemon (local)
                               │
                               ├─▶ executor: Claude Code or Codex, inside a git worktree
                               │      writes the code, returns a structured result
                               │
                               └─▶ supervisor: Claude or Codex
                                      reads the run evidence, issues a verdict,
                                      opens the PR on accept

The task itself moves through a file-backed queue, and the supervisor's verdict decides where it lands:

draft task YAML
        |  galley task queue
        v
   tasks/queued/  →  daemon claims it  →  tasks/running/
        |
        |  Claude Code executes in a git worktree
        v
   supervisor review (Claude or Codex)
        |
        +-- accepted  -------------------→ tasks/done/  (+ open PR if enabled)
        +-- needs_revision  -------------→ retry, while loop budget remains
        +-- needs_supervisor_review  ----→ tasks/failed/  (escalate to me)
        +-- hard_stop  ------------------→ tasks/failed/  (no retry)

Everything runs locally, every change stays as git-visible diffs, and every attempt writes its evidence to disk: command_plan.json, run_result.json, the supervisor's verdict, git_status.json, diff.patch. When the loop escalates to me, I'm not guessing; I'm reading the file the supervisor read.

The task file is the trusted input. It's where the goal, the acceptance criteria (each with an ID the executor has to report back against), the allowed and forbidden paths, the loop budget, and the PR behavior live. The model never gets to redefine its own success criteria mid-run. It gets to satisfy them or fail them.

The decisions I'd actually defend

The loop is the obvious part. Here's the stuff that took longer to get right and that I think matters more.

A Skill installs the tool

This is the bit that still feels a little science-fictional to me. Galley ships an Agent Skill: a Claude Code plugin, and a Codex marketplace entry. You install the skill first. Then you ask it, in plain language, to set up the repo. It installs the galley binary, inspects the repository, drafts a quality.yaml and an environment.yaml, explains the execution settings to you, writes a valid task YAML, validates it, and queues it. It only queues after you say yes.

I've shipped CLIs before. The onboarding was always a README and a prayer. Here the onboarding is an agent, and "explain what this config field means and pick a sensible default for my repo" is just a thing it does. That splits your docs in two: the README is for the human who wants to understand the system, and the skill's reference files are for the agent that has to operate it correctly without you in the loop. They overlap less than you'd think, and I'm still figuring out where each line belongs.

Acceptance-criteria test skeletons go in first

This one's a direct response to the 4.7 trust problem. There's an optional preflight step. Before the first executor attempt, Galley runs a built-in test-creator pass that writes test skeletons into the worktree, one per acceptance criterion, and records the mapping back onto the running task: for each AC, the skeleton's path, the behavior it's meant to pin down, and where it plugs into the codebase. In a Go repo a skeleton is about what you'd expect:

func TestAC1_RunAgentOverridesTimeoutPerCall(t *testing.T) {
    t.Skip("AC1: run_agent callers can override execution timeout per call")
    // executor fills this in
}

Those skeleton paths are validated against the task's allowed paths, so the test-creator can't scatter files wherever it likes. And the executor can't get an "accepted" verdict while those tests are still skipped and the required checks haven't run green; the supervisor downgrades that to needs_supervisor_review. The effect is small but real: the implementation has to converge on something the AC-shaped tests accept. A model that's drifting toward a clever-but-wrong solution runs into the skeleton and has to reckon with it. It's harder to wander off when there's already a fence where the spec said the fence should be. (It's off by default, since some tasks genuinely shouldn't have it, but for "implement feature X with these three behaviors," I turn it on.)

`quality.yaml` is a thing I grow

Each repo gets a quality profile: which checks are required, which review dimensions must pass, what evidence the supervisor should expect, what severity of finding blocks acceptance. It starts small. Then every time a run produces something technically-passing-but-wrong-for-this-codebase, I add a line. Over time the profile becomes the codebase's accumulated opinion about what "good" means here, and both the executor and the supervisor get handed that opinion at the start of every task. Implementations stop drifting because the definition of "done well" stopped being implicit.

Claude writes, a second model signs off

Supervisor review defaults to Claude. But I can flip it to Codex per task, and for anything I'd have manually double-checked before, I do. Same-model review (Claude checking Claude) is fine and catches plenty. A different model catches a different category of mistake, and because a rejection comes back as another attempt with the feedback attached, the executor gets to fix what the reviewer flagged instead of just failing the task. So a long unattended run doesn't drift the way an unreviewed one does: every accepted step has had a second model poke at it, and the diff you end up with has those corrections baked in.

The deterministic / non-deterministic seam

Building this kind of tool, the genuinely fiddly part isn't the AI calls — it's the boundary between the parts that must be exact and the parts that get to improvise. Two places I had to draw a hard line:

The executor has to return a structured JSON result, every time, or the retry-and-review loop has nothing to stand on; the supervisor can't evaluate a free-form essay. So Galley installs a small guard plugin into the executor's Claude Code that enforces the output format. The creative work is non-deterministic; the envelope it arrives in is not.
Opus 4.7's defaults made me uneasy enough that I replaced the executor's system prompt outright with one derived from Codex-style prompting and from my own claude-code-workflows OSS, and did the same kind of swap for the supervisor prompts. The task YAML literally has prompt_mode: replace for this. I'd rather pin the behavior than hope for it.

Neither of these is glamorous. Both are the difference between a demo and something I leave running while I'm in a meeting.

Galley building Galley

The thing I didn't plan for: once it worked, the obvious next move was to have Galley develop Galley.

A few recent fixes, all queued through the skill and executed by Galley itself. The PR body was rendering every acceptance criterion as not_satisfied even when the supervisor had accepted them with evidence, which is confusing for anyone reading the PR. galley task show, the command I run constantly, was loudly reporting latest_claude_status: failed on tasks that had actually been accepted with a PR open; true as raw history, wrong as the headline. The PR-comment trigger only recognized /galley rerun ... and /galley requeue ..., when what I actually wanted was to type /galley fix the failing test and have it pick that up as the request. And new worktrees were being branched off whatever the source repo's HEAD happened to be instead of the configured base branch, so one PR's commits could leak into the next.

Some of those I caught by using the thing. Some — like the test-skeleton preflight — came from sitting down with a pile of runs/ evidence and asking what would have stopped the bad run earlier. Either way, the loop closes: I notice a gap, I write it up as a task with acceptance criteria, Galley implements it, a supervisor checks it, a PR shows up, I tweak it from a comment.

It's not fully autonomous and I'm not pretending it is. I review the task drafts. I read the escalations. I still tweak PRs. But the ratio of "me describing what I want" to "me typing the code" has tilted further than I expected, and the safety rails (evidence on disk, AC-shaped tests, a growing quality profile, a second model's sign-off) are what let me actually trust the tilt.

Where it is

Galley is MIT-licensed and available on GitHub at shinpr/galley. Both the executor and the supervisor are selectable between Claude Code and Codex. If you want a different model checking the work — the original premise of this post — Claude → Codex is still what I'd reach for; a different model's blind spots are the whole point. Codex → Codex is the cost-optimized alternative: same model on both sides, but the supervisor only ever sees ACs and evidence files, not the executor's conversation, which mitigates self-preference bias even without going cross-model. It's built for trusted local repositories: task YAML is trusted input, quality checks run locally, PR comments can request a requeue but can't rewrite your gates, and only the PR author (who is also a repo owner or collaborator) can drive it from comments. It also leans on git, a git worktree per task, and gh for the PR path.

To try it, add the plugin and let the Galley skill do the setup: it installs the galley CLI if it isn't already on your PATH, inspects the repo, drafts the quality.yaml and environment.yaml profiles and the task YAML, and queues only after you approve. That "install a skill, have it install the tool" loop is the part that still feels new to me.

Claude Code

/plugin marketplace add shinpr/galley
/plugin install galley@galley-tools
/reload-plugins
/galley:galley Set up Galley for this repository.

Codex

codex plugin marketplace add shinpr/galley

That only adds the Galley marketplace. Next, start a Codex session, run /plugins, select Galley, and install it from there.

Then invoke the skill with $galley:

$galley Set up Galley for this repository.

If you'd rather install the CLI yourself first, it's a one-liner: curl -fsSL https://raw.githubusercontent.com/shinpr/galley/main/scripts/install.sh | sh. Either way you end up describing tasks to the skill in plain language from there.

I'm curious how other people have dealt with the same trust gap. Have you put a second model in the review loop? Did a cross-model pairing actually buy you something, or was Claude-reviewing-Claude enough? And if you've found a better answer than "evidence on disk plus tests that go in before the code," I'd genuinely like to hear it. If you build something around this, or break it in an interesting way, even better.

DEV Community

A second pair of eyes for Claude Code: building Galley, a local runner that checks the work before the PR opens

TL;DR

The setup that made me build this

"Third time, automate it"

What Galley actually does

The decisions I'd actually defend

A Skill installs the tool

Acceptance-criteria test skeletons go in first

`quality.yaml` is a thing I grow

Claude writes, a second model signs off

The deterministic / non-deterministic seam

Galley building Galley

Where it is

Claude Code

Codex

Top comments (0)

TL;DR

The setup that made me build this

"Third time, automate it"

What Galley actually does

The decisions I'd actually defend

A Skill installs the tool

Acceptance-criteria test skeletons go in first

quality.yaml is a thing I grow

Claude writes, a second model signs off

The deterministic / non-deterministic seam

Galley building Galley

Where it is

Claude Code

Codex

`quality.yaml` is a thing I grow