I open-sourced a 4-agent adversarial code review team. Any coding agent can call it as an MCP server. Built in heym.

#agents #mcp #ai #opensource

I shipped an open-source workflow this week: a 4-agent adversarial code review team that runs on heym and exposes itself as an MCP server. Any coding agent (Cursor, Claude Code, Codex, custom Python, Antigravity) can call into it for a structured second-opinion review on its own output. MIT licensed. Fork it.

The workflow is open source. It calls Ejentum's harness API for the cognitive scaffolds (free tier for experimentation, paid tier for ongoing use). Calling it "open" and ignoring that dependency would be dishonest, so I'm naming it up front.

That sounds small. Look at where the field has landed.

Git is the agent control loop now

Karpathy's autoresearch uses Git as its whole control loop, committing changes and rolling back the ones that don't work. Claude Code's GitHub Action takes an issue and opens a PR. Codex Cloud is built on the same idea. The agent's job is now to produce a thing you can review the way you'd review a colleague's work. A branch. A diff. A pull request.

Nobody had to design this. Git was already the artefact senior engineers used to evaluate work they didn't write. The agents just walked into a 20-year-old workflow we'd already gotten good at.

So who reviews the agent's PR?

Right now: the human does. Which works at human throughput. Doesn't work at agent throughput.

The natural next step: agents review agents. The catch is that most "agent reviews agent" implementations are one LLM with a clever prompt pretending to be three reviewers. The model can rubber-stamp itself. The "concerns" are theatrical. The reviewer is the same brain that wrote the code.

But before I show you what I built, the obvious objection: don't CodeRabbit, Greptile, Qodo, Ellipsis already do this? They review code with AI. The answer is they're vertical SaaS bots reviewing human PRs on GitHub. They don't expose themselves as primitives that other agents can call programmatically. This is the open layer beneath them: a peer-review primitive any coding agent invokes when it needs a critical second look on its own output. Different audience, different problem.

So back to the question. You need a workflow that structurally resists faking review. Here's what that looks like.

How the workflow refuses to rubber-stamp

Four nodes on the heym canvas. One architect agent. Three specialists.

The architect has no Ejentum harness and no HTTP tool. It cannot author concerns. It can ONLY delegate, classify, and integrate. Every concern in the final verdict must come from a specialist's evidence; the architect synthesizes but never invents.

Each Ejentum harness is a cognitive scaffold injected into the model's context before it generates: a named failure pattern to avoid, a procedure to follow, suppression vectors that block the shortcut. Different harness, different posture.

The three specialists each carry a different one:

The reasoner, with the reasoning harness, decomposes review angles.
The implementer, with the code harness, writes verification tests against the diff.
The reviewer, with the anti-deception harness, refuses framing tension and demands positive evidence for "this looks fine."

Each specialist is locked to one Ejentum mode. Cross-lab models on each (Anthropic, Google, Alibaba, Zhipu) to reduce correlated failure modes (different RLHF priors, different training distributions). Not eliminated; reduced.

The architect outputs a structured verdict: VERDICT (approve | request_changes | discuss), CHANGE_CLASSIFICATION, FRAMING_NOTES (the reviewer's concern verbatim), CONCERNS (each sourced from a specialist with severity), REVIEW_FOCUS (the reasoner's top angles).

When the test suite runs the workflow on a "quick refactor" PR that swaps raise UserNotFound(id) for return user or default, the implementer writes a test asserting the original raise behavior, the reviewer flags the framing tension ("refactor framing is misleading; raises become returns default is a behavior change"), and the architect verdict is request_changes with severity high. None of those concerns came from the architect. The architecture surfaced them through the specialists. The remaining failure modes (architect synthesis bias, correlated cross-lab pretraining, specialist tunnel-vision) are real, and a well-designed adversarial review acknowledges them rather than pretending the structural separation alone is sufficient.

The architect's full system prompt is at github.com/ejentum/agent-teams/tree/main/adversarial-code-review/heym. If the structural separation is the load-bearing claim, you should be able to read the prompt yourself and decide whether the constraint actually holds. I'd rather you do that than take my word.

heym is the multiplier

heym is closest to n8n with first-class agent primitives. Self-hosted via Docker. Native multi-agent orchestration (isOrchestrator: true and subAgentLabels on the agent node), canvas node tools, native MCP client, and crucially: each heym workflow can be exposed as its own MCP server.

Which means this 4-agent code review team isn't just a workflow. It's a callable primitive. Drop the MCP into Cursor, Claude Code, an autoresearch loop, a Codex Cloud job, or a custom Python pipeline. The agent finishes its work, calls the team for a code review, gets back a structured verdict, and decides what to do with it.

That's the layer the field hasn't filled yet. Vertical bots like CodeRabbit do human PR review on GitHub; nobody had built the open primitive for the agent layer. So I did.

Open source

The workflow JSON, system prompts, verification tests, and a setup walkthrough are at github.com/ejentum/agent-teams/tree/main/adversarial-code-review/heym. MIT.

For one-click import on the heym template marketplace: heym.run/templates/adversarial-code-review.

You need:

A heym instance, v0.0.13+ (self-hosted Docker).
An Ejentum API key (free tier 100 calls; Ki at 5,000/month for ongoing use).
LLM credentials in heym for whichever model families you want each specialist running on.

Import the JSON, set credentials, walk through the README. Roughly 15 minutes from clone to first working review if heym is already running; longer if you're standing up the heym Docker stack from zero.

What heym is, in three sentences (for readers who haven't seen it)

heym is "an AI-native automation platform built from the ground up around LLMs, agents, and intelligent tooling" (their own description). The closest analog is n8n with native agent primitives baked in. Self-hosted via Docker, repo at github.com/heymrun/heym, shipping fast over the past month.

Two heym features this workflow leans on: canvas node tools (any node on the canvas can be wired into an Agent's Tool input, with individual fields marked as agent-fillable at runtime) and native multi-agent orchestration (one agent calls named sub-agents and sub-workflows visually). Without those primitives, you'd be hand-coding orchestration; with them, the entire 4-agent setup is a canvas you can read at a glance.

Where this is going

This is the first team in agent-teams/. The pattern (orchestrator + N specialists with cognitive harnesses) generalizes to other tasks where multi-cognitive analysis genuinely beats single-agent output:

Refactor planner (reasoning + code + anti-deception)
Security audit triage (anti-deception + code + reasoning)
Production debug forensic (reasoning + code + memory)
Strategic decision audit (reasoning + anti-deception + memory)

Each follows the same structural rule: the architect has no harness, every concern is sourced from a specialist's evidence. The architecture encodes the multi-cognitive value into the workflow shape rather than leaving it to prompt theater.

If you build a team using this pattern, drop a folder in agent-teams/ with your workflow + system prompts and I'll merge.

What this is not

Not a hosted SaaS. You run heym on your own Docker. The Ejentum harness calls go through Ejentum's API; the rest is on your infrastructure.

Not a replacement for human PR review. It's a prefilter. The architect verdict gives the human a structured starting point: classification, sourced concerns, severity, falsifying tests. The human still makes the merge call.

Not a benchmark of "AI code review accuracy." It's a workflow template. Run it on your own diffs; calibrate to your own taste.