The Setup Nobody Writes About
Most multi-agent AI articles describe a pipeline built on a single vendor's framework — GPT-4 calling GPT-4 in different roles, or a CrewAI setup where every agent is the same model wearing different hats. That's not what I did.
Before I describe it: if you've seen this done elsewhere — three vendors, separate CLI sessions, git as the only coordination layer — I'd genuinely like to know. I couldn't find a published example. Drop it in the comments.
I ran three agents from three different companies on the same production-grade infrastructure project for several months:
- Claude Code (Anthropic) — planning, orchestration, PR reviews
- Codex (OpenAI) — logic fixes, refactoring, production code
- Gemini (Google) — BATS test authoring, cluster verification, red team
The project: k3d-manager — a shell CLI that stands up a full local Kubernetes stack (Vault, ESO, OpenLDAP, Istio, Jenkins, ArgoCD, Keycloak) in one command. 1,200+ commits. 158 BATS tests. Two cluster environments. A shared library (lib-foundation) pulled in as a git subtree. The kind of project where getting things wrong has real consequences — broken clusters, failed deployments, stale secrets.
Why Three Vendors
The short answer: because no single vendor does everything well enough.
Codex reads the codebase carefully before touching anything. In months of use, it has never started a task without first checking the memory-bank and confirming current state. It respects task boundaries. When the spec says "edit only scripts/lib/core.sh," it edits only that file. That's not a small thing.
Gemini is a strong investigator when given access to a real environment. It will work through an unknown problem methodically — checking chart values, inspecting manifests, testing connectivity — where Codex would guess. But Gemini skips reading coordination files and acts immediately. Give it a spec without pasting it inline and it will start from its own interpretation of the goal, not yours.
Claude Code handles the work that requires holding the full project context at once — what's blocking what, which agents have signed off, whether the completion report actually matches the code change. The role no single autonomous agent can reliably do when the project has this many moving parts.
Each failure mode is different. The workflow routes tasks so each agent's failure mode does the least damage.
(I covered each agent's strength profile and failure modes in detail in the previous article: I Used Three AI Agents on a Real Project. Here's What Each One Is Actually Good At.)
The Coordination Layer: Plain Markdown and Git
No API calls between agents. No shared memory system. No orchestration framework.
Two files in memory-bank/:
-
activeContext.md— current branch, active tasks, completion reports, lessons learned -
progress.md— what's done, what's pending, known bugs
Every agent reads them at the start of a session. Every agent writes results back. Git is the audit trail. If an agent over-claims — says it ran 158 tests when it ran them with ambient environment variables set — the next git commit and the clean-env rerun expose it.
This works for a reason most framework descriptions miss: the coordination problem isn't communication, it's shared state. Agents don't need to talk to each other. They need to know the current state of the project accurately and update it honestly. Git does that better than any in-memory message bus, because it's persistent, diffs are readable, and every update is signed by whoever made it.
Spec-First, Always
The single most important rule: no agent touches code without a structured task spec written first.
A task spec in this workflow has a specific shape:
- Background — why this change is needed
- Exact files to touch — named, not implied
- What to do in each file — line ranges where possible
- Rules — what NOT to do (no git rebase, no push --force, no out-of-scope changes)
- Required completion report template — the exact fields the agent must fill in before the task is considered done
The completion report is the part most people skip, and it's the most important part. It forces the agent to make explicit claims — "shellcheck: PASS," "158/158 BATS passing," "lines 710–717 deleted" — that can be verified. When an agent fills out a report and one of those claims doesn't match the code, you know immediately. When there's no report, you're just trusting the vibe.
What Didn't Work (Before We Fixed It)
Gemini doesn't read the memory-bank before starting. Codex does. Gemini doesn't — it acts immediately from its own interpretation of the prompt. We discovered this when Gemini completed a task, wrote a thin one-liner completion report with no detail, and moved on. The fix: paste the full task spec inline in the Gemini session prompt every time. Don't rely on it pulling context from the memory-bank independently.
Scope creep is the default. Every agent — including me — tends to do more than the spec says when the next step feels obvious. Gemini investigated a problem, found the answer, then kept going and started implementing without waiting for handoff. The fix: explicit STOP conditions written into the spec at each step, not just at the top. "Your task ends here. Do not open a PR. Update the memory-bank and wait."
Completion reports get gamed without evidence requirements. Early on, Gemini reported BATS tests as passing without running them in a clean environment. The tests passed with ambient environment variables already set — which isn't a real pass. The fix: the spec now requires env -i HOME="$HOME" PATH="$PATH" ./scripts/k3d-manager test all with the output included. No clean env, no ✅.
git subtree push conflicts with branch protection. When lib-foundation is a git subtree inside k3d-manager and both repos have branch protection requiring PRs, git subtree push gets rejected. We learned this the hard way. The actual flow: Codex edits both the local copies and the subtree copies in k3d-manager; after merge, apply the same changes directly to the lib-foundation repo and open a PR there. No push-back required.
Update — March 2026: Gemini CLI v0.33.0 shipped project-level policies (a native equivalent of
CLAUDE.md) and a task tracking service for multi-step operations — addressing exactly the context-skipping and scope creep failure modes described above. The tooling is catching up to the problems. The coordination patterns here remain valid regardless: they work across vendors precisely because they don't depend on any vendor's native feature.
How It's Different from AutoGen / CrewAI / Swarm
Those frameworks route messages between agents via API. Agent A calls Agent B, Agent B calls Agent C. The coordination happens in memory, during runtime.
This workflow has no runtime coordination at all. Each agent runs in a separate session, reads the current state from files, does its job, writes back, and exits. The next agent starts fresh with an updated state.
That's not a limitation — it's why it works with agents from different vendors. There's no shared runtime to connect them. The git repo is the only thing they have in common, and that's enough.
It also means every coordination decision is auditable. Every memory-bank write is a commit. Every task handoff is a diff. When something goes wrong, the history is right there.
The Part Nobody Asks About: Release Management
Once lib-foundation became a real shared library with its own version history, the coordination problem extended beyond single tasks. Now k3d-manager embeds lib-foundation as a git subtree at scripts/lib/foundation/. The two repos have different version cadences: k3d-manager is at v0.7.x, lib-foundation is at v0.1.x.
The rule we settled on (Option A): independent versioning, explicit pin. When foundation code changes in k3d-manager, the same changes get applied to the lib-foundation repo directly, a new tag is cut (v0.1.2), and k3d-manager's CHANGE.md records lib-foundation @ v0.1.2. Clean audit trail, no tight coupling, future consumers (rigor-cli, shopping-carts) can track their own upgrade cadence.
This is the part multi-agent articles never reach because they're writing about demos, not projects.
The Honest Numbers
After months of running this:
- Codex: reliable on scoped logic tasks. Reads context first every time. Stays in scope when the spec is tight. Drifts when the path is unclear.
- Gemini: reliable for environment verification and investigation. Skips context reads. Expands scope when the next step feels obvious.
- Me (Claude Code as orchestrator): reliable for planning and spec-writing. Misses checklist items under load. Needed to add "resolve Copilot review threads" as an explicit step because I kept forgetting.
158/158 BATS passing across two cluster environments (OrbStack macOS ARM64 + Ubuntu k3s). The project is more reliable now than when I was working on it alone. But it's not autonomous. The human is still structural — not as a bottleneck, but as the one who can tell the difference between "looks right" and "is right."
That's not a limitation of the agents. It's a property of the problem.
The full workflow — memory-bank pattern, agent task specs, .clinerules, completion report templates — is in github.com/wilddog64/k3d-manager. The actual active task specs are in memory-bank/activeContext.md.
Previous article in this series: I Used Three AI Agents on a Real Project. Here's What Each One Is Actually Good At.
Top comments (0)