Everyone Is Building the Same Thing
Browse multi-agent AI articles and a pattern emerges fast. LangGraph with GPT-4o playing three roles. CrewAI where the "researcher," "writer," and "editor" are all the same model. AutoGen orchestrating GPT-4 with GPT-4. The LangGraph "swarm" article making the rounds this week: one model, multiple system prompts, branded as emergent coordination.
These are not multi-agent systems. They are one agent with a role-switching UI.
That distinction matters for a reason nobody writes about: monoculture. When all your agents share the same model, they share the same blind spots.
A Bug That Proves the Point
This week I was working on lib-foundation — a shared Bash library used by k3d-manager. A function called _deploy_cluster_resolve_provider had been working correctly for months, or so we thought. It contains a TTY check:
if [[ -t 0 && -t 1 ]]; then
provider="$(_deploy_cluster_prompt_provider)"
The logic: if stdin and stdout are both TTYs, show an interactive provider prompt. Otherwise default silently to k3d. Sensible.
The bug: the function was being called via command substitution:
provider="$(_deploy_cluster_resolve_provider "$platform" "$provider_cli" "$force_k3s")"
Command substitution $() creates a subshell where stdout is a pipe, not a TTY. So [[ -t 1 ]] is always false. The interactive prompt never fires. On every interactive Linux session with no provider set, the function silently defaults to k3d — bypassing the prompt entirely.
Codex wrote the original function. Claude reviewed it across multiple sessions. Neither caught it.
Copilot (GitHub's code review bot, GPT-4o based) caught it — flagged as P1 on the pull request.
Why Different Models Catch Different Things
This isn't a story about Copilot being smarter than Claude or Codex. It's a story about attention patterns.
Each model is trained on different data, fine-tuned on different tasks, and evaluated against different benchmarks. That produces genuinely different blind spots — not random noise, but systematic gaps that vary by model family.
Codex is optimized for code generation within a spec. It reads context carefully, stays in scope, and produces correct implementations of what it's asked to do. It is less likely to question the call site.
Claude handles architectural reasoning and cross-file coherence well. It tracks what's blocking what, and spots when a change creates a downstream inconsistency. It is less likely to scrutinize low-level shell behavior in a function body it didn't write.
Copilot is trained heavily on pull request review — it has seen millions of diffs where reviewers flagged exactly this class of issue: a function whose behavior changes when the execution context changes. It caught the TTY bug because that's the shape of problem it's been optimized to find.
The fix required applying the same change in two places: lib-foundation and k3d-manager's own core.sh. Both had the identical bug. Two models reviewed it. A third model from a different vendor found it.
The Monoculture Math
If you run three GPT-4o agents on a codebase, you have three instances of the same blind spots. The bug that GPT-4o misses on the first pass, it will miss on the second and third. You've added parallelism, not diversity.
The marketing language around "swarms" and "multi-agent" implies diversity of perspective. It doesn't deliver it when the model is the same.
This is not hypothetical. It's the same dynamics that caused the monoculture failures in other engineering domains:
- Dependent libraries sharing the same vulnerability because they all descended from the same upstream
- Cloud outages cascading because multiple "independent" systems shared the same availability zone
- Financial models all failing in the same direction because they were all trained on the same historical data
AI model monoculture is a newer version of the same pattern.
What Cross-Vendor Actually Buys You
In this workflow:
- Codex (OpenAI) writes production code — scoped, disciplined, stays in spec
- Gemini (Google) runs verification on real clusters — investigates, red-teams, finds environment-specific failures
- Claude (Anthropic) holds architectural context — tracks state, writes specs, reviews agent output
- Copilot (GitHub/OpenAI, but trained specifically on PR review patterns) reviews pull requests
Each vendor's failure modes are different. Codex drifts when a spec is underspecified. Gemini skips context files and expands scope. Claude misses low-level shell semantics under load. Copilot flags things without enough context to know if they're actually bugs.
The workflow routes tasks to minimize each agent's failure mode — and uses a different vendor's strengths to catch what the previous vendor missed. That's not a theoretical benefit. It's what caught the TTY bug.
The Confirmation Gate Pattern
One more thing worth noting from this week. That LangGraph "swarm" article had one genuinely good idea buried in the framework noise: the Human-in-the-Loop approval gate. A blocking call that suspends execution, asks for explicit confirmation, and only resumes on a valid response.
We added that pattern to the k3dm-mcp roadmap — not as a LangGraph node, but as a first-class MCP tool:
{ "action": "destroy_cluster", "target": "dev-cluster", "blast_radius": "destroy" }
→ { "status": "awaiting_confirmation", "token": "<one-time-token>", "ttl": 60 }
→ resume only on: { "token": "<one-time-token>", "confirm": true }
One-time token with a 60-second TTL stored server-side. An agent can't self-approve — it has to surface the confirmation to a human and wait. The token expires before the next step if nobody responds.
This is how you borrow a good pattern from a framework-heavy article without adopting the framework. Find the structural insight, strip the boilerplate.
The Actual State of Multi-Agent in 2026
Real multi-agent work — the kind where the project has real consequences and runs over months — looks less like a LangGraph graph and more like a team with distinct roles, explicit handoffs, and a shared audit trail.
The coordination layer doesn't have to be sophisticated. Ours is two markdown files and a git repo. What it has to be is honest: accurate shared state that every agent reads before starting and updates before exiting.
And the agents have to actually be different. Different models, different vendors, different training focuses. Not the same model wearing different hats.
That's the thing the framework demos consistently skip. It's also the thing that determines whether your multi-agent system finds bugs or just shuffles them around.
This is part of an ongoing series on running production infrastructure with cross-vendor AI agents. The coordination patterns, task specs, and memory-bank format are all in github.com/wilddog64/k3d-manager.
Previous articles:
Top comments (0)