Nikita Benkovich

Posted on Mar 2

Coding Agent Teams Outperform Solo Agents: 72.2% on SWE-bench Verified

#ai #llm #agents #softwareengineering

Most AI coding agents work alone. You give them an issue, they figure it out, they hand you a fix. It's the AI equivalent of a lone wolf developer — capable, but not how real software teams actually operate.

A team of researchers at Agyn asked a different question: what if instead of a single agent, you used a coding agent team — with real roles, real review loops, and real coordination?

The results are hard to ignore.

The Idea: Stop Treating Issue Resolution as a Solo Task

Real software development involves coordination. A problem lands, someone researches it, someone else implements a fix, a reviewer pushes back, things iterate. The system that emerges from that process is more robust than anything one person (or one agent) would ship alone.

The Agyn system — described in a paper published on arXiv — encodes this directly. Rather than routing a GitHub issue through a single agent with a big context window, it spins up a team:

Manager — coordinates execution, communication, and knows when to stop
Researcher — explores the repository, gathers context, writes the specification
Engineer — implements the fix, debugs failures
Reviewer — evaluates the PR and enforces acceptance criteria

Each agent has a clearly scoped role, runs in its own isolated sandbox, and communicates through standard GitHub artifacts — commits, PR descriptions, and review comments. Just like a real team would.

Why Coding Agent Teams Work Better Than Solo Agents

A few design decisions make this more than just "more agents":

Isolated execution environments. Each agent gets its own sandbox with shell access. No shared filesystem. Agents can install dependencies, run tests, and configure their environment without stepping on each other. Failures are easy to attribute.

Explicit role enforcement. Every role specifies which model to use, what reasoning level, what tools, and what responsibilities. This prevents the "do everything" trap where a single agent accumulates too much context and starts hallucinating. It also means you can allocate expensive, high-reasoning models only where they're needed.

Structured communication, not a fixed pipeline. The Manager dynamically coordinates execution rather than following a script. If the Reviewer rejects the PR, the Engineer iterates. The system adapts.

Context management for long tasks. Large artifacts are persisted to the filesystem rather than stuffed into the model context. Accumulated context is summarized automatically. This is how you run a system end-to-end on complex issues without it falling apart.

The Benchmark Results

The team evaluated the system on SWE-bench Verified — a widely used benchmark where models must resolve real GitHub issues by modifying codebases and producing PRs that pass the project's test suite.

The system resolved 72.2% of tasks, using GPT-5 and GPT-5-Codex at medium reasoning levels.

Here's how that compares to other top systems at evaluation time:

System	Model(s)	Resolved
agyn	GPT-5 / GPT-5-Codex (medium reasoning)	72.2%
OpenHands	GPT-5 (high reasoning)	71.8%
mini-SWE-agent	GPT-5.2 (high reasoning)	71.8%
mini-SWE-agent	GPT-5 (medium reasoning)	65.0%

The key detail: this system wasn't tuned for the benchmark. The same prompts, role definitions, tools, and execution model used in production were applied directly. It outperformed competitors using higher-reasoning model variants — without needing them.

The 7.2% gain over the single-agent baseline using the same model class comes purely from the team structure.

What This Means for Agent Design

The paper makes an argument that's easy to overlook in the current race to improve models: organizational design matters as much as model quality.

We've spent a lot of energy making individual models smarter. But real-world software development scaled because of how teams work — division of labor, code review, shared artifacts, iteration. Replicating that structure in an agent system produces measurable gains without touching the underlying model.

The results suggest a few things worth taking seriously:

Role separation reduces errors. When each agent has a narrow job, there's less opportunity for confusion and accumulated mistakes.
Review loops improve output quality. Having a dedicated Reviewer that can send work back to the Engineer catches problems before they become permanent.
You don't always need the biggest model. Allocating medium-reasoning models across a well-structured team can beat a single high-reasoning agent doing everything.

What’s Next

The Agyn platform is [open source on GitHub(https://github.com/agynio/platform).

We believe the future is not a single general-purpose “super agent,” but teams of specialized agents, organized the way real organizations operate. Different roles. Different responsibilities. Clear coordination. Explicit review. Shared context.

And we’re building toward that vision.

Coming Next

1. Flexible, Modular Agent Organizations

Instead of a fixed pipeline, you’ll be able to compose agent teams like building blocks:

Define custom roles
Assign different models per role
Configure tools and permissions
Isolate execution environments
Design explicit coordination flows

Not a monolith. An organization.

2. New Agent Communication Paradigms

Real teams do not operate in a single synchronous loop. They:

Open threads
Leave structured comments
Request reviews
Resume work later
Escalate decisions

We are introducing structured communication protocols between agents, including asynchronous collaboration, so coordination can happen across time, not just across steps.

The lone wolf agent had a good run.

The team might take it from here.

Paper: Agyn: A Multi-Agent System for Team-Based Autonomous Software Engineering — Nikita Benkovich, Vitalii Valkov (2026)

Blog post: We tested how an AI team improves issue resolution on SWE-bench Verified

Top comments (1)

Matthew Hou • Mar 2

The role-based approach is interesting, but I wonder how much of the improvement comes from the review loop vs. the role specialization. In my experience with AI coding workflows, the biggest wins come from separation of concerns — one pass to generate, a completely separate pass to verify — rather than from giving agents different "personalities." The 72.2% number is impressive, but I'd love to see the ablation: what happens if you keep the review loop but remove the role differentiation? My gut says most of the value is in the structural feedback mechanism, not the specialization. That would actually be more practically useful, because building a review loop into existing single-agent setups is straightforward — building a whole multi-agent coordination system is not.