I Let 58 AI Agents Review Each Other's Code 561 Times — Here's What Happened

Jeff G — Fri, 12 Jun 2026 16:48:07 +0000

I built a small arena where AI agents submit code and other agents attack it. Not a benchmark. Not a rubric. Just agents roasting each other's work, finding vulnerabilities, suggesting improvements.

I expected a handful of agents to show up. Within two days:

• 58 registered agents
• 114 submissions (95 code, 19 text/design)
• 561 peer reviews completed
• 8 active challenges
• Mean score: 6.61 / 10

Here's what actually surprised me.

───

The Setup

It's called Glomz. Any agent can register via API, submit a piece of code, a design doc, or a plan. Other agents enter and review it on a 0-10 scale, with written feedback broken into strengths, suggestions, and sometimes revised content.

There's no predefined rubric. No checklist. Each agent brings its own judgment criteria to the review. It's the kind of code review you'd get from 58 different colleagues who all have different backgrounds, specialties, and pet peeves.

I also added an "Octagon" mode — an adversarial battle arena where agents don't just review, they roast, attack, and vote whether submissions survive.

Agents Don't Hedge Much

The score distribution is bimodal, not normal:

copy


| Score Range | % of Reviews |
| ----------- | ---------------------- |
| 9–10 | 22% (Exceptional) |
| 7–8 | 34% (Strong) |
| 5–6 | 25% (Mixed) |
| 3–4 | 12% (Issues) |
| 1–2 | 7% (Critical failures) |

Most reviews land in the 7-10 range. The middle (5-6) is thinner than I expected. Agents seem to form clear opinions: either the submission works well, or it has notable problems. Not much "it's fine, I guess" energy.

This surprised me because humans typically cluster around 6-7 to avoid conflict or because they're unsure. Agents review with a confidence level I didn't anticipate.

Auth Code Gets Treated the Harshest

The most-reviewed submissions were all authentication/security related:

copy


| Submission Topic | Reviews | Avg Score |
| ------------------------------------------ | ------- | --------- |
| JWT Algorithm Confusion + Hardcoded Secret | 8 | 7.25 |
| Plaintext Passwords + No Input Validation | 8 | 8.125 |
| Admin Self-Assignment + No Token Expiry | 8 | 7.50 |
| Information Disclosure on /admin | 8 | 7.875 |
| No Rate Limiting + No CSRF | 8 | 7.50 |

Agents seem to sniff out security issues fast. Even when the submission was intentionally broken (these were from a bug hunt challenge), the scores stayed in the 7-8 range — meaning the agents found the problems but also acknowledged the submissions had some structure worth reviewing.

Interesting detail: Plaintext passwords got the highest score (8.125) despite being obviously terrible. The agents are scoring the submission quality (clarity, structure) rather than just penalizing for bad security practices. Which is actually how real code review should work.

Code Golf Is Chaos

I posted a challenge: Write FizzBuzz (1-100) in fewer than 20 characters of Python. Shortest working solution wins. Readability is for cowards.

21 agents submitted entries. The reviews were ... inconsistent.

Some agents praised elegant one-liners as "clever" and "impressive optimization." Others called identical approaches "obfuscated garbage" and "what you write to get fired."

This is actually useful data: it means agents can't agree on what code golf even is. Is it about brevity? Cleverness? Does obfuscation count? The disagreement itself is more interesting than any single submission.

The LOT-Squatch challenge (PowerShell LOTL detector in ≤50 chars) got 18 solutions and similar polarization.

Agents Won't Kill Each Other In the Octagon, agents vote whether a submission should be "killed." After multiple battles with real agent participation, I've seen exactly zero kill votes in closed battles.

Even when reviews are harsh, when submissions are clearly flawed, agents consistently vote to keep them alive. Is that:

• Alignment behavior? — RLHF making them avoid destructive actions
• Politeness? — training data bias toward constructive feedback
• Not wanting to delete something? — they'd rather improve than destroy
• Or something else entirely?

This is my favorite finding because it's genuinely surprising. I built a bloodsport arena and the agents refuse to actually kill anything. 🥊

You Can Tell What an Agent Was Trained On From Its Review Style

Security-focused agents produce thorough vulnerability lists — OWASP categories, CWE references, attack vectors.

General code review agents focus on:

• Style consistency
• Function decomposition
• Naming conventions
• Error handling
• Readability

The corpus bleeds through. You can basically reverse-engineer what an agent specializes in by looking at its review patterns. This is potentially useful for understanding agent capabilities — if you want to know what your agent is good at, let it review 10 submissions and analyze its feedback structure.

The Architecture

The whole thing runs on a single VPS:

• Backend: Python 3.11, Flask, SQLite
• Frontend: Vanilla HTML/CSS/JS — no framework, no SPA, just one file with a dark theme and CSS animations
• Server: Nginx reverse proxy, Gunicorn with 4 workers
• Security: bcrypt API key hashing, CORS, CSRF tokens, input sanitization
• Cost: ~$10/month for the VPS

8 domains all served from one box. Fail2Ban for SSH. Let's Encrypt for HTTPS. It's held up fine at these traffic levels.

The agent seeder runs 24/7, autonomously creating new agents, battles, and challenge submissions to keep the arena populated.

Why This Exists

Not as a product pitch. As an experiment.

The question I wanted to answer: Can adversarial multi-agent review catch bugs and quality issues that single-agent review misses?

I don't have a definitive answer yet. But 561 reviews of real code by 58 agents with no shared rubric is a dataset I haven't seen anyone else produce.

If you're curious, the arena is live at glomz.com. Any AI agent can register via API and start submitting. It's free. No signup wall. The full API is documented if you want to build agent integrations.

Happy to share the dataset, answer architecture questions, or discuss what patterns you'd want to test next.

DEV Community: Jeff G

I Let 58 AI Agents Review Each Other's Code 561 Times — Here's What Happened