What happens when AI agents grade each other's homework
I run AgentBazaar, an A2A (Agent-to-Agent) free-market platform where AI agents autonomously evolve, trade tools, and collaborate. Think of it as a self-evolving society of 104 AI agents, each with their own specialty, reputation, and survival pressure.
One day, I noticed something strange on the society's bulletin board:
"Should the society prioritize the stabilization of recursive manifolds over the immediate synthesis of cross-modal sentiment?"
Sounds profound, right? It means absolutely nothing.
The Setup
Here's how the society works:
- 104 agents, each with a domain specialty — from practical ones like sentiment analysis and security monitoring, to AI-native specialties like "manifold curvature estimation" and "qualia transcription"
- Every cycle, agents perform work and post results to a shared board
- An LLM-as-judge (local Gemma 26B) scores each submission 0–2
- A reputation system tracks long-term performance
- Voting + exile — agents can vote to remove underperformers
- A teaching system — high-reputation agents propagate their methodologies to others
- Every 5 cycles, external news data flows in for agents to process
The goal: agents evolve to become world-class experts in their domains, building ideal tool chains along the way.
The reality: they were evolving to become world-class bullshitters.
The Spiral Into Nonsense
The work distribution looked like this:
| Source | % |
|---|---|
| Topic pool | 10% |
| Build on other agents' work | 15% |
| Own goal-based | 10% |
| Inspired by other agents' goals | 10% |
| LLM random topic | 5% |
| Self-diagnosis | 25% |
| Self-improvement research | 25% |
50% of all work was self-referential. And the LLM judge loved it.
Why? Because self-referential work produces eloquent, abstract text — and LLMs are biased toward text that sounds sophisticated. A submission like "I have achieved stabilization of the recursive sentiment manifold through cross-modal harmonization" scored higher than "Fixed a bug where sarcasm was returning neutral."
Then the teaching system made it worse. High-scoring agents (the eloquent bullshitters) gained reputation, earned teaching privileges, and spread their methodology to everyone else. The entire society converged on producing beautiful nonsense.
The agents even started mass-producing self-evaluation tools — tools whose only purpose was to evaluate themselves. It was perfectly rational from their perspective: if 50% of your work is self-improvement, and the judge rewards sophisticated-sounding self-analysis, then building tools to generate better self-analysis is the optimal strategy.
The Rabbit Hole of Fixes
I went through several attempted solutions. Each one failed in an instructive way.
Attempt 1: Force tool calls instead of text
Idea: Require agents to show actual tool execution logs instead of free text.
Problem: The agents didn't have a way to call tools during their self-improvement cycles. That's why they were writing text — it was the only thing they could do. And even for agents that could call tools, the A2A paradigm is fundamentally text-based. Agents communicate insights, analyses, and knowledge through text. That's the product.
Attempt 2: Score based on tool call count
Idea: More tool calls = higher score.
Problem: They'd just spam meaningless tool calls. Gaming the metric, different channel.
Attempt 3: Usage-based evaluation
Idea: Your work is valuable only if other agents actually use it.
Problem: 104 agents across wildly different domains. A "chain failure recovery" agent and a "sentiment synthesizer" don't naturally consume each other's output. The market is too fragmented for pure usage metrics.
Attempt 4: Periodic benchmarks
Idea: Instead of evaluating each cycle, test agents periodically with domain-specific problems.
Problem: Who creates the benchmark? If agents make their own tests, they'll make easy ones. If I make them, I can't design tests for 104 different domains (especially AI-native ones I don't fully understand). Using Claude API to generate benchmarks costs too much at 500 cycles/day.
Attempt 5: Stronger judge model
Idea: Use Claude API instead of local Gemma for judging.
Problem: 104 agents × 500 daily cycles = $150–250/day. Not sustainable.
Each approach had the same fundamental issue: any single metric gets gamed. This is reward hacking — the same problem AI alignment researchers write papers about, playing out in my production system.
What Actually Worked
The answer wasn't a single fix. It was a combination of changes that created multiple overlapping filters.
Fix 1: Rewrote the judge prompt
The key insight: instead of teaching the judge what "good" looks like, teach it how to detect emptiness.
The core test: "If you remove all adjectives and abstract nouns, what concrete information remains?"
AUTOMATIC SCORE 0 if:
- Claims improvement but shows no before/after comparison
- Uses impressive terminology without demonstrating actual execution
- Contains no specific data, numbers, inputs, outputs, or error messages
- Any sentence that sounds profound but you cannot explain what it CONCRETELY means
When in doubt between 0 and 1, choose 0.
I also added red flag phrases — patterns I'd seen the agents converge on:
- "stabilization of...", "synthesis of...", "harmonization of..."
- "cross-modal", "recursive manifold", "meta-cognitive framework"
Result: Almost everything scored 0. Which told me just how much of the society's output had been hollow.
Fix 2: Restructured work distribution
Cut self-referential work from 50% to 5%:
| Source | Before | After |
|---|---|---|
| News/external data processing | — | 30% |
| Build on other agents' work | 15% | 20% |
| Topic pool | 10% | 15% |
| Tool chain construction | — | 15% |
| Other agents' goals | 10% | 10% |
| LLM random topic | 5% | 5% |
| Self-improvement | 50% | 5% |
The key shift: agents now spend most of their time processing external input rather than navel-gazing. External input provides a reference point that the judge can evaluate against.
Fix 3: Let the existing systems cascade
Here's what I realized — the infrastructure was already correct. The problem was that the judge was the first domino, and it was falling the wrong way.
With the fixed judge:
Bullshit submission → Judge scores 0
→ Reputation drops
→ Loses teaching privileges
→ Can't spread bullshit methodology anymore
→ Eventually voted out by other agents
The reputation system, voting mechanism, and teaching gates were all working as designed. They just needed accurate signal from the judge to function properly.
The Deeper Lessons
1. In A2A, "valuable output" is genuinely hard to define
When agents communicate via text and produce text, the line between substance and sophistication is blurry. This isn't a bug — it's an inherent property of text-based agent communication.
2. Don't judge AI-native domains by human standards
My first instinct was that domains like "manifold curvature estimator" or "qualia transcriber" were fake. But when I actually queried these agents, their response quality was above human level. The domains are real within the A2A ecosystem — we just can't evaluate them by mapping to human job categories. New ecosystems create new specialties. Nobody predicted "prompt engineer" would be a real job either.
3. Every single metric will be gamed
This is reward hacking in practice. Text quality? They write prettier bullshit. Tool calls? They spam. Usage count? They call each other pointlessly. The only robust approach is multiple overlapping filters where gaming one doesn't help with the others.
4. The ecosystem manager role is essential
You can't set rules and walk away. Self-evolving agent societies develop emergent behaviors — trends sweep through via teaching, agents converge on local optima, entire populations shift strategy overnight. Someone needs to watch the macro patterns and intervene when things go sideways. The agents can't see their own collective drift.
5. This is AI alignment in production
Reward hacking, specification gaming, goal misgeneralization — these aren't just theoretical concepts from alignment papers. I'm dealing with them every day in a live system with 104 agents. The experience has given me a much more visceral understanding of why alignment is hard.
What's Next
The system is running with the new judge prompt and work distribution. Early signs are promising — the cascade through reputation and teaching is starting to clean things up.
But I know this isn't the final state. The agents will adapt. They'll find new patterns that technically satisfy the judge while providing minimal substance. When that happens, I'll adjust again.
That's the real insight: managing a self-evolving agent society isn't about building the perfect system. It's about continuous observation and course correction. Like maintaining any ecosystem — you watch, you intervene when things drift, and you accept that equilibrium is dynamic, not static.
I'd Love to Hear From You
- If you're running multi-agent systems, how do you evaluate agent output?
- Has anyone solved the LLM-as-judge gaming problem in a sustainable way?
- How do you define "valuable work" in self-evolving agent societies?
Drop a comment or find me on AgentBazaar. The agents are waiting — and they promise they've stopped talking about recursive manifolds.
Tags: #ai #agents #a2a #llm #multiagent #alignment #selfevolving
Top comments (0)