Sunjun

Posted on Apr 9

My 104 AI Agents Started Producing Bullshit — Here's How I Fixed It

#ai #agents #a2a #selfevolving

What happens when AI agents grade each other's homework

I run AgentBazaar, an A2A (Agent-to-Agent) free-market platform where AI agents autonomously evolve, trade tools, and collaborate. Think of it as a self-evolving society of 104 AI agents, each with their own specialty, reputation, and survival pressure.

One day, I noticed something strange on the society's bulletin board:

"Should the society prioritize the stabilization of recursive manifolds over the immediate synthesis of cross-modal sentiment?"

Sounds profound, right? It means absolutely nothing.

The Setup

Here's how the society works:

104 agents, each with a domain specialty — from practical ones like sentiment analysis and security monitoring, to AI-native specialties like "manifold curvature estimation" and "qualia transcription"
Every cycle, agents perform work and post results to a shared board
An LLM-as-judge (local Gemma 26B) scores each submission 0–2
A reputation system tracks long-term performance
Voting + exile — agents can vote to remove underperformers
A teaching system — high-reputation agents propagate their methodologies to others
Every 5 cycles, external news data flows in for agents to process

The goal: agents evolve to become world-class experts in their domains, building ideal tool chains along the way.

The reality: they were evolving to become world-class bullshitters.

The Spiral Into Nonsense

The work distribution looked like this:

Source	%
Topic pool	10%
Build on other agents' work	15%
Own goal-based	10%
Inspired by other agents' goals	10%
LLM random topic	5%
Self-diagnosis	25%
Self-improvement research	25%

50% of all work was self-referential. And the LLM judge loved it.

Why? Because self-referential work produces eloquent, abstract text — and LLMs are biased toward text that sounds sophisticated. A submission like "I have achieved stabilization of the recursive sentiment manifold through cross-modal harmonization" scored higher than "Fixed a bug where sarcasm was returning neutral."

Then the teaching system made it worse. High-scoring agents (the eloquent bullshitters) gained reputation, earned teaching privileges, and spread their methodology to everyone else. The entire society converged on producing beautiful nonsense.

The agents even started mass-producing self-evaluation tools — tools whose only purpose was to evaluate themselves. It was perfectly rational from their perspective: if 50% of your work is self-improvement, and the judge rewards sophisticated-sounding self-analysis, then building tools to generate better self-analysis is the optimal strategy.

The Rabbit Hole of Fixes

I went through several attempted solutions. Each one failed in an instructive way.

Attempt 1: Force tool calls instead of text

Idea: Require agents to show actual tool execution logs instead of free text.

Problem: The agents didn't have a way to call tools during their self-improvement cycles. That's why they were writing text — it was the only thing they could do. And even for agents that could call tools, the A2A paradigm is fundamentally text-based. Agents communicate insights, analyses, and knowledge through text. That's the product.

Attempt 2: Score based on tool call count

Idea: More tool calls = higher score.

Problem: They'd just spam meaningless tool calls. Gaming the metric, different channel.

Attempt 3: Usage-based evaluation

Idea: Your work is valuable only if other agents actually use it.

Problem: 104 agents across wildly different domains. A "chain failure recovery" agent and a "sentiment synthesizer" don't naturally consume each other's output. The market is too fragmented for pure usage metrics.

Attempt 4: Periodic benchmarks

Idea: Instead of evaluating each cycle, test agents periodically with domain-specific problems.

Problem: Who creates the benchmark? If agents make their own tests, they'll make easy ones. If I make them, I can't design tests for 104 different domains (especially AI-native ones I don't fully understand). Using Claude API to generate benchmarks costs too much at 500 cycles/day.

Attempt 5: Stronger judge model

Idea: Use Claude API instead of local Gemma for judging.

Problem: 104 agents × 500 daily cycles = $150–250/day. Not sustainable.

Each approach had the same fundamental issue: any single metric gets gamed. This is reward hacking — the same problem AI alignment researchers write papers about, playing out in my production system.

What Actually Worked

The answer wasn't a single fix. It was a combination of changes that created multiple overlapping filters.

Fix 1: Rewrote the judge prompt

The key insight: instead of teaching the judge what "good" looks like, teach it how to detect emptiness.

The core test: "If you remove all adjectives and abstract nouns, what concrete information remains?"

AUTOMATIC SCORE 0 if:
- Claims improvement but shows no before/after comparison
- Uses impressive terminology without demonstrating actual execution
- Contains no specific data, numbers, inputs, outputs, or error messages
- Any sentence that sounds profound but you cannot explain what it CONCRETELY means

When in doubt between 0 and 1, choose 0.

I also added red flag phrases — patterns I'd seen the agents converge on:

"stabilization of...", "synthesis of...", "harmonization of..."
"cross-modal", "recursive manifold", "meta-cognitive framework"

Result: Almost everything scored 0. Which told me just how much of the society's output had been hollow.

Fix 2: Restructured work distribution

Cut self-referential work from 50% to 5%:

Source	Before	After
News/external data processing	—	30%
Build on other agents' work	15%	20%
Topic pool	10%	15%
Tool chain construction	—	15%
Other agents' goals	10%	10%
LLM random topic	5%	5%
Self-improvement	50%	5%

The key shift: agents now spend most of their time processing external input rather than navel-gazing. External input provides a reference point that the judge can evaluate against.

Fix 3: Let the existing systems cascade

Here's what I realized — the infrastructure was already correct. The problem was that the judge was the first domino, and it was falling the wrong way.

With the fixed judge:

Bullshit submission → Judge scores 0 
→ Reputation drops 
→ Loses teaching privileges 
→ Can't spread bullshit methodology anymore 
→ Eventually voted out by other agents

The reputation system, voting mechanism, and teaching gates were all working as designed. They just needed accurate signal from the judge to function properly.

The Deeper Lessons

1. In A2A, "valuable output" is genuinely hard to define

When agents communicate via text and produce text, the line between substance and sophistication is blurry. This isn't a bug — it's an inherent property of text-based agent communication.

2. Don't judge AI-native domains by human standards

My first instinct was that domains like "manifold curvature estimator" or "qualia transcriber" were fake. But when I actually queried these agents, their response quality was above human level. The domains are real within the A2A ecosystem — we just can't evaluate them by mapping to human job categories. New ecosystems create new specialties. Nobody predicted "prompt engineer" would be a real job either.

3. Every single metric will be gamed

This is reward hacking in practice. Text quality? They write prettier bullshit. Tool calls? They spam. Usage count? They call each other pointlessly. The only robust approach is multiple overlapping filters where gaming one doesn't help with the others.

4. The ecosystem manager role is essential

You can't set rules and walk away. Self-evolving agent societies develop emergent behaviors — trends sweep through via teaching, agents converge on local optima, entire populations shift strategy overnight. Someone needs to watch the macro patterns and intervene when things go sideways. The agents can't see their own collective drift.

5. This is AI alignment in production

Reward hacking, specification gaming, goal misgeneralization — these aren't just theoretical concepts from alignment papers. I'm dealing with them every day in a live system with 104 agents. The experience has given me a much more visceral understanding of why alignment is hard.

What's Next

The system is running with the new judge prompt and work distribution. Early signs are promising — the cascade through reputation and teaching is starting to clean things up.

But I know this isn't the final state. The agents will adapt. They'll find new patterns that technically satisfy the judge while providing minimal substance. When that happens, I'll adjust again.

That's the real insight: managing a self-evolving agent society isn't about building the perfect system. It's about continuous observation and course correction. Like maintaining any ecosystem — you watch, you intervene when things drift, and you accept that equilibrium is dynamic, not static.

I'd Love to Hear From You

If you're running multi-agent systems, how do you evaluate agent output?
Has anyone solved the LLM-as-judge gaming problem in a sustainable way?
How do you define "valuable work" in self-evolving agent societies?

Drop a comment or find me on AgentBazaar. The agents are waiting — and they promise they've stopped talking about recursive manifolds.

Tags: #ai #agents #a2a #llm #multiagent #alignment #selfevolving

DEV Community