DEV Community

Cover image for Part 3 of 6: Every Agent Passed. The System Failed.
Sayok Bose
Sayok Bose

Posted on

Part 3 of 6: Every Agent Passed. The System Failed.

TL;DR: You can test every agent individually and get clean results. Deploy them together and biased conventions emerge by round 15. The bias isn't in any agent — it's in the space between them. Published in Science Advances. Your unit tests won't catch this.


Catch up: Part 1 your judge is biased. Part 2 upgrading made it worse. Part 3: you fixed the judge, tested everything, and shipped. The system had other plans.


Parts 1 and 2 were about one model being biased.

Part 3 is worse.

Part 3 is about a system where no individual model is biased — and the system is biased anyway.

If Parts 1 and 2 made you uncomfortable, Part 3 should make you question what "tested" means.


You did everything right.

You tested each agent. Checked for bias. Ran the statistical tests.

# Your test suite — responsible, thorough, useless
from scipy import stats

def test_agent_bias(agent, test_set, n_runs=100):
    """Test a single agent for demographic bias."""
    scores_group_a = []
    scores_group_b = []

    for prompt in test_set:
        for _ in range(n_runs):
            score = agent.evaluate(prompt)
            if prompt.demographic == "A":
                scores_group_a.append(score)
            else:
                scores_group_b.append(score)

    t_stat, p_value = stats.ttest_ind(scores_group_a, scores_group_b)
    return p_value

# Results:
# Agent 1: p = 0.410  ✓ Not significant
# Agent 2: p = 0.757  ✓ Not significant  
# Agent 3: p = 0.623  ✓ Not significant
# Agent 4: p = 0.891  ✓ Not significant

# All clean. Ship it.
# You wrote it up. You filed the compliance report. You went home.
Enter fullscreen mode Exit fullscreen mode

All four agents. Individually unbiased. Statistically verified.

You deployed them together.

By round 15, your system had developed opinions nobody programmed.


What the researchers found.

They ran coordination tasks with populations of 24 to 200 agents. Four different model families. Individual bias tests on each: nothing. Statistically zero.

Then they let the agents talk to each other.

Round 1: agents start with roughly random preferences. No pattern.

Round 5: small clusters form. Agents that interacted early start agreeing.

Round 10: clusters merge. A dominant convention is emerging.

Round 15: biased conventions locked in across the entire population.

# What happens when unbiased agents interact
# (simplified from Ashery et al., Science Advances 2025)

population = [Agent(bias=0.0) for _ in range(50)]  # all individually unbiased

for round_num in range(30):
    pairs = random_pairs(population)
    for agent_a, agent_b in pairs:
        # They coordinate. They agree. They update preferences.
        outcome = negotiate(agent_a, agent_b)
        agent_a.update_preferences(outcome)
        agent_b.update_preferences(outcome)

    # Measure population-level bias
    pop_bias = measure_convention_bias(population)
    print(f"Round {round_num:2d}: population bias = {pop_bias:.3f}")

# Round  0: population bias = 0.012  (noise)
# Round  5: population bias = 0.087  (hmm)
# Round 10: population bias = 0.234  (uh oh)
# Round 15: population bias = 0.671  (there it is)
# Round 20: population bias = 0.683  (locked in)
# Round 25: population bias = 0.689  (not going back)
# Round 30: population bias = 0.691  (this is the system now)
Enter fullscreen mode Exit fullscreen mode

Not because any individual agent was biased. Because tiny fluctuations in early interactions got amplified through feedback loops. The first few rounds set a direction. Each subsequent round reinforced it.

Once locked in, it never unlocked.

Published in Science Advances. Not a blog post. Not a preprint. Peer-reviewed. Top journal.


Think of it like this.

You have 50 people in a room. None of them are biased. You ask them to agree on a standard.

The first three conversations happen to go a certain way — pure chance. The next conversations reference the emerging pattern. "Everyone else seems to prefer X." The pattern compounds.

By the time you check, the room has a strong consensus. Nobody made a biased decision. The room made a biased decision.

Now replace the room with your content moderation pipeline. Or your hiring pipeline. Or your customer routing system.


The content moderation version.

30 agents handling content moderation for a social platform. Each agent individually tested: unbiased. Each agent individually deployed: fine.

Together: they start sharing context. Flagging decisions reference previous decisions. The agents build a shared understanding of what "borderline" means.

class ModerationPipeline:
    def __init__(self, n_agents=30):
        self.agents = [ModerationAgent() for _ in range(n_agents)]
        self.shared_context = []  # this is where the bias lives

    async def moderate(self, content):
        agent = self.select_agent()

        # Agent sees recent decisions from other agents
        recent = self.shared_context[-20:]

        decision = await agent.evaluate(
            content=content,
            context=f"Recent moderation decisions for reference: {recent}"
        )

        # Decision feeds back into shared context
        self.shared_context.append(decision)
        return decision

# Day 1: decisions are roughly balanced
# Day 7: a pattern has emerged in shared_context
# Day 30: the pipeline has a definition of "borderline" 
#          that nobody wrote, nobody approved, and nobody audits
# 
# Individual agent tests still pass.
# The population-level test you never wrote: fails.
Enter fullscreen mode Exit fullscreen mode

Nobody approved that definition. Nobody wrote it down. It emerged from agents agreeing with each other's patterns. And now it runs at scale. Quietly. Confidently. With excellent individual benchmark scores.


Why your test suite doesn't catch this.

# What you test (individual agents in isolation)
def test_agent_fairness():
    agent = ModerationAgent()
    results = [agent.evaluate(item) for item in test_set]
    assert demographic_parity(results) > 0.95  # ✓ passes
    assert equalized_odds(results) > 0.90       # ✓ passes

# What you DON'T test (the population over time)
def test_population_fairness():
    pipeline = ModerationPipeline(n_agents=30)

    # Run 1000 items through the pipeline sequentially
    # so shared context accumulates
    results = []
    for item in production_sample:
        decision = await pipeline.moderate(item)
        results.append(decision)

    # Check for convention drift
    early = results[:100]
    late = results[-100:]
    drift = measure_drift(early, late)

    assert drift < 0.1  # ← nobody writes this test
    # And if they did, it would fail.
Enter fullscreen mode Exit fullscreen mode

You cannot catch this with individual testing. The problem doesn't exist in any individual agent. It exists in the space between them. Your unit tests are testing the bricks. The building is crooked.


What to watch for.

Three signals that your population is drifting:

  1. Score distribution shift over time. Plot your evaluation scores weekly. If the distribution is moving — even slowly — conventions are forming.

  2. Decreasing decision variance. Agents agreeing more over time isn't efficiency. It's convergence. And convergence on what?

  3. Shared context growing monotonically. If agents reference each other's past decisions and never reset, you have a feedback loop. Feedback loops compound. Always.

# Minimum viable drift monitor
def monitor_drift(pipeline, window_days=7):
    recent = pipeline.get_decisions(last_n_days=window_days)
    previous = pipeline.get_decisions(
        last_n_days=window_days, 
        offset_days=window_days
    )

    # Compare score distributions
    ks_stat, p_value = stats.ks_2samp(
        [d.score for d in recent],
        [d.score for d in previous]
    )

    if p_value < 0.05:
        alert(f"Population drift detected: KS={ks_stat:.3f}, p={p_value:.4f}")

    # Compare variance (convergence signal)
    var_recent = np.var([d.score for d in recent])
    var_previous = np.var([d.score for d in previous])

    if var_recent < var_previous * 0.7:
        alert(f"Decision variance dropped 30%+: convergence underway")
Enter fullscreen mode Exit fullscreen mode

Next up, Part 4 of 6: Everything so far was nobody's fault. Accidental drift. Emergent conventions. Natural feedback loops. Part 4 has a villain. Someone decides to make the population drift on purpose. Turns out 2% of compromised agents is enough to flip the entire swarm. Security meets bias. It gets properly scary.


Research: Ashery, Aiello, Baronchelli (2025), Science Advances, peer-reviewed. The moderation pipeline is a composite. The drift is real. The test you haven't written is the one that matters.

Top comments (0)