fliptrigga13

Posted on Mar 26

My 11-Agent AI Swarm Was Secretly Hallucinating. My Own Monitoring Tool Caught It.

#python #ai #llm #devops

I built an 11-agent swarm to write Reddit outreach for my product. It ran for weeks. It was hallucinating usernames the entire time — and I didn't notice until I ran a session diff comparing it to a 3-agent rewrite.

This is what the diff showed me, and why I think most multi-agent systems have the same problem.

The Setup

The old system — call it V1 — was an 11-agent blackboard architecture:

COMMANDER → SCOUT → CONVERSION_ANALYST → COPYWRITER → 
METACOG → EXECUTE → EVIDENCE_CHECK → SENTINEL → 
EXECUTIONER → VALIDATOR → ARBITER

Each agent read the shared blackboard, added its output, and passed it forward. Architecturally, it looked impressive. In practice, each agent was doing one of two things:

Restating what the previous agent said
Inventing context that wasn't there

The worst part: it had explicit directives saying never fabricate Reddit usernames. The directives were right there in the prompt, importance score 9.9 out of 10.

It hallucinated usernames anyway. Every cycle.

What the Session Diff Showed

VeilPiercer captures every step a pipeline takes — what it read, what it produced, timing — and lets you compare two runs side by side.

Here's the V1 (Session B, 11 steps) vs V3 (Session A, 3 steps) diff at step 0:

V1 — SCOUT output (step 2, after 38,900ms):

u/techsolo posted: "Been battling with agent divergence again. 
Prometheus alerts are great, but when I need to trace back to 
understand why it happened, I'm lost..."

u/techsolo is fabricated. There was no u/techsolo in the live data. The agent invented a realistic-sounding username and a realistic-sounding quote, and every subsequent agent treated it as ground truth.

V1 — EXECUTIONER output (step 6):

"u/techsolo, looking at the challenges you face with AI agent 
divergence and the need for better observability tools, I 
recommend considering VeilPiercer..."

An outreach message — addressed to a user who doesn't exist — ready to post.

V3 — COPYWRITER output (step 1, after 2,800ms):

One thing that bites a lot of agent setups at scale is silent 
divergence between steps. By the time something breaks, the issue 
happened 3-5 steps earlier when agents silently diverged. 
VeilPiercer captures what each step READ vs what it PRODUCED...

No username. Grounded in a real thread. Took 3 steps and 2.8 seconds.

Why Multi-Agent Cascades Hallucinate More, Not Less

The intuition behind multi-agent systems is: more checks = more reliability. More reviewers = better output.

That intuition is wrong in LLM pipelines.

When you chain 11 agents through a shared blackboard, each agent reads the previous agent's output as if it were ground truth. If step 2 invents a username, step 3 builds on it. By step 8, the hallucination is load-bearing.

The session diff makes this visible in a way logs can't. You can see exactly which step introduced the fabricated context, and watch every downstream step faithfully repeat it.

V1 had a SENTINEL agent specifically designed to catch this. Here's what it output:

[SENTINEL_MAGNITUDE]: [SCOUT]: u/techsolo, I've been following 
your discussion about AI agent divergence...

The sentinel was summarizing the violation — using the fabricated username — as evidence that the pipeline was working.

The Fix: 3 Deterministic Steps

V3 dropped the blackboard entirely:

SCRAPER (pure Python) → COPYWRITER (llama3.1:8b) → REWARD (llama3.1:8b)

SCRAPER: No LLM. Fetches live Reddit JSON, scores threads by keyword density, returns the highest-scoring real thread. If nothing scores above 0, it skips the cycle. No fallback. No fabrication.

COPYWRITER: Single LLM call with the real thread as context. Hardcoded bans: no usernames, no price mentions, no first-person invented experience, no AI-tell openers.

REWARD: Single LLM call scoring 0.0–1.0. Gate at 0.65. If it doesn't pass, the cycle is discarded.

V1 averaged below 0.40 on the same scoring rubric. V3 averages 0.80+.

What the Session Diff Actually Tells You

The useful thing isn't that V3 is "better." The useful thing is that the diff showed me where V1 broke down — and it wasn't where I expected.

I expected the problem to be in the COPYWRITER. It wasn't. The problem was in step 2, SCOUT, when it read low-signal thread data and invented a user to fill the gap. Everything after that was downstream hallucination.

This is the reproducibility problem in a concrete form: two pipelines given the same directive produce completely different behavior, and you can't see why without inspecting what each step read vs what it produced.

The diff surfaced that in one comparison. No added logging. No instrumentation changes.

The Thing Worth Noting

The V1 system was more expensive to run, slower (11 steps vs 3), and produced worse output. The complexity wasn't buying reliability — it was buying a longer hallucination chain.

Most LLM pipelines I've seen have a version of this. The multi-agent architecture gives the illusion of validation. But if each validator is an LLM reading LLM output, you're not adding oversight — you're adding amplification.

The session diff is how you find that out before it matters.

VeilPiercer is the per-step tracing tool I built for local Ollama pipelines. It captures what each step reads and produces, lets you diff sessions, and runs fully offline. veil-piercer.com

Top comments (4)

Scott Reno • Mar 26

Interesting. Thanks for sharing!

fliptrigga13 • Mar 26

Glad it landed. The session diff is what made the culprit visible, without seeing where each step diverged, I'd have kept chasing the wrong agent.

klement Gunndu • Mar 26

Curious whether the hallucination was an architecture issue or context dilution — with 11 agents each restating the last, grounding data gets washed out fast. Did the 3-agent rewrite fix it by reducing that telephone effect?

fliptrigga13 • Mar 26

Context dilution is the right framing. Both factors were live, architecture amplified the hallucination once it existed, but the root cause was an LLM doing the grounding step. SCOUT invented u/techsolo because it was reading low-signal thread data and filling the gap. What fixed it in V3 wasn't just fewer agents, it was removing the LLM from data acquisition entirely. SCRAPER is pure Python fetching live Reddit JSON: that username either exists in the payload or it doesn't. The telephone effect only runs when an LLM is talking to itself at every step.