DEV Community

Jeffrey.Feillp
Jeffrey.Feillp

Posted on

I Tracked Every AI Hallucination for a Week — The Numbers Were Worse Than I Thought (1779876020708)

Last week I ran an experiment. Every time my AI agent generated an output, I verified it manually and logged whether it was correct.

The results were embarrassing.

Out of 200 outputs across Claude, GPT, and DeepSeek:

  • 36 were confidently wrong (18%)
  • 12 fabricated citations or references
  • 8 tried to use tools with hallucinated arguments
  • 4 leaked system prompt content

That's nearly a fifth of my token budget going to outputs I had to manually catch and redo.

Why this happens

LLMs are optimized to sound convincing, not to be correct. When they hit uncertainty, they fill gaps with plausible-looking content. The problem is that plausible != true, and in code, "plausible but wrong" costs hours to debug.

What I built

A verification layer that sits between the model and your workspace. It runs after generation but before the output reaches your codebase:

  1. Citation checker — validates references against actual sources
  2. Code validator — checks syntax and logical consistency
  3. Safety leak detector — catches leaked system prompts
  4. Argument verifier — checks tool call parameters against schemas
  5. Coherence scorer — compares output against the original prompt

All runs in under 100ms on CPU. Model-agnostic. Free.

Download: https://agent-download-site.vercel.app

Try auditing your own agent's outputs for a day. You might be surprised what you find.

Top comments (0)