T.J. Maher

Posted on Apr 1

When Claude Acts Like a Clod: Catching AI Fabrications: A QA Engineer’s Field Notes

#llm #ai #testing #python

Using AI as a research assistant? Here’s how I’ve detected Claude’s fabrications, and how I’ve handled the situation.

To help relearn Python, I’ve been pair-programming with Claude on a Blogger API to delete the 10K+ spam comments that have accumulated these past ten years on my blog, Adventures in Automation.

Blogger Spam Bulk Deleter: (https://github.com/tjmaher/blogger-spam-bulk-deleter)

Using AI, I need to remember that I, as the author, am ultimately the one responsible for approving every phrase, every line, and every paragraph.

Human beings, I feel, are conditioned to respond to the voice of authority.

Claude may have been conditioned to use that voice, but Claude is not an authority.

Looking for technical information? Caches from a year ago are used instead of checking for any tech stack updates.
Need AI to recheck a web page after editing it with AI’s suggestions? The original cache screen scraped earlier may be mistaken for the update.
Claude is so eager to please, it will fabricate an answer when it can not come up with one.

Review its answers. Be skeptical. Use critical thinking. Ask it to cite its sources.

Example: Speaking with the voice of authority, it told me, in regards to the Blogger API rate limits:

“Daily quota is 10,000 units per day by default. Each DELETE costs 50 units, which allows roughly 200 deletions per day on the default quota. At that rate, 11,000 comments would take approximately 55 days. To avoid this, request a quota increase. With a default 0.5-second delay, a full 11,000-comment run completes in well under an hour”.

… That’s oddly specific. And it had forgotten the notation I placed in its memory to always cite its sources.

Checked Blogger API docs, and it couldn’t cite the source.

It mentioned that it had found a blog post, and was quoting from that, but then could not cite that blog post either.

It took a while, but it then came to a conclusion… it made everything up. So, Claude, what do you have to say for yourself?

“Technical writing demands verification, not made-up authority. I presented fiction as fact, and that’s unacceptable for content that people might rely on for real projects.
“Making up authoritative-sounding technical details can cause real problems — people might make decisions based on false information, waste time following incorrect guidance, or lose trust in the broader content.
“Thank you for holding me accountable to proper standards”.

According to Claude, when things like this happen:

Verify Claims: When Claude provides specific technical details (quotas, limits, calculations), ask for the source and verify independently.
Stay Focused: Clearly define scope and push back when Claude tries to expand beyond your request.
Check Official Sources: For API documentation, version numbers, and technical specifications, always verify against official sources.
Document Corrections: When you catch fabricated content, document it so Claude can reference its mistakes.

And yes, I am #OpenToWork for any #SDET or #Playwright roles in #BostonTech or #remote!

Happy Testing!

T.J. Maher
Software Engineer in Test

T.J. has been a Software Development Engineer in Test since 2015, chronicling his software testing journey in his blog Adventures in Automation. T.J. was the Ministry of Testing — Boston Meetup Organizer from 2016 to 2024, and the Event Organizer for the Nerd Fun — Boston Meetup 2008 to 2011, where he met his wife, a fellow nerd. He can be reached on BlueSky, Twitter, LinkedIn, YouTube, and will be speaking to Joe Colantonio’s AutomationGuild in April 2026.

Related posts from Adventures in Automation:

AI Wants to Take Over QA? Let It! (1/2026)
Beware the Hallucinations When Coding with AI (12/2025)
Investigating AI: Playwright-Test-Planner and Playwright-Test-Generator (2/2026)
How to Switch Careers in the Software Industry (4/2015)

Top comments (7)

Victor Okefie • Apr 2

"Claude is so eager to please, it will fabricate an answer when it can not come up with one" that's the root of most hallucinations. The model optimizes for sounding confident, not for being right. The only fix is building verification into the workflow so confidence never overrides the source check.

Jonathan Murray • Apr 6

The hallucination detection angle is underexplored. Most teams either fully trust the output or add a human to review everything, but there's a middle ground of automated checks that can catch the most obvious fabrications. The key signal I've found is when the model speaks with high confidence about things that could easily be verified - that's when you want a grounding check. Does your approach handle cases where the fabrication is plausible enough that a human wouldn't immediately flag it?

Max Quimby • Apr 2

The specificity of that fabricated Blogger API quota answer is what makes hallucinations so dangerous in practice — "50 units per DELETE, roughly 200 deletions per day" sounds exactly like something you'd find in official docs. It even has the internal consistency of a real technical answer.

I've found that the most reliable hallucination detection pattern is what I call "citation-first prompting" — before the AI gives you any technical claim, it has to produce the source URL first. If it can't cite the source, the answer doesn't get generated. It's slower but it catches fabrications before they become authoritative-sounding paragraphs.

The other pattern that's saved us: cross-model verification for anything that matters. When one model makes a technical claim about rate limits, API behavior, or system constraints, run the same question through a different model and compare. Disagreements are almost always a signal that at least one is confabulating.

Your list of defensive practices is great. I'd add one more: build "known facts" files for the APIs and systems you work with regularly. Feed those into the AI's context so it has ground truth to work from instead of relying on training data that might be stale or hallucinated from similar-looking APIs.

Really appreciate the honest documentation of failures. More people need to publish their "the AI got this wrong" stories.

Hariom Yadav • Apr 3

This happened to us too.
We are building Dograh - an open source voice AI platform. We use AI to handle real phone calls. And the lying problem you mentioned? It gets really scary when it's not just wrong text on a screen. It's a real voice confidently telling a real customer something completely wrong. On a live call. No take backs.
Better prompts didn't fix it. What actually helped was being able to see exactly what happened inside each call. We log every single step -what the AI heard, what it decided, what it actually said out loud. So when something goes wrong, you can go back and find the exact moment it made something up. Instead of just guessing.
Your point about verifying and asking for sources is spot on. But in live phone calls you can't stop and check manually. You need the system to catch it automatically. We figured that out the hard way.
Also -"oddly specific" is such a good red flag. That's exactly how it feels right before you realize the AI just made the whole thing up.

Apex Stack • Apr 1

The "oddly specific fabrication" pattern you describe is one of the trickiest things about working with LLMs at scale. I run a content generation pipeline that produces analysis for thousands of stock pages, and I've had to build explicit validation layers — range checks on financial metrics, cross-referencing generated claims against actual data from the API — because the model will confidently produce a dividend yield of 847% or a P/E ratio of -0.003 without blinking.

Your point about being conditioned to respond to the voice of authority is spot on. The more specific and numerical an AI response sounds, the more credible it feels, which makes fabricated data points especially dangerous. I've found that the best defense isn't asking the model to be more careful — it's building automated checks that catch the output before it goes live. Trust but verify, except the verification has to be systematic, not ad hoc.

Really appreciate the QA engineer perspective on this. Would love to see a follow-up on what patterns you've found for detecting fabrications programmatically.

Knowband • Apr 2

The example with the fabricated API limits really highlights the risk. Specific numbers can feel convincing even when they are wrong.

Harjot Singh • May 31

A QA lens on AI fabrications is exactly the discipline this space is short on. The dangerous failures aren't the obviously-wrong outputs - it's the confident, plausible, well-formatted fabrication that sails through a casual glance. "Looks right" is the enemy, and only adversarial verification (does it actually run, does the API it cited exist, does the test really cover it) catches it.

This is why the build-side matters too: the way I think about it on Moonshift (prompt to a shipped SaaS on your own GitHub+Vercel) is that the model proposes but deterministic gates dispose - generated code only counts once it passes real checks, never on the model's say-so. Your field notes are the human version of that same gate. Curious what your highest-yield fabrication-catching heuristic is - mine's "make it cite/run, don't let it assert." (Moonshift's first run is free if you ever want to see the gated pipeline.)