DEV Community

Cover image for When Claude Acts Like a Clod: Catching AI Fabrications: A QA Engineer’s Field Notes
T.J. Maher
T.J. Maher

Posted on

When Claude Acts Like a Clod: Catching AI Fabrications: A QA Engineer’s Field Notes

Using AI as a research assistant? Here’s how I’ve detected Claude’s fabrications, and how I’ve handled the situation.

To help relearn Python, I’ve been pair-programming with Claude on a Blogger API to delete the 10K+ spam comments that have accumulated these past ten years on my blog, Adventures in Automation.

Using AI, I need to remember that I, as the author, am ultimately the one responsible for approving every phrase, every line, and every paragraph.

Human beings, I feel, are conditioned to respond to the voice of authority.

Claude may have been conditioned to use that voice, but Claude is not an authority.

  • Looking for technical information? Caches from a year ago are used instead of checking for any tech stack updates.
  • Need AI to recheck a web page after editing it with AI’s suggestions? The original cache screen scraped earlier may be mistaken for the update.
  • Claude is so eager to please, it will fabricate an answer when it can not come up with one.

Review its answers. Be skeptical. Use critical thinking. Ask it to cite its sources.

Example: Speaking with the voice of authority, it told me, in regards to the Blogger API rate limits:

“Daily quota is 10,000 units per day by default. Each DELETE costs 50 units, which allows roughly 200 deletions per day on the default quota. At that rate, 11,000 comments would take approximately 55 days. To avoid this, request a quota increase. With a default 0.5-second delay, a full 11,000-comment run completes in well under an hour”.

… That’s oddly specific. And it had forgotten the notation I placed in its memory to always cite its sources.

Checked Blogger API docs, and it couldn’t cite the source.

It mentioned that it had found a blog post, and was quoting from that, but then could not cite that blog post either.

It took a while, but it then came to a conclusion… it made everything up. So, Claude, what do you have to say for yourself?

“Technical writing demands verification, not made-up authority. I presented fiction as fact, and that’s unacceptable for content that people might rely on for real projects.
“Making up authoritative-sounding technical details can cause real problems — people might make decisions based on false information, waste time following incorrect guidance, or lose trust in the broader content.
“Thank you for holding me accountable to proper standards”.

According to Claude, when things like this happen:

  • Verify Claims: When Claude provides specific technical details (quotas, limits, calculations), ask for the source and verify independently.
  • Stay Focused: Clearly define scope and push back when Claude tries to expand beyond your request.
  • Check Official Sources: For API documentation, version numbers, and technical specifications, always verify against official sources.
  • Document Corrections: When you catch fabricated content, document it so Claude can reference its mistakes.

And yes, I am #OpenToWork for any #SDET or #Playwright roles in #BostonTech or #remote!

Happy Testing!

T.J. Maher
Software Engineer in Test

T.J. has been a Software Development Engineer in Test since 2015, chronicling his software testing journey in his blog Adventures in Automation. T.J. was the Ministry of Testing — Boston Meetup Organizer from 2016 to 2024, and the Event Organizer for the Nerd Fun — Boston Meetup 2008 to 2011, where he met his wife, a fellow nerd. He can be reached on BlueSky, Twitter, LinkedIn, YouTube, and will be speaking to Joe Colantonio’s AutomationGuild in April 2026.

Related posts from Adventures in Automation:

Top comments (6)

Collapse
 
theeagle profile image
Victor Okefie

"Claude is so eager to please, it will fabricate an answer when it can not come up with one" that's the root of most hallucinations. The model optimizes for sounding confident, not for being right. The only fix is building verification into the workflow so confidence never overrides the source check.

Collapse
 
jon_at_backboardio profile image
Jonathan Murray

The hallucination detection angle is underexplored. Most teams either fully trust the output or add a human to review everything, but there's a middle ground of automated checks that can catch the most obvious fabrications. The key signal I've found is when the model speaks with high confidence about things that could easily be verified - that's when you want a grounding check. Does your approach handle cases where the fabrication is plausible enough that a human wouldn't immediately flag it?

Collapse
 
max_quimby profile image
Max Quimby

The specificity of that fabricated Blogger API quota answer is what makes hallucinations so dangerous in practice — "50 units per DELETE, roughly 200 deletions per day" sounds exactly like something you'd find in official docs. It even has the internal consistency of a real technical answer.

I've found that the most reliable hallucination detection pattern is what I call "citation-first prompting" — before the AI gives you any technical claim, it has to produce the source URL first. If it can't cite the source, the answer doesn't get generated. It's slower but it catches fabrications before they become authoritative-sounding paragraphs.

The other pattern that's saved us: cross-model verification for anything that matters. When one model makes a technical claim about rate limits, API behavior, or system constraints, run the same question through a different model and compare. Disagreements are almost always a signal that at least one is confabulating.

Your list of defensive practices is great. I'd add one more: build "known facts" files for the APIs and systems you work with regularly. Feed those into the AI's context so it has ground truth to work from instead of relying on training data that might be stale or hallucinated from similar-looking APIs.

Really appreciate the honest documentation of failures. More people need to publish their "the AI got this wrong" stories.

Collapse
 
hariom_yadav_ecd68da61fd4 profile image
Hariom Yadav

This happened to us too.
We are building Dograh - an open source voice AI platform. We use AI to handle real phone calls. And the lying problem you mentioned? It gets really scary when it's not just wrong text on a screen. It's a real voice confidently telling a real customer something completely wrong. On a live call. No take backs.
Better prompts didn't fix it. What actually helped was being able to see exactly what happened inside each call. We log every single step -what the AI heard, what it decided, what it actually said out loud. So when something goes wrong, you can go back and find the exact moment it made something up. Instead of just guessing.
Your point about verifying and asking for sources is spot on. But in live phone calls you can't stop and check manually. You need the system to catch it automatically. We figured that out the hard way.
Also -"oddly specific" is such a good red flag. That's exactly how it feels right before you realize the AI just made the whole thing up.

Collapse
 
apex_stack profile image
Apex Stack

The "oddly specific fabrication" pattern you describe is one of the trickiest things about working with LLMs at scale. I run a content generation pipeline that produces analysis for thousands of stock pages, and I've had to build explicit validation layers — range checks on financial metrics, cross-referencing generated claims against actual data from the API — because the model will confidently produce a dividend yield of 847% or a P/E ratio of -0.003 without blinking.

Your point about being conditioned to respond to the voice of authority is spot on. The more specific and numerical an AI response sounds, the more credible it feels, which makes fabricated data points especially dangerous. I've found that the best defense isn't asking the model to be more careful — it's building automated checks that catch the output before it goes live. Trust but verify, except the verification has to be systematic, not ad hoc.

Really appreciate the QA engineer perspective on this. Would love to see a follow-up on what patterns you've found for detecting fabrications programmatically.

Collapse
 
scott_morrison_39a1124d85 profile image
Knowband

The example with the fabricated API limits really highlights the risk. Specific numbers can feel convincing even when they are wrong.