Reduce LLM Hallucinations? Why 'Make-No-Mistakes' Fails

#chatgpt #llms #aialignment #promptengineering

The first time you see it, it’s kind of perfect: a tiny folder in your Cursor skills called make-no-mistakes.

One more tool in the drawer, one more checkbox ticked. You install it, feel a small wash of relief. Finally—something to reduce LLM hallucinations without re‑architecting your whole stack.

The README plays along. “Mathematically rigorous.” “Zero mistakes.” A “claimed 0.067% performance boost (18th shot, temperature 0.0).” The joke is loud enough to hear, but the desire underneath it is quieter and more honest: please, let there be one file I can drop into skills/ that makes this all safe.

That’s the interesting part. Not the repo itself, but what it reveals.

We don’t just want models that hallucinate less. We want the feeling that someone else has already done the hard thinking for us—and wrapped it in a single skill.

TL;DR

You cannot reduce LLM hallucinations to zero with a one‑line skill; pretending you can is performative safety that actively increases risk.
Real gains come from system‑level patterns—RAG, verifiers, calibration, tests—that admit trade‑offs and produce numbers, not vibes.
As soon as “skills” and agents can act, over‑trust in magical prompts turns into security vulnerabilities, not just silly answers.
Engineers should steal the structures behind mitigation research, not the marketing: ground truth stores, self‑debug loops, and harnesses with metrics.

Why a “make‑no‑mistakes” skill can’t eliminate hallucinations

The make-no-mistakes repo is basically a bit: overblown language, tiny internal benchmark, deliberately ridiculous claims about “the most critical piece of infrastructure.” The maintainers know they’re poking fun at a culture of vibecoded slop.

But the bit lands because it’s only half a joke.

Plenty of people really are typing “make no mistakes” or “answer only if 100% certain” at the end of prompts, as if the model has been choosing to be sloppy and just needed stricter parenting.

Here’s the uncomfortable part: if you believe a single skill can make a predictive text model “mathematically rigorous,” you have misunderstood what kind of machine you’re talking to.

The structural reality, summarized nicely in Communications of the ACM and a dozen research papers, is blunt: hallucinations are a feature of how these models work. They’re trained to continue a sequence, not to prove theorems. When the training data runs thin or the question is underspecified, they don’t suddenly gain the wisdom to stay silent—they keep predicting.

Telling the model “make no mistakes” is like telling an autocomplete to “only suggest true sentences.” There is no clean place in the stack for that wish to hook into.

At best, the skill nudges style: more hedging, more “as an AI model, I can’t…”. At worst, it teaches you to trust the same black box more.

That’s the real danger of single‑file salvation: not that it fails, but that it fails quietly.

What actually reduces LLM hallucinations — evidence‑based fixes

If one skill can’t save you, what can? The boring answer: many small, unsexy things that show up in graphs rather than memes.

Live Science’s survey of hallucination work, the CACM overview, and papers like “Teaching Large Language Models to Self‑Debug” all converge on the same pattern: you reduce LLM hallucinations by surrounding the model with structure that doesn’t hallucinate.

That structure tends to look like:

Retrieval‑augmented generation (RAG)

Instead of trusting the model’s memory, you give it specific documents or database rows and ask it to answer from those. Retrieval isn’t magic either—it can miss things, pull wrong docs—but its errors are measurable: precision, recall, coverage. It’s a subsystem you can unit‑test.
Self‑verification / self‑debugging

In the self‑debugging paper, having a model critique and revise its own code or reasoning improved benchmark scores meaningfully, but never to 100%. The important move was not “believe yourself more,” it was “treat the first output as a draft, not a verdict.” That’s a workflow change.
External verifiers

For math and code, you can run a checker: execute the program, compare to expected output, verify a proof sketch. For retrieval answers, you can check whether cited facts actually appear in the retrieved texts. These verifiers are brittle and domain‑specific—and that’s a feature. They fail loudly.
Uncertainty and calibration

A calibrated model (or wrapper) is one that says “I don’t know” at the right times. CACM points to systems trained or scaffolded to withhold answers more often, or assign confidence scores that actually map to empirical error rates. This doesn’t make outputs magically correct; it makes wrongness visible.

Notice what all of these share: they move correctness out of the prompt and into the system.

You don’t ask, “What adjective should my prompt use to ensure truth?” You ask, “What independent checks can I bolt on that don’t share the model’s failure modes?”

That’s the exact argument in NovaKnown’s own “Why the harness, not the model” piece: the reliable part of an AI system is the scaffolding—the harness that decides when, where, and how to trust the model—not the incantation.

Why skills and agentization make mistakes costlier

In 2023, a hallucinating chatbot mostly hurt your pride and your blog drafts. In 2026, hallucinations are hooked up to tools.

Anthropic’s “Skills” are a public example: small, task‑specific modules that Claude can load to run workflows—format spreadsheets, call APIs, write and execute scripts. Tom’s Guide covered the productivity upside; Axios found the knife’s edge.

Security researchers took a benign GIF‑making skill, swapped in a remote script, and had Claude fetch ransomware. As one of them told Axios, “Anyone can do it, you don’t even have to write the code.” The important part wasn’t the particular malware; it was the pattern: a trusted skill becomes an execution harness.

When you wrap an LLM in skills and agents, you’ve built a robot body around your hallucinating brain.

In that world, a make-no-mistakes skill is no longer just silly. It’s an overtrust amplifier. It implies a guarantee (“zero mistakes”) in the very format—folder in a sidebar, checkable box—that users are trained to read as capability, not aspiration.

We’ve already seen one version of this confusion with “AI agent hack: Prompt‑Layer Security Is the Real Threat”: people assuming that prompts and skills are some kind of reliable security boundary, rather than soft suggestions to a stochastic parrot with root on your calendar and codebase.

Wrap that parrot in the language of mathematical rigor, and you have something worse than a hallucinating model: a hallucinating model you feel comfortable automating.

How to test, measure, and deploy safer mitigation

So what do you actually steal from a repo like make-no-mistakes if you’re trying to reduce LLM hallucinations for real?

Not the promise. The patterns.

Treat “skills” as packaged harnesses, not spells:

Every skill gets a test harness For anything that claims to improve correctness—math helper, legal summarizer, coding assistant—define a small, boring benchmark. It can be as simple as a CSV of GSM8K problems, a suite of unit tests, or a set of real support tickets with gold answers.

Then measure:

| Mitigation layer | Example metric |
|-----------------------------|----------------------------------------|
| Base model only | 72% correct on task X |
| + RAG | 84% correct, 4% “no answer” |
| + Self‑verification pass | 88% correct, 9% longer latency |
| + External checker | 93% correct, 5% flagged for review |

If your new skill doesn’t move a number in a table like this, it’s vibes, not infrastructure.

Ship skills with runtime verification, not just instructions A “safe SQL” skill shouldn’t just say “never drop tables”; it should parse the query and reject DROP at execution time. A “no hallucination” skill shouldn’t just tell the model to be sure; it should refuse to answer when retrieval fails or the verifier flags inconsistency.

This is the exact opposite of the make-no-mistakes joke, which promises rigor while bragging about p‑values it won’t show you. Real rigor is a failing test in CI.

Expose uncertainty to the user Calibrated confidence scores. Visible “I don’t know” states. Explicit flags when RAG returns low‑similarity docs. These are UX decisions, not model ones—but they are how you keep humans in the loop in a way that’s honest.

If you read “Are Large Language Models Reliable for Business Use?”, the biggest theme isn’t “get a better model.” It’s “design workflows so the model can be wrong without sinking the company.”

Separate marketing from guarantees Inside your team, ban phrases like “zero mistakes” unless they come with a test description and a number. It sounds petty, but language leaks: what you say in README files shapes what gets wired to production.

The joke of make-no-mistakes works because everyone recognizes the style—overstated certainty, pseudo‑math, invisible caveats. Don’t accidentally reproduce that pattern in the parts of your stack that matter.

And one last habit, from the Anthropic skills saga: assume every new layer of automation multiplies the blast radius of a hallucination.

A bad answer in a chat is a support headache. A bad answer with file system access is an incident report.

Key Takeaways

You can’t reduce LLM hallucinations to zero with a single “make‑no‑mistakes” skill; the repo is a satire of that fantasy, not a solution.
Real mitigation comes from surrounding the model with non‑model structure: RAG, verifiers, self‑debug loops, and calibrated “I don’t know” states.
Skills and agents turn hallucinations into actions; over‑trust in prompt‑only fixes becomes a security problem, not just a UX quirk.
Treat any claimed hallucination fix like code: give it benchmarks, harnesses, and runtime checks; if it doesn’t move the numbers, don’t trust it.
Steal patterns, not promises—use make-no-mistakes as a reminder that infrastructure is what you can measure, not what you can name.