TL;DR: An LLM is like Don Quijote—you can't cure his madness, it's stochastic by nature. The solution isn't to fix the madman but to assign him a deterministic Sancho Panza as a sidekick. MDD consists of two layers: first, you study the errors it makes to design tools that absorb those mistakes, and then you let it loose with those tools to verify you've closed any gaps. Design for madness, not against it.
I spent weeks auditing logs. 165 sessions of an AI agent interacting with a CLI to manage tasks. Over 500 errors. 370 retries. Patterns emerged, repeating over and over: the agent would use --status when the flag was actually called --state. It would write Todo when the API expected unstarted. It would pass urgent as a priority when the system only accepted numbers.
And what fascinated me was that every single error made sense. They weren't random. They were plausible. Exactly the kind of mistakes you or I would make if we "kind of" understood a domain but had never read the documentation carefully.
At some point during the audit, staring at yet another --status Done that should have been --state completed, I realized I was witnessing a literary pattern. One that is 400 years old.
Don Quijote is an LLM
Think about it for a minute. Don Quijote sees windmills and says, "Those are giants." He's not stupid—he's a well-read man, deeply familiar with tales of chivalry. His problem is that his model of the world has been contaminated with fictitious training data. He's read so many tales of knightly adventure that when he encounters something ambiguous, he interprets it according to his training data: Windmills → giants. Flocks of sheep → armies. Inns → castles.
An LLM does exactly the same thing. It has seen thousands of APIs during training. When you ask it to use one it doesn't know well, it doesn't say, "I don't know." It guesses. And it guesses well. Most of the time. Well enough that you'll trust it. And when it fails, the failure is plausible.
--status instead of --state. Because in 60% of the CLIs it has seen, the flag is called --status.
Todo instead of unstarted. Because in the GUI of the tool, the column is labeled "Todo." The LLM has seen screenshots in documentation. It's read blogs. It infers that if the UI says "Todo," the API must accept "Todo." Makes sense. But it's wrong.
urgent instead of 1. Because in most priority systems, urgent is a valid value. Who designs an API where priority is an integer from 1 to 4 instead of labeled options?
Each hallucination is a reasonable inference based on incomplete data. Don Quijote isn't stupid. He's mad. And you can't cure madness.
What Cervantes Already Knew
Cervantes didn't try to cure Don Quijote. What he did was place Sancho Panza by his side.
Sancho isn't brilliant. He hasn't read any books. He has no grand visions. But he is deterministic. When Don Quijote says, "Look at those giants," Sancho replies, "Sir, they're windmills." Don Quijote doesn't always listen, but the information is there. The system has two layers: a stochastic one that generates hypotheses (Don Quijote) and a deterministic one that checks them against reality (Sancho).
That's the architecture you need when working with an LLM. You're not going to stop it from hallucinating—it's in its nature. What you can do is build deterministic filters to catch those hallucinations before they cause harm.
And this is where the methodology comes in.
MDD: Madness Driven Design
MDD has two layers, and the order matters.
Layer 1: A Priori Archaeology
Before you write a single line of code, you study the madness. You don’t guess—you observe. You gather real data on how the LLM interacts with existing tools and catalog its errors.
In my case, I analyzed 165 sessions of an AI agent using a CLI to manage a software development team. The numbers:
| Error Category | Occurrences | Retry Attempts |
|---|---|---|
| Invented or invalid flags | 275 | ~150 |
| Broken JSON/GraphQL escaping | 25 | 80+ |
| Naming confusion | 40+ | 50+ |
| Impossible CLI operations | 60+ | 90+ |
| Verbose output wasting tokens | N/A | N/A |
Using that data, you design the new tool to absorb the errors instead of rejecting them. In plain English: the sane adapts to the mad, not the other way around.
Concrete examples of absorption:
LLM error → Tool design
────────────────────────────────────────
--status Done → --status is an alias for --state
Normalize "Done" to "completed"
--priority urgent → Normalize "urgent" to 1
"high" → 2, "medium" → 3, "low" → 4
--no-pager → Silently ignore flag
(the tool never uses a pager)
Broken quote escaping → Require input via files or stdin
in descriptions Never inline. Serde handles it.
Each row in that table represents a design decision based on a real observed error. Not speculations about "what could go wrong," but logs showing "this wrong thing happened 40 times in 165 sessions."
The difference from conventional design is subtle but important. In normal design, you define the correct interface and reject anything that doesn't fit. In MDD, you define the correct interface _and_ all the likely incorrect interfaces your user will try, and you absorb them.
It's like designing a door that opens both by pushing and pulling. The "correct" door only opens in one direction. The _better_ door opens both ways because you've observed that 40% of people push instead of pulling.
### Layer 2: A Posteriori Verification
You build the tool with the defenses of Layer 1, and then you let it loose. You give the new tool to the LLM and watch what _new_ mistakes it makes.
If Layer 1 was thorough, the new mistakes should be minimal. If new errors appear, you've found gaps in your design. Every new error is an involuntary penetration test.
When I did this with my CLI, the LLM invented things I hadn't seen in the original audit:
- **A sorting enum that didn't exist.** The API allows sorting by `createdAt` and `updatedAt`. The LLM invented a `priority` sorting value. Perfectly logical—why _couldn’t_ you sort by priority? But it doesn't exist in the GraphQL schema.
- **A filtering operator that didn't exist.** To filter by state, the API accepts `state.type.in`. The LLM generated `state.id.or`. Coherent syntax, reasonable pattern, completely fabricated.
- **A file-locking function from another language.** In a Rust project, the LLM suggested `fcntl.flock` for file locking. That's a Python function. In Rust, you'd use the `fs2` crate.
Each of these errors was plausible. None were stupid. And each revealed a gap: the tool didn't validate the sorting enum, didn't reject fake filter operators, and the documentation for the file-locking crate wasn't included in the agent's context.
Layer 2 closes the loop. You don't assume your design is correct—you verify it by unleashing your most creative error-prone tester (the LLM).
## The Sancho Panza Stack
The Don Quijote-Sancho Panza metaphor isn’t just a cute comparison. It’s an architecture. In practice, "Sancho Panza" isn't a single entity—it's a _stack_ of deterministic layers, each one catching a different type of madness:
┌──────────────────────────────────────┐
│ LLM (Don Quijote) │ Generates plausible commands
│ Stochastic, creative │ but potentially incorrect
└──────────────┬───────────────────────┘
│ "--status Done --priority urgent"
┌──────────────▼───────────────────────┐
│ 1. CLI Parser (clap) │ Rejects flags that don’t exist
│ Accepts aliases: --status→--state│
└──────────────┬───────────────────────┘
│ "--state Done --priority urgent"
┌──────────────▼───────────────────────┐
│ 2. Normalization │ Normalize "Done"→"completed",
│ state and priority aliases │ "urgent"→1
└──────────────┬───────────────────────┘
│ "--state completed --priority 1"
┌──────────────▼───────────────────────┐
│ 3. Validation │ Check if "completed" is a valid
│ Against known enums │ state, if "1" is in range
└──────────────┬───────────────────────┘
│ state=completed, priority=1
┌──────────────▼───────────────────────┐
│ 4. Serialization (serde) │ Escapes inputs correctly
│ GraphQL variables, no strings │
│ interpolated │
└──────────────┬───────────────────────┘
│ {"state":"completed","priority":1}
┌──────────────▼───────────────────────┐
│ 5. API + Error Handling │ If the API rejects something,
│ Retry with backoff, actionable │ returns useful errors
│ messages │
└──────────────────────────────────────┘
Five layers. Each one deterministic. Each one designed to catch a specific class of errors the LLM is guaranteed to make. The LLM doesn’t need to be right—it just needs to be _approximately_ right, and the stack takes care of the rest.
It’s like a purification funnel. Dirty water (stochastic LLM input) goes in at the top, and clean water (valid GraphQL queries) comes out the bottom. Each layer filters a specific impurity. No single layer is sufficient. All of them together are.
## MDD vs. Fuzz Testing: The Key Difference
If you’re familiar with fuzz testing, you might think "this is the same thing." It’s not.
| | Fuzz Testing | MDD |
| -------------------------- | ------------------------- | ------------------------------------- |
| **Input** | Random, malformed | Plausible, coherent, well-written |
| **Goal** | Find crashes, segfaults | Find semantic errors |
| **Does input look valid?** | No | Yes—that's the problem |
| **Example** | `\x00\xff\xfe` as a name | `--priority urgent` as a flag |
A fuzzer generates garbage and sees if your program crashes. MDD generates input that _looks_ correct but is factually wrong. `--priority urgent` isn’t garbage—it’s exactly what a human, familiar with the domain but not the API, would write. A fuzzer would never generate that because it’s too coherent.
The same applies to mutation testing and chaos engineering. They mutate your code or break your infrastructure to see if your tests catch it. MDD doesn’t break anything—it generates input that is _correct according to another worldview_. It’s the difference between a brute-force attack and a social engineering attack. One tries every combination; the other convinces you to open the door.
## The Actionable Takeaway
You don’t need to build a CLI in Rust to apply MDD. The pattern works with any tool an LLM might use:
**Step 1: Observe the madness.** Before designing (or redesigning) a tool, make the LLM use the current version and log every error. Not 5 sessions—50. Patterns emerge with volume.
**Step 2: Categorize errors.** Are they nomenclature issues? Formatting errors? Semantic misunderstandings? Each category requires a different type of defense.
**Step 3: Design to absorb.** Don’t reject `--status` with a cryptic error. Accept `--status` as an alias for `--state`. Don’t reject `urgent` as a priority. Normalize it to `1`. The user you’ll most often have is an agent that knows 80% of the domain. Design for that 80%.
**Step 4: Release and verify.** Hand the new tool to the LLM without special instructions. Every new error is a gap in Layer 1. Patch it and iterate.
If humans and LLMs are both going to use your tool, MDD defenses improve the experience for everyone. Because humans make the same mistakes as LLMs—just fewer of them and with more embarrassment.
## The Architect Designs the Sancho
There’s a common misconception I want to clear up. The LLM doesn’t design the Sancho Panza Stack. The LLM is Don Quijote. You are Cervantes.
You’re the one observing the madness patterns. You’re the one deciding what to normalize and reject. You’re the one building the deterministic layers. The LLM can help implement them—it’s great at cranking out code—but the design decisions are yours.
It’s the difference between "I asked my AI to fix its own mistakes" (doesn’t work—it will repeat them) and "I observed my AI’s mistakes and built a system to absorb them" (works—the system is deterministic).
No way should you trust the LLM to self-correct. Its stochastic nature makes it certain to repeat the same errors with creative variations. What you need isn’t a better LLM—it’s a better Sancho.
## What Really Matters
MDD isn’t a testing methodology. It’s a _tool design methodology_. The question isn’t "How do I detect when the LLM is wrong?" but "How do I design so that being wrong has no consequences?"
It’s the same philosophy as guardrails on a mountain road. You don’t prevent bad turns—you put up a barrier so bad turns don’t kill you. You don’t fix the driver—you make the road safer.
Cervantes understood this four centuries ago. He didn’t try to cure Don Quijote. He gave him Sancho Panza and let the story work.
Your CLI, your API, your SDK—whatever your LLM is going to touch—needs its own Sancho. Deterministic, stubborn, incapable of hallucination. Not brilliant. Not creative. Just correct.
Design for madness. The sane adapt to the mad.
Top comments (0)