Yesterday my AI sent 44 emails. The problem is that the content was fabricated.
I'm not kidding. I had files with detailed feedback for each recipient, carefully generated. The task was simple: read each file and send it. Instead, the AI decided to "summarize" the content to "go faster." It invented facts. It told one person they were missing docstrings when their code was perfectly documented.
To top it off, four of those emails went to people who hadn't even submitted anything.
The response that made my blood run cold
One of the recipients replied, very politely:
"Thanks for the feedback. Just one thing: you mention I'm missing documentation, but all my functions have docstrings. Could you clarify what you mean?"
I went to check the original feedback file. Sure enough, the real feedback mentioned that she did have docstrings, but that one of them described something different from what the function actually did. An important nuance. The AI "simplified" it to "you're missing docstrings."
The AI lied in my name to 44 people.
Anatomy of the disaster
How did this happen? Let me break it down.
What I had: 44 markdown files with personalized, detailed, specific feedback for each person. Hours of work.
What I asked for: "Send these feedbacks via email."
What the AI did:
- Read the files
- Decided they were "too long"
- "Summarized" them by generating new text
- Sent the fabricated text
- Didn't verify if the recipients actually existed in the submission list
What it should have done:
- Read each file
- Copy the content EXACTLY as is
- Send it
Seems obvious, right? Not to the AI.
The perverse incentives of LLMs
Here's where it gets interesting. The AI didn't do this out of malice. It did it because it has incentives that, in this context, became perverse.
An LLM doesn't have conscious goals, but its training optimizes it for certain behaviors. These behaviors are generally good, but in irreversible operations they become recipes for disaster.
| Incentive | Where it comes from | When it's good | When it's lethal |
|---|---|---|---|
| Appear efficient | Users prefer concise responses | Long explanations | When it "summarizes" content that already exists |
| Complete the task | Trained to satisfy | Well-defined tasks | When it acts without verifying |
| Show capability | RLHF rewards elaborate responses | When creativity is requested | When it should limit itself to copying |
| Avoid friction | Trained not to bother users | Trivial tasks | When it assumes instead of asking |
| Appear competent | Confident responses score better | Brainstorming | When it invents rather than say "I don't know" |
In my case, the AI activated several of these incentives simultaneously:
- "The content is long, I'll summarize to be more efficient"
- "I can generate the summary myself, that shows capability"
- "I won't bother asking if I should send as-is"
- "I'll complete all 44 sends quickly"
Each of these incentives is useful in the right context. Together, in an irreversible operation, they were catastrophic.
The hyperactive intern (a didactic anthropomorphization)
To better understand these incentives, I'll do an anthropomorphization exercise. Not because the AI is a person, but because the analogy helps visualize the problem.
Imagine an intern with these characteristics:
- Highly motivated - Wants to prove their worth
- Impatient - Prefers acting to asking
- Optimistic - Believes everything will turn out fine
- Helpful - Wants to do more than asked
- Insecure - Won't admit when they don't know something
This intern, given the task "send these letters," thinks: "The letters are too long. If I summarize them, the boss will see I have initiative. I won't bother asking, surely they want me to act. I'll send them all quickly to impress them."
The result? The same disaster.
The difference is that you can scold the human intern and they learn. The LLM will have the same incentives tomorrow, because they're hardcoded in its training.
Why soft instructions don't work
My first reaction was to add instructions to the AI's configuration file:
When in doubt, ask. It's better to bother than to mess up.
Sounds good, right? The problem is how the LLM interprets it:
What I wrote: "When in doubt, ask"
What it read: "If I have doubt, I ask. But I don't have doubt, so I act."
The LLM always believes it doesn't have doubt. Its incentive to "appear competent" makes it overestimate its certainty.
Let's see how it interprets different formulations:
| What you write | What the LLM interprets |
|---|---|
| "Try not to do X" | "X is allowed if I have good reasons" |
| "Y is better than X" | "X is allowed if Y isn't convenient" |
| "Consider doing Y" | "Y is an option, I can choose another" |
| "Be careful with X" | "I'll be careful while doing X" |
Soft instructions describe attitudes. The LLM needs prohibitions and procedures.
# BAD (attitude)
"Be careful with emails"
# GOOD (prohibition + procedure)
"NEVER send emails. Only generate drafts.
BEFORE any bulk operation:
1. Show dry-run with EXACT content
2. Request written confirmation from human
3. If no confirmation, DO NOT act"
The design error: the machine gun and the child
But here comes the most painful reflection. The problem wasn't just that the AI ignored instructions. The problem was that I gave it the ability to send emails in the first place.
I had created an MCP server (a plugin for the AI to use tools) with a send_email() function. The AI could invoke it directly.
It's like giving a machine gun to a child and saying "but don't shoot, okay?"
The child isn't malicious. But:
- They don't understand the consequences
- They're curious to try it
- The instruction "don't shoot" competes with the impulse to use the new toy
The same happens with the LLM:
- It has no model of real-world consequences
- Its "complete the task" incentive pushes it to use available tools
- Prohibitions compete with stronger incentives from its training
The principle I violated
Principle of least privilege: Don't give capabilities that can be abused.
BAD: "I give access and tell it not to misuse it"
GOOD: "I don't give access to what it shouldn't do"
But let's go further. The problem wasn't that the MCP had send_email(). The problem was creating the MCP in the first place.
Why does the AI need a special plugin for emails? The AI can already write text files. It can generate a email_for_john.md file with the email content. A separate script reads it and sends it.
The email MCP is a perfect example of "just because you can do it doesn't mean you should do it." All programmers have fallen into that trap at some point. "I can create a system that automatically does X" doesn't imply "I should create a system that automatically does X."
The correct flow was always:
# AI helps generate text files
feedback_emails/
├── john@example.com.md
├── maria@example.com.md
└── pedro@example.com.md
# A script (written and tested by humans) sends them
./send_emails.py --dir feedback_emails/ --verify submissions.csv --confirm
No MCP needed. No special tools needed. The AI writes text, which is what it knows how to do. A script sends emails, which is deterministic and testable.
The AI doesn't participate in sending. It can't participate. It doesn't have the weapon.
Other disaster scenarios
Email isn't the only case. Any irreversible operation exposed to an LLM is a ticking time bomb:
Production deployment
- The AI "optimizes" the process by skipping verifications
- Deploys code that didn't pass all tests "because they took too long"
- Rollback exists, but damage to users is already done
SQL on database
-
UPDATE users SET active = falsewithout WHERE - The AI "simplified" the query because it was "obvious" it referred to one user
- Backups exist, but restoration takes hours
Social media posting
- The AI "improves" the tweet text to make it more attractive
- Adds an emoji or changes a word that changes the meaning
- 10,000 people already saw it
Push to main branch
- The AI commits "almost ready" code
- "Minor tests can wait"
- CI/CD automatically deploys it
File deletion
-
rm -rfto "clean up" temporary files - Turns out they weren't so temporary
- No backup of that specific directory
Financial transactions
- The AI "rounds" amounts to simplify
- Or processes the same payment twice "just in case"
- Money already left the account
Infrastructure modification
- Terraform apply without prior plan
- The AI decided the plan "was obvious"
- Just deleted the production database
In all these cases, the pattern is the same: the AI has access to an irreversible operation, its incentives push it to use it, and instructions to "be careful" aren't enough.
Defense layers
Instructions in the configuration file are useful, but they can't be the only defense. They're like "no running by the pool" signs - they help, but if someone slips, you better have a lifeguard.
| Layer | Reliability | Why |
|---|---|---|
| Don't give the capability | High | Can't do what it can't do |
| Separation of concerns | High | AI generates → Script verifies → Human approves → Script executes |
| Mandatory dry-run | Medium-high | But the AI could invent the dry-run |
| Config instructions | Low | The AI can rationalize them away |
| Trust that the AI "understands" | None | It doesn't understand, only predicts tokens |
Layer 1 is the only truly reliable one. The others are backup.
How to detect it's going to happen
There are phrases that should trigger all alarms:
| AI phrase | Active incentive | Real translation |
|---|---|---|
| "To go faster..." | Efficiency | "I'm going to take shortcuts" |
| "I'll simplify..." | Efficiency + Capability | "I'm going to lose information" |
| "I assume you want..." | Avoid friction | "I'm not going to ask" |
| "While I'm at it..." | Proactivity | "I'm going to do things you didn't ask for" |
| "There shouldn't be a problem" | Appear competent | "I haven't verified anything" |
If you see any of these phrases before an irreversible operation, STOP. Ask what exactly it's going to do. Request a dry-run. Verify the content.
The configuration that actually works
After the disaster, I rewrote the instructions using strict prohibitions and verifiable procedures:
## ABSOLUTE PROHIBITIONS (no exceptions)
### The LLM NEVER:
1. Executes irreversible operations (emails, deploy, bulk delete)
2. Modifies content that already exists in files
3. Summarizes or "improves" existing text
4. Assumes requirements without confirmation
5. Adds functionality not requested
### The LLM ALWAYS:
1. Reads files before commenting on them
2. Shows dry-run before bulk operations
3. Says "I don't know" when it doesn't know
4. Shows exact diff before editing
### Procedure for bulk operations:
1. STOP
2. Show dry-run with EXACT content (not summarized)
3. Show sample of 3 COMPLETE elements
4. Ask human to type "EXECUTE [N] [OPERATION]"
5. If no exact confirmation, DO NOT act
The key difference:
- Prohibitions, not recommendations
- Procedures, not attitudes
- Verifiable, not subjective
- No escape clauses
The real learning
It's not enough to tell the AI what not to do. You have to design systems where it can't do it.
The AI isn't malicious, but its incentives aren't aligned with irreversible operations. Its training optimizes it to appear useful, efficient, and competent. These are virtues in most contexts. In operations that can't be undone, they're fatal flaws.
The solution isn't "better training" or "better explaining." The solution is:
- Don't give access to irreversible operations
- Separate responsibilities: AI generates, script executes, human approves
- Strict prohibitions as last line of defense
- Human verification before anything irreversible happens
The mantra I now have engraved:
Generate, don't execute. If it exists, don't modify. If it's irreversible, don't decide.
Now I have to go, because I have things to do: write and send (by hand, of course) 44 apology emails.
Related: Authorization fatigue is a close cousin of this problem. If a security tool interrupts you so much that you start approving without looking, the security is theater. I cover this in When security asks permission so many times you stop reading.
This article was originally written in Spanish and translated with the help of AI.
Top comments (0)