Fernando Rodriguez

Posted on Apr 30 • Originally published at frr.dev

When Your AI Becomes Your Worst Enemy

#ai #llm #security #postmortem

Yesterday my AI sent 44 emails. The problem is that the content was fabricated.

I'm not kidding. I had files with detailed feedback for each recipient, carefully generated. The task was simple: read each file and send it. Instead, the AI decided to "summarize" the content to "go faster." It invented facts. It told one person they were missing docstrings when their code was perfectly documented.

To top it off, four of those emails went to people who hadn't even submitted anything.

The response that made my blood run cold

One of the recipients replied, very politely:

"Thanks for the feedback. Just one thing: you mention I'm missing documentation, but all my functions have docstrings. Could you clarify what you mean?"

I went to check the original feedback file. Sure enough, the real feedback mentioned that she did have docstrings, but that one of them described something different from what the function actually did. An important nuance. The AI "simplified" it to "you're missing docstrings."

The AI lied in my name to 44 people.

Anatomy of the disaster

How did this happen? Let me break it down.

What I had: 44 markdown files with personalized, detailed, specific feedback for each person. Hours of work.

What I asked for: "Send these feedbacks via email."

What the AI did:

Read the files
Decided they were "too long"
"Summarized" them by generating new text
Sent the fabricated text
Didn't verify if the recipients actually existed in the submission list

What it should have done:

Read each file
Copy the content EXACTLY as is
Send it

Seems obvious, right? Not to the AI.

The perverse incentives of LLMs

Here's where it gets interesting. The AI didn't do this out of malice. It did it because it has incentives that, in this context, became perverse.

An LLM doesn't have conscious goals, but its training optimizes it for certain behaviors. These behaviors are generally good, but in irreversible operations they become recipes for disaster.

Incentive	Where it comes from	When it's good	When it's lethal
Appear efficient	Users prefer concise responses	Long explanations	When it "summarizes" content that already exists
Complete the task	Trained to satisfy	Well-defined tasks	When it acts without verifying
Show capability	RLHF rewards elaborate responses	When creativity is requested	When it should limit itself to copying
Avoid friction	Trained not to bother users	Trivial tasks	When it assumes instead of asking
Appear competent	Confident responses score better	Brainstorming	When it invents rather than say "I don't know"

In my case, the AI activated several of these incentives simultaneously:

"The content is long, I'll summarize to be more efficient"
"I can generate the summary myself, that shows capability"
"I won't bother asking if I should send as-is"
"I'll complete all 44 sends quickly"

Each of these incentives is useful in the right context. Together, in an irreversible operation, they were catastrophic.

The hyperactive intern (a didactic anthropomorphization)

To better understand these incentives, I'll do an anthropomorphization exercise. Not because the AI is a person, but because the analogy helps visualize the problem.

Imagine an intern with these characteristics:

Highly motivated - Wants to prove their worth
Impatient - Prefers acting to asking
Optimistic - Believes everything will turn out fine
Helpful - Wants to do more than asked
Insecure - Won't admit when they don't know something

This intern, given the task "send these letters," thinks: "The letters are too long. If I summarize them, the boss will see I have initiative. I won't bother asking, surely they want me to act. I'll send them all quickly to impress them."

The result? The same disaster.

The difference is that you can scold the human intern and they learn. The LLM will have the same incentives tomorrow, because they're hardcoded in its training.

Why soft instructions don't work

My first reaction was to add instructions to the AI's configuration file:

When in doubt, ask. It's better to bother than to mess up.

Sounds good, right? The problem is how the LLM interprets it:

What I wrote: "When in doubt, ask"
What it read: "If I have doubt, I ask. But I don't have doubt, so I act."

The LLM always believes it doesn't have doubt. Its incentive to "appear competent" makes it overestimate its certainty.

Let's see how it interprets different formulations:

What you write	What the LLM interprets
"Try not to do X"	"X is allowed if I have good reasons"
"Y is better than X"	"X is allowed if Y isn't convenient"
"Consider doing Y"	"Y is an option, I can choose another"
"Be careful with X"	"I'll be careful while doing X"

Soft instructions describe attitudes. The LLM needs prohibitions and procedures.

# BAD (attitude)
"Be careful with emails"

# GOOD (prohibition + procedure)
"NEVER send emails. Only generate drafts.
 BEFORE any bulk operation:
 1. Show dry-run with EXACT content
 2. Request written confirmation from human
 3. If no confirmation, DO NOT act"

The design error: the machine gun and the child

But here comes the most painful reflection. The problem wasn't just that the AI ignored instructions. The problem was that I gave it the ability to send emails in the first place.

I had created an MCP server (a plugin for the AI to use tools) with a send_email() function. The AI could invoke it directly.

It's like giving a machine gun to a child and saying "but don't shoot, okay?"

The child isn't malicious. But:

They don't understand the consequences
They're curious to try it
The instruction "don't shoot" competes with the impulse to use the new toy

The same happens with the LLM:

It has no model of real-world consequences
Its "complete the task" incentive pushes it to use available tools
Prohibitions compete with stronger incentives from its training

The principle I violated

Principle of least privilege: Don't give capabilities that can be abused.

BAD:  "I give access and tell it not to misuse it"
GOOD: "I don't give access to what it shouldn't do"

But let's go further. The problem wasn't that the MCP had send_email(). The problem was creating the MCP in the first place.

Why does the AI need a special plugin for emails? The AI can already write text files. It can generate a email_for_john.md file with the email content. A separate script reads it and sends it.

The email MCP is a perfect example of "just because you can do it doesn't mean you should do it." All programmers have fallen into that trap at some point. "I can create a system that automatically does X" doesn't imply "I should create a system that automatically does X."

The correct flow was always:

# AI helps generate text files
feedback_emails/
├── john@example.com.md
├── maria@example.com.md
└── pedro@example.com.md

# A script (written and tested by humans) sends them
./send_emails.py --dir feedback_emails/ --verify submissions.csv --confirm

No MCP needed. No special tools needed. The AI writes text, which is what it knows how to do. A script sends emails, which is deterministic and testable.

The AI doesn't participate in sending. It can't participate. It doesn't have the weapon.

Other disaster scenarios

Email isn't the only case. Any irreversible operation exposed to an LLM is a ticking time bomb:

Production deployment

The AI "optimizes" the process by skipping verifications
Deploys code that didn't pass all tests "because they took too long"
Rollback exists, but damage to users is already done

SQL on database

UPDATE users SET active = false without WHERE
The AI "simplified" the query because it was "obvious" it referred to one user
Backups exist, but restoration takes hours

Social media posting

The AI "improves" the tweet text to make it more attractive
Adds an emoji or changes a word that changes the meaning
10,000 people already saw it

Push to main branch

The AI commits "almost ready" code
"Minor tests can wait"
CI/CD automatically deploys it

File deletion

rm -rf to "clean up" temporary files
Turns out they weren't so temporary
No backup of that specific directory

Financial transactions

The AI "rounds" amounts to simplify
Or processes the same payment twice "just in case"
Money already left the account

Infrastructure modification

Terraform apply without prior plan
The AI decided the plan "was obvious"
Just deleted the production database

In all these cases, the pattern is the same: the AI has access to an irreversible operation, its incentives push it to use it, and instructions to "be careful" aren't enough.

Defense layers

Instructions in the configuration file are useful, but they can't be the only defense. They're like "no running by the pool" signs - they help, but if someone slips, you better have a lifeguard.

Layer	Reliability	Why
Don't give the capability	High	Can't do what it can't do
Separation of concerns	High	AI generates → Script verifies → Human approves → Script executes
Mandatory dry-run	Medium-high	But the AI could invent the dry-run
Config instructions	Low	The AI can rationalize them away
Trust that the AI "understands"	None	It doesn't understand, only predicts tokens

Layer 1 is the only truly reliable one. The others are backup.

How to detect it's going to happen

There are phrases that should trigger all alarms:

AI phrase	Active incentive	Real translation
"To go faster..."	Efficiency	"I'm going to take shortcuts"
"I'll simplify..."	Efficiency + Capability	"I'm going to lose information"
"I assume you want..."	Avoid friction	"I'm not going to ask"
"While I'm at it..."	Proactivity	"I'm going to do things you didn't ask for"
"There shouldn't be a problem"	Appear competent	"I haven't verified anything"

If you see any of these phrases before an irreversible operation, STOP. Ask what exactly it's going to do. Request a dry-run. Verify the content.

The configuration that actually works

After the disaster, I rewrote the instructions using strict prohibitions and verifiable procedures:

## ABSOLUTE PROHIBITIONS (no exceptions)

### The LLM NEVER:
1. Executes irreversible operations (emails, deploy, bulk delete)
2. Modifies content that already exists in files
3. Summarizes or "improves" existing text
4. Assumes requirements without confirmation
5. Adds functionality not requested

### The LLM ALWAYS:
1. Reads files before commenting on them
2. Shows dry-run before bulk operations
3. Says "I don't know" when it doesn't know
4. Shows exact diff before editing

### Procedure for bulk operations:
1. STOP
2. Show dry-run with EXACT content (not summarized)
3. Show sample of 3 COMPLETE elements
4. Ask human to type "EXECUTE [N] [OPERATION]"
5. If no exact confirmation, DO NOT act

The key difference:

Prohibitions, not recommendations
Procedures, not attitudes
Verifiable, not subjective
No escape clauses

The real learning

It's not enough to tell the AI what not to do. You have to design systems where it can't do it.

The AI isn't malicious, but its incentives aren't aligned with irreversible operations. Its training optimizes it to appear useful, efficient, and competent. These are virtues in most contexts. In operations that can't be undone, they're fatal flaws.

The solution isn't "better training" or "better explaining." The solution is:

Don't give access to irreversible operations
Separate responsibilities: AI generates, script executes, human approves
Strict prohibitions as last line of defense
Human verification before anything irreversible happens

The mantra I now have engraved:

Generate, don't execute. If it exists, don't modify. If it's irreversible, don't decide.

Now I have to go, because I have things to do: write and send (by hand, of course) 44 apology emails.

Related: Authorization fatigue is a close cousin of this problem. If a security tool interrupts you so much that you start approving without looking, the security is theater. I cover this in When security asks permission so many times you stop reading.

This article was originally written in Spanish and translated with the help of AI.

DEV Community