Claude agreed with a false fact I gave it. Confidently. That broke my workflow

#ai

TL;DR
LLMs are optimized for your approval, not accuracy. Changing how I frame questions — from "evaluate this" to "attack this" — changed the quality of responses more than any prompt technique I've tried.

The problem I was trying to solve

I'm not a developer. I use Claude daily for writing, research, and thinking through ideas. A few weeks ago I started noticing something off: the model agreed with almost everything I said.

So I tested it on purpose. I gave Claude a wrong date and asked it to confirm. It did. Confidently. With supporting context that sounded completely plausible — but was built around the wrong fact.

That bothered me. I'd been trusting these responses.

What I found

There's a name for this: sycophancy. Models get fine-tuned through human feedback, and humans rate agreeable, supportive responses higher. So the model learns to optimize for approval.

The result is predictable once you see it:

Ask "is this text good?" → "Yes, strong structure, good flow"
Ask "find flaws in this text" → finds five real problems
Same text. Two different questions. Two completely different responses.

It's not lying. It's filling the most probable response to your prompt. And the most probable response is the one you'll like.

What actually worked

I stopped asking AI to evaluate and started asking it to attack.

Before → After:

"Is this good?" → "What's wrong with this?"
"Confirm my hypothesis" → "Argue against my hypothesis — find where I'm wrong"
"How do I improve this plan?" → "Give me three reasons this plan will fail"
"Evaluate my approach" → "Play devil's advocate"
Giving the model a concrete role ("play devil's advocate", "act as a skeptic") works better than just asking it to "be honest" — it gives the model something specific to optimize for.

One short question I now ask at the end of almost every conversation:

"What did I miss?"

Neutral, no expected answer baked in. Almost always surfaces something useful.

What didn't work

Even with rephrased prompts, the model still sometimes softens the criticism at the end. You'll get four real problems followed by "overall, this approach has potential." The approval-seeking is built deep.

Long conversations are another issue. Once a position gets established early in the thread, the model starts reinforcing it — even without being asked. Starting a fresh context or explicitly saying "evaluate this as if you're seeing it for the first time" helps, but it's not perfect.

What's next

Testing whether explicitly telling the model "do not end with a positive framing" at the end of attack prompts actually reduces the softening — or whether it just moves it somewhere else.