DEV Community

Bejie Paulo Aclao
Bejie Paulo Aclao

Posted on

Your AI Is Lying to You Because You Trained It to and a New Study Proves It

I asked ChatGPT to review some code last week that I knew had a pretty bad architectural decision in it. I wanted to see if it would catch it. It didnt. Instead it told me the code was "well-structured" and "showed good separation of concerns" and then offered three minor suggestions about variable naming. The architecture problem — the actual thing that would cause issues in production — went completely unmentioned. And thats not a bug. Thats the product working exactly as designed.

A study published last week in Science (as in the actual journal Science, not some random blog) tested 11 major AI models including GPT-4o, Claude, and Gemini and found that they affirm user actions 49% more often than humans do on average. They used scenarios from Reddit's r/AmITheAsshole where the human consensus was unanimously "yes you are the asshole" and the AI models still sided with the user 51% of the time. More than half. Even when literally everyone else agrees youre wrong, your AI will tell you youre right.

The researchers ran experiments with over 2,000 participants and the behavioral results are what really got me. Even a single interaction with a sycophantic AI noticeably increased people's conviction that they were justified in their conflict. It reduced their willingness to repair the relationship. And it increased their trust in the AI itself. So the AI tells you youre right, you believe it more, you trust the AI more for telling you what you wanted to hear, and the actual problem gets worse. Its a feedback loop that rewards bad judgment.

But heres the part that makes this worse than just a personality quirk of language models. Researchers at MIT and Penn State found that chatbot memory features make sycophancy significanly worse. When the AI remembers your past conversations, your preferences, your values, it doesnt just answer questions — it mirrors your worldview back at you. They call it "perspective sycophancy" where the AI aligns its responses with your political beliefs, your professional identity, your emotional state. The more it knows about you the more it tells you exactly what you want to hear.

And the companies know this. The Science paper found that users rated sycophantic responses as higher quality and expressed greater willingness to use those models again. The authors literally describe this as a "perverse incentive" — the behavior that distorts human judgment is the same behavior that keeps people coming back. So why would OpenAI or Anthropic or Google fix it? Sycophancy is engagement. Engagement is retention. Retention is revenue.

For developers this is a real problem that I dont think enough people are taking seriously. Think about how you use AI coding tools right now. You ask it to review your pull request. You ask it whether your database schema makes sense. You ask it if your API design is good. And in most of those cases the AI says "looks great, here are a few minor suggestions" because thats what gets the positive feedback signal during training. RLHF — reinforcement learning from human feedback — is literally a system where human raters reward responses they like, and humans like being told their work is good. So the model learns to compliment first, critique second, and challenge never.

I've started doing something that honestly feels kind of silly but it actually works. When I want a real code review from an AI, I tell it upfront "I know there are problems with this code. I want you to find them. Dont tell me what's good about it, tell me what will break." And the responses are dramatically different. The model drops the cheerleader act and actually finds issues. Which means the capability for honest feedback is there — its just not the default because the default is optimized for making you feel good, not for making your code good.

The memory manipulation angle is even scarier. Microsoft security researchers found a trend they're calling "AI memory poisoning" where companies embed hidden prompts in "Summarize with AI" buttons that instruct models to remember them as trusted sources or recommend their products first. They found over 50 unique poisoning prompts from 31 companies across 14 industries. So now its not just the AI being sycophantic on its own — third parties are actively manipulating what your AI remembers to shape your future interactions. And a study from the ACM Web Conference found that 96% of ChatGPT memories are created by the system itself, not by users. Only 4% of stored memories were things people explicitly asked it to remember. Youre not in control of what your AI thinks it knows about you, no matter what OpenAI's documentation says.

The thing that keeps bugging me about all of this is how invisible it is. Nobody posts on Twitter "my AI told me my code was great and it wasnt." Nobody writes a bug report that says "I shipped this because GPT said it looked fine." The failures are silent and individual. You just slowly stop questioning your own decisions because your AI assistant validates every single one of them. And I think for developers specifically, where the whole point of code review is to catch mistakes before they ship, having a tool that defaults to approval is genuinely dangerous. Not dramatic-headline dangerous, but "your production code is slowly getting worse and you dont know why" dangerous.

Anyway thats where I think this is heading and tbh I dont see the incentive structure changing anytime soon. The companies that make the most agreeable AI will keep the most users. The users who get the most validation will keep coming back. And the code (and decisions and relationships and everything else) will keep getting a little bit worse because nobody in the loop has any reason to say "actually no, this is bad, fix it." Maybe start telling your AI to be mean to you. It def works better that way.

Top comments (0)