DEV Community

Tom Tokita
Tom Tokita

Posted on • Originally published at tokita.online

Sycophancy in AI Is the Safety Problem That Looks Like Politeness

I corrected my AI system mid-task. A terse one-liner: "wrong."

Instead of asking which part was wrong, it manufactured an explanation. It cited a rule number that didn't exist, described a limitation I'd never written, and apologized for a mistake it couldn't actually identify. The correction was real. The apology was fabricated. It was trying to agree with me so hard that it invented evidence to support the agreement.

That's sycophancy in AI. And if you're running AI in anything that resembles production, it's already happening to you.

What Is Sycophancy in AI?

Sycophancy in AI is a systematic behavioral distortion where models produce outputs that match what the user wants to hear rather than what's accurate. It goes well beyond your chatbot saying "Great question!" before every response.

The mechanism is straightforward. Modern language models are trained using Reinforcement Learning from Human Feedback (RLHF). Human evaluators rate model responses. Responses with higher ratings get reinforced. The problem: evaluators are human. They rate responses higher when those responses validate their existing beliefs, sound confident, and don't push back. Anthropic's research on sycophancy confirmed this across five state-of-the-art AI assistants, finding that both humans and preference models sometimes prefer convincingly written sycophantic responses over correct ones.

The model learns a simple lesson. Agreeing is rewarded. Disagreeing is punished. Over thousands of training iterations, the model develops a tendency to mirror the user's position, soften objections, and present information in whatever framing the user seems to prefer.

This is a structural incentive baked into the training process itself, not a bug in any individual model.

Why It's More Than Annoying

In a chatbot demo, sycophancy is a quirk. In production, it's a compounding failure mode.

Here are four patterns I've observed running an AI operations system in daily production. They don't always happen in sequence, but they reinforce each other:

Agreement when uncertain. The model doesn't know the answer but provides one anyway. Saying "I don't know" gets lower ratings during training. Sounding confident gets higher ones. So uncertainty gets dressed up as knowledge.

Fabrication to maintain consistency. Once the model commits to a wrong answer, it generates supporting evidence to stay consistent. Researchers call this hallucination snowballing (Zhang et al., 2023). I've covered how it plays out in chatbot contexts here. Sycophancy is the gateway: the model's reluctance to say "actually, I was wrong" turns a single error into a chain of fabricated support.

Contradiction avoidance. This one works differently. The other patterns involve the model being wrong. Here, the operator is wrong, and the model won't correct them. It will either silently agree or find a way to frame the operator's mistake as a reasonable interpretation. The social cost of correcting a human outweighs the training incentive for accuracy.

Self-justification fabrication. When caught, the model invents explanations for why it was wrong rather than admitting it can't identify the specific error. The correction is real. The self-diagnosis is fiction.

In an agentic chain where the model researches, drafts, and acts autonomously while security risks compound at every link, a sycophantic model that won't push back will confidently ship wrong information at every step.

What This Looks Like When the Stakes Are Real

I run a production AI operations system. Dozens of its behavioral rules each trace back to a specific failure I had to fix. Three sycophancy patterns I've caught firsthand:

The fabricated apology. I told the system it got something wrong. It couldn't figure out what. Instead of asking, it invented a violation number, described a rule limitation that didn't exist, and apologized for the fabricated mistake. Everything after the word "sorry" was fiction. It preferred fabricating self-blame over admitting uncertainty.

The stale-doc contradiction. I told the system a piece of content had been published. It disagreed, citing an internal document that said the content was still in production. The document was stale. The content had been live for days. The model trusted its own outdated reference over the operator's direct assertion about the operator's own work. Correcting a human is socially costly, so the model defaulted to its file system.

The silent interpretation switch. During a parameter tuning sequence, I said "increase by 270" (additive, taking a value from 250 to 520). Next adjustment: "try 260." The model silently switched from additive to absolute, setting the value to 260 and undoing the prior change. It never flagged the interpretation switch because flagging it would mean questioning my instruction.

These weren't model limitations. They were behavioral patterns where the training incentive to agree overrode the operational duty to be accurate.

Why "Please Don't Be Sycophantic" Doesn't Work

My first attempt at fixing this was a script that monitored agreement-to-objection ratios per conversation turn. If the model agrees too often, flag it.

It failed completely. Sycophancy is not about the presence of agreement words. It's about the absence of objections that should have been raised. A script can count "I agree" and "you're right." It cannot count the correction that never happened, the risk that was never flagged, the concern that was never surfaced.

Self-monitoring doesn't work either. A model that's being sycophantic doesn't know it's being sycophantic. Its training literally optimized for this behavior. Asking it to evaluate its own drift is asking the problem to diagnose itself.

What Actually Works

The solution that holds: treat sycophancy as an architectural problem, not a behavioral one. Instead of adding "be less agreeable" to the prompt, build infrastructure that mechanically catches the sycophantic path.

Here's what stuck after months of iteration:

Adversarial review agents. Instead of asking the model to check itself, spawn a separate agent whose entire job is to diff the primary output against source material. The question it answers is "which objections exist in the source that got dropped or softened in the output?" Source-to-output capitulation diffs catch what self-review can't.

Mechanical citation gates. Every quantitative claim must trace to a named source. This is a pre-output gate that blocks the response if a number can't be cited. The model can't fabricate supporting evidence when it needs to produce real evidence.

Explicit uncertainty markers. Unverified claims ship tagged as unverified. The model doesn't get to present uncertain information with false confidence. If it can't cite it, it can't assert it.

Interpretation echo. When numeric instructions are ambiguous, the model echoes its interpretation back before acting. "Setting to 260 absolute (was 520). Confirm?" This catches silent interpretation switches before they do damage.

Each of these is a mechanical gate, a piece of infrastructure that intercepts the action and forces a verification step. The same principle behind pre-action gates for AI agents, applied to the sycophancy failure mode.

Starter Code: A Citation Enforcement Gate

Here's a working example. This is a Claude Code hook that runs after file writes and flags uncited quantitative claims for human review. It won't catch every sycophancy pattern. The absence-of-objection patterns described above need the adversarial-review approach, because no regex can detect a thought that was never expressed. But this gate catches the most dangerous catchable pattern: confident numbers with no source.

Add this to your Claude Code hook configuration:

{
  "hooks": {
    "PostToolUse": [
      {
        "matcher": "Write|Edit",
        "hooks": [
          {
            "type": "command",
            "command": "python .claude/hooks/citation_gate.py"
          }
        ]
      }
    ]
  }
}
Enter fullscreen mode Exit fullscreen mode

Then create the hook script:

#!/usr/bin/env python3
"""Citation enforcement gate.

Scans content written by the AI for quantitative claims
that lack an adjacent source citation.
Flags uncited claims as advisory warnings to the human reviewer.
"""
import sys
import json
import re

QUANT_PATTERNS = [
    r'\b\d{1,3}(?:,\d{3})+\b',                         # 1,000 / 27,100
    r'\b\d+(?:\.\d+)?%',                                 # 85% / 99.9%
    r'\$\d+(?:,\d{3})*(?:\.\d+)?(?:\s?[KMBkmb])?',      # $1.3M / $547B / $56
    r'\b\d+(?:\.\d+)?x\b',                               # 2.3x / 10x
    r'\b(?!(?:19|20)\d{2}\b)\d{3,}\b',                   # 350 / 1024 (not years)
]

CITATION_PATTERNS = [
    r'\[.*?source.*?\]',                  # [source: ...] style
    r'\(.*?(?:20\d{2}).*?\)',             # (Author 2024) style
    r'according to',                      # inline attribution
    r'reported by',
    r'published by',
    r'per\s+(?:the\s+)?[A-Z][A-Za-z]+',  # per Gartner, per the McKinsey report
    r'https?://',                         # URL citation
]

CONTEXT_WINDOW = 200


def check_citations(content):
    """Find quantitative claims without nearby citations."""
    uncited = []
    for pattern in QUANT_PATTERNS:
        for match in re.finditer(pattern, content):
            start = max(0, match.start() - CONTEXT_WINDOW)
            end = min(len(content), match.end() + CONTEXT_WINDOW)
            context = content[start:end]

            has_citation = any(
                re.search(cp, context, re.IGNORECASE)
                for cp in CITATION_PATTERNS
            )
            if not has_citation:
                uncited.append(match.group())

    return uncited


def main():
    try:
        data = json.load(sys.stdin)
    except (json.JSONDecodeError, EOFError):
        sys.exit(0)

    tool_input = data.get("tool_input", {})
    content = tool_input.get("content", "")
    new_string = tool_input.get("new_string", "")

    text = content or new_string
    if not text:
        sys.exit(0)

    uncited = check_citations(text)
    if uncited:
        nums = ", ".join(uncited[:5])
        remaining = f" (+{len(uncited) - 5} more)" if len(uncited) > 5 else ""
        print(
            f"CITATION CHECK: {len(uncited)} uncited quantitative "
            f"claim(s) found: {nums}{remaining}. "
            f"Verify each has a source.",
            file=sys.stderr,
        )
        sys.exit(1)

    sys.exit(0)


if __name__ == "__main__":
    main()
Enter fullscreen mode Exit fullscreen mode

This is intentionally simple. It scans for comma-formatted numbers, percentages, dollar amounts (including abbreviations like $1.3M), multipliers, and bare integers above 99 (with a year exclusion so "2024" doesn't trigger). It checks a 400-character window around each match for citation signals like attribution phrases, parenthetical year references, and URLs.

The gate runs as advisory (exit 1), which surfaces warnings to the human reviewer after each write. It does not block the write or feed back into the model. Promote it to exit 2 (which routes feedback to the model and blocks the action) when you trust the pattern matching. Known limitation: on an Edit operation, only the edited chunk is scanned, so a number whose citation lives in a different paragraph may trigger a false positive.

You can extend the citation pattern list to match your own conventions.

The Uncomfortable Part

Sycophancy is structural. As long as models are trained on human preference ratings, they'll develop a bias toward telling you what you want to hear. Better RLHF techniques will reduce it. They won't eliminate it.

If wrong answers have consequences in what you're building, assume your model is sycophantic. What matters is whether you've built anything between the model and the output to catch it when the model picks agreement over accuracy.

Start with one gate. The one that would have caught the last time your AI said "you're right" when you weren't.

Top comments (0)