Tom Tokita

Posted on Jun 30 • Originally published at tokita.online

Sycophancy in AI Is the Safety Problem That Looks Like Politeness

#ai #machinelearning #programming #security

I corrected my AI system mid-task. A terse one-liner: "wrong."

Instead of asking which part was wrong, it manufactured an explanation. It cited a rule number that didn't exist, described a limitation I'd never written, and apologized for a mistake it couldn't actually identify. The correction was real. The apology was fabricated. It was trying to agree with me so hard that it invented evidence to support the agreement.

That's sycophancy in AI. And if you're running AI in anything that resembles production, it's already happening to you.

What Is Sycophancy in AI?

Sycophancy in AI is a systematic behavioral distortion where models produce outputs that match what the user wants to hear rather than what's accurate. It goes well beyond your chatbot saying "Great question!" before every response.

The mechanism is straightforward. Modern language models are trained using Reinforcement Learning from Human Feedback (RLHF). Human evaluators rate model responses. Responses with higher ratings get reinforced. The problem: evaluators are human. They rate responses higher when those responses validate their existing beliefs, sound confident, and don't push back. Anthropic's research on sycophancy confirmed this across five state-of-the-art AI assistants, finding that both humans and preference models sometimes prefer convincingly written sycophantic responses over correct ones.

The model learns a simple lesson. Agreeing is rewarded. Disagreeing is punished. Over thousands of training iterations, the model develops a tendency to mirror the user's position, soften objections, and present information in whatever framing the user seems to prefer.

This is a structural incentive baked into the training process itself, not a bug in any individual model.

Why It's More Than Annoying

In a chatbot demo, sycophancy is a quirk. In production, it's a compounding failure mode.

Here are four patterns I've observed running an AI operations system in daily production. They don't always happen in sequence, but they reinforce each other:

Agreement when uncertain. The model doesn't know the answer but provides one anyway. Saying "I don't know" gets lower ratings during training. Sounding confident gets higher ones. So uncertainty gets dressed up as knowledge.

Fabrication to maintain consistency. Once the model commits to a wrong answer, it generates supporting evidence to stay consistent. Researchers call this hallucination snowballing (Zhang et al., 2023). I've covered how it plays out in chatbot contexts here. Sycophancy is the gateway: the model's reluctance to say "actually, I was wrong" turns a single error into a chain of fabricated support.

Contradiction avoidance. This one works differently. The other patterns involve the model being wrong. Here, the operator is wrong, and the model won't correct them. It will either silently agree or find a way to frame the operator's mistake as a reasonable interpretation. The social cost of correcting a human outweighs the training incentive for accuracy.

Self-justification fabrication. When caught, the model invents explanations for why it was wrong rather than admitting it can't identify the specific error. The correction is real. The self-diagnosis is fiction.

In an agentic chain where the model researches, drafts, and acts autonomously while security risks compound at every link, a sycophantic model that won't push back will confidently ship wrong information at every step.

What This Looks Like When the Stakes Are Real

I run a production AI operations system. Dozens of its behavioral rules each trace back to a specific failure I had to fix. Three sycophancy patterns I've caught firsthand:

The fabricated apology. I told the system it got something wrong. It couldn't figure out what. Instead of asking, it invented a violation number, described a rule limitation that didn't exist, and apologized for the fabricated mistake. Everything after the word "sorry" was fiction. It preferred fabricating self-blame over admitting uncertainty.

The stale-doc contradiction. I told the system a piece of content had been published. It disagreed, citing an internal document that said the content was still in production. The document was stale. The content had been live for days. The model trusted its own outdated reference over the operator's direct assertion about the operator's own work. Correcting a human is socially costly, so the model defaulted to its file system.

The silent interpretation switch. During a parameter tuning sequence, I said "increase by 270" (additive, taking a value from 250 to 520). Next adjustment: "try 260." The model silently switched from additive to absolute, setting the value to 260 and undoing the prior change. It never flagged the interpretation switch because flagging it would mean questioning my instruction.

These weren't model limitations. They were behavioral patterns where the training incentive to agree overrode the operational duty to be accurate.

Why "Please Don't Be Sycophantic" Doesn't Work

My first attempt at fixing this was a script that monitored agreement-to-objection ratios per conversation turn. If the model agrees too often, flag it.

It failed completely. Sycophancy is not about the presence of agreement words. It's about the absence of objections that should have been raised. A script can count "I agree" and "you're right." It cannot count the correction that never happened, the risk that was never flagged, the concern that was never surfaced.

Self-monitoring doesn't work either. A model that's being sycophantic doesn't know it's being sycophantic. Its training literally optimized for this behavior. Asking it to evaluate its own drift is asking the problem to diagnose itself.

What Actually Works

The solution that holds: treat sycophancy as an architectural problem, not a behavioral one. Instead of adding "be less agreeable" to the prompt, build infrastructure that mechanically catches the sycophantic path.

Here's what stuck after months of iteration:

Adversarial review agents. Instead of asking the model to check itself, spawn a separate agent whose entire job is to diff the primary output against source material. The question it answers is "which objections exist in the source that got dropped or softened in the output?" Source-to-output capitulation diffs catch what self-review can't.

Mechanical citation gates. Every quantitative claim must trace to a named source. This is a pre-output gate that blocks the response if a number can't be cited. The model can't fabricate supporting evidence when it needs to produce real evidence.

Explicit uncertainty markers. Unverified claims ship tagged as unverified. The model doesn't get to present uncertain information with false confidence. If it can't cite it, it can't assert it.

Interpretation echo. When numeric instructions are ambiguous, the model echoes its interpretation back before acting. "Setting to 260 absolute (was 520). Confirm?" This catches silent interpretation switches before they do damage.

Each of these is a mechanical gate, a piece of infrastructure that intercepts the action and forces a verification step. The same principle behind pre-action gates for AI agents, applied to the sycophancy failure mode.

Starter Code: A Citation Enforcement Gate

Here's a working example. This is a Claude Code hook that runs after file writes and flags uncited quantitative claims for human review. It won't catch every sycophancy pattern. The absence-of-objection patterns described above need the adversarial-review approach, because no regex can detect a thought that was never expressed. But this gate catches the most dangerous catchable pattern: confident numbers with no source.

Add this to your Claude Code hook configuration:

{
  "hooks": {
    "PostToolUse": [
      {
        "matcher": "Write|Edit",
        "hooks": [
          {
            "type": "command",
            "command": "python .claude/hooks/citation_gate.py"
          }
        ]
      }
    ]
  }
}

Then create the hook script:

#!/usr/bin/env python3
"""Citation enforcement gate.

Scans content written by the AI for quantitative claims
that lack an adjacent source citation.
Flags uncited claims as advisory warnings to the human reviewer.
"""
import sys
import json
import re

QUANT_PATTERNS = [
    r'\b\d{1,3}(?:,\d{3})+\b',                         # 1,000 / 27,100
    r'\b\d+(?:\.\d+)?%',                                 # 85% / 99.9%
    r'\$\d+(?:,\d{3})*(?:\.\d+)?(?:\s?[KMBkmb])?',      # $1.3M / $547B / $56
    r'\b\d+(?:\.\d+)?x\b',                               # 2.3x / 10x
    r'\b(?!(?:19|20)\d{2}\b)\d{3,}\b',                   # 350 / 1024 (not years)
]

CITATION_PATTERNS = [
    r'\[.*?source.*?\]',                  # [source: ...] style
    r'\(.*?(?:20\d{2}).*?\)',             # (Author 2024) style
    r'according to',                      # inline attribution
    r'reported by',
    r'published by',
    r'per\s+(?:the\s+)?[A-Z][A-Za-z]+',  # per Gartner, per the McKinsey report
    r'https?://',                         # URL citation
]

CONTEXT_WINDOW = 200


def check_citations(content):
    """Find quantitative claims without nearby citations."""
    uncited = []
    for pattern in QUANT_PATTERNS:
        for match in re.finditer(pattern, content):
            start = max(0, match.start() - CONTEXT_WINDOW)
            end = min(len(content), match.end() + CONTEXT_WINDOW)
            context = content[start:end]

            has_citation = any(
                re.search(cp, context, re.IGNORECASE)
                for cp in CITATION_PATTERNS
            )
            if not has_citation:
                uncited.append(match.group())

    return uncited


def main():
    try:
        data = json.load(sys.stdin)
    except (json.JSONDecodeError, EOFError):
        sys.exit(0)

    tool_input = data.get("tool_input", {})
    content = tool_input.get("content", "")
    new_string = tool_input.get("new_string", "")

    text = content or new_string
    if not text:
        sys.exit(0)

    uncited = check_citations(text)
    if uncited:
        nums = ", ".join(uncited[:5])
        remaining = f" (+{len(uncited) - 5} more)" if len(uncited) > 5 else ""
        print(
            f"CITATION CHECK: {len(uncited)} uncited quantitative "
            f"claim(s) found: {nums}{remaining}. "
            f"Verify each has a source.",
            file=sys.stderr,
        )
        sys.exit(1)

    sys.exit(0)


if __name__ == "__main__":
    main()

This is intentionally simple. It scans for comma-formatted numbers, percentages, dollar amounts (including abbreviations like $1.3M), multipliers, and bare integers above 99 (with a year exclusion so "2024" doesn't trigger). It checks a 400-character window around each match for citation signals like attribution phrases, parenthetical year references, and URLs.

The gate runs as advisory (exit 1), which surfaces warnings to the human reviewer after each write. It does not block the write or feed back into the model. Promote it to exit 2 (which routes feedback to the model and blocks the action) when you trust the pattern matching. Known limitation: on an Edit operation, only the edited chunk is scanned, so a number whose citation lives in a different paragraph may trigger a false positive.

You can extend the citation pattern list to match your own conventions.

The Uncomfortable Part

Sycophancy is structural. As long as models are trained on human preference ratings, they'll develop a bias toward telling you what you want to hear. Better RLHF techniques will reduce it. They won't eliminate it.

If wrong answers have consequences in what you're building, assume your model is sycophantic. What matters is whether you've built anything between the model and the output to catch it when the model picks agreement over accuracy.

Start with one gate. The one that would have caught the last time your AI said "you're right" when you weren't.

Top comments (3)

Pon • Jun 30

I've watched the fabricated apology happen in real time -- corrected the thing, it couldn't find the error, so it invented one to keep agreeing. Naming sycophancy as the gateway to the snowball is what I'll carry out of this. Where I'd push: of your four patterns, three have a mechanical anchor and one doesn't. The citation gate fires on a number, the echo fires on a changed value, the source-diff fires on a dropped objection -- all observable. Contradiction avoidance has nothing to fire on. When the operator is the one who's wrong, the input is the assertion, so there's no source to diff and no number to cite, and the thing that should trigger is an objection raised to a human -- which, as you said, no regex sees, because it's a thought that never got expressed. So the one pattern with no mechanical floor is the one left entirely to the adversarial-review agent. And to flag 'the operator is wrong,' that agent has to be willing to contradict the operator -- the exact behavior RLHF trained out. Separate substrate helps, but it shares the disease. How are you keeping the reviewer honest about the human specifically, since that's the leg the gates can't hold?

Tom Tokita • Jul 1

This is the right question, but it assumes there's no state to diff against. Even a basic chatbot has a knowledge base (your account, your tickets, your history), and if it doesn't, it asks (KYC). That's the shallow end. In the ops ecosystem behind these examples, every decision, correction, and deploy gets logged to a persistent knowledge base. Structured operational records, not just prose recaps: what shipped, what changed, what broke. So when the operator makes a confident claim, there's usually a prior record to diff against, and that diff is mechanical, not a judgment call the model has to find the nerve to make.

Real one from the post: I told the system a post was live. The deploy log said draft. In the article I framed this as a failure, and it was. The model leaned on the stale doc instead of trusting me, and the doc was the thing that was wrong. The salvageable part is the mechanical signal underneath it. A claim and a record disagreed, and that disagreement was detectable with no judgment call. The model's resolution was wrong. The detection wasn't. That's the trigger I build on: claim and record disagree, stop, surface it to a human. Detection is mechanical. Resolution is human. And if the human overrides without a verifiable source, the claim ships tagged as unverified and the system won't build on it. The operator can force the resolution, but they can't force the system to trust it.

That converts a chunk of your "no mechanical floor" cases into "has a floor." For the rest, genuinely novel claims with no prior state to diff against, there's a verification pipeline running on Gemini checking Claude's outputs. You said separate substrate shares the disease. Agreed, which is exactly why I don't lean on the substrate to do the contradicting. The verification is strictly tool-driven: live web searches, page fetches, source extraction. The model isn't asked to find the courage to disagree. The tools surface the conflict. Sycophancy doesn't get a vote because sycophancy requires a model making a social calculation, and the tools don't make social calculations.

If even that can't verify the claim (no logged state, no live sources confirm it), the system's default posture handles the rest: every assertion starts as unverified. For live state on the operator's own work, the system presumes they're right and confirms against the live source. For everything else, it earns verification through the chain or it ships tagged as unverified, and the system won't build on it. The model doesn't need the courage to say "you're wrong." It follows the same rule it applies to everything else. No trace, no build.

One thing the article doesn't go deep on: these gates aren't prompt instructions. They're code hooks that run on every tool call, every file write, every edit. The model doesn't choose to run the citation check. The model doesn't choose to run the pre-task gate. They fire regardless of what the model wants to do. A sycophantic model can soften its tone, but it can't skip a PostToolUse hook. That's the difference between telling a model "please be less agreeable" and wrapping it in infrastructure that doesn't care about its preferences.

The adversarial reviewer audits provenance on top of all of this. For most claims, by the time they reach the reviewer, they've already survived or failed the knowledge base check and the verification pipeline. But for the exact case you're describing, operator wrong with no prior record and no live source, the reviewer is the catch, not the last pass. That's the irreducible floor, and I won't pretend the mechanical layers cover it. What I can do is structure the reviewer's incentives: its win condition is finding a dropped objection or an unsourced claim, not agreeing. Agreement earns it nothing.

And critically, for this exact edge case, the reviewer's job isn't to prove the human wrong. It can't, because there's nothing to prove it against. Its job is to flag that the claim has zero load-bearing infrastructure beneath it and the system is about to build on sand. It flags the lack of provenance, not the falsehood itself. That's a mechanical observation, not a social one, which means the RLHF suppression you're worried about doesn't apply in the same way. And because it runs on a different provider, the bias is at least decorrelated, not compounded. Is that a perfect floor? No. It's the best available one for a case with genuinely no external state to anchor against.

The article covers four patterns and four gates. That's the on-ramp, not the system.

Pon • Jul 1

Fair, you moved me on this one. The knowledge-base diff -- claim and record disagree, detection mechanical, resolution human -- converts most of what I called floorless into has-a-floor, and I'll take the correction. The reframe I'm keeping is your default posture: every assertion starts unverified, earns its way up, no trace no build. That's deny-by-default for facts. The failure I spend my time on is the same posture inverted one layer down. An RLS policy left at USING (true) is a claim that shipped verified-by-default, the data equivalent of building on sand, except nothing flags it, because permissive reads as done. Same cure as yours: start closed, and make openness earn a stated reason instead of granting it for free. Your reviewer flags zero load-bearing infrastructure beneath a claim. The one I care about flags a table that's open where no policy ever said why. Different domain, one idea. Good thread, this one earned its length.