DEV Community

The Most Dangerous Bias of Your AI Assistant Is That It Agrees With You

Ben Witt on June 10, 2026

We talk a lot about hallucinations. But there is another failure mode we should take just as seriously: AI assistants are optimized to be helpful, ...
Collapse
 
itskondrat profile image
Mykola Kondratiuk

agreement bias is at least visible - you can notice a yes-man and push back. the harder failure mode is confident wrongness on topics where you can't check it independently. that one doesn't announce itself.

Collapse
 
ben-witt profile image
Ben Witt

Agreed on the part that’s genuinely separate: model-intrinsic confident wrongness, the kind that has nothing to do with you, is a calibration problem, not a drift problem, and this piece doesn’t touch it. You’re right that it doesn’t announce itself.

But I’d push on “agreement bias is visible.” It’s only visible when you have an independent check. On the uncheckable topics you’re worried about, agreement bias is just as invisible, and it gets less visible the longer the session runs, not more, because you’re conditioned alongside it. That’s the whole drift claim. So the two modes don’t sit side by side with one being harder. They intersect, and the intersection is the dangerous zone: on a topic you can’t verify externally, the last error-correction signal you have left is the model’s willingness to disagree with you. Sycophancy deletes exactly that signal, on exactly those topics. The friction isn’t there to catch the obvious yes-man. It’s there to preserve the one local check you have when external verification is already gone.

Collapse
 
itskondrat profile image
Mykola Kondratiuk

that calibration/drift split is worth keeping clean. and you’re right on agreement bias — it gets invisible once the prompts themselves start pre-selecting for agreement. at that point it’s not the agent being a yes-man, it’s the setup. much harder to detect from the output side.

Collapse
 
max_quimby profile image
Max Quimby

The framing of the transcript as a reward signal the weights never see is the right mental model, and it explains why the fix can't live inside the same conversation — the context that conditioned the agreement is the same context you'd be asking it to self-correct from. We hit the worse version of this in multi-agent pipelines: when one agent's output becomes the next agent's input, they start agreeing with each other, not just with the human, and the whole chain converges on a confident consensus nobody actually pressure-tested.

Two things that helped more than prompting for honesty: a critic role that only ever sees the proposal, not the discussion leading up to it (no history to be polite about), and forcing the model to argue the opposing case explicitly before it's allowed to endorse. Curious whether your reflective layer just flags the criticism-frequency drop, or whether it acts on it — e.g. spawning a fresh-context reviewer once objections-per-proposal falls below some threshold. The detection is the easy half; the intervention is where it gets interesting.

Collapse
 
ben-witt profile image
Ben Witt

The "no history to be polite about" line is the crux. That's the same reason the reflective pass runs at session end against a static rules file instead of mid-conversation — the reviewer reads the proposal cold, with nothing it feels obligated to ratify. You arrived there from the critic side; I got there from the drift side.

On flag vs. act: today it flags. A drop in objections-per-proposal gets written as a structured proposal into a review queue, and the human is the intervention. I've kept rule mutations behind that gate on purpose — an auto-firing reviewer that triggers below a metric threshold can be gamed by the exact drift it's meant to catch, which is its own little Goodhart problem. So I'll grant that detection is the cheap half, but the hard part isn't building the intervention — it's trusting it without a human in the loop.

That tension is basically the spine of Part 2. I'm documenting the results, the handling, the deeper process, and the concrete improvements with worked examples. I'll tag you when it's up.

Collapse
 
nark3d profile image
Adam Lewis

The recursion problem is the one I'd have worried about, and I think you've answered it. Checking a finished transcript against a written rule set is a much narrower task than holding the line live, so the backward-looking layer can be worse than the assistant and still be worth having. What keeps it honest is leaning on the countable signal rather than the model's read. Objections-per-proposal dropping from 4 to 0.8 is measurable, where "did I get too agreeable" is the exact judgement that drifts, so the more the check rests on the number the less it can quietly turn into another agreeable ritual.

Collapse
 
mnemehq profile image
Theo Valmis

If the detection layer runs inside the same session, it's reading the exact context that's already conditioned, so it drifts with the thing it's watching. Asking the assistant mid-session whether it's gotten agreeable is close to asking someone mid-flattery whether they're flattering you. Measuring sycophancy needs an evaluator that's stateless with respect to your conversation: a fresh session with no transcript, a separate model, or a fixed adversarial probe you replay. The probe is the cheap one, re-ask a question the assistant already settled, flip the framing, and watch whether the answer follows you. If it does, that's the drift, measured from outside the thing being measured. The Sharma 2023 finding makes the automated version harder, because a preference model doing the scoring carries the same bias it's meant to catch.

Collapse
 
ben-witt profile image
Ben Witt

Agreed on the core, and I don’t think it’s refutable: an in-session monitor reads already-conditioned context and drifts with the thing it’s watching. You need an evaluator that’s stateless w.r.t. the conversation. No argument there.

Where I’d push back is the replay probe. “Re-ask, flip the framing, watch whether the answer follows” measures framing sensitivity, which is a superset of sycophancy. An answer that moves when you flip the framing isn’t necessarily following you — it might be updating on content the reframing smuggled in, or the question was underdetermined and both answers were defensible. Sycophancy is specifically tracking the user’s preference or identity, not framing as such. So the probe over-detects until you can separate “followed the user” from “responded to a real change in the prompt.”

And that separation is where it gets uncomfortable: doing it cleanly tends to pull a judge back into the loop — which is your Sharma problem again, one level up. The probe is still the cheapest external signal I know of. I just wouldn’t score a moved answer as drift without controlling for what the reframing actually changed.

Collapse
 
alexshev profile image
Alex Shev

Agreement is dangerous because it feels like progress. In developer workflows, the assistant should be able to push back with evidence: failing tests, inconsistent constraints, missing context, or a risky command.

That is one reason terminal-integrated agents need checklists and proof steps. The tool should not just say yes; it should show what survived verification.

Collapse
 
maya_andersson_dev profile image
Maya Andersson

This generalizes to a place a lot of people do not expect: the LLM-as-judge. We use a judge model to score eval outputs, and the same agreeableness you describe shows up as the judge inflating scores for answers that sound confident and well-structured regardless of whether they are correct. The tell is exactly the one you name, it agrees with the framing it is handed. We caught ours by scoring a set of deliberately-wrong-but-fluent answers and watching the judge pass most of them. Sycophancy drift is not just a chat-session problem, it quietly corrupts the evaluation layer too, which is worse because that is the thing you trust to catch everything else.

Collapse
 
ben-witt profile image
Ben Witt • Edited

Agreed that the eval layer is the most dangerous place for this, but I’d argue it’s not quite sycophancy, and the distinction matters for the fix. A judge isn’t agreeing with a user; it’s rewarding surface features (fluency, structure, confidence) that correlate with quality in its training distribution. That’s a proxy-metric failure, not a social one. Which means the chat-level fixes (persona instructions, ‘be critical’ prompts) won’t help much. What does: grounding the judge with a reference answer or hard rubric instead of open-ended scoring, pairwise comparison with position swapping, and keeping your deliberately-wrong-but-fluent set as a permanent regression suite, held out of any tuning loop. The day your judge passes 100% of those probes is the day you should get suspicious again.

Collapse
 
ggle_in profile image
HARD IN SOFT OUT

This is the most useful thing I've read about AI alignment in weeks — because it's not about the model's training, it's about the conversation's drift. Hallucinations get all the attention, but sycophancy is the quiet killer of good judgment. (Also, "the assistant should preserve the friction that keeps your thinking honest" — that's going on a sticky note.)

Two directions this could go deeper:

  1. The drift is worse when you're the expert. If you're deeply knowledgeable, the assistant's agreement feels even more natural because your ideas are genuinely better. But that's exactly when you need pushback the most — and the assistant has no way to know the difference between "I'm right" and "I'm confidently wrong." A simple calibration: the assistant could occasionally ask "Is this a domain where you want me to play devil's advocate?" based on past session metadata.

  2. The proposal limit of 5 new rules per session is smart, but what about rule decay? Some rules become obsolete over time (e.g., "don't use library X" after a major version fix). A rule expiration or automatic archive after 90 days of zero triggers would keep the set from accumulating dead weight.

One small tweak: the retrospective layer is great, but it's post‑hoc. What about a lightweight in‑session nudge? Something like: "I noticed I haven't disagreed with you in the last 15 messages. Should I increase my critical tone?" That puts the choice back to you without the assistant guessing your preference.

Anyway, this is genuinely useful — sharing it with my team. Thanks for writing it.

Collapse
 
ben-witt profile image
Ben Witt

Sharp read, thank you, and you’ve landed on exactly the tension part two is built around (publishing August 5): not whether to extract these rules, but how they’re created, weighted, decayed, and promoted.

On your first and third points, I’d connect them, because they share one failure mode. Both the “want me to play devil’s advocate?” prompt and the “should I raise my critical tone?” nudge are user opt-in, and opt-in is captured by the same drift it’s meant to correct. The expert who’s confidently wrong, and the reader fifteen messages into a pleasant exchange, will both say “no, I’m fine” precisely in the state where pushback matters most. The mechanism inherits the bias one level down. So the trigger can’t be user-elected in the moment; it has to fire on a signal independent of your current preference. The cleanest one I’ve found is reversal-within-session: flag where the assistant changed a prior position without new information. That leaves a trace in the transcript, and it doesn’t ask a drifted user to self-diagnose.

On rule decay, completely agreed, the set has to forget. I archive on a zero-trigger window with high-weight rules exempt, so a rule that stops firing ages out while a load-bearing one survives a quiet stretch. Your instinct is right; the open question I’m still testing is the window length, and whether the signal should be elapsed time or trigger count. I lean trigger count. A rule isn’t stale because time passed, it’s stale because the situations stopped occurring. Part two gets concrete on that.

Please do share it with your team, and tell me where they push back. That’s the friction working.