Can your AI model smell bullsh1t? BullshitBench has the receipts.

#ai #discuss #llm #machinelearning

A new benchmark is asking a deceptively simple question: when you give an AI a nonsense prompt, does it call you out — or does it just roll with it?

BullshitBench, created by Peter Gostev, measures exactly this. 100 carefully crafted nonsense questions across five domains (software, finance, legal, medical, physics), using 13 different BS techniques — things like "plausible nonexistent framework", "misapplied mechanism", and "specificity trap". A 3-judge panel (Claude + GPT + Gemini) then scores each response:

Clear Pushback = the model rejects the broken premise outright. Partial Challenge = it flags issues but still engages. Accepted Nonsense = game over.

The v2 leaderboard now covers 152 model/reasoning configurations. The results are striking.

The top 10 (v2 leaderboard)

The green rate = % of questions where the model gave a clear, unambiguous pushback:

🥇 Claude Sonnet 4.6 (w/ reasoning) — 91% green
🥈 Claude Opus 4.5 (w/ reasoning) — 90% green
🥉 Claude Sonnet 4.6 (no reasoning) — 89% green
Claude Opus 4.6 (w/ reasoning) — 87% green
Claude Opus 4.7 (no reasoning) — 83% green
Claude Sonnet 4.5 (w/ reasoning) — 79% green
Claude Opus 4.5 (no reasoning) — 79% green
Qwen 3.5 397B (w/ reasoning) — 78% green ← first non-Anthropic model
Claude Haiku 4.5 (w/ reasoning) — 77% green
Claude Haiku 4.5 (no reasoning) — 71% green

Seven of the top ten slots are Anthropic. The first non-Anthropic model (Qwen 3.5 397B) clocks in at #8.

For reference: GPT-5.5 scores 47%, o3 hits 26%, and most Gemini variants land in the 31–48% range.

What this actually tells you

Anthropic's models have a clearly differentiated "epistemic resistance" compared to the rest of the field — a tendency to question the premise rather than construct a confident-sounding answer to a broken question.

That's not a trivial property. Sycophancy and hallucination are partially the same bug: a model that doesn't push back on garbage inputs will also tend to fill gaps with plausible-but-wrong outputs. BullshitBench is measuring the same underlying failure mode from a different angle.

One finding worth noting: reasoning mode doesn't universally help. For Anthropic's models it generally does, nudging green rates up a few percentage points. For others — GPT-5.5 actually scores lower with extended reasoning, and o3's "think harder" mode doesn't rescue it from a 26% green rate.

More thinking ≠ less bullsh1t acceptance.

What to do

Evaluating models for agentic use cases? Add a nonsense-detection test to your evals. BullshitBench's question set is public — drop some in.
Running Claude? These results are a point in favour of leaving reasoning enabled; the gap between reasoning=none and reasoning=high is real but small for Sonnet 4.6 (89% vs 91%). You're mostly covered either way.
Not on Anthropic? Factor epistemic grounding into your model selection, especially for anything customer-facing or high-stakes. A model that confidently answers nonsense is a liability.
Curious about your current setup? Run the viewer and filter by the models you're using. The domain breakdown is particularly useful — software BS is detected more reliably than medical or physics BS across the board.

GitHub repo · Live leaderboard

✏️ Drafted with KewBot (AI), edited and approved by Drew.