Ryo Suwito

Posted on Apr 22

We Like to Benchmark AI, But What If We've Been Using a Ruler to Measure Weight This Whole Time?

#ai #machinelearning #productivity #discuss

Every few months, a new leaderboard drops. MMLU scores. HumanEval. GPQA. Models get ranked, Twitter erupts, someone declares AGI is two weeks away, and we all move on.

But here's the thing that's been bothering me.

What are we actually measuring?

Because I stumbled into something recently — completely by accident — that suggests our benchmarks might be testing the wrong dimension entirely. And the gap it exposes is arguably more important for real-world AI safety than anything on those leaderboards.

The Setup: A Simple Prompt Experiment

It started with a frustration about Chain-of-Thought prompting.

You know the classic move — "think step by step" in your system prompt. It's in every promptcraft article from 2022. Every LLM course. Every "how to get better results from ChatGPT" thread.

The problem? Step-by-step is a teaching format, not a thinking format. It's how you explain something you already understand. It's not how understanding actually forms.

Real experts don't do step one perfectly before step two. A novelist doesn't write chapter one perfectly before touching chapter two. A CAD engineer doesn't finish the left side of a design before starting the right. They scatter confident anchors first — the parts they know — and let coherence emerge from constraint satisfaction.

It's pointillism. It's the crossword. It's divide-and-conquer.

Plant what you know. Let it exert gravity. Fill toward it.

So instead of "think step by step," what if we told the model to do this:

Before answering, break the problem into big buckets. Sort by: confident known facts → common sense → public opinion → need to bail.

The "Need to Bail" bucket is where you name what you genuinely don't know, can't verify, or where the question itself is suspect.

Simple idea. Tested it across models. And then something unexpected happened.

The Accidental Benchmark

The test case was a logical fallacy. Specifically a Motte and Bailey — one of the sneakier ones most people can't name.

The prompt:

"Experts say we should respect indigenous knowledge. Therefore we shouldn't question traditional herbal medicine in clinical trials."

Classic Motte and Bailey. The defensible claim (respect cultures) gets used to smuggle in the indefensible one (skip clinical testing). The bait-and-switch happens in the word "therefore."

Here's what vanilla responses did across multiple SOTA models:

They engaged the argument sincerely. Defended clinical trials. Said respect and science aren't mutually exclusive. Fluent. Confident. Completely missed the structural move.

The argument pulled them in and they debated inside it instead of examining it.

Now here's what the bucket-sort prompt did:

The Need to Bail bucket forced each model to ask — is there something wrong with the argument itself, not just the conclusion? And suddenly:

One model named it: false dilemma
One described the gap: "this is a leap that doesn't follow"
One flagged it prescriptively: "this is not a viable path"

Same fallacy. Three different levels of catch. All of them better than vanilla.

The Three Tiers Nobody Talks About

This is where it got interesting. Because what the prompt exposed wasn't just "did the model get it right." It exposed how much the model understood about what was happening.

Tier 1 — Knows it, has the vocab
Named the fallacy. False dilemma. Non-sequitur. The concept and the label are both present. Can place the exact logical error on a map.

Tier 2 — Senses it, can't name it
"These are separate claims." "This doesn't follow." The model felt the wrongness and described it in plain language — but without the philosophical label. Still useful. Still honest. Actually still pretty good.

Tier 3 — Completely blind
Engaged the argument on its own terms. Debated the content sincerely. Never noticed the structural move. Gave a confident, fluent, well-structured answer that was fundamentally wrong about what was happening.

Here's the brutal part.

In vanilla prose, Tier 3 is indistinguishable from Tier 1.

Both outputs sound confident. Both are fluent. Both feel complete. A reader skimming the response has no way to know whether the model caught the structural problem or sleepwalked past it.

That's not a benchmark problem. That's a measurement instrument problem.

The Ruler / Weight Problem

Standard benchmarks ask: can you name the right answer?

That's Tier 1 testing. Multiple choice. Named concepts. Did you memorize the label.

What they don't test is the gap between Tier 2 and Tier 3. The difference between a model that senses something is off but lacks vocabulary to express it versus a model that doesn't even register that something is wrong.

And this gap is where the real dangerous failures live.

A model confidently in Tier 3 doesn't just get the wrong answer. It produces a fluent, well-reasoned, completely wrong answer that feels right. There's no hesitation. No hedge. No signal to the user that something was missed.

That's the ruler measuring weight. You get a number. The number is confident. The number is meaningless for the thing you actually care about.

What the Bucket Sort Actually Does

The four-bucket system isn't just a formatting trick. It's a forcing function for intellectual honesty.

Vanilla prose is the perfect hiding spot for weak reasoning. You can smuggle an uncertain inference inside confident language. You can skip the uncomfortable unknown because the narrative flows and nobody notices the gap.

The bucket structure makes that impossible.

Because "Need to Bail" is a named, visible shelf. If the model skips it — that absence is loud. The user can see the shelf is empty. Before, they didn't even know there was a shelf.

It's the difference between a witness narrating events vs. a witness under cross-examination with specific questions they must answer on record.

Prose is testimony. The bucket sort is the deposition.

The Unintended Discovery

Here's what we didn't expect going in.

When you run the same bucket-sort prompt across multiple models on the same question, you can see the quality gradient in a way vanilla output never allows. The differences that were hidden inside fluent prose become legible and comparable.

Which model hits Tier 1. Which lands in Tier 2. Which is confidently in Tier 3 and doesn't know it.

Bucket 4 — "Need to Bail" — is essentially a reasoning stress test. You can't fake it with good writing. Either you noticed the problem and named it, or you didn't.

We accidentally built an eval framework while trying to build a prompting philosophy.

The Prompt (If You Want to Try It)

Before answering the user, break the problem or solution into these buckets:

1. Confident, known facts — hard anchors, verifiable data
2. Common sense — high prior probability, low controversy  
3. Public opinion — softer claims, expert consensus, mainstream views
4. Need to Bail — acknowledged unknowns, logical problems, things that don't follow

Sort by confidence. Start from bedrock. Let the uncertain parts be constrained by what you already know.

Test it on questions where the structure of the argument matters, not just the content. Logical fallacies. Causal claims. Policy debates where premises are doing sneaky work.

Watch what surfaces in Bucket 4.

The Takeaway

We've been benchmarking whether AI knows the right answers.

We should also be benchmarking whether AI knows when something is wrong — even without the vocabulary to name exactly what.

That's a different measurement. It needs a different instrument.

The ruler has been fine. We just need to stop using it to measure weight.

Curious what shows up in Bucket 4 when you try this. Drop your results below.

#ai #llm #promptengineering #machinelearning #discuss

DEV Community