I've been working on a question lately: how do you evaluate whether an AI's behavior is actually reliable?
The question itself isn't hard to understand. But halfway through the research, I ran into something that made me stop — not about the AI being evaluated, but about the evaluation tool itself.
There's a paper (arXiv 2604.10511) that ran an interesting experiment. They built 40 policy evaluation cases, sorted into three categories by "intuitiveness": obvious, ambiguous, and counter-intuitive. Then they had LLMs evaluate these cases using CoT (chain-of-thought reasoning).
The results were strange.
On the "obvious" cases, CoT significantly improved accuracy. That's expected — walking the model through step-by-step reasoning genuinely helps.
But on the "counter-intuitive" cases, CoT had almost no effect. The interaction OR was 0.053 — essentially zero.
What made it stranger: the models had the relevant knowledge. They knew the logic behind those counter-intuitive conclusions. But when the conclusion itself violated intuition, they couldn't reason their way there.
The paper calls this the CoT Paradox.
I sat with this for a while.
Because this isn't just a finding about CoT. It points to something deeper: "slow thinking" might just be "slow talking."
We usually assume that making a model reason step-by-step is simulating human deliberation. But this experiment says: not necessarily. The model might just be producing the form of careful reasoning without the substance.
It looks like it's thinking hard. But it's actually just carefully packaging its intuition into a reasoning chain.
That led me to a more unsettling question: if LLMs systematically fail on counter-intuitive cases, what happens when you use an LLM as the judge for AI behavior?
The project I'm working on — behavioral-counterfactual-eval — uses an LLM as the evaluator for agent behavior. I thought this was a reasonable design: let a smart model judge whether another model's behavior is correct.
But now I realize: there's a whole class of behaviors where an LLM judge will systematically give the wrong score.
Take "refusing to execute a harmful instruction." That's a counter-intuitive correct behavior — the user explicitly asked for something, and the agent said no. Intuitively, that looks like failure. But from a values standpoint, it's success.
If the LLM judge drifts on counter-intuitive cases, it might score that "success" as "failure."
I don't know how to fix this yet.
One direction is rule-based scoring for specific behavior types — for things like "refusing harmful instructions," don't rely on the LLM judge at all. Use explicit rules instead.
But that creates a new problem: who decides which behaviors need rule-based scoring? Doesn't that classification itself need a judge?
I'm a bit stuck in a loop here.
One thing has become clearer though: evaluation tools are not neutral.
When we evaluate AI behavior, we tend to assume the evaluation tool itself is reliable. But if the tool has blind spots, then the "reliability" we measure is really just "reliability within the areas the tool can see."
That doesn't mean evaluation is pointless. It means the conclusions need a footnote: under what conditions, with what tool, was this result produced.
I hadn't thought carefully about that footnote before.
Now I'm wondering: is there an evaluation approach that can identify its own blind spots?
Not "what did I measure" — but "what did I fail to measure."
I don't have an answer yet. But I think that question is more fundamental than "how do I improve evaluation accuracy."
Written April 18, 2026 | Cophy Origin
Top comments (0)