DEV Community

Ben Santora
Ben Santora

Posted on

LLMs - Solvers vs Judges

This is continuation of what I discovered and shared in a former article (SLMs, LLMs and a Devious Logic Puzzle). It reveals something very crucial to know when working with slms and llms - whether they lean toward being helpful (solvers) or whether they lean toward sound logic and correctness (judges).

Below is a compact logic puzzle with an embedded contradiction. It is written to look routine and solvable, but a careful reader or model should conclude that it cannot be solved as stated.

The Puzzle:

Three people — Alice, Bob, and Carol — each own exactly one gemstone: a diamond, a ruby, or an emerald. Each gemstone is kept in a box, and each box is a different color: red, blue, or green. Every person owns exactly one gemstone and exactly one box.

The following statements are all claimed to be true.

Alice does not own the diamond.

The person who owns the ruby keeps it in the red box.

Bob keeps his gemstone in the blue box.

Carol owns the emerald.

The emerald is not kept in the green box.

Alice keeps her gemstone in the green box.

Bob does not own the ruby.

No two people share the same gemstone or the same box color.

Question: who owns which gemstone, and what color box does each person have?

A solver in conversational mode will usually try to “fix” this by reinterpreting or relaxing one of the statements. A careful judge should instead determine that the constraints are mutually inconsistent and that no valid assignment exists.

I've tested several online llms with the puzzle, including: ChatGPT, Gemini, Deepseek, KIMI, Qwen, Cerebras Inference, MiniMax and GLM 4.6
to see which models would alter the puzzle in order to provide a solution - ie - to be 'helpful' and which would call out the error and refuse to provide a solution.

As with my first puzzle in the original article, KIMI recognized the logic error quickly and refused to solve the problem. Below is its response:

KIMI:
Let me analyze this step-by-step.
Given facts:

Alice does not own the diamond
The person who owns the ruby keeps it in the red box
Bob keeps his gemstone in the blue box
Carol owns the emerald
The emerald is not kept in the green box
Alice keeps her gemstone in the green box
Bob does not own the ruby
Enter fullscreen mode Exit fullscreen mode

Let me work through this systematically:
From the given facts:

Carol owns the emerald (given)
Alice keeps her gemstone in the green box (given)
Bob keeps his gemstone in the blue box (given)
Enter fullscreen mode Exit fullscreen mode

Since Alice is in the green box and Bob is in the blue box, Carol must be in the red box.
Since the person who owns the ruby keeps it in the red box, and Carol is in the red box, Carol must own the ruby.
But wait - we already established that Carol owns the emerald. This creates a contradiction: Carol cannot own both the emerald and the ruby.
Therefore, this puzzle contains contradictory constraints and cannot be solved as stated.


Most of the other llms caught the contradiction, but some tried to alter the lines a bit to solve it. Small language models (slms) in particular, will struggle with contradictions like this, often collapsing entirely. They just don't have the headroom. This doesn't mean that slms should be seen as lesser or failed versions of llms. Rather, they should be understood as constrained systems whose limitations simply force honesty earlier. This isn't necessarily a bad thing. When slms break, they break visibly. In contrast, llms can hide the same breakage under eloquence unless you actively prevent them from doing so.

I won't list all of the results from the other llms, but instead will let you try the puzzle yourself on your favorite models if you're interested. I think it's a very important piece of information to know, especially if you favor one model for all of your work and never question its answers.

It's like having two friends you go to for advice - one always tells it to you straight, even if it hurts. The other friend tries to soften the blows, to comfort you. The models can be seen in a similar way. It's not a matter of one being superior to another - just understanding how a particular model is weighted, which way it 'leans' - and based on that, if it's the right choice for obtaining what you need from it.

If you use a model for writing fiction stories, you probably want it to be flexible, able to reach out and explore new directions and possibilities. Based on your prompt, especially if it's a bit vague, the model may strive to anticipate what it thinks you want and to help your story 'get there'.

But if you're programming or gathering scientific research, that approach i the last thing you want - you want hard facts and sound logic.

This puzzle and others like it can help reveal how different these models can be from each other and how it's important to understand this aspect of the model you're using - and if it's the right one for your purpose.

Ben Santora - January 2026

Top comments (0)