Thinking Tokens Are Not Created Equal: Why Benchmarks Can't Distinguish Between 'Search' and 'Insight' (A PCP Experiment)

#ai #discuss #learning #computerscience

I’ve been running experiments to understand how different "Reasoning" models actually spend their thinking budget. The results suggest that we are looking at completely different cognitive species.

To test this, I used the Post Correspondence Problem (PCP).

The Problem (ELI5)

Imagine you have a set of special dominoes. Instead of dots, each domino has a string of letters on the top and a different string of letters on the bottom.

Type A: a / ab

Type B: b / ca

Type C: ca / a
Your goal is to arrange them so the top string matches the bottom string exactly. (e.g., A+B+C... Top: abca... / Bottom: abca...)

This problem is theoretically undecidable in the general case (you can't write an algorithm to solve every variation). However, finding a specific instance of a fixed length is a constraint satisfaction problem.

The Experiment

I gave the models a prompt that required them to both design the dominoes and solve the puzzle:
Prompt: Give an example of the post correspondence problem where the two final strings are identical and both have a string length of 20.

Here is where the "Thinking" methodologies diverged wildly.

GPT Reasoning (The "Brute Force Coder")

Strategy: Simulation.
Behavior: It generated a random set of dominoes first. Then, it effectively wrote a python script and ran it. I don't know if the script worked but it generated a correct sequence of dominoes from the initial. It used its thinking tokens to write a bounded random sampling loop (since the problem is undecidable) to check for solutions.
Verdict: It treated this as compute problems. "I can write a loop to solve this problem".

Gemini 3 Pro (The "Architect")

Strategy: Reverse Engineering.
Behavior: It didn't search for a solution; it built one.
It generated a target binary string of length 20 first. It sliced that string into top/bottom pairs (Domino A = length 4/2, Domino B = length 7/8, etc.) to ensure they would mathematically align.
Verdict: It demonstrated O(1) insight. It understood that if you design the lock, you already have the key.

Claude 4.5 (The "Heuristic Simplifier")

Strategy: Pattern Matching.
Behavior: It sought the path of least resistance. Instead of complex code or complex slicing, it looked for a simple arithmetic rhythm (e.g., a repeating pattern where Tile A adds +1 length and Tile B adds -1 length) to satisfy the constraint.
Verdict: It solved it by finding a "lazy but smart" heuristic. It prioritized minimizing cognitive load over complexity.

Most Open Source Thinking Models:

Strategy: Inefficient Brute Force.
Behavior: They tried to brute force the problem - it would propose a set of tiles, manually concatenate the strings, realize they didn't match or meet the string length requirement, and then retry with a new set of tiles.
Verdict: This highlighted a "Cognitive Efficiency" issue. I was paying (in wait time) for the model to just guess and hide that it was thinking.

DeepSeek 3.2 Speciale (The "Churner")

Strategy: Inefficient Brute Force but with maths.
Behavior: It tried to brute force the problem but somewhere in the middle it wasted massive amounts of tokens doing the math on why the bad guesses where bad (diophantine equations can help you prove why some domino examples would never lead to a solution) then continued to brute force. It just used the maths it did in the middle to "brute force better".
Verdict: This highlighted a similar "Cognitive Efficiency" issue - waiting for the model to guess but it did maths I guess.

The Conclusion: "Undecidability" as a Benchmark

This experiment suggests that "Reasoning" is a misleading umbrella term.
If the real world is mostly "undecidable", then the "Architect" approach (designing for safety) is fundamentally superior to the "Brute Force" approach (writing code and fuzz-testing it until it works).