DEV Community

Cover image for 96.2% vs 6.9% — I Watched 5 Frontier LLMs Fail at Sudoku While an Energy-Based Model Solved It in 0.24s
Aryan Iyappan
Aryan Iyappan

Posted on

96.2% vs 6.9% — I Watched 5 Frontier LLMs Fail at Sudoku While an Energy-Based Model Solved It in 0.24s

I Watched 5 Frontier LLMs Fail at Sudoku While an Energy-Based Model Solved It in 0.24 Seconds

Yann LeCun chairs the technical board. The live demo makes his case better than any paper ever could — and the architecture implications go far deeper than puzzles.


Last week I spent an hour on a website that made me question everything I thought I knew about AI reasoning. It wasn't a research paper. It wasn't a benchmark leaderboard. It was a Sudoku solver.

I loaded a hard puzzle and clicked "Compare All Models." Six AI systems raced to solve it simultaneously.

KONA EBM vs LLMs comparison — EBM solved correctly in 0.24s while LLMs are still running or producing errors

Model Result Time Cost
KONA 1.0 (EBM) ✅ Correct — valid grid 0.24s Free
LLM 1 ⏳ Timed out — never finished 20s+
LLM 2 ⏳ Timed out — lost in reasoning 50s+
LLM 3 ⏳ Timed out — reasoning diverged 50s+
LLM 4 (Gemini 3 Pro) ❌ API Error — returned 404
LLM 5 ❌ Wrong — 14 duplicate digits 2.93s $0.0005

The LLMs didn't just lose. They failed in ways that reveal structural limitations in how every frontier model today approaches reasoning.

LLM 2 and LLM 3 stuck in infinite chain-of-thought. LLM 4 errored out. LLM 5 wrong with 14 duplicates

LLMs 2 and 3 spent their entire runtime producing verbose chain-of-thought. "Let me analyze this row by row, column by column. Column 0 has 6, 9, 2, 5 filled, leaving 1, 3, 4, 7, 8. Box 3 in the middle-left has..." They generated hundreds of tokens of reasoning that looked correct at every step — identifying the right constraints, analyzing the right cells, considering the right candidates. And they never converged. The reasoning drifted. The constraints multiplied. There was no mechanism to notice that the chain of thought had become incoherent.

LLM 5 finished. It output a complete 9×9 grid with all cells filled and numbers in plausible positions. It had 14 duplicate digits across rows, columns, and boxes. The model produced output that was locally coherent — each token followed reasonably from the previous token — but globally broken.

The demo is at sudoku.logicalintelligence.com. Test it yourself before reading further — the experience of watching frontier models fail in real time while an unknown architecture succeeds changes how you think about AI reasoning.


The Architecture Problem Nobody Is Talking About

The company behind this is Logical Intelligence. Their product: Energy-Based Reasoning Models. Their flagship model: KONA. Their orchestrator: Aleph. And the name on their Technical Research Board is Yann LeCun — Founding Chair.

Logical Intelligence homepage —

The team is serious. Michael Freedman — Fields Medalist, one of the most decorated mathematicians alive — is Chief Science Officer. Eve Bodnia is Founder and CEO. Boris Hanin (Princeton ORFE) is a technical advisor. This is not a side project.

LeCun has spent over a decade arguing that autoregressive LLMs cannot reach human-level reasoning. Most of the industry tuned him out because LLMs kept getting better at benchmarks. But this demo exposes exactly the limitations he's been warning about — and it's live, public, and testable by anyone with a browser.

To understand why the Sudoku demo is so revealing, you need to understand the structural difference between how LLMs and Energy-Based Models approach reasoning.

How LLMs Reason — And Why They Hit a Wall

Every frontier model today — GPT-5.2, Claude 4, Gemini 3, DeepSeek V3.2 — shares the same fundamental architecture: autoregressive generation. They produce tokens one at a time, left to right. Each token is a hard commitment. There's no native mechanism to go back and revise an earlier token when a later constraint invalidates it.

When an LLM "reasons," it produces a chain of thought — a sequence of tokens walking through the problem step by step. This works well for language because natural language is forgiving. Small inconsistencies don't break meaning. You can say something slightly contradictory and still be understood.

But reasoning has three structural problems when you try to do it autoregressively:

1. No undo. If step 7 contradicts step 2, you can't fix step 2. You'd need to regenerate every token from step 2 onward. In practice, the model doesn't even detect the contradiction — it keeps generating tokens that drift further from global consistency.

2. Locally scored. LLMs are trained to predict the next token given the previous ones. This optimizes for local coherence, not global correctness. A reasoning chain can be perfectly coherent at every step and still reach a fundamentally wrong conclusion. The training signal has no concept of "the final answer must satisfy all 27 constraints simultaneously."

3. Discrete token space. Reasoning traces are sequences of discrete tokens. You can't make small, gradient-based edits to improve consistency. Improvement requires discrete search, reranking, or resampling — all of which are expensive, lossy, and unreliable for long chains.

Here's exactly how an LLM approaches a constraint satisfaction problem:

Carbon: LLM autoregressive approach — token by token, commit without undo, no global consistency check

The Sudoku demo demonstrates all three failures simultaneously. The LLMs produced reasoning that looked correct at each step. They identified the right constraints. They analyzed the right cells. They considered the right candidates. But they couldn't maintain global consistency across 81 cells and 27 constraints. The chain of thought drifted. Errors accumulated silently. There was no mechanism to notice, locate, or correct them.

This is not a "bad prompt" problem. It's not a "needs more training" problem. It's an architecture problem. The model was never designed to verify global consistency because its fundamental operation — predict the next token — has no concept of "the whole."

How Energy-Based Models Reason — And Why It's Different

EBMs work on a fundamentally different principle. Instead of generating tokens left to right, they learn an energy function — a scalar score that evaluates how consistent any state is with all constraints simultaneously.

Low energy = valid. High energy = something is broken.

The crucial property that makes this architecture categorically different: the energy function works on partial solutions. You can evaluate a half-completed grid and get actionable feedback on which specific constraints are being violated, and where the violation was introduced. This turns reasoning from a sampling problem (generate tokens, hope they're correct) into an optimization problem (minimize energy by satisfying constraints).

Carbon: Energy-Based Model approach — evaluate globally, gradient descent toward consistency, guaranteed valid output

Logical Intelligence's three core theses — stated directly in their technical blog — are worth reading in full:

Thesis 1: LLMs are fundamentally limited as reasoning models due to their reliance on discrete tokens generated left-to-right. This is a serious impediment for scaling up AI reasoning.

Thesis 2: Energy-Based Reasoning Models overcome the main difficulties inherent in using LLM-based reasoning models. The key advantage: EBRMs provide a score you can apply to intermediate states that tells you whether you're staying consistent with global constraints and helps you pinpoint what is broken so you can repair it.

Thesis 3: Scaling AI reasoning requires using EBRMs for reasoning and LLMs for coordination, especially when translating to and from natural language instruction.

KONA, their flagship model, embodies all three theses. It is non-autoregressive at the trace level — it generates complete reasoning traces simultaneously and conditions directly on the problem constraints. It reasons in a continuous latent space using dense vector tokens rather than discrete ones, which enables gradient-based refinement. It is globally scored — the energy function evaluates end-to-end trace quality directly, so long-horizon coherence is trained and optimized natively.

The performance gap isn't subtle. On their benchmark of hard Sudoku puzzles:

Model Accuracy Avg Latency
KONA 1.0 (EBM) 96.2% 313 ms
GPT-5.2 6.9% varies
Claude models "close but still wrong" "think for a long time"
DeepSeek V3.2 "fast but many duplicates" "finishes quickly"

From their published benchmark results: "Claude models often think for a long time and then produce something close but still wrong. GPT-5.2 does best among the LLMs at 6.9%, which is still a 93% failure rate on a puzzle that any patient human can solve. DeepSeek V3.2 tends to finish quickly but with many duplicates across rows, columns, and boxes — failing to solve the puzzle."


The Bigger Picture: Aleph and PutnamBench

This isn't just about Sudoku. Logical Intelligence's orchestrator Aleph coordinates KONA, LLMs, and other tools in a compound system. Aleph recently achieved a near-perfect score on PutnamBench — a formal reasoning benchmark based on the William Lowell Putnam Mathematical Competition, widely considered the most prestigious university-level mathematics competition in North America.

These aren't grid puzzles. These are problems that most math PhDs cannot solve — problems where correctness is formally verifiable, meaning you cannot fake it with plausible-sounding output. The reasoning must be logically sound or it doesn't count at all.

Aleph has also topped other formal reasoning benchmarks, including MiniF2F and ProofNet. And in a separate result, Aleph formalized and disproved an open problem in planar unit distance graph theory — a problem in the Erdős unit distance problem family that had resisted resolution for years. The team published the formal Lean 4 proof alongside the result.

This is not an LLM getting lucky on a benchmark. This is an architecture designed from first principles to produce verifiably correct reasoning — and it's doing so at levels beyond what the best human mathematicians can achieve.


LeCun's Bet — And Why It Matters

LeCun's involvement isn't ceremonial. He has been the most prominent advocate for energy-based models in AI for over a decade. His position has been remarkably consistent across that entire period: autoregressive generation is a dead end for reasoning. The path to human-level AI requires architectures that can reason about constraints globally, not just generate sequences that are locally plausible.

He laid out this argument in detail in his 2022 position paper "A Path Towards Autonomous Machine Intelligence," where he proposed a cognitive architecture built around a "world model" — essentially an energy-based model that learns to predict and evaluate states of the world — coordinated with other modules for perception, memory, and action. The Logical Intelligence architecture (EBRMs for reasoning, LLMs for language, Aleph for orchestration) is the closest practical implementation of that vision that has been publicly demonstrated.

Most of the AI industry bet against this view. The reasoning was simple and, at the time, defensible: LLMs keep improving on benchmarks, so clearly the architecture is fine. Just scale it up. More parameters, more data, more GPUs. Hundreds of billions of dollars have been committed to this scaling hypothesis.

The Sudoku demo — and PutnamBench, and the formal verification results — suggest a different answer. Scaling doesn't fix the architecture. It produces longer, more articulate chains of wrong reasoning. The failure mode is not insufficient capability; it's a fundamental mismatch between the architecture and the task.

The vision isn't "EBMs replace LLMs." It's:

┌─────────────────────────────────────┐
│        HUMAN INTERACTION LAYER       │
│    (LLMs — language understanding,   │
│     generation, coordination)        │
├─────────────────────────────────────┤
│       ORCHESTRATION LAYER            │
│    (Aleph — planning, tool use,      │
│     result verification)             │
├─────────────────────────────────────┤
│        REASONING CORE                │
│    (EBRMs — constraint satisfaction, │
│     formal verification, planning)   │
└─────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

LLMs generate candidates and handle human interaction — what they're architecturally suited for. EBRMs evaluate those candidates against formal constraints and guarantee correctness — what they're architecturally suited for. Orchestration coordinates between them, calling the right tool for the right sub-problem. Each component does what it was designed to do, rather than forcing one architecture to handle everything.

This mirrors how human cognition works, at least at a high level. We have intuitive, language-based thinking (Kahneman's System 1) and deliberate, constraint-aware reasoning (System 2). We don't try to solve math problems by generating plausible-sounding words and hoping the constraints work out. We use different cognitive systems for different kinds of thinking. The compound AI architecture does the same.


Why This Matters for Developers

If you're building AI systems today, you've almost certainly encountered the "looks right but is wrong" failure mode. An LLM generates code that compiles and passes a quick review — but has a subtle logic error that only manifests under specific conditions. Or it produces a deployment plan that sounds perfectly reasonable but violates a security constraint you forgot to state explicitly.

This is not a prompt engineering problem. The model has no mechanism for verifying global consistency because it wasn't designed to. Future iterations with more parameters will not fix this — they will produce the same class of error with higher confidence and more articulate justifications.

What makes Logical Intelligence worth paying attention to:

  1. It's not a research paper — it's a live product. Load the demo at sudoku.logicalintelligence.com and test any frontier LLM against KONA yourself. The gap is not incremental. It is categorical.

  2. The team is at the highest level. Yann LeCun (Turing Award, Chief AI Scientist at Meta), Michael Freedman (Fields Medalist), Eve Bodnia (Founder and CEO). The combined mathematical and engineering depth is extraordinary for a company this early.

  3. The cost differential exposes the architectural inefficiency. The EBM solved the puzzle in 0.24 seconds running in-browser for free. The LLM that "succeeded" took 12x longer, cost real money, and produced the wrong answer. At scale — millions of reasoning queries per day — this difference compounds.

  4. Constraint satisfaction is everywhere. Code verification, medical diagnosis, legal reasoning, chip design, structural engineering, financial compliance, robotics, scheduling — these are all fundamentally constraint satisfaction problems where being wrong is expensive. Any domain where constraints are non-negotiable and errors cascade is a domain where autoregressive generation is structurally insufficient.

The architecture conversation in AI is about to become harder to ignore. LLMs are remarkable at what they do — language. But reasoning, in the formal sense — maintaining global consistency while navigating constraint spaces — may require a fundamentally different approach. Logical Intelligence is building one. And the fact that Yann LeCun is betting his institutional credibility on it should make everyone pay closer attention than the industry currently is.


Try It Yourself

The demo is public and free:


Are energy-based models the next architecture shift, or a niche solution that LLMs will eventually absorb? I'm genuinely curious — especially from people building reasoning systems in production. Drop your take in the comments.

Top comments (0)