The Hierarchical Reasoning Model: Can a 27M-Parameter Network Outthink Chain-of-Thought?

#ai #deeplearning #machinelearning #python

The Hierarchical Reasoning Model: Can a 27M-Parameter Network Outthink Chain-of-Thought?

A new paper on arXiv (2506.21734) describes a small recurrent architecture called the Hierarchical Reasoning Model (HRM) that claims to solve complex Sudoku puzzles, navigate large mazes, and score competitively on the Abstraction and Reasoning Corpus (ARC-AGI) — all with roughly 27 million parameters and no Chain-of-Thought prompting. That combination of claims is unusual enough to be worth unpacking carefully, especially since an independent audit by the ARC Prize team paints a more nuanced picture of what is actually driving the results.

What Problem Is HRM Trying to Solve?

Standard transformer-based language models reason by generating tokens one at a time. When a task requires deep search or backtracking — think solving a hard Sudoku or finding an optimal path through a 30×30 maze — the model must either externalize every intermediate step as text (Chain-of-Thought) or fail. CoT works surprisingly well, but it has real costs: it consumes output tokens proportional to reasoning depth, it is brittle when the chain goes wrong early, and it requires the model to have learned the right "reasoning vocabulary" during training.

HRM takes a different approach: instead of reasoning through tokens, it reasons within its hidden states across multiple recurrent cycles. The idea is that a model can "think" without writing anything down, as long as it has enough internal computational depth.

The Dual-Stack Architecture

HRM is built around two coupled recurrent modules that operate at different timescales:

High-Level Module (H): This is the slow planner. It updates once per outer cycle and is responsible for abstract strategy — setting the direction for the current phase of computation. Think of it as the part of the network that decides what to work on next.

Low-Level Module (L): This is the fast worker. For every single update of H, L iterates T times, performing rapid, detailed computations. It handles the fine-grained search and refinement within the direction set by H.

The two modules are coupled: L's output feeds into H's next update, and H's output resets L's starting state for the next inner loop. This creates a hierarchical convergence dynamic — H periodically disrupts L's convergence, forcing it to start a new computational phase rather than settling into a fixed point too early.

Both modules are built on standard transformer blocks with full self-attention and rotary positional encoding, so the architecture is not exotic at the component level. What is unusual is the recurrent outer loop and the way gradients are computed.

Training Without Backpropagation Through Time

Recurrent networks are notoriously difficult to train because backpropagation through time (BPTT) requires storing the entire computation history, which grows linearly with the number of recurrent steps. For a model that runs hundreds of inner-loop iterations, this is prohibitive.

HRM sidesteps this by using a one-step gradient approximation derived from the Implicit Function Theorem. Rather than unrolling the full recurrent computation, the gradient is estimated at the fixed point of the inner loop. This keeps memory usage at O(1) regardless of how many recurrent steps are taken — a meaningful practical advantage.

The model also uses Adaptive Computation Time (ACT), a halting mechanism trained via Q-learning. Instead of always running a fixed number of cycles, the model learns to stop early on easy inputs and run longer on hard ones. This lets it trade compute for accuracy at inference time.

What the Numbers Actually Show

The paper reports near-perfect accuracy on Sudoku-Extreme and optimal pathfinding in 30×30 mazes, tasks where CoT-based models often fail entirely. On ARC-AGI-1, the paper claims 41% accuracy.

However, the ARC Prize team conducted an independent verification and found a more modest 32% Pass@2 on ARC-AGI-1 and only 2% Pass@2 on ARC-AGI-2. The gap between the paper's reported numbers and the independent evaluation is worth noting — ARC-AGI-2 is a harder, less-saturated benchmark, and the 2% score there suggests the model has not learned a general reasoning capability.

More importantly, the ARC Prize analysis found that the hierarchical architecture itself contributed relatively little to the performance. A standard transformer baseline using the same training pipeline — including the same outer-loop refinement and data augmentation — achieved nearly identical results. The actual drivers of performance appear to be:

Outer-loop refinement: The system iteratively generates candidate solutions and checks them for self-consistency, retrying until a valid answer is found. This is a form of test-time compute scaling that is independent of the architecture.
Task augmentation: The training data is heavily augmented with rotations, flips, and recolorings of ARC tasks, which helps the model generalize within the distribution of ARC-style puzzles.

This does not make HRM uninteresting — the memory-efficient training method and the ACT mechanism are genuinely useful contributions. But it does mean the headline claim ("27M parameters outperforms CoT") should be read carefully. The performance comes from a well-engineered training and inference pipeline, not purely from the hierarchical recurrent design.

Why Latent Reasoning Is Still Worth Watching

Even if the ARC Prize analysis deflates some of the architectural claims, the broader direction HRM represents is worth following. There is a real question in the field about whether the right way to scale reasoning is to generate more tokens (longer CoT, more output compute) or to build models that do more computation per token in their hidden states (deeper recurrence, mixture-of-experts routing, etc.).

HRM is one of several recent architectures — alongside Mamba-based state-space models and hybrid attention-recurrence designs — that explore the second path. The practical appeal is clear: if a model can reason deeply without externalizing every step, it could be faster, cheaper, and less sensitive to early errors in the reasoning chain.

The challenge, as the HRM analysis illustrates, is that it is hard to isolate the contribution of the architecture from the contribution of the training and inference setup. Rigorous ablations and independent evaluations are essential before drawing strong conclusions.

Key Takeaways

HRM uses a dual-stack recurrent design (slow planner + fast worker) to perform latent reasoning without generating explicit reasoning tokens.
Its memory-efficient training via the Implicit Function Theorem and adaptive halting via ACT are practical contributions independent of the benchmark results.
Independent evaluation by the ARC Prize team found lower scores than the paper reports, and attributed most of the performance to outer-loop refinement and data augmentation rather than the hierarchical architecture.
The broader question of whether latent (hidden-state) reasoning can compete with token-level Chain-of-Thought remains open and active.