The Return of Recursion: How 5M-Parameter Models Are Outperforming Frontier LLMs on Reasoning in 2026
TL;DR Summary
- Tiny recursive models with 5-7 million parameters are achieving state-of-the-art on deterministic reasoning tasks that frontier LLMs score 0% on — including Sudoku-Extreme, ARC-AGI puzzles, and maze navigation
- The key innovation: reasoning in latent space instead of generating "thinking tokens" like Chain-of-Thought — delivering 100x speedups and 75% token reduction
- Probabilistic TRM (7M params) achieves 98.75% on Sudoku-Extreme using Gaussian noise to escape local optima, while DeepSeek-R1 scores 0.0%
- RecursiveMAS applies recursion to multi-agent systems — agents communicate via latent representations ("telepathically"), cutting tokens by 75.6% and improving accuracy by 8.3%
- Attractor Models (27M params) outperform 1.3B Transformers trained on twice as many tokens — by solving for fixed points instead of iterating
Direct Answer Block
Recursive models revive a pre-transformer AI concept — iterative reasoning — but with modern training methods that avoid the vanishing gradient problems that killed RNNs. Instead of generating Chain-of-Thought tokens (slow, expensive), they refine representations in hidden latent space through loops. A 5M-parameter TRM achieves 87.4% on Sudoku-Extreme where DeepSeek-R1 scores 0%, while probabilistic extensions push this to 98.75% — at less than 0.0001x the cost of frontier LLMs.
Introduction
The AI industry has spent three years scaling transformers — more parameters, more data, more compute. Chain-of-Thought reasoning made them smarter but also slower and more expensive: every reasoning step is a token, every token costs money, and long chains hit context limits. Meanwhile, a parallel research thread has been quietly reviving recursion — and the results are startling. Models with 5 million parameters are solving puzzles that billion-parameter systems fail completely, using 100x less compute and generating 75% fewer tokens. Here's how recursive architectures work, why they're making a comeback, and where they fit in production.
Why did the AI industry abandon recursion — and why are 5M-parameter models now beating frontier LLMs at reasoning?
The story of recursion in AI is a story of training instability. Recurrent Neural Networks (RNNs) were the dominant architecture before transformers. They processed sequences iteratively — refining a hidden state through repeated passes — which is, conceptually, exactly what recursive reasoning models do today.
The problem was vanishing and exploding gradients. When you backpropagate through a recursive loop, gradients either shrink to zero (vanishing) or blow up to infinity (exploding) as the number of iterations grows. Training became unstable. The transformer's solution — process everything in parallel with attention, no recurrence — eliminated the gradient problem and enabled the scaling revolution of 2018-2025.
But attention has its own scaling problem: quadratic compute cost. Each token attends to every other token. Chain-of-Thought makes this worse — every reasoning step generates a new token that must attend to every previous token. Long reasoning chains become exponentially expensive.
"Autoregressive LLMs hit a reasoning wall — Chain-of-Thought forces models to externalize intermediate thoughts token by token, becoming slow and memory-intensive as sequences grow." — AlphaSignal summary of the recursive architecture revival
The recursive models being published in 2026 solve the gradient problem that killed RNNs through modern training innovations:
- TRM and HRM use weight-sharing across recursion steps, keeping the parameter count tiny (5-27M) and making gradient flow manageable
- Attractor Models use implicit differentiation — solving for fixed points analytically rather than through backpropagation-through-time — making training memory constant in effective depth
- Probabilistic TRM injects Gaussian noise at each step and uses a learned Q-head for early stopping, avoiding convergence to suboptimal solutions
The result: recursion is back, and it works. The arXiv:2605.19943 paper on Probabilistic TRM demonstrates 91.2% accuracy on Pencil Puzzle Bench vs 55.1% for frontier LLMs — "at less than 0.0001x the cost, using only 7M parameters."
How does recursive latent reasoning differ from Chain-of-Thought — and why does it deliver 100x speedups?
The fundamental difference is where reasoning happens:
Chain-of-Thought (autoregressive):
- Model generates reasoning steps as text tokens: "Step 1: Let me think about this... Step 2: If X then Y... Step 3: Therefore..."
- Each token must be generated, then fed back as input for the next token
- Token generation is sequential — cannot parallelize
- All intermediate tokens count toward context length and API costs
- Each token invokes the full forward pass of a massive model
Recursive Latent Reasoning:
- Model refines a hidden representation through a loop — no tokens emitted until the final answer
- The loop runs in the model's latent space (hidden states, not text)
- Iteration count is determined by convergence or a fixed budget, not by token generation speed
- No intermediate tokens = no context bloat, no token costs for reasoning steps
- The loop uses a tiny model (5-27M params), not a massive transformer
The 100x speedup claim comes from this architectural difference: each Chain-of-Thought step requires a full forward pass through a billion-parameter model and generates a token. Each recursive latent step requires a forward pass through a million-parameter model and produces no token. The HRM paper (cited in the newsletter) demonstrated up to 100x speedup for deterministic reasoning tasks compared to autoregressive CoT approaches.
The token reduction is even more dramatic. RecursiveMAS — which applies recursive principles to multi-agent systems — achieved 75.6% token reduction by round 3 (arXiv:2604.25917). Agents pass continuous latent representations to each other instead of text messages. Only the final answer is converted to text.
What are HRM, TRM, Probabilistic TRM, RecursiveMAS, and Attractor Models — and how do they compare?
Five distinct recursive approaches have emerged. Here's how they compare:
| Model | Parameters | Key Innovation | Best Result | vs Frontier LLMs |
|---|---|---|---|---|
| HRM (Hierarchical Reasoning Model) | 27M | H-L dual-module loop: slow abstract planning + fast detailed computation | ARC-AGI SOTA with 1,000 examples | 100x speedup on deterministic tasks |
| TRM (Tiny Recursive Model) | 5-7M | Single 2-layer weight-sharing network; increase recursion steps, not layers | 87.4% Sudoku-Extreme | DeepSeek-R1: 0.0% |
| Probabilistic TRM | 7M | Gaussian noise at each step enables diverse exploration; Q-head selects best | 98.75% Sudoku-Extreme | 0.0001x cost, 91.2% vs 55.1% frontier on puzzles |
| RecursiveMAS | Multi-agent | Agents communicate via latent states ("telepathically"); recursive collaboration loop | 8.3% accuracy gain, 2.4x speedup, 75.6% fewer tokens | Matches or exceeds on code + medical reasoning |
| Attractor Models | 27M | Implicit differentiation solves for fixed points; equilibrium internalization | 91.4% Sudoku-Extreme, beats 1.3B Transformer | Claude/GPT o3 fail completely on maze tasks |
HRM (arXiv:2603.22871 — March 2026)
The oldest of the modern recursive models. Uses two modules: H (high-level) for slow abstract planning and L (low-level) for fast detailed computation, coupled in a recursive loop. Inspired by human cognition — the dual-process theory where System 2 (slow, deliberate) plans and System 1 (fast, automatic) executes. Achieved state-of-the-art on ARC-AGI puzzles with only 1,000 training examples.
TRM (published at ICLR 2026 Latent & Implicit Thinking Workshop)
Strips HRM to its essence: a single 2-layer weight-sharing network. The key insight: increase recursion steps, not layers. More recursion depth improves generalization more than more parameters. The 5M-parameter TRM hit 87.4% on Sudoku-Extreme — a task where DeepSeek-R1 scored 0.0%. The TRM+Mamba-2 hybrid from arXiv:2602.12078 improved pass@2 on ARC-AGI by +2.0% while maintaining parameter parity.
Probabilistic TRM (arXiv:2605.19943 — May 2026)
TRM's deterministic recursion can converge to suboptimal solutions with no escape mechanism. PTRM solves this by injecting Gaussian noise at each recursion step, creating parallel trajectories that explore diverse solution basins. A learned Q-head (initially used for early stopping in TRM) selects the best trajectory. The improvement: Sudoku-Extreme from 87.4% to 98.75%, Pencil Puzzle Bench from 62.6% to 91.2% — nearly double frontier LLM accuracy.
RecursiveMAS (arXiv:2604.25917 — April 2026)
Applies recursion to the multi-agent paradigm. Instead of agents exchanging text messages (expensive, verbose), they pass continuous latent representations through a lightweight RecursiveLink module — described as "telepathic" communication. The system is trained with an inner-outer loop algorithm for whole-system co-optimization. Results: 8.3% accuracy gain across 9 benchmarks, 1.2x-2.4x inference speedup, 34.6-75.6% token reduction.
Attractor Models (arXiv:2605.12466 — May 2026)
The most mathematically novel approach. Instead of iterating a fixed number of times, Attractor Models solve for a fixed point using implicit differentiation. The model proposes output embeddings, then an attractor module refines them by solving for equilibrium — training memory stays constant regardless of effective depth. The most remarkable finding: equilibrium internalization — after training, the model's initial output is already near equilibrium, allowing the solver to be removed at inference with little degradation. A 770M Attractor Model outperforms a 1.3B Transformer trained on twice as many tokens.
How does Probabilistic TRM use Gaussian noise to break out of local optima and achieve 98.75% on Sudoku-Extreme?
Deterministic recursion has a fundamental weakness: it follows the same path every time. If that path leads to a suboptimal solution, there's no escape — the recursion converges to a local minimum and stays there.
Probabilistic TRM introduces stochastic exploration as a test-time compute scaling strategy:
- Inject Gaussian noise at each deep recursion step — small perturbations that nudge the latent state into neighboring regions of the solution space
- Run multiple parallel trajectories — each with different noise realizations, exploring different solution basins
- Use the Q-head for selection — the same Q-head originally designed for early stopping in TRM now scores each trajectory's quality
- Select the best trajectory — highest Q-head score wins
The key insight: this requires no retraining. The original TRM's Q-head — trained for early stopping — naturally generalizes to trajectory selection. The noise injection is applied at inference time only. The PTRM paper shows accuracy gains across multiple benchmarks without any task-specific augmentations.
"PTRM injects Gaussian noise at each deep recursion step, enabling parallel trajectories to explore diverse solution basins, and selects among them using the model's existing Q head. Without requiring retraining or task-specific augmentations, PTRM enables substantial accuracy gains." — arXiv:2605.19943 abstract
The practical implication: for deterministic reasoning tasks (puzzles, logic, math proofs), you can take an existing tiny recursive model and improve its accuracy by 10-30% simply by adding noise at inference time and running a few parallel trajectories. No model modification needed.
How does RecursiveMAS apply recursion to multi-agent systems — and why does "telepathic" latent communication reduce tokens by 75%?
Standard multi-agent systems work like a chat room: Agent A generates text, Agent B reads it and generates text, Agent C reads both and generates text. Every message consumes tokens, adds latency, and accumulates error as text summaries lose information.
RecursiveMAS changes the communication channel: agents pass continuous latent representations — floating-point vectors in the model's hidden space — through a lightweight RecursiveLink module. The module is a small learned network that transforms one agent's latent state into a format the next agent can process.
This is described as "telepathic" communication because:
- No information is lost to text compression (a latent vector preserves more information than a text summary)
- No tokens are consumed (the communication is continuous, not discrete)
- Communication is parallelizable (multiple agent pairs can exchange latent states simultaneously)
- The RecursiveLink module is optimized end-to-end with the agents, so the latent format evolves to be maximally useful
The results from arXiv:2604.25917:
- 75.6% token reduction by round 3 (vs text-based MAS)
- 2.4x end-to-end speedup (latent communication is faster than text generation)
- 8.3% accuracy improvement (latent states preserve more information than text summaries)
The framework was evaluated under 4 representative agent collaboration patterns across 9 benchmarks spanning mathematics, science, medicine, search, and code generation. The latent approach consistently outperformed text-based alternatives across all patterns.
The inner-outer loop training algorithm deserves attention: the outer loop optimizes the whole multi-agent system, while the inner loop handles per-agent recursion. Shared gradient-based credit assignment propagates across recursion rounds — meaning later agents can influence the training of earlier agents, and vice versa.
Where do recursive models fit in production — and when should you still use frontier LLMs instead?
Recursive models are specialized reasoning engines, not general-purpose language models. The deployment boundary is clear:
Use recursive models for:
- Deterministic logic tasks: Sudoku, constraint satisfaction, puzzle solving, theorem proving
- Pattern recognition: ARC-AGI puzzles, Raven's matrices, abstract reasoning
- Latency-critical applications: Robotics, embodied AI, real-time systems where 100ms matters
- Cost-sensitive tasks: Running 7M-parameter models locally vs API calls to billion-parameter models
- Data-scarce domains: Scientific exploration where training examples are limited (HRM achieved SOTA with 1,000 examples)
Use frontier LLMs for:
- Language understanding and generation: Creative writing, summarization, translation, conversation
- General knowledge tasks: Question answering, fact retrieval, explanation
- Code generation: Real-world software engineering (note: RecursiveMAS showed gains on code benchmarks, but general SWE requires LLM capabilities recursive models don't have)
- Multi-modal tasks: Images, audio, video understanding — recursive models are currently text/latent only
The hybrid future:
The newsletter source describes the optimal architecture as hybrid systems — recursive models as specialized reasoning engines inside LLM-powered applications. An LLM handles the interface (understanding user intent, generating explanations, formatting output), then delegates deterministic reasoning tasks to a recursive sub-component that returns results in milliseconds rather than seconds.
The Attractor Models paper suggests another direction: equilibrium internalization. If models can learn to internalize reasoning to the point where the solver can be removed at inference, then recursive training becomes a way to produce standard feed-forward models that have internalized deeper reasoning — no recursion needed at inference time.
Frequently Asked Questions
Q: Can I run a recursive model on my laptop?
Yes. These models are 5-27 million parameters — orders of magnitude smaller than even a "small" LLM (1B+). A 7M-parameter TRM or PTRM runs easily on consumer hardware. The challenge is that recursive inference loops may require multiple forward passes, but even 50 passes through a 7M model is computationally trivial compared to one pass through a 70B LLM.
Q: Are recursive models a replacement for transformers?
No. They're complementary. Recursive models excel at deterministic reasoning and pattern recognition. LLMs excel at language, creativity, and general knowledge. The most promising direction is hybrid systems where recursive models serve as reasoning engines inside LLM-based applications.
Q: Why can't I just use CoT with my existing LLM?
You can — for many tasks, CoT works well. But for specific classes of problems (Sudoku, mazes, ARC-AGI), CoT fails because the problem requires exploring a solution space iteratively, not generating a linear chain of reasoning. Frontier LLMs score 0% on these tasks. Recursive models are designed specifically for iterative solution-space exploration.
Q: How do recursive models handle tasks they weren't trained on?
Generalization is where they shine. Because recursive models have so few parameters (5-27M), they can't memorize — they must learn general reasoning strategies. TRM achieved 45% on ARC-AGI-1 with 5M parameters, while frontier LLMs with orders of magnitude more parameters struggle. The weight-sharing across recursion steps acts as a strong regularizer.
Q: What's the difference between HRM and TRM?
HRM uses two separate modules (H for abstract planning, L for detailed computation) in a coupled loop. TRM simplifies this to a single weight-sharing network. TRM is smaller (5-7M vs 27M), simpler, and achieved competitive results. Probabilistic TRM builds on TRM. Attractor Models are a different approach — solving for fixed points rather than iterating.
Q: Is the recursive architecture revival connected to the "Titans" and "deep thinking" trends?
Yes — they're parallel developments. Titans (Google, 2025) introduced neural memory modules for long context. Deep thinking approaches extend reasoning through iterative refinement. The recursive architecture revival is the most radical version: tiny models that replace autoregressive token generation with latent-space iteration entirely, rather than augmenting it.
Glossary
- Recursive latent reasoning: Iterative refinement of hidden representations in a model's latent space without emitting intermediate tokens — the core mechanism behind TRM, HRM, and related architectures
- Chain-of-Thought (CoT): An autoregressive reasoning method where models generate intermediate reasoning steps as text tokens — effective but slow and token-expensive
- Weight-sharing: Using the same parameters across multiple recursion steps, keeping model size tiny while enabling deep computation through iteration count
- Probabilistic recursion: Injecting Gaussian noise at each recursion step to explore diverse solution basins, then selecting the best trajectory — improves accuracy without retraining
- Equilibrium internalization (Attractor Models): A phenomenon where fixed-point training causes the model's initial output to already be near equilibrium, allowing the solver to be removed at inference
- Test-time compute scaling: Improving model accuracy by spending more computation at inference (more iterations, more trajectories) rather than during training
Author
Ramsis Hammadi — AI/ML engineer specializing in GenAI, LLM engineering, and automation. Full bio →


Top comments (0)