[Day 7] Does Giving an AI More "Thinking Time" Really Make It Smarter? Training an OpenMythos-Style Mini Model on DGX
Intro
Day 7!
Reddit kept surfacing this new project called OpenMythos in my feed with "12 days to replicate frontier AI, ASI is near" headlines, and I got curious enough to dig in.
Tools used: my home AI machine (DGX Spark) + OpenMythos (PyTorch reconstruction of the rumored Claude Mythos architecture) + synthetic multi-digit addition.
The question: does giving an AI more "thinking time" (= more recurrent loops at inference) actually make it smarter?
Today's setup
The hype
On 2026-04-07, Anthropic announced Claude Mythos. Press coverage highlights zero-day discovery capabilities — reportedly 271 zero-days in Firefox and a 27-year-old bug in OpenBSD — but the model's architecture and weights remain unreleased. Anthropic kept Mythos itself behind a limited-access coalition (Project Glasswing — AWS, Apple, Microsoft, Google, CrowdStrike, Palo Alto, ~40 organizations) rather than releasing it publicly.
Twelve days later, Kye Gomez (Swarms) released OpenMythos, a PyTorch reconstruction of the suspected architecture. The repo is explicit upfront:
"an independent, community-driven theoretical reconstruction based solely on publicly available research and speculation. It is not affiliated with, endorsed by, or connected to Anthropic"
So OpenMythos is not Mythos. It's a hypothesis-in-code: a Recurrent-Depth Transformer (RDT) with MoE FFNs and MLA/GQA attention, capable of being trained from scratch on standard text data. No leaked weights, no distillation.
Reddit's "ASI is near" framing skips this critical distinction. The interesting question, once you set the hype aside, is whether the architectural idea — recurrent depth — actually works.
Note for this article: OpenMythos is not Claude Mythos — it's a theoretical reconstruction inspired by looped-transformer research. The experiments below are not "Claude Mythos capability tests" but rather "how does a looped / recurrent-depth structure behave on a small synthetic task."
Three perspectives on looped transformers
Browsing the literature, I found three different studies giving different pictures of how looped transformers behave:
| Source | Scale | Claim |
|---|---|---|
| Saunshi et al. 2025 (ICLR, research paper) | tens of M params, synthetic | Loops work: k layers looped L times approximately matches kL-layer fixed-depth, on addition / p-hop induction / math |
| Geiping et al. 2025 (Huginn, research paper) | 3.5B params, 800B tokens | Task-dependent: at scale on natural-language benchmarks, gains can be marginal (T=4 → T=32 only +1.82 points on GSM8K), though effects vary by task and compute regime |
| Micheal Bee 2026-04 (Medium, independent experiment blog) | 17M params, 12 GPU-hours on RTX 5070 Ti | Loops plateau at T=2 in this small-scale setup: hidden state reaches a fixed-point that subsequent iterations cannot escape |
Theory, large-scale empirics, and an independent solo replication give different pictures. I wanted to add a fourth data point from my own DGX Spark on a clean, controlled task — multi-digit addition.
What I'd hoped to see
- Does training-time accuracy phase-transition (grok) at some step? (Saunshi 3-stage prediction)
- Does test-time loop count matter? At what point does it stop helping?
- Does the hidden state actually keep evolving across loops, or does it hit a fixed-point early? (the Bee question)
Headline finding
-
Loops help, but only within a narrow window centered on the training loop count. With training-time
max_loop_iters=4, accuracy peaks at exactly T=4 (100% across all digit counts) and decays in both directions — fewer loops underthink, more loops overthink. - Bee's "T=2 fixed-point" reproduced. Cosine similarity between consecutive hidden states jumps from ~0.72 to ~0.95 at T=2, then climbs slowly to ~0.99 by T=4 and stays flat through T=32.
- Striking per-seed grokking variance. Same hyperparameters, four seeds: seeds 1 and 3 solve 5-digit addition by step 4,000; seed 2 takes 10,000; seed 0 stalls at <10% until step 16,000, then jumps to 100%.
- No depth extrapolation in this setup. Saunshi's claim that training at T=4 should generalize to deeper T at inference does not reproduce here — instead, T>4 hurts.
🌀 What is a "looped" transformer?
A standard transformer (GPT-4, Llama, most local LLMs) routes input tokens through a stack of distinct layers, each used exactly once per forward pass. To make it "think deeper," you stack more layers — increasing parameter count.
A looped transformer reuses the same parameters across multiple iterations. The model has a Prelude → Recurrent Block × T → Coda structure: a few standard layers up front, then one block iterated T times with input injection at every step, then a few more standard layers.
Input tokens
↓
[Prelude P] — standard layers, run once
↓
[Recurrent Block R] — one block looped T times
↑_______↓ h_{t+1} = A·h_t + B·e + Transformer(h_t, e)
↓
[Coda C] — standard layers, run once
↓
Output logits
At each loop iteration t, the hidden state updates via the LTI injection rule, and the encoded input e (Prelude output) is re-injected to keep the original signal alive across arbitrary depth. The injection parameters are constrained so that spectral radius ρ(A) < 1, which prevents divergence over many loops (Parcae stability framework).
The key claim: more loops at inference = deeper reasoning, without adding parameters. This is conceptually analogous to chain-of-thought scaling — except the "thinking" happens in continuous latent space rather than discrete token space.
🔧 Experimental setup
I trained a deliberately tiny OpenMythos variant on multi-digit addition. The model is small enough to run 4 seeds in parallel on a single GPU but large enough to exhibit the looped-transformer phenomena.
OpenMythos tiny (3.4M params)
↓
Train 4 seeds in parallel, 30k steps each, fp32 on DGX Spark (GB10)
↓
Experiment A: greedy autoregressive accuracy
loops ∈ {1, 2, 4, 8, 16, 32} × digits ∈ {2, 3, 4, 5}
↓
Experiment B: cosine similarity between consecutive hidden states
⇒ does the recurrent block reach a fixed-point?
↓
Compare against Saunshi / Huginn / Bee
Model config
MythosConfig(
vocab_size=16, # digits 0-9 + '+', '=', pad, eos
dim=256,
n_heads=8,
n_kv_heads=2, # GQA
max_seq_len=32,
max_loop_iters=4, # training depth; inference varies
prelude_layers=1,
coda_layers=1,
attn_type="gqa",
n_experts=4, # MoE FFN inside recurrent block
n_shared_experts=1,
n_experts_per_tok=2,
expert_dim=512,
lora_rank=8, # depth-wise LoRA per loop step
)
Total parameters: 3,386,658 (~3.4M).
Data
On-the-fly synthetic addition. Operands are uniformly sampled from [10^(d-1), 10^d - 1] for digit count d ∈ {2, 3, 4, 5}. Sequence format "A+B=R$", where R = str(A+B)[::-1] (reverse-order answer, following Saunshi's convention so left-to-right autoregressive generation can carry digits naturally).
Loss is applied only at positions following the = token (i.e., on the answer tokens).
Training
- Optimizer: AdamW, betas (0.9, 0.95), wd 0.1
- LR: max 3e-4, warmup 2000 steps, cosine decay to 1e-5
- Grad clip: 1.0
- Batch size: 128
- Max steps: 30000
- dtype: fp32
Initially I tried bf16 to use the GB10 efficiently, but OpenMythos stores RoPE frequencies as complex64 buffers, and model.to(bfloat16) silently drops the imaginary part, breaking attention. For a 3.4M-param model on 128 GB of unified memory, fp32 is fine — the bottleneck is not memory but parallel scheduling.
Four seeds {0, 1, 2, 3} run in parallel on the same GPU. Per-seed throughput drops to ~12K tok/s (vs ~50K solo), but wall-clock time for all four is approximately equivalent to one solo run.
📊 Results
Experiment A: accuracy heatmap
Mean fully-correct rate across 4 seeds, 500 eval samples per condition:
| Inference loops | d=2 | d=3 | d=4 | d=5 |
|---|---|---|---|---|
| 1 | 0.38 ± 0.12 | 0.19 ± 0.09 | 0.09 ± 0.07 | 0.02 ± 0.02 |
| 2 | 0.53 ± 0.17 | 0.50 ± 0.12 | 0.16 ± 0.08 | 0.21 ± 0.16 |
| 4 (train) | 1.00 | 1.00 | 1.00 | 1.00 |
| 8 | 0.98 ± 0.01 | 0.98 ± 0.01 | 0.94 ± 0.03 | 0.86 ± 0.08 |
| 16 | 0.91 ± 0.04 | 0.91 ± 0.05 | 0.75 ± 0.10 | 0.56 ± 0.16 |
| 32 | 0.62 ± 0.12 | 0.65 ± 0.13 | 0.45 ± 0.13 | 0.26 ± 0.17 |
Observations:
- Peak is exactly at training-time loop count (T=4), 100% across all digit counts.
- One step of inference-time extrapolation (T=8) is near-peak but already shows degradation at d=5 (86%).
- Beyond T=8, accuracy collapses monotonically. At T=32, even 2-digit addition drops to 62%.
- Under-looping (T=1, T=2) hurts more at higher digit counts, consistent with depth being needed to chain carries.
Experiment B: fixed-point analysis
Mean cosine similarity between consecutive hidden states cos(h_t, h_{t-1}) over answer positions, averaged across 4 seeds, 200 samples per digit:
| t | d=2 | d=3 | d=4 | d=5 |
|---|---|---|---|---|
| 1 | 0.711 | 0.726 | 0.745 | 0.744 |
| 2 | 0.961 | 0.967 | 0.957 | 0.946 |
| 3 | 0.985 | 0.986 | 0.977 | 0.971 |
| 4 | 0.993 | 0.992 | 0.986 | 0.983 |
| 8 | 0.999 | 0.999 | 0.998 | 0.996 |
| 16 | 0.9995 | 0.9996 | 0.9992 | 0.998 |
| 32 | 0.9995 | 0.9996 | 0.999 | 0.998 |
Bee's T=2 fixed-point claim is reproduced in spirit but not literally: cosine similarity jumps to ~0.95 at T=2 (vs. Bee's near-1.0), then asymptotes to ~0.99 by T=4 and stays flat through T=32.
The difference vs. accuracy is telling: hidden state is effectively static (by cosine similarity) from T=4 onwards, yet accuracy collapses at T=16-32. Two non-exclusive interpretations: (a) overthinking — late loops drift away from a converged solution; (b) distribution shift — training used T=4, so T>>4 is simply an out-of-distribution use of the model. Worth noting that cosine similarity ≈ 1 doesn't prove the hidden state is doing nothing — small logit-relevant deltas may still accumulate.
Digit-count dependence on fixed-point timing is small (d=5 lags d=2 by ~0.01 in cosine sim). "Harder problems take more loops to converge" is not observed here — they converge at the same rate but the converged state is just less accurate at higher digit counts.
Bonus: training dynamics
The most striking thing in the training curves is seed-dependent grokking timing. Four runs of identical hyperparameters:
- seed 1: loss → 0 by step 3,000, all digits ≥88% by step 4,000
- seed 3: loss → 0 by step 4,000, all digits ≥87% by step 4,000
- seed 2: stuck at loss ~0.35 plateau until step 8,000, then collapses to 0 by step 10,000; d=4/5 jump from <10% to 99% in 2,000 steps
- seed 0: stuck at loss ~0.30 plateau until step 15,000, then collapses; d=4 groks at step 12,000-14,000, d=5 groks at step 16,000
This is textbook Saunshi-style three-stage grokking (memorization → in-distribution → systematic), with the third-stage trigger varying by a factor of 4x in step count purely on random init. The largest seed gap (seed 0 vs. seed 1) is ~12,000 steps, roughly 1 hour of wall-clock on this DGX.
If you trained a single seed and stopped early, you might conclude "OpenMythos can't generalize beyond d=3" — which would be wrong. The architecture can solve all 4 digit buckets; some random seeds just need much longer to find the systematic-generalization solution.
💡 What this means for the three perspectives
Where my data point lands
My single-DGX small-scale result lands somewhere between Bee and a partial refutation of Saunshi:
- Bee's fixed-point at small T is reproduced. Hidden state effectively stops evolving by T=4 (cosine sim ≥ 0.99) and certainly by T=8.
- Saunshi's depth-extrapolation does NOT reproduce. Inference at T > train_T does not improve accuracy — it harms it. T=8 is already at 86% on d=5 (vs. 100% at T=4), and T=32 collapses to 26%. The "train at depth k, infer at depth k·L" recipe assumes the recurrent block has learned to keep refining; in my setup it has not.
- Huginn's limited-gain finding is consistent at small scale. Extra inference loops give negative ROI rather than diminishing positive ROI.
- New observation: seed-dependent grokking with up to 12K-step variance. This is an under-emphasized variable in the public looped-transformer discourse — single-seed studies (Bee's solo replication, individual rows in Saunshi's tables) may be substantially under- or over-estimating typical behavior.
Reconciliation attempt
Theory (Saunshi), large-scale empirics (Huginn), and independent replication (Bee) may not actually be in contradiction — they may be measuring different facets of the same phenomenon at different scales:
- Saunshi: shows loops can work on the right kind of problem (algorithmic, depth-bounded reasoning) at the right kind of scale (small synthetic).
- Huginn: shows that loops trained at 3.5B / 800B token scale on natural-language data give only marginal gains on a benchmark (GSM8K) that already favors CoT.
- Bee: shows that within a particular small-scale training recipe, the recurrent block's hidden state stops evolving very early in inference.
These three findings are compatible with a unified picture: loops carry compute, but only up to a depth bounded by the task's algorithmic complexity and the model's expressive capacity. Beyond that depth, the hidden state stops moving meaningfully, and additional loops are computation without information.
What I'd watch next
- Increase loop count during training (here I used 4) and see if the inference-time scaling extends further
- Try ACT halting more aggressively to see how the model self-regulates loop depth per token
- Add task heterogeneity (mix p-hop induction or parity) to test whether the fixed-point timing varies by problem class
🛠️ Technical details
Reproducing this experiment
git clone https://github.com/kyegomez/OpenMythos
cd OpenMythos
pip install -e .
# Data, training, evaluation scripts (this Day 7 folder):
python scripts/train.py --seed 0 --max_steps 30000
python scripts/eval_accuracy.py --seeds 0 1 2 3
python scripts/eval_fixedpoint.py --seeds 0 1 2 3
python scripts/plot.py
The training and evaluation scripts are at https://github.com/SAETAG/dgx-100-experiments/tree/main/days/day07-openmythos-loop-debate/scripts.
What went wrong (and was fixed)
- bf16 broke complex RoPE buffer: switched to fp32; fine at 3.4M parameters
- Initial training-time max_loop_iters too small: kept at 4 per Saunshi's recipe; future experiments could vary this
-
Greedy generation is slow at high loop counts: each batch repeats
n_loopsforward passes through the recurrent block; for loops=32 this is non-trivial
Hyperparameter choices: why these
-
dim=256, expert_dim=512, 1 prelude / 1 coda layer: smallest config that still exhibits looping behavior; matches Saunshi's scale -
n_experts=4: enough to demonstrate MoE routing without bloating params -
lora_rank=8: depth-wise LoRA lets each loop iteration adapt slightly without breaking weight-sharing -
max_seq_len=32: tight bound — d=5 addition fits in ~18 chars
References
- OpenMythos GitHub (Kye Gomez)
- Claude Mythos Preview (Anthropic, 2026-04-07)
- Project Glasswing
- Reasoning with Latent Thoughts (Saunshi et al., ICLR 2025)
- Scaling up Test-Time Compute with Latent Reasoning (Geiping et al., Huginn)
- Testing the OpenMythos Hypothesis (Micheal Bee)
- Parcae — Scaling Laws for Stable Looped Language Models
- Loop, Think, & Generalize (Implicit Reasoning in Recurrent-Depth Transformers)
Tomorrow: Day 8
A follow-up to Day 7, pushing looped thinking one step further into something harder…!



Top comments (0)