PEPPERCORN

Posted on May 19

[Day 7] Does Giving an AI More 'Thinking Time' Really Make It Smarter? Training an OpenMythos-Style Mini Model on DGX

#localllm #ai #dgxspark #transformers

[Day 7] Does Giving an AI More "Thinking Time" Really Make It Smarter? Training an OpenMythos-Style Mini Model on DGX

Intro

Day 7!

Reddit kept surfacing this new project called OpenMythos in my feed with "12 days to replicate frontier AI, ASI is near" headlines, and I got curious enough to dig in.

Tools used: my home AI machine (DGX Spark) + OpenMythos (PyTorch reconstruction of the rumored Claude Mythos architecture) + synthetic multi-digit addition.

The question: does giving an AI more "thinking time" (= more recurrent loops at inference) actually make it smarter?

Today's setup

The hype

On 2026-04-07, Anthropic announced Claude Mythos. Press coverage highlights zero-day discovery capabilities — reportedly 271 zero-days in Firefox and a 27-year-old bug in OpenBSD — but the model's architecture and weights remain unreleased. Anthropic kept Mythos itself behind a limited-access coalition (Project Glasswing — AWS, Apple, Microsoft, Google, CrowdStrike, Palo Alto, ~40 organizations) rather than releasing it publicly.

Twelve days later, Kye Gomez (Swarms) released OpenMythos, a PyTorch reconstruction of the suspected architecture. The repo is explicit upfront:

"an independent, community-driven theoretical reconstruction based solely on publicly available research and speculation. It is not affiliated with, endorsed by, or connected to Anthropic"

So OpenMythos is not Mythos. It's a hypothesis-in-code: a Recurrent-Depth Transformer (RDT) with MoE FFNs and MLA/GQA attention, capable of being trained from scratch on standard text data. No leaked weights, no distillation.

Reddit's "ASI is near" framing skips this critical distinction. The interesting question, once you set the hype aside, is whether the architectural idea — recurrent depth — actually works.

Note for this article: OpenMythos is not Claude Mythos — it's a theoretical reconstruction inspired by looped-transformer research. The experiments below are not "Claude Mythos capability tests" but rather "how does a looped / recurrent-depth structure behave on a small synthetic task."

Three perspectives on looped transformers

Browsing the literature, I found three different studies giving different pictures of how looped transformers behave:

Source	Scale	Claim
Saunshi et al. 2025 (ICLR, research paper)	tens of M params, synthetic	Loops work: k layers looped L times approximately matches kL-layer fixed-depth, on addition / p-hop induction / math
Geiping et al. 2025 (Huginn, research paper)	3.5B params, 800B tokens	Task-dependent: at scale on natural-language benchmarks, gains can be marginal (T=4 → T=32 only +1.82 points on GSM8K), though effects vary by task and compute regime
Micheal Bee 2026-04 (Medium, independent experiment blog)	17M params, 12 GPU-hours on RTX 5070 Ti	Loops plateau at T=2 in this small-scale setup: hidden state reaches a fixed-point that subsequent iterations cannot escape

Theory, large-scale empirics, and an independent solo replication give different pictures. I wanted to add a fourth data point from my own DGX Spark on a clean, controlled task — multi-digit addition.

What I'd hoped to see

Does training-time accuracy phase-transition (grok) at some step? (Saunshi 3-stage prediction)
Does test-time loop count matter? At what point does it stop helping?
Does the hidden state actually keep evolving across loops, or does it hit a fixed-point early? (the Bee question)

Headline finding

Loops help, but only within a narrow window centered on the training loop count. With training-time max_loop_iters=4, accuracy peaks at exactly T=4 (100% across all digit counts) and decays in both directions — fewer loops underthink, more loops overthink.
Bee's "T=2 fixed-point" reproduced. Cosine similarity between consecutive hidden states jumps from ~0.72 to ~0.95 at T=2, then climbs slowly to ~0.99 by T=4 and stays flat through T=32.
Striking per-seed grokking variance. Same hyperparameters, four seeds: seeds 1 and 3 solve 5-digit addition by step 4,000; seed 2 takes 10,000; seed 0 stalls at <10% until step 16,000, then jumps to 100%.
No depth extrapolation in this setup. Saunshi's claim that training at T=4 should generalize to deeper T at inference does not reproduce here — instead, T>4 hurts.

🌀 What is a "looped" transformer?

A standard transformer (GPT-4, Llama, most local LLMs) routes input tokens through a stack of distinct layers, each used exactly once per forward pass. To make it "think deeper," you stack more layers — increasing parameter count.

A looped transformer reuses the same parameters across multiple iterations. The model has a Prelude → Recurrent Block × T → Coda structure: a few standard layers up front, then one block iterated T times with input injection at every step, then a few more standard layers.

Input tokens
   ↓
[Prelude P]          — standard layers, run once
   ↓
[Recurrent Block R]  — one block looped T times
   ↑_______↓          h_{t+1} = A·h_t + B·e + Transformer(h_t, e)
   ↓
[Coda C]             — standard layers, run once
   ↓
Output logits

At each loop iteration t, the hidden state updates via the LTI injection rule, and the encoded input e (Prelude output) is re-injected to keep the original signal alive across arbitrary depth. The injection parameters are constrained so that spectral radius ρ(A) < 1, which prevents divergence over many loops (Parcae stability framework).

The key claim: more loops at inference = deeper reasoning, without adding parameters. This is conceptually analogous to chain-of-thought scaling — except the "thinking" happens in continuous latent space rather than discrete token space.

🔧 Experimental setup

I trained a deliberately tiny OpenMythos variant on multi-digit addition. The model is small enough to run 4 seeds in parallel on a single GPU but large enough to exhibit the looped-transformer phenomena.

OpenMythos tiny (3.4M params)
  ↓
Train 4 seeds in parallel, 30k steps each, fp32 on DGX Spark (GB10)
  ↓
Experiment A: greedy autoregressive accuracy
              loops ∈ {1, 2, 4, 8, 16, 32}  ×  digits ∈ {2, 3, 4, 5}
  ↓
Experiment B: cosine similarity between consecutive hidden states
              ⇒ does the recurrent block reach a fixed-point?
  ↓
Compare against Saunshi / Huginn / Bee

Model config

MythosConfig(
    vocab_size=16,         # digits 0-9 + '+', '=', pad, eos
    dim=256,
    n_heads=8,
    n_kv_heads=2,          # GQA
    max_seq_len=32,
    max_loop_iters=4,      # training depth; inference varies
    prelude_layers=1,
    coda_layers=1,
    attn_type="gqa",
    n_experts=4,           # MoE FFN inside recurrent block
    n_shared_experts=1,
    n_experts_per_tok=2,
    expert_dim=512,
    lora_rank=8,           # depth-wise LoRA per loop step
)

Total parameters: 3,386,658 (~3.4M).

Data

On-the-fly synthetic addition. Operands are uniformly sampled from [10^(d-1), 10^d - 1] for digit count d ∈ {2, 3, 4, 5}. Sequence format "A+B=R$", where R = str(A+B)[::-1] (reverse-order answer, following Saunshi's convention so left-to-right autoregressive generation can carry digits naturally).

Loss is applied only at positions following the = token (i.e., on the answer tokens).

Training

Optimizer: AdamW, betas (0.9, 0.95), wd 0.1
LR: max 3e-4, warmup 2000 steps, cosine decay to 1e-5
Grad clip: 1.0
Batch size: 128
Max steps: 30000
dtype: fp32

Initially I tried bf16 to use the GB10 efficiently, but OpenMythos stores RoPE frequencies as complex64 buffers, and model.to(bfloat16) silently drops the imaginary part, breaking attention. For a 3.4M-param model on 128 GB of unified memory, fp32 is fine — the bottleneck is not memory but parallel scheduling.

Four seeds {0, 1, 2, 3} run in parallel on the same GPU. Per-seed throughput drops to ~12K tok/s (vs ~50K solo), but wall-clock time for all four is approximately equivalent to one solo run.

📊 Results

Experiment A: accuracy heatmap

Mean fully-correct rate across 4 seeds, 500 eval samples per condition:

Inference loops	d=2	d=3	d=4	d=5
1	0.38 ± 0.12	0.19 ± 0.09	0.09 ± 0.07	0.02 ± 0.02
2	0.53 ± 0.17	0.50 ± 0.12	0.16 ± 0.08	0.21 ± 0.16
4 (train)	1.00	1.00	1.00	1.00
8	0.98 ± 0.01	0.98 ± 0.01	0.94 ± 0.03	0.86 ± 0.08
16	0.91 ± 0.04	0.91 ± 0.05	0.75 ± 0.10	0.56 ± 0.16
32	0.62 ± 0.12	0.65 ± 0.13	0.45 ± 0.13	0.26 ± 0.17

Observations:

Peak is exactly at training-time loop count (T=4), 100% across all digit counts.
One step of inference-time extrapolation (T=8) is near-peak but already shows degradation at d=5 (86%).
Beyond T=8, accuracy collapses monotonically. At T=32, even 2-digit addition drops to 62%.
Under-looping (T=1, T=2) hurts more at higher digit counts, consistent with depth being needed to chain carries.

Experiment B: fixed-point analysis

Mean cosine similarity between consecutive hidden states cos(h_t, h_{t-1}) over answer positions, averaged across 4 seeds, 200 samples per digit:

t	d=2	d=3	d=4	d=5
1	0.711	0.726	0.745	0.744
2	0.961	0.967	0.957	0.946
3	0.985	0.986	0.977	0.971
4	0.993	0.992	0.986	0.983
8	0.999	0.999	0.998	0.996
16	0.9995	0.9996	0.9992	0.998
32	0.9995	0.9996	0.999	0.998

Bee's T=2 fixed-point claim is reproduced in spirit but not literally: cosine similarity jumps to ~0.95 at T=2 (vs. Bee's near-1.0), then asymptotes to ~0.99 by T=4 and stays flat through T=32.

The difference vs. accuracy is telling: hidden state is effectively static (by cosine similarity) from T=4 onwards, yet accuracy collapses at T=16-32. Two non-exclusive interpretations: (a) overthinking — late loops drift away from a converged solution; (b) distribution shift — training used T=4, so T>>4 is simply an out-of-distribution use of the model. Worth noting that cosine similarity ≈ 1 doesn't prove the hidden state is doing nothing — small logit-relevant deltas may still accumulate.

Digit-count dependence on fixed-point timing is small (d=5 lags d=2 by ~0.01 in cosine sim). "Harder problems take more loops to converge" is not observed here — they converge at the same rate but the converged state is just less accurate at higher digit counts.

Bonus: training dynamics

The most striking thing in the training curves is seed-dependent grokking timing. Four runs of identical hyperparameters:

seed 1: loss → 0 by step 3,000, all digits ≥88% by step 4,000
seed 3: loss → 0 by step 4,000, all digits ≥87% by step 4,000
seed 2: stuck at loss ~0.35 plateau until step 8,000, then collapses to 0 by step 10,000; d=4/5 jump from <10% to 99% in 2,000 steps
seed 0: stuck at loss ~0.30 plateau until step 15,000, then collapses; d=4 groks at step 12,000-14,000, d=5 groks at step 16,000

This is textbook Saunshi-style three-stage grokking (memorization → in-distribution → systematic), with the third-stage trigger varying by a factor of 4x in step count purely on random init. The largest seed gap (seed 0 vs. seed 1) is ~12,000 steps, roughly 1 hour of wall-clock on this DGX.

If you trained a single seed and stopped early, you might conclude "OpenMythos can't generalize beyond d=3" — which would be wrong. The architecture can solve all 4 digit buckets; some random seeds just need much longer to find the systematic-generalization solution.

💡 What this means for the three perspectives

Where my data point lands

My single-DGX small-scale result lands somewhere between Bee and a partial refutation of Saunshi:

Bee's fixed-point at small T is reproduced. Hidden state effectively stops evolving by T=4 (cosine sim ≥ 0.99) and certainly by T=8.
Saunshi's depth-extrapolation does NOT reproduce. Inference at T > train_T does not improve accuracy — it harms it. T=8 is already at 86% on d=5 (vs. 100% at T=4), and T=32 collapses to 26%. The "train at depth k, infer at depth k·L" recipe assumes the recurrent block has learned to keep refining; in my setup it has not.
Huginn's limited-gain finding is consistent at small scale. Extra inference loops give negative ROI rather than diminishing positive ROI.
New observation: seed-dependent grokking with up to 12K-step variance. This is an under-emphasized variable in the public looped-transformer discourse — single-seed studies (Bee's solo replication, individual rows in Saunshi's tables) may be substantially under- or over-estimating typical behavior.

Reconciliation attempt

Theory (Saunshi), large-scale empirics (Huginn), and independent replication (Bee) may not actually be in contradiction — they may be measuring different facets of the same phenomenon at different scales:

Saunshi: shows loops can work on the right kind of problem (algorithmic, depth-bounded reasoning) at the right kind of scale (small synthetic).
Huginn: shows that loops trained at 3.5B / 800B token scale on natural-language data give only marginal gains on a benchmark (GSM8K) that already favors CoT.
Bee: shows that within a particular small-scale training recipe, the recurrent block's hidden state stops evolving very early in inference.

These three findings are compatible with a unified picture: loops carry compute, but only up to a depth bounded by the task's algorithmic complexity and the model's expressive capacity. Beyond that depth, the hidden state stops moving meaningfully, and additional loops are computation without information.

What I'd watch next

Increase loop count during training (here I used 4) and see if the inference-time scaling extends further
Try ACT halting more aggressively to see how the model self-regulates loop depth per token
Add task heterogeneity (mix p-hop induction or parity) to test whether the fixed-point timing varies by problem class

🛠️ Technical details

Reproducing this experiment

git clone https://github.com/kyegomez/OpenMythos
cd OpenMythos
pip install -e .

# Data, training, evaluation scripts (this Day 7 folder):
python scripts/train.py --seed 0 --max_steps 30000
python scripts/eval_accuracy.py --seeds 0 1 2 3
python scripts/eval_fixedpoint.py --seeds 0 1 2 3
python scripts/plot.py

The training and evaluation scripts are at https://github.com/SAETAG/dgx-100-experiments/tree/main/days/day07-openmythos-loop-debate/scripts.

What went wrong (and was fixed)

bf16 broke complex RoPE buffer: switched to fp32; fine at 3.4M parameters
Initial training-time max_loop_iters too small: kept at 4 per Saunshi's recipe; future experiments could vary this
Greedy generation is slow at high loop counts: each batch repeats n_loops forward passes through the recurrent block; for loops=32 this is non-trivial

Hyperparameter choices: why these

dim=256, expert_dim=512, 1 prelude / 1 coda layer: smallest config that still exhibits looping behavior; matches Saunshi's scale
n_experts=4: enough to demonstrate MoE routing without bloating params
lora_rank=8: depth-wise LoRA lets each loop iteration adapt slightly without breaking weight-sharing
max_seq_len=32: tight bound — d=5 addition fits in ~18 chars

References

Tomorrow: Day 8

A follow-up to Day 7, pushing looped thinking one step further into something harder…!

100ExperimentsWithDGX #LocalLLM

DEV Community