Disclaimer: OpenMythos is a community-driven theoretical reconstruction. It is not affiliated with or endorsed by Anthropic. All claims about Claude Mythos's architecture are speculative hypotheses backed by publicly available research.
What Is OpenMythos?
On April 21, 2026, Kye Gomez - founder of Swarms AI - published OpenMythos to GitHub. The project is a fully open-source PyTorch reconstruction of the hypothesized architecture behind Anthropic's Claude Mythos model.
The thesis: Claude Mythos achieves its extraordinary reasoning not by stacking hundreds of unique transformer layers, but by looping a compact set of layers multiple times, performing continuous "latent chain-of-thought" reasoning in hidden state space before ever emitting a single output token.
This idea - a Recurrent-Depth Transformer (RDT) - is grounded in a growing body of 2024–2025 academic research from ICLR, DeepSeek, and multiple independent labs. The architecture combines:
- A three-stage Prelude → Loop → Coda pipeline
- Spectral-radius-constrained hidden state updates (from Parcae architecture)
- Adaptive Computation Time (ACT) halting for per-token variable compute
- Fine-grained Mixture of Experts (MoE) with DeepSeek-V3-style bias-based load balancing
- Multi-Latent Attention (MLA) for 10–20× KV cache reduction
- Depth-wise LoRA adapters for cheap per-loop specialization
Per Blockchain.news, early training runs show 2.67× faster validation steps compared to a baseline dense transformer at the same parameter count.
The Central Hypothesis
The key architectural claim: a 770M-parameter Recurrent-Depth Transformer can match the effective capacity of a standard 1.3B dense transformer, because every parameter is reused N times across loop iterations.
Effective Compute ≈ Parameters × Loop Iterations
vs.
Dense Transformer Effective Compute ≈ Parameters × 1
This means the model can:
- Scale reasoning depth at inference without retraining (run more loops for harder problems)
- Generalize to more loops than it was trained on (depth extrapolation via LoRA clamping)
- Run entirely in continuous latent space - no chain-of-thought token emission required
"A 770M-parameter RDT matches a 1.3B dense model" - MarkTechPost, April 2026
Architecture Overview
The model follows a strict three-stage pipeline:
File: open_mythos/main.py:899–1086
The key insight: Prelude and Coda execute once (fixed compute). The Recurrent Block holds all the reasoning capacity and runs T times. The frozen encoding e is injected at every loop step, preventing the model from "forgetting" the input.
Dissection: Six Novel Mechanisms
4.1 LTI-Stable Injection - The Heartbeat
File: open_mythos/main.py:684–743
The most critical and least obvious component. Without it, looped transformers diverge.
class LTIInjection(nn.Module):
"""Linear Time-Invariant injection with spectral radius < 1 by construction."""
def get_A(self) -> torch.Tensor:
# A_continuous = -exp(log_A) → always negative diagonal
# A_discrete = exp(Δt × A_continuous) → always in (0, 1)
return torch.exp(
-torch.exp((self.log_dt + self.log_A).clamp(-20, 20))
)
def forward(self, h, e, transformer_out):
A = self.get_A() # spectral radius guaranteed < 1
return A * h + self.B * e + transformer_out
The update rule:
h_{t+1} = A · h_t + B · e + Transformer(h_t, e)
Where ρ(A) < 1 is guaranteed by parameterization - not enforced by regularization.
Why this matters:
| A parameterization | What happens |
|---|---|
| Unconstrained | ρ(A) ≥ 1 possible → hidden state explodes after N loops |
| Soft regularization | Sometimes works, often diverges at high LR |
| LTI with ZOH | ρ(A) < 1 always → stable at any depth |
The implementation uses zero-order-hold (ZOH) discretization: a continuous-time negative diagonal matrix A_c = -exp(log_A) is mapped to discrete time via exp(Δt · A_c), which always lands in (0, 1). This is borrowed from state-space models (Gu et al., 2021 - S4).
Every divergent training run in the Parcae architecture paper had ρ(A) ≥ 1. Every convergent run had ρ(A) < 1.
4.2 ACT Halting - Variable Compute per Token
Files: open_mythos/main.py:750–781 (halting unit), open_mythos/main.py:865–889 (integration in RecurrentBlock)
class ACTHalting(nn.Module):
"""Per-position adaptive computation time."""
def forward(self, h: torch.Tensor) -> torch.Tensor:
return torch.sigmoid(self.halt(h)).squeeze(-1)
In the loop:
# Remainder trick: assign leftover probability at threshold crossing
remainder = 1.0 - cumulative_halt
crossed = (cumulative_halt + p) >= self.act_threshold
weight = torch.where(crossed, remainder, p)
# Accumulate weighted hidden state
h_out += weight.unsqueeze(-1) * h
cumulative_halt += weight
still_running = ~crossed
What this achieves:
"The cat sat." → halts at loop 3 (trivial, no reasoning needed)
"Prove P ≠ NP." → halts at loop 16 (maximum compute allocated)
"2 + 2" → halts at loop 1
"Multi-step logic..." → halts at loop 12
Per ICLR 2025 research on recurrent-depth architectures, looped updates exhibit a rapid norm decay pattern: early iterations make large hidden-state changes, late iterations make tiny orthogonal adjustments. ACT exploits this by halting when updates become negligible.
Throughput impact: 2–3× improvement in inference throughput (easy tokens exit early, expensive compute is allocated to hard tokens only).
The critical bug fixed in OpenMythos v0.4.0: halted positions must be gated from weight accumulation. Once a position halts, its h must not be included in gradient updates - a subtle but catastrophic error if missed.
4.3 Loop-Index RoPE - Teaching Shared Weights Two Jobs
File: open_mythos/main.py:541–571
def loop_index_embedding(h: torch.Tensor, loop_t: int, loop_dim: int) -> torch.Tensor:
"""Inject sinusoidal depth-position signal into hidden state."""
freqs = 1.0 / (theta ** (torch.arange(0, loop_dim, 2) / loop_dim))
angles = loop_t * freqs
emb = torch.cat([angles.sin(), angles.cos()], dim=-1)[:loop_dim]
emb_full = torch.zeros(h.shape[-1])
emb_full[:loop_dim] = emb
return h + emb_full.unsqueeze(0).unsqueeze(0)
The problem it solves: With pure weight sharing, the model runs identical computation at loop 1 and loop 16 - no mechanism to differentiate "early encoding" from "late refinement."
The solution: Inject a sinusoidal signal keyed to the loop index t before every iteration, similar to how RoPE encodes sequence position. Now the shared weights can learn functionally distinct behaviors per depth - not via separate parameters, but via different activations conditioned on the loop signal.
This is analogous to the RingFormer architecture (Heo et al., Feb 2025) which uses low-rank "level signals" for the same purpose.
4.4 Depth-Wise LoRA - Cheap Specialization at Scale
File: open_mythos/main.py:578–620
class LoRAAdapter(nn.Module):
"""Per-loop scale LoRA: shared A/B matrices, learned scale per loop index."""
def forward(self, x: torch.Tensor, loop_t: int) -> torch.Tensor:
t_idx = min(loop_t, self.scale.num_embeddings - 1) # clamp for depth extrapolation
s = self.scale(t_idx) # (rank,) - learned per-loop scale
down = self.down(x) * s # (B, T, rank)
return down @ self.B # (B, T, dim)
Parameter cost analysis:
| Approach | Parameters per loop |
|---|---|
| Fully distinct weights |
dim × dim (hundreds of millions) |
| Pure weight sharing | 0 (least expressive) |
| LoRA adapter |
rank × dim × 2 + rank × max_loops (thousands) |
The clamp operation (min(loop_t, max_t)) enables depth extrapolation: train on 16 loops, run inference with 32 loops. Loops 17–32 reuse the scale learned for loop 16. Quality follows an exponential improvement curve with loop count before plateauing.
This is validated by the MoDr paper (OpenReview) - "Mixture-of-Depth-Recurrent Transformers" - which shows LoRA-based depth adaptation enables reliable out-of-distribution loop generalization.
4.5 Fine-Grained MoE with Bias-Based Load Balancing
File: open_mythos/main.py:426–534
class MoEFFN(nn.Module):
"""DeepSeek-style: fine-grained routed experts + always-on shared experts."""
def forward(self, x):
logits = self.router(x) # (B, T, n_experts)
scores = F.softmax(logits, dim=-1) # gate weights (gradient flows here)
biased_log = logits + self.router_bias # bias shifted (no gradient)
topk_idx = biased_log.topk(self.topk, dim=-1).indices
topk_scores = scores.gather(-1, topk_idx)
topk_scores = topk_scores / topk_scores.sum(-1, keepdim=True) # renormalize
# Dispatch tokens to selected experts
out = self._dispatch(x, topk_idx, topk_scores)
# Always-on shared experts
for expert in self.shared_experts:
out = out + expert(x)
return out
The load-balancing trick (DeepSeek-V3 style):
Standard auxiliary-loss balancing adds a penalty term to the training objective - but this introduces competing gradients and a tricky hyperparameter. OpenMythos uses bias-based routing instead:
Per arxiv:2408.15664 (Auxiliary-Loss-Free Load Balancing):
- Biases are updated externally: overloaded experts get their bias decreased, underloaded ones increased
- No gradient interference with the task objective
- Zero token dropping during training and inference
The v0.4.0 bugfix "stop load balance bias gradient leak" fixed a subtle error where the bias update was accidentally being included in the backward pass - polluting task gradients with load-balancing signals.
Fine-grained vs coarse-grained experts:
| Type | Expert dim | Experts | Active per token |
|---|---|---|---|
| Coarse (Mixtral-style) | Large (≈ full FFN) | 8 | 2 |
| Fine-grained (DeepSeek-style) | Small (≈ 1/16 FFN) | 256 | 32 |
| OpenMythos 3B | expert_dim=4096 |
64 | top-4 |
Fine-grained experts activate more diverse combinations per token, increasing effective routing paths from C(8,2)=28 to C(64,4)≈635,376.
4.6 Multi-Latent Attention - 10–20× KV Cache Compression
File: open_mythos/main.py:284–419
MLA compresses KV to a low-rank latent, dramatically reducing inference memory:
Standard KV Cache: K, V ∈ R^{n_heads × head_dim} per token
GQA Cache: K, V ∈ R^{n_kv_heads × head_dim} per token
MLA Cache: c_kv ∈ R^{kv_lora_rank} per token
k_rope ∈ R^{qk_rope_head_dim} per token
At 1T scale:
| Mechanism | Cache per token | Ratio |
|---|---|---|
| Full MHA | 128 × 128 × 2 = 32,768 |
1× |
| GQA (16 KV heads) | 16 × 128 × 2 = 4,096 |
8× |
| MLA | 1024 + (128 × 64) = 9,216 |
3.6× over GQA |
The trick: only c_kv (the latent) and k_rope (RoPE-encoded keys) are cached. K_nope and V are reconstructed on-the-fly via a cheap upward projection - compute cost is negligible vs. memory saved.
# At each token position:
c_kv, k_rope_raw = kv_down(x).split([kv_lora_rank, qk_rope_head_dim], dim=-1)
# Cache c_kv and k_rope - NOT K, V themselves
# At attention time:
kv_out = kv_up(c_kv_cached) # reconstruct K_nope + V from latent
K_nope, V = kv_out.split([...], dim=-1) # split reconstructed output
K = concat(K_nope, k_rope_cached) # full K = nope + rope components
This was first introduced in DeepSeek-V2 and is one of the most practically significant innovations for long-context inference.
The Training Pipeline
File: training/3b_fine_web_edu.py
Dataset: FineWeb-Edu
class FineWebEduDataset(IterableDataset):
def __iter__(self):
ds = load_dataset(
"HuggingFaceFW/fineweb-edu",
name=self.subset,
split="train",
streaming=True,
).shard(num_shards=total_shards, index=shard_index)
- 1.3 trillion tokens, Apache 2.0 licensed
- Streaming from HuggingFace Hub (no local disk required)
- Two-dimensional sharding:
world_size × num_workers- disjoint, no duplication - Documents packed into rolling 2048-token chunks
Training Configuration (3B Model)
| Parameter | Value |
|---|---|
| Model | mythos_3b() - 3.7B params, 64 experts, 16 loops |
| Tokenizer | openai/gpt-oss-20b (100K vocab) |
| Sequence length | 2,048 tokens |
| Global batch | ~512K tokens (256 grad accum steps) |
| Total tokens | 30B (~2.5× Chinchilla-efficient for looped models) |
| LR schedule | Linear warmup (2000 steps) → cosine decay |
| Max LR | 3e-4 |
| Optimizer | AdamW fused, betas=(0.9, 0.95), weight_decay=0.1 |
| Precision | bfloat16 (H100/A100) |
| Distributed | FSDP (Fully Sharded Data Parallel) |
FSDP Setup
model = FSDP(
model,
sharding_strategy=ShardingStrategy.FULL_SHARD,
mixed_precision=MixedPrecision(param_dtype=torch.bfloat16),
auto_wrap_policy=ModuleWrapPolicy({TransformerBlock, RecurrentBlock}),
device_id=local_rank,
)
# Gradient accumulation with no_sync() - all-reduce only on final micro-step
for micro_step in range(grad_accum_steps):
ctx = model.no_sync() if micro_step < grad_accum_steps - 1 else nullcontext()
with ctx, amp_ctx:
logits = model(x)
loss = F.cross_entropy(logits.view(-1, vocab), y.view(-1)) / grad_accum_steps
loss.backward()
Token efficiency claim: Looped architectures are ~2.5× more token-efficient than dense models at equal parameter count. A 3B RDT at 30B tokens matches a 3B dense model at 75B tokens. This tracks with Chinchilla-style analysis adjusted for parameter reuse.
Model Variants: 1B to 1T
File: open_mythos/variants.py
Scaling principles:
-
expert_dimgrows with model size (maintain activation density) - Loop count increases (frontier models reason deeper per token)
- Context and output length jump at 100B+ (1M token context enabled)
Security Angle
Threat Modelling Locally-Runnable Reasoning Models
OpenMythos is not just an academic curiosity - it directly changes the threat landscape for AI-assisted security work. Here's why this architecture matters for security practitioners.
1. Local Deployment = No Rate Limiting
Commercial frontier models (GPT-4, Claude) apply rate limits, content filters, and usage policies. A locally-running RDT with 3B parameters and a 512K-token context breaks all of these controls.
Per arxiv:2504.10112 (Benchmarking LLM-driven Offensive Security), state-of-the-art LLM agents achieve:
- 228.6% improvement in penetration testing task completion rate (PentestGPT)
- 60% success rate obtaining shell access in CTF environments (RapidPen)
- $0.30–$0.60 per exploitation attempt using commercial APIs
With a locally-running OpenMythos model, the per-attempt cost drops to compute only.
2. Inference-Time Scaling for Hard Problems
The ACT halting mechanism is particularly relevant for security: hard cryptographic reasoning, complex vulnerability chains, and multi-step exploit development are exactly the "hard" problems that get allocated more loops.
"Find a path from X endpoint to the admin database"
→ ACT allocates maximum loops per token
→ model reasons in latent space across the full attack chain
→ outputs a step-by-step exploitation path
This is the same compute-on-demand property that makes RDTs interesting for math and coding - and adversarial reasoning is just another form of hard multi-step problem.
3. Defensive Use Cases
The flip side: the same architecture enables powerful defensive applications:
- Log anomaly detection: 1M token context window (mythos_100b+) can ingest an entire day of SIEM logs in a single pass and reason across them for lateral movement indicators
- Malware analysis: Decompiled binary context fed to the model for behavioral classification
- Vulnerability triage: Static analysis output reasoning for false-positive reduction
- SOC automation: Multi-step reasoning chains for alert investigation without human-in-the-loop
Per MDPI Cybersecurity Survey, LLMs in cybersecurity are actively being deployed across:
- Intrusion/anomaly detection
- Threat intelligence extraction
- Automated vulnerability repair
- Red team simulation
4. Tokenizer Attack Surface
File: open_mythos/tokenizer.py
class MythosTokenizer:
def __init__(self, model_id: str = "openai/gpt-oss-20b"):
self.tokenizer = AutoTokenizer.from_pretrained(model_id)
The tokenizer is loaded from HuggingFace Hub at runtime with no local checksum validation. This is a supply chain attack surface - a poisoned tokenizer on HuggingFace could alter token mappings and inject adversarial behavior into any model using it. This is a known class of vulnerability documented in ML supply chain attacks research.
Mitigation: Pin tokenizer versions, validate checksums, mirror to internal artifact registry.
5. KV Cache Memory Safety
The generate method has no explicit bounds on KV cache growth:
def generate(self, input_ids, max_new_tokens=64, n_loops=8, ...):
# kv_cache grows with sequence length × layers × heads
# No OOM protection; long sequences cause silent crash
In a production inference endpoint, this creates a resource exhaustion vector - long sequences or high concurrency causes OOM crashes. Defense: implement sequence length limits and cache size monitoring at the inference wrapper layer.
6. Prompt Injection via Raw Causal LM
OpenMythos is a pure causal language model - no system prompt infrastructure, no guardrails. Any downstream application wrapping OpenMythos for a security tool inherits the full prompt-injection surface and must implement filtering at the application layer.
What the Research Says
OpenMythos does not invent from scratch. Every mechanism has an academic foundation:
| Mechanism | Paper | Conference/Year |
|---|---|---|
| Recurrent-Depth Transformers | Geiping et al. | ICLR 2025 |
| LTI Stable Injection (Parcae) | Hayden Prairie et al. | 2026 |
| Universal Transformers + ACT | Dehghani et al. | ICLR 2019 |
| Multi-Latent Attention | DeepSeek-V2 | 2024 |
| Fine-Grained MoE | DeepSeek-V3 | Dec 2024 |
| Auxiliary-Loss-Free Balancing | arxiv:2408.15664 | 2024 |
| LoRA depth adaptation | Bae et al. 2024; MoDr | 2024–2025 |
| Flash Attention 2 | Dao et al. | NeurIPS 2023 |
| GQA | Ainslie et al. | EMNLP 2023 |
The convergence of these techniques into a single architecture is the core contribution. Each alone is known; together they form a coherent reasoning machine.
The Grokking Connection
RDTs exhibit a striking property documented in ICLR 2025 research: training shows phase transitions in generalization (grokking). The model suddenly jumps from memorization to systematic generalization at a critical training token threshold - and this transition is more pronounced in looped models than in dense models.
Latent Chain-of-Thought
arxiv:2507.02199 shows that RDT hidden state trajectories are decodable: you can extract intermediate reasoning steps from the loop iterations without ever emitting reasoning tokens. This suggests "chain-of-thought" is not a discrete token-level phenomenon - it is an emergent property of iterated hidden-state refinement.
Benchmarks & Evidence
From the OpenMythos training logs and community reports:
Validation Loss Curves (3B training run, FineWeb-Edu 30BT):
Step 0: loss=11.2 (random baseline)
Step 5,000: loss=3.8 (initial convergence)
Step 20,000: loss=2.9 (mid-training)
Step 58,000: loss=2.4 (training complete)
Inference throughput comparison (3B, A100, batch=32):
Dense 3B baseline: 940 tokens/sec
OpenMythos 3B (MoE): 2,510 tokens/sec [2.67× faster]
Source: Blockchain.news, April 2026
The throughput gain comes from:
- ACT halting: Fewer loops for easy tokens
- MoE sparsity: ~5% of routed expert parameters active per token
- MLA cache compression: Smaller KV cache = more sequences fit in GPU memory = higher batch size
Quick Start
from open_mythos import OpenMythos, MythosConfig
from open_mythos.variants import mythos_1b
from open_mythos.tokenizer import MythosTokenizer
import torch
# Build a 1B model
cfg = mythos_1b()
model = OpenMythos(cfg).cuda()
tok = MythosTokenizer()
# Generate with 16 reasoning loops
input_ids = torch.tensor([tok.encode("Explain the proof of Gödel's incompleteness theorem.")]).cuda()
output = model.generate(input_ids, max_new_tokens=256, n_loops=16)
print(tok.decode(output[0].tolist()))
# Scale up reasoning at inference (no retraining)
output_deep = model.generate(input_ids, max_new_tokens=256, n_loops=32)
Install:
pip install open-mythos # core
pip install "open-mythos[flash]" # + Flash Attention 2 (2-3× faster)
Conclusion
OpenMythos is more than a speculative reverse-engineering project. It is a working, production-grade PyTorch implementation of a state-of-the-art reasoning architecture that:
- Challenges the "more layers = better" paradigm - depth through iteration, not stacking
- Makes inference-time scaling practical - run more loops at test time for harder problems
- Compresses memory aggressively - MLA + sparse MoE makes frontier-scale models runnable on fewer GPUs
- Brings stability guarantees - LTI injection removes training instability without hyperparameter tuning
- Changes the security landscape - locally-runnable reasoning models with long context eliminate API-based controls
The architecture sits at a confluence of ICLR 2025, DeepSeek-V3, and Universal Transformer research - not speculation, but synthesis. Whether or not it correctly reconstructs Claude Mythos, OpenMythos is a significant architectural contribution in its own right.
References
Geiping et al. - Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach - ICLR 2025. openreview.net/pdf?id=WwpYSOkkCt
DeepSeek-AI - DeepSeek-V3 Technical Report - arxiv:2412.19437. arxiv.org/pdf/2412.19437
DeepSeek-AI - DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model - 2024. arxiv.org/abs/2405.04434
Wang et al. - Auxiliary-Loss-Free Load Balancing Strategy for Mixture of Experts - arxiv:2408.15664. arxiv.org/html/2408.15664v1
Dao, T. - FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning - NeurIPS 2023. openreview.net/forum?id=mZn2Xyh9Ec
Shah et al. - FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision - 2024. openreview.net/forum?id=tVConYid20
Bae et al. - Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA - 2024. arxiv.org/abs/2410.20672
Heo et al. - RingFormer: A Ring-Enhanced Graph Transformer for Organic Solar Cell Property Prediction - Feb 2025.
MoDr - Mixture-of-Depth-Recurrent Transformers - OpenReview. openreview.net/forum?id=9Pba4rcQbE
Gu, A. et al. - Efficiently Modeling Long Sequences with Structured State Spaces - ICLR 2022.
Dehghani et al. - Universal Transformers - ICLR 2019. arxiv.org/abs/1807.03819
Graves, A. - Adaptive Computation Time for Recurrent Neural Networks - 2016.
Hu et al. - LoRA: Low-Rank Adaptation of Large Language Models - arxiv:2106.09685. arxiv.org/abs/2106.09685
Benchmark: LLM Agents in Autonomous Cyberattacks Survey - arxiv:2505.12786. arxiv.org/html/2505.12786v2
Happe, A. et al. - Benchmarking LLM-driven Offensive Security - arxiv:2504.10112. arxiv.org/html/2504.10112
Fang, R. et al. - LLMs in Cybersecurity: A Survey - MDPI AI. mdpi.com/2673-2688/6/9/216
Understanding Dynamic Compute Allocation in Recurrent Transformers - arxiv:2602.08864. arxiv.org/html/2602.08864
Thinking Deeper, Not Longer: Depth-Recurrent Transformers - arxiv:2603.21676. arxiv.org/html/2603.21676
MarkTechPost - Meet OpenMythos - April 2026. marktechpost.com/2026/04/19
Blockchain.news - 2.67× Faster Validation Steps - April 2026. blockchain.news/ainews
Block Sparse FlashAttention - arxiv:2512.07011. arxiv.org/abs/2512.07011
MoE Survey 2024 - arxiv:2406.18219. arxiv.org/abs/2406.18219
Optimizing MoE Routing - arxiv:2506.16419. arxiv.org/html/2506.16419v1
GitHub: kyegomez/OpenMythos



Top comments (0)