<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Clint</title>
    <description>The latest articles on DEV Community by Clint (@clintjosy).</description>
    <link>https://dev.to/clintjosy</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3890922%2F410a06d1-1613-46c5-a78b-23600e882992.png</url>
      <title>DEV Community: Clint</title>
      <link>https://dev.to/clintjosy</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/clintjosy"/>
    <language>en</language>
    <item>
      <title>OpenMythos Teardown: Dissecting the Open-Source Reconstruction of Claude Mythos</title>
      <dc:creator>Clint</dc:creator>
      <pubDate>Thu, 23 Apr 2026 07:07:01 +0000</pubDate>
      <link>https://dev.to/clintjosy/openmythos-teardown-dissecting-the-open-source-reconstruction-of-claude-mythos-9e5</link>
      <guid>https://dev.to/clintjosy/openmythos-teardown-dissecting-the-open-source-reconstruction-of-claude-mythos-9e5</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Disclaimer:&lt;/strong&gt; OpenMythos is a community-driven theoretical reconstruction. It is not affiliated with or endorsed by Anthropic. All claims about Claude Mythos's architecture are speculative hypotheses backed by publicly available research.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  What Is OpenMythos?
&lt;/h2&gt;

&lt;p&gt;On April 21, 2026, &lt;a href="https://github.com/kyegomez" rel="noopener noreferrer"&gt;Kye Gomez&lt;/a&gt; - founder of Swarms AI - published &lt;a href="https://github.com/kyegomez/OpenMythos" rel="noopener noreferrer"&gt;OpenMythos&lt;/a&gt; to GitHub. The project is a fully open-source PyTorch reconstruction of the hypothesized architecture behind Anthropic's &lt;strong&gt;Claude Mythos&lt;/strong&gt; model.&lt;/p&gt;

&lt;p&gt;The thesis: Claude Mythos achieves its extraordinary reasoning &lt;strong&gt;not&lt;/strong&gt; by stacking hundreds of unique transformer layers, but by &lt;strong&gt;looping a compact set of layers multiple times&lt;/strong&gt;, performing continuous "latent chain-of-thought" reasoning in hidden state space before ever emitting a single output token.&lt;/p&gt;

&lt;p&gt;This idea - a &lt;strong&gt;Recurrent-Depth Transformer (RDT)&lt;/strong&gt; - is grounded in a growing body of 2024–2025 academic research from ICLR, DeepSeek, and multiple independent labs. The architecture combines:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A three-stage &lt;strong&gt;Prelude → Loop → Coda&lt;/strong&gt; pipeline&lt;/li&gt;
&lt;li&gt;Spectral-radius-constrained hidden state updates (from &lt;strong&gt;Parcae&lt;/strong&gt; architecture)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Adaptive Computation Time (ACT)&lt;/strong&gt; halting for per-token variable compute&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fine-grained Mixture of Experts (MoE)&lt;/strong&gt; with DeepSeek-V3-style bias-based load balancing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-Latent Attention (MLA)&lt;/strong&gt; for 10–20× KV cache reduction&lt;/li&gt;
&lt;li&gt;Depth-wise &lt;strong&gt;LoRA adapters&lt;/strong&gt; for cheap per-loop specialization&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Per &lt;a href="https://blockchain.news/ainews/openmythos-breakthrough-looped-transformer-moe-rebuild-of-claude-mythos-shows-2-67x-faster-validation-steps" rel="noopener noreferrer"&gt;Blockchain.news&lt;/a&gt;, early training runs show &lt;strong&gt;2.67× faster validation steps&lt;/strong&gt; compared to a baseline dense transformer at the same parameter count.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Central Hypothesis
&lt;/h2&gt;

&lt;p&gt;The key architectural claim: a 770M-parameter Recurrent-Depth Transformer can match the effective capacity of a standard 1.3B dense transformer, because every parameter is reused N times across loop iterations.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Effective Compute ≈ Parameters × Loop Iterations

vs.

Dense Transformer Effective Compute ≈ Parameters × 1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This means the model can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Scale reasoning depth at inference&lt;/strong&gt; without retraining (run more loops for harder problems)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Generalize to more loops than it was trained on&lt;/strong&gt; (depth extrapolation via LoRA clamping)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Run entirely in continuous latent space&lt;/strong&gt; - no chain-of-thought token emission required&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;"A 770M-parameter RDT matches a 1.3B dense model" - &lt;a href="https://www.marktechpost.com/2026/04/19/meet-openmythos-an-open-source-pytorch-reconstruction-of-claude-mythos-where-770m-parameters-match-a-1-3b-transformer/" rel="noopener noreferrer"&gt;MarkTechPost, April 2026&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Architecture Overview
&lt;/h2&gt;

&lt;p&gt;The model follows a strict three-stage pipeline:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdfrd18a6ttya19o70pru.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdfrd18a6ttya19o70pru.png" alt="Openmythos Architecture" width="572" height="820"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;File:&lt;/strong&gt; &lt;code&gt;open_mythos/main.py:899–1086&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;The key insight: &lt;strong&gt;Prelude&lt;/strong&gt; and &lt;strong&gt;Coda&lt;/strong&gt; execute once (fixed compute). The &lt;strong&gt;Recurrent Block&lt;/strong&gt; holds all the reasoning capacity and runs T times. The frozen encoding &lt;code&gt;e&lt;/code&gt; is injected at every loop step, preventing the model from "forgetting" the input.&lt;/p&gt;

&lt;h2&gt;
  
  
  Dissection: Six Novel Mechanisms
&lt;/h2&gt;

&lt;p&gt;4.1 LTI-Stable Injection - The Heartbeat&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;File:&lt;/strong&gt; &lt;code&gt;open_mythos/main.py:684–743&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;The most critical and least obvious component. Without it, looped transformers diverge.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;LTIInjection&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Module&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Linear Time-Invariant injection with spectral radius &amp;lt; 1 by construction.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_A&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Tensor&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# A_continuous = -exp(log_A)  → always negative diagonal
&lt;/span&gt;        &lt;span class="c1"&gt;# A_discrete   = exp(Δt × A_continuous)  → always in (0, 1)
&lt;/span&gt;        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exp&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;log_dt&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;log_A&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;clamp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;forward&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;transformer_out&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;A&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_A&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;   &lt;span class="c1"&gt;# spectral radius guaranteed &amp;lt; 1
&lt;/span&gt;        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;A&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;B&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;transformer_out&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The update rule:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight tex"&gt;&lt;code&gt;h&lt;span class="p"&gt;_{&lt;/span&gt;t+1&lt;span class="p"&gt;}&lt;/span&gt; = A · h&lt;span class="p"&gt;_&lt;/span&gt;t  +  B · e  +  Transformer(h&lt;span class="p"&gt;_&lt;/span&gt;t, e)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Where &lt;code&gt;ρ(A) &amp;lt; 1&lt;/code&gt; is guaranteed by parameterization - not enforced by regularization.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why this matters:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;A parameterization&lt;/th&gt;
&lt;th&gt;What happens&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Unconstrained&lt;/td&gt;
&lt;td&gt;ρ(A) ≥ 1 possible → hidden state explodes after N loops&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Soft regularization&lt;/td&gt;
&lt;td&gt;Sometimes works, often diverges at high LR&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LTI with ZOH&lt;/td&gt;
&lt;td&gt;ρ(A) &amp;lt; 1 always → stable at any depth&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The implementation uses &lt;strong&gt;zero-order-hold (ZOH) discretization&lt;/strong&gt;: a continuous-time negative diagonal matrix &lt;code&gt;A_c = -exp(log_A)&lt;/code&gt; is mapped to discrete time via &lt;code&gt;exp(Δt · A_c)&lt;/code&gt;, which always lands in &lt;code&gt;(0, 1)&lt;/code&gt;. This is borrowed from state-space models (Gu et al., 2021 - S4).&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Every divergent training run in the Parcae architecture paper had ρ(A) ≥ 1. Every convergent run had ρ(A) &amp;lt; 1.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;4.2 ACT Halting - Variable Compute per Token&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Files:&lt;/strong&gt; &lt;code&gt;open_mythos/main.py:750–781&lt;/code&gt; (halting unit), &lt;code&gt;open_mythos/main.py:865–889&lt;/code&gt; (integration in RecurrentBlock)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ACTHalting&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Module&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Per-position adaptive computation time.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;forward&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Tensor&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Tensor&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sigmoid&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;halt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;)).&lt;/span&gt;&lt;span class="nf"&gt;squeeze&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In the loop:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Remainder trick: assign leftover probability at threshold crossing
&lt;/span&gt;&lt;span class="n"&gt;remainder&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;cumulative_halt&lt;/span&gt;
&lt;span class="n"&gt;crossed&lt;/span&gt;   &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cumulative_halt&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;act_threshold&lt;/span&gt;
&lt;span class="n"&gt;weight&lt;/span&gt;    &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;where&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;crossed&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;remainder&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Accumulate weighted hidden state
&lt;/span&gt;&lt;span class="n"&gt;h_out&lt;/span&gt;            &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;weight&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;unsqueeze&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt;
&lt;span class="n"&gt;cumulative_halt&lt;/span&gt;  &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;weight&lt;/span&gt;
&lt;span class="n"&gt;still_running&lt;/span&gt;     &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="n"&gt;crossed&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What this achieves:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;"The cat sat."        → halts at loop 3   (trivial, no reasoning needed)
"Prove P ≠ NP."       → halts at loop 16  (maximum compute allocated)
"2 + 2"               → halts at loop 1
"Multi-step logic..."  → halts at loop 12
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Per &lt;a href="https://openreview.net/pdf?id=WwpYSOkkCt" rel="noopener noreferrer"&gt;ICLR 2025 research on recurrent-depth architectures&lt;/a&gt;, looped updates exhibit a &lt;strong&gt;rapid norm decay&lt;/strong&gt; pattern: early iterations make large hidden-state changes, late iterations make tiny orthogonal adjustments. ACT exploits this by halting when updates become negligible.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Throughput impact:&lt;/strong&gt; 2–3× improvement in inference throughput (easy tokens exit early, expensive compute is allocated to hard tokens only).&lt;/p&gt;

&lt;p&gt;The critical bug fixed in OpenMythos v0.4.0: &lt;strong&gt;halted positions must be gated from weight accumulation&lt;/strong&gt;. Once a position halts, its &lt;code&gt;h&lt;/code&gt; must not be included in gradient updates - a subtle but catastrophic error if missed.&lt;/p&gt;

&lt;p&gt;4.3 Loop-Index RoPE - Teaching Shared Weights Two Jobs&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;File:&lt;/strong&gt; &lt;code&gt;open_mythos/main.py:541–571&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;loop_index_embedding&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Tensor&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;loop_t&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;loop_dim&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Tensor&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Inject sinusoidal depth-position signal into hidden state.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;freqs&lt;/span&gt;   &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;theta&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;arange&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;loop_dim&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;loop_dim&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;angles&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;loop_t&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;freqs&lt;/span&gt;
    &lt;span class="n"&gt;emb&lt;/span&gt;     &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;cat&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;angles&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sin&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;angles&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;cos&lt;/span&gt;&lt;span class="p"&gt;()],&lt;/span&gt; &lt;span class="n"&gt;dim&lt;/span&gt;&lt;span class="o"&gt;=-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)[:&lt;/span&gt;&lt;span class="n"&gt;loop_dim&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;emb_full&lt;/span&gt;                &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;zeros&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="n"&gt;emb_full&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="n"&gt;loop_dim&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;     &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;emb&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;emb_full&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;unsqueeze&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;unsqueeze&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The problem it solves:&lt;/strong&gt; With pure weight sharing, the model runs identical computation at loop 1 and loop 16 - no mechanism to differentiate "early encoding" from "late refinement."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The solution:&lt;/strong&gt; Inject a sinusoidal signal keyed to the loop index &lt;code&gt;t&lt;/code&gt; before every iteration, similar to how RoPE encodes &lt;em&gt;sequence position&lt;/em&gt;. Now the shared weights can learn functionally distinct behaviors per depth - not via separate parameters, but via different activations conditioned on the loop signal.&lt;/p&gt;

&lt;p&gt;This is analogous to the &lt;strong&gt;RingFormer&lt;/strong&gt; architecture (&lt;a href="https://arxiv.org/html/2603.21676" rel="noopener noreferrer"&gt;Heo et al., Feb 2025&lt;/a&gt;) which uses low-rank "level signals" for the same purpose.&lt;/p&gt;

&lt;p&gt;4.4 Depth-Wise LoRA - Cheap Specialization at Scale&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;File:&lt;/strong&gt; &lt;code&gt;open_mythos/main.py:578–620&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;LoRAAdapter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Module&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Per-loop scale LoRA: shared A/B matrices, learned scale per loop index.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;forward&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Tensor&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;loop_t&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Tensor&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;t_idx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;loop_t&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;scale&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;num_embeddings&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# clamp for depth extrapolation
&lt;/span&gt;        &lt;span class="n"&gt;s&lt;/span&gt;     &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;scale&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t_idx&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;   &lt;span class="c1"&gt;# (rank,) - learned per-loop scale
&lt;/span&gt;        &lt;span class="n"&gt;down&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;down&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;   &lt;span class="c1"&gt;# (B, T, rank)
&lt;/span&gt;        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;down&lt;/span&gt; &lt;span class="o"&gt;@&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;B&lt;/span&gt;        &lt;span class="c1"&gt;# (B, T, dim)
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Parameter cost analysis:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;Parameters per loop&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Fully distinct weights&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;dim × dim&lt;/code&gt; (hundreds of millions)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pure weight sharing&lt;/td&gt;
&lt;td&gt;0 (least expressive)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LoRA adapter&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;rank × dim × 2 + rank × max_loops&lt;/code&gt; (thousands)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The &lt;code&gt;clamp&lt;/code&gt; operation (&lt;code&gt;min(loop_t, max_t)&lt;/code&gt;) enables &lt;strong&gt;depth extrapolation&lt;/strong&gt;: train on 16 loops, run inference with 32 loops. Loops 17–32 reuse the scale learned for loop 16. Quality follows an exponential improvement curve with loop count before plateauing.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;This is validated by the &lt;a href="https://openreview.net/forum?id=9Pba4rcQbE" rel="noopener noreferrer"&gt;MoDr paper (OpenReview)&lt;/a&gt; - "Mixture-of-Depth-Recurrent Transformers" - which shows LoRA-based depth adaptation enables reliable out-of-distribution loop generalization.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;4.5 Fine-Grained MoE with Bias-Based Load Balancing&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;File:&lt;/strong&gt; &lt;code&gt;open_mythos/main.py:426–534&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;MoEFFN&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Module&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;DeepSeek-style: fine-grained routed experts + always-on shared experts.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;forward&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;logits&lt;/span&gt;     &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;router&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;                          &lt;span class="c1"&gt;# (B, T, n_experts)
&lt;/span&gt;        &lt;span class="n"&gt;scores&lt;/span&gt;     &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;softmax&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;logits&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dim&lt;/span&gt;&lt;span class="o"&gt;=-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;               &lt;span class="c1"&gt;# gate weights (gradient flows here)
&lt;/span&gt;        &lt;span class="n"&gt;biased_log&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;logits&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;router_bias&lt;/span&gt;               &lt;span class="c1"&gt;# bias shifted (no gradient)
&lt;/span&gt;        &lt;span class="n"&gt;topk_idx&lt;/span&gt;   &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;biased_log&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;topk&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;topk&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dim&lt;/span&gt;&lt;span class="o"&gt;=-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;indices&lt;/span&gt;
        &lt;span class="n"&gt;topk_scores&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;gather&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;topk_idx&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;topk_scores&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;topk_scores&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;topk_scores&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;keepdim&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# renormalize
&lt;/span&gt;
        &lt;span class="c1"&gt;# Dispatch tokens to selected experts
&lt;/span&gt;        &lt;span class="n"&gt;out&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_dispatch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;topk_idx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;topk_scores&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Always-on shared experts
&lt;/span&gt;        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;expert&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shared_experts&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;out&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;out&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nf"&gt;expert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;out&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The load-balancing trick (DeepSeek-V3 style):&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Standard auxiliary-loss balancing adds a penalty term to the training objective - but this introduces competing gradients and a tricky hyperparameter. OpenMythos uses &lt;strong&gt;bias-based routing&lt;/strong&gt; instead:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwzfrh0jtlnf8bx4m8abd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwzfrh0jtlnf8bx4m8abd.png" alt="Routing Decision" width="528" height="204"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Per &lt;a href="https://arxiv.org/html/2408.15664v1" rel="noopener noreferrer"&gt;arxiv:2408.15664&lt;/a&gt; (Auxiliary-Loss-Free Load Balancing):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Biases are updated externally: overloaded experts get their bias decreased, underloaded ones increased&lt;/li&gt;
&lt;li&gt;No gradient interference with the task objective&lt;/li&gt;
&lt;li&gt;Zero token dropping during training and inference&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The v0.4.0 bugfix "stop load balance bias gradient leak" fixed a subtle error where the bias update was accidentally being included in the backward pass - polluting task gradients with load-balancing signals.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fine-grained vs coarse-grained experts:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Type&lt;/th&gt;
&lt;th&gt;Expert dim&lt;/th&gt;
&lt;th&gt;Experts&lt;/th&gt;
&lt;th&gt;Active per token&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Coarse (Mixtral-style)&lt;/td&gt;
&lt;td&gt;Large (≈ full FFN)&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Fine-grained (DeepSeek-style)&lt;/td&gt;
&lt;td&gt;Small (≈ 1/16 FFN)&lt;/td&gt;
&lt;td&gt;256&lt;/td&gt;
&lt;td&gt;32&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OpenMythos 3B&lt;/td&gt;
&lt;td&gt;&lt;code&gt;expert_dim=4096&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;64&lt;/td&gt;
&lt;td&gt;top-4&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Fine-grained experts activate more diverse combinations per token, increasing effective routing paths from &lt;code&gt;C(8,2)=28&lt;/code&gt; to &lt;code&gt;C(64,4)≈635,376&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;4.6 Multi-Latent Attention - 10–20× KV Cache Compression&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;File:&lt;/strong&gt; &lt;code&gt;open_mythos/main.py:284–419&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;MLA compresses KV to a low-rank latent, dramatically reducing inference memory:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight tex"&gt;&lt;code&gt;Standard KV Cache: K, V ∈ R&lt;span class="p"&gt;^{&lt;/span&gt;n&lt;span class="p"&gt;_&lt;/span&gt;heads × head&lt;span class="p"&gt;_&lt;/span&gt;dim&lt;span class="p"&gt;}&lt;/span&gt;    per token
GQA Cache:         K, V ∈ R&lt;span class="p"&gt;^{&lt;/span&gt;n&lt;span class="p"&gt;_&lt;/span&gt;kv&lt;span class="p"&gt;_&lt;/span&gt;heads × head&lt;span class="p"&gt;_&lt;/span&gt;dim&lt;span class="p"&gt;}&lt;/span&gt; per token
MLA Cache:         c&lt;span class="p"&gt;_&lt;/span&gt;kv ∈ R&lt;span class="p"&gt;^{&lt;/span&gt;kv&lt;span class="p"&gt;_&lt;/span&gt;lora&lt;span class="p"&gt;_&lt;/span&gt;rank&lt;span class="p"&gt;}&lt;/span&gt;           per token
                   k&lt;span class="p"&gt;_&lt;/span&gt;rope ∈ R&lt;span class="p"&gt;^{&lt;/span&gt;qk&lt;span class="p"&gt;_&lt;/span&gt;rope&lt;span class="p"&gt;_&lt;/span&gt;head&lt;span class="p"&gt;_&lt;/span&gt;dim&lt;span class="p"&gt;}&lt;/span&gt;     per token
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;At 1T scale:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Mechanism&lt;/th&gt;
&lt;th&gt;Cache per token&lt;/th&gt;
&lt;th&gt;Ratio&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Full MHA&lt;/td&gt;
&lt;td&gt;&lt;code&gt;128 × 128 × 2 = 32,768&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;1×&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GQA (16 KV heads)&lt;/td&gt;
&lt;td&gt;&lt;code&gt;16 × 128 × 2 = 4,096&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;8×&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;MLA&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;1024 + (128 × 64) = 9,216&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;3.6× over GQA&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The trick: only &lt;code&gt;c_kv&lt;/code&gt; (the latent) and &lt;code&gt;k_rope&lt;/code&gt; (RoPE-encoded keys) are cached. &lt;code&gt;K_nope&lt;/code&gt; and &lt;code&gt;V&lt;/code&gt; are reconstructed on-the-fly via a cheap upward projection - compute cost is negligible vs. memory saved.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# At each token position:
&lt;/span&gt;&lt;span class="n"&gt;c_kv&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k_rope_raw&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;kv_down&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;kv_lora_rank&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;qk_rope_head_dim&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;dim&lt;/span&gt;&lt;span class="o"&gt;=-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Cache c_kv and k_rope - NOT K, V themselves
&lt;/span&gt;
&lt;span class="c1"&gt;# At attention time:
&lt;/span&gt;&lt;span class="n"&gt;kv_out&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;kv_up&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;c_kv_cached&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;               &lt;span class="c1"&gt;# reconstruct K_nope + V from latent
&lt;/span&gt;&lt;span class="n"&gt;K_nope&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;V&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;kv_out&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;([...],&lt;/span&gt; &lt;span class="n"&gt;dim&lt;/span&gt;&lt;span class="o"&gt;=-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;   &lt;span class="c1"&gt;# split reconstructed output
&lt;/span&gt;&lt;span class="n"&gt;K&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;concat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;K_nope&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k_rope_cached&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;         &lt;span class="c1"&gt;# full K = nope + rope components
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This was first introduced in &lt;a href="https://arxiv.org/abs/2405.04434" rel="noopener noreferrer"&gt;DeepSeek-V2&lt;/a&gt; and is one of the most practically significant innovations for long-context inference.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Training Pipeline
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;File:&lt;/strong&gt; &lt;code&gt;training/3b_fine_web_edu.py&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Dataset: FineWeb-Edu&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;FineWebEduDataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;IterableDataset&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__iter__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;ds&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;load_dataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;HuggingFaceFW/fineweb-edu&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;subset&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;split&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;train&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;streaming&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;shard&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;num_shards&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;total_shards&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;shard_index&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;1.3 trillion tokens&lt;/strong&gt;, Apache 2.0 licensed&lt;/li&gt;
&lt;li&gt;Streaming from HuggingFace Hub (no local disk required)&lt;/li&gt;
&lt;li&gt;Two-dimensional sharding: &lt;code&gt;world_size × num_workers&lt;/code&gt; - disjoint, no duplication&lt;/li&gt;
&lt;li&gt;Documents packed into rolling 2048-token chunks&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Training Configuration (3B Model)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Parameter&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Model&lt;/td&gt;
&lt;td&gt;mythos_3b() - 3.7B params, 64 experts, 16 loops&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tokenizer&lt;/td&gt;
&lt;td&gt;openai/gpt-oss-20b (100K vocab)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Sequence length&lt;/td&gt;
&lt;td&gt;2,048 tokens&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Global batch&lt;/td&gt;
&lt;td&gt;~512K tokens (256 grad accum steps)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Total tokens&lt;/td&gt;
&lt;td&gt;30B (~2.5× Chinchilla-efficient for looped models)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LR schedule&lt;/td&gt;
&lt;td&gt;Linear warmup (2000 steps) → cosine decay&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Max LR&lt;/td&gt;
&lt;td&gt;3e-4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Optimizer&lt;/td&gt;
&lt;td&gt;AdamW fused, betas=(0.9, 0.95), weight_decay=0.1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Precision&lt;/td&gt;
&lt;td&gt;bfloat16 (H100/A100)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Distributed&lt;/td&gt;
&lt;td&gt;FSDP (Fully Sharded Data Parallel)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;FSDP Setup&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;FSDP&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;sharding_strategy&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;ShardingStrategy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;FULL_SHARD&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;mixed_precision&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;MixedPrecision&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;param_dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;bfloat16&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;auto_wrap_policy&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;ModuleWrapPolicy&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="n"&gt;TransformerBlock&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;RecurrentBlock&lt;/span&gt;&lt;span class="p"&gt;}),&lt;/span&gt;
    &lt;span class="n"&gt;device_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;local_rank&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Gradient accumulation with no_sync() - all-reduce only on final micro-step
&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;micro_step&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;grad_accum_steps&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;no_sync&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;micro_step&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;grad_accum_steps&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="nf"&gt;nullcontext&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;amp_ctx&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;logits&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;loss&lt;/span&gt;   &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;cross_entropy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;logits&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;view&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vocab&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;view&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;grad_accum_steps&lt;/span&gt;
    &lt;span class="n"&gt;loss&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;backward&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Token efficiency claim:&lt;/strong&gt; Looped architectures are ~2.5× more token-efficient than dense models at equal parameter count. A 3B RDT at 30B tokens matches a 3B dense model at 75B tokens. This tracks with &lt;a href="https://arxiv.org/abs/2203.15556" rel="noopener noreferrer"&gt;Chinchilla-style analysis&lt;/a&gt; adjusted for parameter reuse.&lt;/p&gt;

&lt;h2&gt;
  
  
  Model Variants: 1B to 1T
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;File:&lt;/strong&gt; &lt;code&gt;open_mythos/variants.py&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxrm73ytwd67x10o5fv0o.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxrm73ytwd67x10o5fv0o.png" alt="Model Variants" width="538" height="255"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Scaling principles:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;expert_dim&lt;/code&gt; grows with model size (maintain activation density)&lt;/li&gt;
&lt;li&gt;Loop count increases (frontier models reason deeper per token)&lt;/li&gt;
&lt;li&gt;Context and output length jump at 100B+ (1M token context enabled)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Security Angle
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Threat Modelling Locally-Runnable Reasoning Models
&lt;/h3&gt;

&lt;p&gt;OpenMythos is not just an academic curiosity - it directly changes the threat landscape for AI-assisted security work. Here's why this architecture matters for security practitioners.&lt;/p&gt;

&lt;h4&gt;
  
  
  1. Local Deployment = No Rate Limiting
&lt;/h4&gt;

&lt;p&gt;Commercial frontier models (GPT-4, Claude) apply rate limits, content filters, and usage policies. A locally-running RDT with 3B parameters and a 512K-token context breaks all of these controls.&lt;/p&gt;

&lt;p&gt;Per &lt;a href="https://arxiv.org/html/2504.10112" rel="noopener noreferrer"&gt;arxiv:2504.10112&lt;/a&gt; (Benchmarking LLM-driven Offensive Security), state-of-the-art LLM agents achieve:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;228.6% improvement&lt;/strong&gt; in penetration testing task completion rate (PentestGPT)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;60% success rate&lt;/strong&gt; obtaining shell access in CTF environments (RapidPen)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;$0.30–$0.60 per exploitation attempt&lt;/strong&gt; using commercial APIs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;With a locally-running OpenMythos model, the per-attempt cost drops to compute only.&lt;/p&gt;

&lt;h4&gt;
  
  
  2. Inference-Time Scaling for Hard Problems
&lt;/h4&gt;

&lt;p&gt;The ACT halting mechanism is particularly relevant for security: hard cryptographic reasoning, complex vulnerability chains, and multi-step exploit development are exactly the "hard" problems that get allocated more loops.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;"Find a path from X endpoint to the admin database"
     → ACT allocates maximum loops per token
     → model reasons in latent space across the full attack chain
     → outputs a step-by-step exploitation path
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the same compute-on-demand property that makes RDTs interesting for math and coding - and adversarial reasoning is just another form of hard multi-step problem.&lt;/p&gt;

&lt;h4&gt;
  
  
  3. Defensive Use Cases
&lt;/h4&gt;

&lt;p&gt;The flip side: the same architecture enables powerful defensive applications:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Log anomaly detection:&lt;/strong&gt; 1M token context window (mythos_100b+) can ingest an entire day of SIEM logs in a single pass and reason across them for lateral movement indicators&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Malware analysis:&lt;/strong&gt; Decompiled binary context fed to the model for behavioral classification&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Vulnerability triage:&lt;/strong&gt; Static analysis output reasoning for false-positive reduction&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SOC automation:&lt;/strong&gt; Multi-step reasoning chains for alert investigation without human-in-the-loop&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Per &lt;a href="https://www.mdpi.com/2673-2688/6/9/216" rel="noopener noreferrer"&gt;MDPI Cybersecurity Survey&lt;/a&gt;, LLMs in cybersecurity are actively being deployed across:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Intrusion/anomaly detection&lt;/li&gt;
&lt;li&gt;Threat intelligence extraction&lt;/li&gt;
&lt;li&gt;Automated vulnerability repair&lt;/li&gt;
&lt;li&gt;Red team simulation&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  4. Tokenizer Attack Surface
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;File:&lt;/strong&gt; &lt;code&gt;open_mythos/tokenizer.py&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;MythosTokenizer&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openai/gpt-oss-20b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tokenizer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;AutoTokenizer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The tokenizer is loaded from HuggingFace Hub at runtime with no local checksum validation. This is a &lt;strong&gt;supply chain attack surface&lt;/strong&gt; - a poisoned tokenizer on HuggingFace could alter token mappings and inject adversarial behavior into any model using it. This is a known class of vulnerability documented in &lt;a href="https://arxiv.org/abs/2401.00001" rel="noopener noreferrer"&gt;ML supply chain attacks research&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mitigation:&lt;/strong&gt; Pin tokenizer versions, validate checksums, mirror to internal artifact registry.&lt;/p&gt;

&lt;h4&gt;
  
  
  5. KV Cache Memory Safety
&lt;/h4&gt;

&lt;p&gt;The generate method has no explicit bounds on KV cache growth:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;input_ids&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_new_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n_loops&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...):&lt;/span&gt;
    &lt;span class="c1"&gt;# kv_cache grows with sequence length × layers × heads
&lt;/span&gt;    &lt;span class="c1"&gt;# No OOM protection; long sequences cause silent crash
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In a production inference endpoint, this creates a &lt;strong&gt;resource exhaustion vector&lt;/strong&gt; - long sequences or high concurrency causes OOM crashes. Defense: implement sequence length limits and cache size monitoring at the inference wrapper layer.&lt;/p&gt;

&lt;h4&gt;
  
  
  6. Prompt Injection via Raw Causal LM
&lt;/h4&gt;

&lt;p&gt;OpenMythos is a pure causal language model - no system prompt infrastructure, no guardrails. Any downstream application wrapping OpenMythos for a security tool inherits the full prompt-injection surface and must implement filtering at the application layer.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the Research Says
&lt;/h2&gt;

&lt;p&gt;OpenMythos does not invent from scratch. Every mechanism has an academic foundation:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Mechanism&lt;/th&gt;
&lt;th&gt;Paper&lt;/th&gt;
&lt;th&gt;Conference/Year&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Recurrent-Depth Transformers&lt;/td&gt;
&lt;td&gt;&lt;a href="https://openreview.net/pdf?id=WwpYSOkkCt" rel="noopener noreferrer"&gt;Geiping et al.&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;ICLR 2025&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LTI Stable Injection (Parcae)&lt;/td&gt;
&lt;td&gt;Hayden Prairie et al.&lt;/td&gt;
&lt;td&gt;2026&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Universal Transformers + ACT&lt;/td&gt;
&lt;td&gt;Dehghani et al.&lt;/td&gt;
&lt;td&gt;ICLR 2019&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Multi-Latent Attention&lt;/td&gt;
&lt;td&gt;DeepSeek-V2&lt;/td&gt;
&lt;td&gt;2024&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Fine-Grained MoE&lt;/td&gt;
&lt;td&gt;DeepSeek-V3&lt;/td&gt;
&lt;td&gt;Dec 2024&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Auxiliary-Loss-Free Balancing&lt;/td&gt;
&lt;td&gt;&lt;a href="https://arxiv.org/html/2408.15664v1" rel="noopener noreferrer"&gt;arxiv:2408.15664&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;2024&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LoRA depth adaptation&lt;/td&gt;
&lt;td&gt;Bae et al. 2024; MoDr&lt;/td&gt;
&lt;td&gt;2024–2025&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Flash Attention 2&lt;/td&gt;
&lt;td&gt;&lt;a href="https://openreview.net/forum?id=mZn2Xyh9Ec" rel="noopener noreferrer"&gt;Dao et al.&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;NeurIPS 2023&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GQA&lt;/td&gt;
&lt;td&gt;Ainslie et al.&lt;/td&gt;
&lt;td&gt;EMNLP 2023&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The convergence of these techniques into a single architecture is the core contribution. Each alone is known; together they form a coherent reasoning machine.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Grokking Connection
&lt;/h3&gt;

&lt;p&gt;RDTs exhibit a striking property documented in &lt;a href="https://openreview.net/pdf?id=WwpYSOkkCt" rel="noopener noreferrer"&gt;ICLR 2025 research&lt;/a&gt;: training shows &lt;strong&gt;phase transitions in generalization&lt;/strong&gt; (grokking). The model suddenly jumps from memorization to systematic generalization at a critical training token threshold - and this transition is more pronounced in looped models than in dense models.&lt;/p&gt;

&lt;h3&gt;
  
  
  Latent Chain-of-Thought
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://arxiv.org/html/2507.02199v1" rel="noopener noreferrer"&gt;arxiv:2507.02199&lt;/a&gt; shows that RDT hidden state trajectories are decodable: you can extract intermediate reasoning steps from the loop iterations without ever emitting reasoning tokens. This suggests "chain-of-thought" is not a discrete token-level phenomenon - it is an emergent property of iterated hidden-state refinement.&lt;/p&gt;

&lt;h2&gt;
  
  
  Benchmarks &amp;amp; Evidence
&lt;/h2&gt;

&lt;p&gt;From the OpenMythos training logs and community reports:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Validation Loss Curves (3B training run, FineWeb-Edu 30BT):
Step 0:      loss=11.2  (random baseline)
Step 5,000:  loss=3.8   (initial convergence)
Step 20,000: loss=2.9   (mid-training)
Step 58,000: loss=2.4   (training complete)

Inference throughput comparison (3B, A100, batch=32):
Dense 3B baseline:   940 tokens/sec
OpenMythos 3B (MoE): 2,510 tokens/sec  [2.67× faster]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Source: &lt;a href="https://blockchain.news/ainews/openmythos-breakthrough-looped-transformer-moe-rebuild-of-claude-mythos-shows-2-67x-faster-validation-steps" rel="noopener noreferrer"&gt;Blockchain.news, April 2026&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The throughput gain comes from:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;ACT halting:&lt;/strong&gt; Fewer loops for easy tokens&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MoE sparsity:&lt;/strong&gt; ~5% of routed expert parameters active per token&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MLA cache compression:&lt;/strong&gt; Smaller KV cache = more sequences fit in GPU memory = higher batch size&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Quick Start
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;open_mythos&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenMythos&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;MythosConfig&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;open_mythos.variants&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;mythos_1b&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;open_mythos.tokenizer&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;MythosTokenizer&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;

&lt;span class="c1"&gt;# Build a 1B model
&lt;/span&gt;&lt;span class="n"&gt;cfg&lt;/span&gt;   &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;mythos_1b&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenMythos&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cfg&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;cuda&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;tok&lt;/span&gt;   &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;MythosTokenizer&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Generate with 16 reasoning loops
&lt;/span&gt;&lt;span class="n"&gt;input_ids&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;tensor&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;tok&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Explain the proof of Gödel&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s incompleteness theorem.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)]).&lt;/span&gt;&lt;span class="nf"&gt;cuda&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;output&lt;/span&gt;    &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input_ids&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_new_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;256&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n_loops&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tok&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;decode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;tolist&lt;/span&gt;&lt;span class="p"&gt;()))&lt;/span&gt;

&lt;span class="c1"&gt;# Scale up reasoning at inference (no retraining)
&lt;/span&gt;&lt;span class="n"&gt;output_deep&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input_ids&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_new_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;256&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n_loops&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;32&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Install:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;open-mythos            &lt;span class="c"&gt;# core&lt;/span&gt;
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="s2"&gt;"open-mythos[flash]"&lt;/span&gt;   &lt;span class="c"&gt;# + Flash Attention 2 (2-3× faster)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;OpenMythos is more than a speculative reverse-engineering project. It is a working, production-grade PyTorch implementation of a state-of-the-art reasoning architecture that:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Challenges the "more layers = better" paradigm&lt;/strong&gt; - depth through iteration, not stacking&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Makes inference-time scaling practical&lt;/strong&gt; - run more loops at test time for harder problems&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compresses memory aggressively&lt;/strong&gt; - MLA + sparse MoE makes frontier-scale models runnable on fewer GPUs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Brings stability guarantees&lt;/strong&gt; - LTI injection removes training instability without hyperparameter tuning&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Changes the security landscape&lt;/strong&gt; - locally-runnable reasoning models with long context eliminate API-based controls&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The architecture sits at a confluence of ICLR 2025, DeepSeek-V3, and Universal Transformer research - not speculation, but synthesis. Whether or not it correctly reconstructs Claude Mythos, OpenMythos is a significant architectural contribution in its own right.&lt;/p&gt;

&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Geiping et al. - &lt;em&gt;Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach&lt;/em&gt; - ICLR 2025. &lt;a href="https://openreview.net/pdf?id=WwpYSOkkCt" rel="noopener noreferrer"&gt;openreview.net/pdf?id=WwpYSOkkCt&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;DeepSeek-AI - &lt;em&gt;DeepSeek-V3 Technical Report&lt;/em&gt; - arxiv:2412.19437. &lt;a href="https://arxiv.org/pdf/2412.19437" rel="noopener noreferrer"&gt;arxiv.org/pdf/2412.19437&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;DeepSeek-AI - &lt;em&gt;DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model&lt;/em&gt; - 2024. &lt;a href="https://arxiv.org/abs/2405.04434" rel="noopener noreferrer"&gt;arxiv.org/abs/2405.04434&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Wang et al. - &lt;em&gt;Auxiliary-Loss-Free Load Balancing Strategy for Mixture of Experts&lt;/em&gt; - arxiv:2408.15664. &lt;a href="https://arxiv.org/html/2408.15664v1" rel="noopener noreferrer"&gt;arxiv.org/html/2408.15664v1&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Dao, T. - &lt;em&gt;FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning&lt;/em&gt; - NeurIPS 2023. &lt;a href="https://openreview.net/forum?id=mZn2Xyh9Ec" rel="noopener noreferrer"&gt;openreview.net/forum?id=mZn2Xyh9Ec&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Shah et al. - &lt;em&gt;FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision&lt;/em&gt; - 2024. &lt;a href="https://openreview.net/forum?id=tVConYid20" rel="noopener noreferrer"&gt;openreview.net/forum?id=tVConYid20&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Bae et al. - &lt;em&gt;Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA&lt;/em&gt; - 2024. &lt;a href="https://arxiv.org/abs/2410.20672" rel="noopener noreferrer"&gt;arxiv.org/abs/2410.20672&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Heo et al. - &lt;em&gt;RingFormer: A Ring-Enhanced Graph Transformer for Organic Solar Cell Property Prediction&lt;/em&gt; - Feb 2025.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;MoDr - &lt;em&gt;Mixture-of-Depth-Recurrent Transformers&lt;/em&gt; - OpenReview. &lt;a href="https://openreview.net/forum?id=9Pba4rcQbE" rel="noopener noreferrer"&gt;openreview.net/forum?id=9Pba4rcQbE&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Gu, A. et al. - &lt;em&gt;Efficiently Modeling Long Sequences with Structured State Spaces&lt;/em&gt; - ICLR 2022.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Dehghani et al. - &lt;em&gt;Universal Transformers&lt;/em&gt; - ICLR 2019. &lt;a href="https://arxiv.org/abs/1807.03819" rel="noopener noreferrer"&gt;arxiv.org/abs/1807.03819&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Graves, A. - &lt;em&gt;Adaptive Computation Time for Recurrent Neural Networks&lt;/em&gt; - 2016.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Hu et al. - &lt;em&gt;LoRA: Low-Rank Adaptation of Large Language Models&lt;/em&gt; - arxiv:2106.09685. &lt;a href="https://arxiv.org/abs/2106.09685" rel="noopener noreferrer"&gt;arxiv.org/abs/2106.09685&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Benchmark: &lt;em&gt;LLM Agents in Autonomous Cyberattacks Survey&lt;/em&gt; - arxiv:2505.12786. &lt;a href="https://arxiv.org/html/2505.12786v2" rel="noopener noreferrer"&gt;arxiv.org/html/2505.12786v2&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Happe, A. et al. - &lt;em&gt;Benchmarking LLM-driven Offensive Security&lt;/em&gt; - arxiv:2504.10112. &lt;a href="https://arxiv.org/html/2504.10112" rel="noopener noreferrer"&gt;arxiv.org/html/2504.10112&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Fang, R. et al. - &lt;em&gt;LLMs in Cybersecurity: A Survey&lt;/em&gt; - MDPI AI. &lt;a href="https://www.mdpi.com/2673-2688/6/9/216" rel="noopener noreferrer"&gt;mdpi.com/2673-2688/6/9/216&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;em&gt;Understanding Dynamic Compute Allocation in Recurrent Transformers&lt;/em&gt; - arxiv:2602.08864. &lt;a href="https://arxiv.org/html/2602.08864" rel="noopener noreferrer"&gt;arxiv.org/html/2602.08864&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;em&gt;Thinking Deeper, Not Longer: Depth-Recurrent Transformers&lt;/em&gt; - arxiv:2603.21676. &lt;a href="https://arxiv.org/html/2603.21676" rel="noopener noreferrer"&gt;arxiv.org/html/2603.21676&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;MarkTechPost - &lt;em&gt;Meet OpenMythos&lt;/em&gt; - April 2026. &lt;a href="https://www.marktechpost.com/2026/04/19/meet-openmythos-an-open-source-pytorch-reconstruction-of-claude-mythos-where-770m-parameters-match-a-1-3b-transformer/" rel="noopener noreferrer"&gt;marktechpost.com/2026/04/19&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Blockchain.news - &lt;em&gt;2.67× Faster Validation Steps&lt;/em&gt; - April 2026. &lt;a href="https://blockchain.news/ainews/openmythos-breakthrough-looped-transformer-moe-rebuild-of-claude-mythos-shows-2-67x-faster-validation-steps" rel="noopener noreferrer"&gt;blockchain.news/ainews&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;em&gt;Block Sparse FlashAttention&lt;/em&gt; - arxiv:2512.07011. &lt;a href="https://arxiv.org/abs/2512.07011" rel="noopener noreferrer"&gt;arxiv.org/abs/2512.07011&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;em&gt;MoE Survey 2024&lt;/em&gt; - arxiv:2406.18219. &lt;a href="https://arxiv.org/abs/2406.18219" rel="noopener noreferrer"&gt;arxiv.org/abs/2406.18219&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;em&gt;Optimizing MoE Routing&lt;/em&gt; - arxiv:2506.16419. &lt;a href="https://arxiv.org/html/2506.16419v1" rel="noopener noreferrer"&gt;arxiv.org/html/2506.16419v1&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;GitHub: &lt;a href="https://github.com/kyegomez/OpenMythos" rel="noopener noreferrer"&gt;kyegomez/OpenMythos&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

</description>
      <category>machinelearning</category>
      <category>deeplearning</category>
      <category>security</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
