<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Clint</title>
    <description>The latest articles on DEV Community by Clint (@clintjosy).</description>
    <link>https://dev.to/clintjosy</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3890922%2F410a06d1-1613-46c5-a78b-23600e882992.png</url>
      <title>DEV Community: Clint</title>
      <link>https://dev.to/clintjosy</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/clintjosy"/>
    <language>en</language>
    <item>
      <title>Your AI, Your Rules: Running a Local LLM with GPU Acceleration on Proxmox</title>
      <dc:creator>Clint</dc:creator>
      <pubDate>Fri, 01 May 2026 16:26:51 +0000</pubDate>
      <link>https://dev.to/clintjosy/your-ai-your-rules-running-a-local-llm-with-gpu-acceleration-on-proxmox-1plh</link>
      <guid>https://dev.to/clintjosy/your-ai-your-rules-running-a-local-llm-with-gpu-acceleration-on-proxmox-1plh</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;From 3 tok/s frustration to 21 tok/s GPU-hybrid inference - a real engineer's guide to self-hosted AI that actually works.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Why Bother Running Local LLMs?
&lt;/h2&gt;

&lt;p&gt;Before we get into the how, let's address the obvious question: &lt;strong&gt;why not just use Claude, GPT, or Gemini?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The honest answer is - for many tasks, you should. But local LLMs make sense when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Privacy matters.&lt;/strong&gt; Code, internal documents, proprietary configs - none of it leaves your machine.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost at scale.&lt;/strong&gt; API calls add up fast when you're running a coding agent all day.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Latency control.&lt;/strong&gt; No network round-trips, no rate limits, no API downtime.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Offline capability.&lt;/strong&gt; Works on a plane, in a data center, behind a firewall.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Experimentation.&lt;/strong&gt; Swap models freely, tune inference parameters, benchmark to your heart's content.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This guide documents a real setup - not a toy demo - built specifically to run &lt;strong&gt;Claude Code&lt;/strong&gt; and &lt;strong&gt;pi.dev&lt;/strong&gt; against a local model, transparently, with no API key required.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Hardware Stack
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Spec&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Host&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Proxmox VE 8.x, kernel 6.17.x&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;CPU&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;12-core (AMD/Intel)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;RAM&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;40 GB allocated to LLM container&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;GPU&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;NVIDIA RTX 2000 Ada Generation Laptop (8 GB VRAM)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Storage&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;120 GB root + &lt;code&gt;/mnt/models&lt;/code&gt; for model files&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Network&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Tailscale mesh for remote access&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The GPU is the critical piece - even a modest 8 GB card dramatically changes what's possible.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Architecture: What I Built
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Folvtjf8vnsyad8lljgbw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Folvtjf8vnsyad8lljgbw.png" alt="Architecture" width="800" height="686"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Two services, two ports, one model:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Port 11434&lt;/strong&gt; - llama.cpp native OpenAI-compatible API (for pi.dev, curl, anything OpenAI-compatible)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Port 4000&lt;/strong&gt; - thin Python proxy translating Anthropic Messages API to OpenAI format (for Claude Code)&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Part 1: The Container - Proxmox LXC Setup
&lt;/h2&gt;

&lt;p&gt;Use an &lt;strong&gt;LXC container&lt;/strong&gt; rather than a full VM because:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Near-native CPU performance (no hypervisor overhead)&lt;/li&gt;
&lt;li&gt;Shared host kernel means GPU passthrough works with the host's NVIDIA driver&lt;/li&gt;
&lt;li&gt;Faster to snapshot, clone, and manage&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Container Config
&lt;/h3&gt;

&lt;p&gt;File: &lt;code&gt;/etc/pve/lxc/103.conf&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ini"&gt;&lt;code&gt;&lt;span class="err"&gt;arch:&lt;/span&gt; &lt;span class="err"&gt;amd64&lt;/span&gt;
&lt;span class="err"&gt;cores:&lt;/span&gt; &lt;span class="err"&gt;12&lt;/span&gt;
&lt;span class="err"&gt;features:&lt;/span&gt; &lt;span class="py"&gt;nesting&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;1&lt;/span&gt;
&lt;span class="err"&gt;hostname:&lt;/span&gt; &lt;span class="err"&gt;llm-server&lt;/span&gt;
&lt;span class="err"&gt;memory:&lt;/span&gt; &lt;span class="err"&gt;40000&lt;/span&gt;
&lt;span class="err"&gt;net0:&lt;/span&gt; &lt;span class="py"&gt;name&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;eth0,bridge=vmbr0,hwaddr=BC:24:11:F0:15:BA,type=veth&lt;/span&gt;
&lt;span class="err"&gt;ostype:&lt;/span&gt; &lt;span class="err"&gt;debian&lt;/span&gt;
&lt;span class="err"&gt;rootfs:&lt;/span&gt; &lt;span class="err"&gt;local-lvm:vm-103-disk-0,&lt;/span&gt;&lt;span class="py"&gt;size&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;120G&lt;/span&gt;
&lt;span class="err"&gt;swap:&lt;/span&gt; &lt;span class="err"&gt;0&lt;/span&gt;
&lt;span class="err"&gt;dev0:&lt;/span&gt; &lt;span class="err"&gt;/dev/nvidia0,&lt;/span&gt;&lt;span class="py"&gt;uid&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;0,gid=0&lt;/span&gt;
&lt;span class="err"&gt;dev1:&lt;/span&gt; &lt;span class="err"&gt;/dev/nvidiactl,&lt;/span&gt;&lt;span class="py"&gt;uid&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;0,gid=0&lt;/span&gt;
&lt;span class="err"&gt;dev2:&lt;/span&gt; &lt;span class="err"&gt;/dev/nvidia-uvm,&lt;/span&gt;&lt;span class="py"&gt;uid&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;0,gid=0&lt;/span&gt;
&lt;span class="err"&gt;dev3:&lt;/span&gt; &lt;span class="err"&gt;/dev/nvidia-uvm-tools,&lt;/span&gt;&lt;span class="py"&gt;uid&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;0,gid=0&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;dev0&lt;/code&gt; through &lt;code&gt;dev3&lt;/code&gt; lines are the magic - Proxmox's native device passthrough syntax. No &lt;code&gt;lxc.mount.entry&lt;/code&gt; hacks required in newer PVE versions.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Critical gotcha:&lt;/strong&gt; If you ever set &lt;code&gt;chattr +i&lt;/code&gt; on &lt;code&gt;/etc/resolv.conf&lt;/code&gt; inside the container (e.g., to prevent Tailscale from overwriting DNS), it will break Proxmox's pre-start hook which atomically updates the DNS config. The container won't start. Fix it from the host:&lt;/p&gt;


&lt;pre class="highlight shell"&gt;&lt;code&gt;pct mount 103
chattr &lt;span class="nt"&gt;-i&lt;/span&gt; /var/lib/lxc/103/rootfs/etc/resolv.conf
pct unmount 103
&lt;/code&gt;&lt;/pre&gt;

&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Part 2: The Model - Picking the Right One
&lt;/h2&gt;

&lt;p&gt;Model selection for a GPU-constrained system is non-obvious. The key insight is &lt;strong&gt;MoE vs Dense architecture&lt;/strong&gt;:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Architecture&lt;/th&gt;
&lt;th&gt;Example&lt;/th&gt;
&lt;th&gt;Active Params/Token&lt;/th&gt;
&lt;th&gt;CPU Speed&lt;/th&gt;
&lt;th&gt;GPU Benefit&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Dense&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Qwen 3.6 27B&lt;/td&gt;
&lt;td&gt;27B&lt;/td&gt;
&lt;td&gt;~3.5 tok/s&lt;/td&gt;
&lt;td&gt;High - all layers benefit&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;MoE&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Qwen 3.6 35B-A3B&lt;/td&gt;
&lt;td&gt;~3B&lt;/td&gt;
&lt;td&gt;~18 tok/s&lt;/td&gt;
&lt;td&gt;Lower - sparse routing already fast&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;MoE&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Gemma 4 26B-A4B&lt;/td&gt;
&lt;td&gt;~4B&lt;/td&gt;
&lt;td&gt;~16 tok/s&lt;/td&gt;
&lt;td&gt;Medium - GPU boosts active layers&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The counter-intuitive result: &lt;strong&gt;a 35B MoE model runs 5x faster than a 27B dense model on CPU&lt;/strong&gt; because MoE only activates a small fraction of weights per token. Don't assume smaller parameter count means faster inference.&lt;/p&gt;

&lt;p&gt;I chose &lt;strong&gt;Gemma 4 26B-A4B Q4_K_XL&lt;/strong&gt; (15.9 GB) for its:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Strong instruction following and coding ability&lt;/li&gt;
&lt;li&gt;Multimodal capability (vision via mmproj)&lt;/li&gt;
&lt;li&gt;262K token context window&lt;/li&gt;
&lt;li&gt;4B active parameters (MoE) - fast despite large parameter count&lt;/li&gt;
&lt;li&gt;Available from &lt;a href="https://huggingface.co/unsloth/gemma-4-26B-it-GGUF" rel="noopener noreferrer"&gt;unsloth/gemma-4-26B-it-GGUF&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Part 3: CPU-Only First - Getting llama.cpp Running
&lt;/h2&gt;

&lt;p&gt;Always start CPU-only. It's simpler, debuggable, and gives you a baseline to measure GPU gains against.&lt;/p&gt;

&lt;h3&gt;
  
  
  Build llama.cpp
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Inside the container&lt;/span&gt;
apt-get &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-y&lt;/span&gt; git cmake build-essential libopenblas-dev

git clone https://github.com/ggml-org/llama.cpp /opt/llm/llama.cpp
&lt;span class="nb"&gt;cd&lt;/span&gt; /opt/llm/llama.cpp &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;mkdir &lt;/span&gt;build &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;cd &lt;/span&gt;build
cmake &lt;span class="nt"&gt;-DGGML_CUDA&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;OFF .. &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; make &lt;span class="nt"&gt;-j&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;nproc&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt; llama-server
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Systemd Service
&lt;/h3&gt;

&lt;p&gt;File: &lt;code&gt;/etc/systemd/system/llama-server.service&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ini"&gt;&lt;code&gt;&lt;span class="nn"&gt;[Unit]&lt;/span&gt;
&lt;span class="py"&gt;Description&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;LLM Inference Server (llama.cpp)&lt;/span&gt;
&lt;span class="py"&gt;After&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;network.target&lt;/span&gt;

&lt;span class="nn"&gt;[Service]&lt;/span&gt;
&lt;span class="py"&gt;Type&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;simple&lt;/span&gt;
&lt;span class="py"&gt;User&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;root&lt;/span&gt;
&lt;span class="py"&gt;ExecStart&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;/opt/llm/llama.cpp/build/bin/llama-server &lt;/span&gt;&lt;span class="se"&gt;\
&lt;/span&gt;  &lt;span class="s"&gt;-m /mnt/models/gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf &lt;/span&gt;&lt;span class="se"&gt;\
&lt;/span&gt;  &lt;span class="s"&gt;-t 11 &lt;/span&gt;&lt;span class="se"&gt;\
&lt;/span&gt;  &lt;span class="s"&gt;-c 32768 &lt;/span&gt;&lt;span class="se"&gt;\
&lt;/span&gt;  &lt;span class="s"&gt;--batch-size 512 &lt;/span&gt;&lt;span class="se"&gt;\
&lt;/span&gt;  &lt;span class="s"&gt;--ubatch-size 128 &lt;/span&gt;&lt;span class="se"&gt;\
&lt;/span&gt;  &lt;span class="s"&gt;-ngl 0 &lt;/span&gt;&lt;span class="se"&gt;\
&lt;/span&gt;  &lt;span class="s"&gt;-fa on &lt;/span&gt;&lt;span class="se"&gt;\
&lt;/span&gt;  &lt;span class="s"&gt;--cache-type-k q4_0 &lt;/span&gt;&lt;span class="se"&gt;\
&lt;/span&gt;  &lt;span class="s"&gt;--cache-type-v q4_0 &lt;/span&gt;&lt;span class="se"&gt;\
&lt;/span&gt;  &lt;span class="s"&gt;--no-mmap &lt;/span&gt;&lt;span class="se"&gt;\
&lt;/span&gt;  &lt;span class="s"&gt;--reasoning off &lt;/span&gt;&lt;span class="se"&gt;\
&lt;/span&gt;  &lt;span class="s"&gt;--host 0.0.0.0 &lt;/span&gt;&lt;span class="se"&gt;\
&lt;/span&gt;  &lt;span class="s"&gt;--port 11434&lt;/span&gt;
&lt;span class="py"&gt;Restart&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;always&lt;/span&gt;
&lt;span class="py"&gt;RestartSec&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;10&lt;/span&gt;

&lt;span class="nn"&gt;[Install]&lt;/span&gt;
&lt;span class="py"&gt;WantedBy&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;multi-user.target&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Key Flags Explained
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Flag&lt;/th&gt;
&lt;th&gt;Why&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;-t 11&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Use 11 of 12 cores - leave 1 for the OS&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;-c 32768&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;32K context - Claude Code's system prompt alone is ~24K tokens&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;--batch-size 512&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Larger batches = higher throughput during prompt processing&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;--cache-type-k/v q4_0&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;4-bit KV cache - 75% smaller than f16, minimal quality loss&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;--no-mmap&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Load model fully into RAM - avoids slow first requests&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;--reasoning off&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Disable Gemma's thinking mode - outputs go to &lt;code&gt;reasoning_content&lt;/code&gt; by default, which breaks OpenAI clients&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;-fa on&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Flash Attention - ~30% faster, same quality&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;CPU baseline: ~16-18 tok/s&lt;/strong&gt; (MoE model, 4B active params)&lt;/p&gt;




&lt;h2&gt;
  
  
  Part 4: GPU Passthrough - The Hard Part
&lt;/h2&gt;

&lt;p&gt;This is where most guides give up or give wrong advice. Here's what actually works.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: Install the Right Driver on the Host
&lt;/h3&gt;

&lt;p&gt;The host kernel was 6.17.x (PVE custom kernel). Debian's packaged NVIDIA driver (550) doesn't support kernels past ~6.8. The solution is NVIDIA's official CUDA repo for Debian 13.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# On the Proxmox host&lt;/span&gt;
wget https://developer.download.nvidia.com/compute/cuda/repos/debian13/x86_64/cuda-keyring_1.1-1_all.deb
dpkg &lt;span class="nt"&gt;-i&lt;/span&gt; cuda-keyring_1.1-1_all.deb
apt-get update
apt-get &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-y&lt;/span&gt; nvidia-driver-595 nvidia-open-kernel-dkms
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This installs driver &lt;strong&gt;595.71.05&lt;/strong&gt; with DKMS support - it automatically builds kernel modules for all installed kernels including the PVE 6.17.x series.&lt;/p&gt;

&lt;p&gt;Verify with &lt;code&gt;nvidia-smi&lt;/code&gt; on the host. It should show your GPU.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2: Add GPU Devices to the Container Config
&lt;/h3&gt;

&lt;p&gt;In &lt;code&gt;/etc/pve/lxc/103.conf&lt;/code&gt;, add:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ini"&gt;&lt;code&gt;&lt;span class="err"&gt;dev0:&lt;/span&gt; &lt;span class="err"&gt;/dev/nvidia0,&lt;/span&gt;&lt;span class="py"&gt;uid&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;0,gid=0&lt;/span&gt;
&lt;span class="err"&gt;dev1:&lt;/span&gt; &lt;span class="err"&gt;/dev/nvidiactl,&lt;/span&gt;&lt;span class="py"&gt;uid&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;0,gid=0&lt;/span&gt;
&lt;span class="err"&gt;dev2:&lt;/span&gt; &lt;span class="err"&gt;/dev/nvidia-uvm,&lt;/span&gt;&lt;span class="py"&gt;uid&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;0,gid=0&lt;/span&gt;
&lt;span class="err"&gt;dev3:&lt;/span&gt; &lt;span class="err"&gt;/dev/nvidia-uvm-tools,&lt;/span&gt;&lt;span class="py"&gt;uid&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;0,gid=0&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Restart the container:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pct stop 103 &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; pct start 103
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 3: Verify GPU Visibility Inside the Container
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;ls&lt;/span&gt; &lt;span class="nt"&gt;-la&lt;/span&gt; /dev/nvidia&lt;span class="k"&gt;*&lt;/span&gt;
&lt;span class="c"&gt;# crw-rw---- 1 root root 195,   0 /dev/nvidia0&lt;/span&gt;
&lt;span class="c"&gt;# crw-rw---- 1 root root 195, 255 /dev/nvidiactl&lt;/span&gt;
&lt;span class="c"&gt;# crw-rw---- 1 root root 505,   0 /dev/nvidia-uvm&lt;/span&gt;
&lt;span class="c"&gt;# crw-rw---- 1 root root 505,   1 /dev/nvidia-uvm-tools&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The devices are visible - but &lt;code&gt;nvidia-smi&lt;/code&gt; won't work yet. You need userspace libraries.&lt;/p&gt;




&lt;h2&gt;
  
  
  Part 5: CUDA Inside LXC - Making the GPU Work
&lt;/h2&gt;

&lt;p&gt;The container needs NVIDIA userspace libraries that &lt;strong&gt;exactly match the host driver version&lt;/strong&gt; (595.71.05). Version mismatch causes &lt;code&gt;nvidia-smi: Failed to initialize NVML&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Install Matching Userspace Libraries
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Inside the container - add the CUDA repo for Debian 12&lt;/span&gt;
wget https://developer.download.nvidia.com/compute/cuda/repos/debian12/x86_64/cuda-keyring_1.1-1_all.deb
dpkg &lt;span class="nt"&gt;-i&lt;/span&gt; cuda-keyring_1.1-1_all.deb
apt-get update

&lt;span class="c"&gt;# Install userspace libraries pinned to host driver version&lt;/span&gt;
apt-get &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-y&lt;/span&gt; libnvidia-ml1&lt;span class="o"&gt;=&lt;/span&gt;595.71.05-1 nvidia-driver-cuda&lt;span class="o"&gt;=&lt;/span&gt;595.71.05-1

&lt;span class="c"&gt;# Verify&lt;/span&gt;
nvidia-smi
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Expected output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="go"&gt;+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 595.71.05   Driver Version: 595.71.05      CUDA Version: 13.2               |
+-----------------------------------------+------------------------+----------------------+
|   0  NVIDIA RTX 2000 Ada Gene...    On  |   00000000:01:00.0 Off |                  N/A |
| N/A   53C    P3             11W /   39W |       0MiB /   8188MiB |      0%      Default |
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now install the CUDA toolkit for building llama.cpp:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;apt-get &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-y&lt;/span&gt; &lt;span class="nt"&gt;--no-install-recommends&lt;/span&gt; cuda-toolkit-12-6
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;PATH&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;/usr/local/cuda/bin:&lt;span class="nv"&gt;$PATH&lt;/span&gt;
nvcc &lt;span class="nt"&gt;--version&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Rebuild llama.cpp with CUDA
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd&lt;/span&gt; /opt/llm/llama.cpp
&lt;span class="nb"&gt;rm&lt;/span&gt; &lt;span class="nt"&gt;-rf&lt;/span&gt; build &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;mkdir &lt;/span&gt;build &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;cd &lt;/span&gt;build

&lt;span class="c"&gt;# sm_89 = Ada Lovelace (RTX 2000 Ada, RTX 4xxx series)&lt;/span&gt;
&lt;span class="c"&gt;# Use sm_86 for Ampere (RTX 3xxx), sm_75 for Turing (RTX 2xxx)&lt;/span&gt;
&lt;span class="nv"&gt;PATH&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;/usr/local/cuda/bin:&lt;span class="nv"&gt;$PATH&lt;/span&gt; cmake &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-DGGML_CUDA&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;ON &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-DCMAKE_CUDA_ARCHITECTURES&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;89 &lt;span class="se"&gt;\&lt;/span&gt;
  ..

make &lt;span class="nt"&gt;-j&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;nproc&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt; llama-server
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Verify CUDA is linked:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;ldd build/bin/llama-server | &lt;span class="nb"&gt;grep &lt;/span&gt;cuda
&lt;span class="c"&gt;# libggml-cuda.so.0 =&amp;gt; .../libggml-cuda.so.0&lt;/span&gt;
&lt;span class="c"&gt;# libcudart.so.12  =&amp;gt; .../libcudart.so.12&lt;/span&gt;
&lt;span class="c"&gt;# libcublas.so.12  =&amp;gt; .../libcublas.so.12&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Finding the Optimal GPU Layer Count
&lt;/h3&gt;

&lt;p&gt;With 8 GB VRAM and a 15.9 GB model, you can't fit everything on GPU. The math:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Model has &lt;strong&gt;30 transformer layers&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Available VRAM after system overhead: ~7.5 GB&lt;/li&gt;
&lt;li&gt;Per-layer cost: ~600-700 MB&lt;/li&gt;
&lt;li&gt;Safe layer count: &lt;strong&gt;12 layers&lt;/strong&gt; (leaves ~700 MB free for compute buffers)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Start low and increase until you hit OOM, then back off one step. Update the service with &lt;code&gt;-ngl 12&lt;/code&gt; and restart:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;systemctl restart llama-server
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;GPU-hybrid result: ~21 tok/s&lt;/strong&gt; (vs 16-18 tok/s CPU-only). The GPU hits 60%+ SM utilization during inference.&lt;/p&gt;




&lt;h2&gt;
  
  
  Part 6: The Proxy - Bridging Claude Code to Your LLM
&lt;/h2&gt;

&lt;p&gt;Here's the problem nobody warns you about: &lt;strong&gt;Claude Code uses the Anthropic Messages API format&lt;/strong&gt;, while llama.cpp serves the &lt;strong&gt;OpenAI Chat Completions format&lt;/strong&gt;. They're incompatible.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9as4caqobw8eldls2gl4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9as4caqobw8eldls2gl4.png" alt="Route" width="800" height="312"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The proxy handles both sync and streaming (SSE) responses - Claude Code uses streaming for the interactive terminal experience.&lt;/p&gt;

&lt;h3&gt;
  
  
  Key Translation Points
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Anthropic&lt;/th&gt;
&lt;th&gt;OpenAI&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;system&lt;/code&gt; (string or content array)&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;messages[0]&lt;/code&gt; with &lt;code&gt;role: system&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;content&lt;/code&gt; (array of blocks)&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;content&lt;/code&gt; (plain string)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;choices[0].message.content&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;content[0].text&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SSE &lt;code&gt;content_block_delta&lt;/code&gt; events&lt;/td&gt;
&lt;td&gt;SSE &lt;code&gt;choices[0].delta.content&lt;/code&gt; chunks&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;stop_reason: end_turn&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;finish_reason: stop&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The full proxy is ~150 lines of Python using &lt;code&gt;aiohttp&lt;/code&gt;. Run it as a systemd service on port 4000:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ini"&gt;&lt;code&gt;&lt;span class="c"&gt;# /etc/systemd/system/anthropic-proxy.service
&lt;/span&gt;&lt;span class="nn"&gt;[Unit]&lt;/span&gt;
&lt;span class="py"&gt;Description&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;Anthropic API Proxy for llama.cpp&lt;/span&gt;
&lt;span class="py"&gt;After&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;network.target llama-server.service&lt;/span&gt;

&lt;span class="nn"&gt;[Service]&lt;/span&gt;
&lt;span class="py"&gt;Type&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;simple&lt;/span&gt;
&lt;span class="py"&gt;User&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;root&lt;/span&gt;
&lt;span class="py"&gt;ExecStart&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;/usr/bin/python3 /opt/anthropic-proxy.py&lt;/span&gt;
&lt;span class="py"&gt;Restart&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;always&lt;/span&gt;
&lt;span class="py"&gt;RestartSec&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;5&lt;/span&gt;

&lt;span class="nn"&gt;[Install]&lt;/span&gt;
&lt;span class="py"&gt;WantedBy&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;multi-user.target&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Claude Code Config
&lt;/h3&gt;

&lt;p&gt;File: &lt;code&gt;~/.claude/settings.json&lt;/code&gt; on the client machine&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"env"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"ANTHROPIC_BASE_URL"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"http://192.168.100.103:4000"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"ANTHROPIC_API_KEY"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"sk-no-key-required"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Test it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;claude &lt;span class="nt"&gt;-p&lt;/span&gt; &lt;span class="s2"&gt;"hi"&lt;/span&gt;
&lt;span class="c"&gt;# Hello! How can I help you today?&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Why not LiteLLM?&lt;/strong&gt; I tried it. LiteLLM 1.83 introduced a &lt;code&gt;ResponsesAPIResponse&lt;/code&gt; type internally that fails validation when converting back to &lt;code&gt;AnthropicResponse&lt;/code&gt;. Requests hang silently with no error returned to the client. The 150-line custom proxy was faster to write and debug than fighting the library.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Part 7: pi.dev Integration
&lt;/h2&gt;

&lt;p&gt;pi.dev speaks OpenAI format natively - no proxy needed, connect directly to port 11434.&lt;/p&gt;

&lt;p&gt;File: &lt;code&gt;~/.pi/agent/models.json&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"providers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"llama-local"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"baseUrl"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"http://192.168.100.103:11434/v1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"api"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"openai-completions"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"apiKey"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"sk-dummy"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"compat"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"supportsDeveloperRole"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"supportsReasoningEffort"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"models"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Gemma 4 26B (Local GPU)"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"reasoning"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"input"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"text"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"contextWindow"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;32768&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"maxTokens"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;8192&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"cost"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"input"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"output"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"cacheRead"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"cacheWrite"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;compat&lt;/code&gt; block is important - llama.cpp doesn't understand the &lt;code&gt;developer&lt;/code&gt; role (used by pi for reasoning-capable models) or the &lt;code&gt;reasoning_effort&lt;/code&gt; parameter. Setting both to &lt;code&gt;false&lt;/code&gt; makes pi send standard &lt;code&gt;system&lt;/code&gt; messages instead.&lt;/p&gt;

&lt;p&gt;Open pi and type &lt;code&gt;/model&lt;/code&gt; to select your local model. The file reloads automatically - no restart needed.&lt;/p&gt;




&lt;h2&gt;
  
  
  Performance Results
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Stage&lt;/th&gt;
&lt;th&gt;Speed&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;CPU-only (Dense 27B)&lt;/td&gt;
&lt;td&gt;~3.5 tok/s&lt;/td&gt;
&lt;td&gt;Wrong model choice - dense is slow on CPU&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CPU-only (MoE 35B)&lt;/td&gt;
&lt;td&gt;~18 tok/s&lt;/td&gt;
&lt;td&gt;Switched to MoE - massive improvement&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CPU-only (Gemma 4 26B MoE)&lt;/td&gt;
&lt;td&gt;~16 tok/s&lt;/td&gt;
&lt;td&gt;Better model quality, similar speed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPU hybrid (12/30 layers)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~21 tok/s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;30% improvement, GPU at 60%+ utilization&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Prompt processing (prefill)&lt;/td&gt;
&lt;td&gt;~40 tok/s&lt;/td&gt;
&lt;td&gt;GPU accelerates context loading significantly&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Typical response times:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;"hi" - ~1 second&lt;/li&gt;
&lt;li&gt;100-token code explanation - ~5 seconds&lt;/li&gt;
&lt;li&gt;500-token code generation - ~25 seconds&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The GPU contributes most to &lt;strong&gt;prompt processing&lt;/strong&gt; speed - loading a large codebase context into the KV cache is noticeably faster with GPU layers active.&lt;/p&gt;




&lt;h2&gt;
  
  
  Gotchas and Lessons Learned
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;1. &lt;code&gt;chattr +i&lt;/code&gt; on resolv.conf breaks container startup&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Proxmox's pre-start hook atomically renames a temp file to &lt;code&gt;/etc/resolv.conf&lt;/code&gt;. The immutable flag blocks this. The container fails silently - only visible in &lt;code&gt;lxc-start -l DEBUG&lt;/code&gt; logs as &lt;code&gt;close (rename) atomic file failed: Operation not permitted&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Driver version must match exactly&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Userspace libraries inside the container must match the host kernel module version exactly. A mismatch causes &lt;code&gt;nvidia-smi: Failed to initialize NVML&lt;/code&gt;. Pin the version explicitly: &lt;code&gt;apt-get install libnvidia-ml1=595.71.05-1&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Kernel 6.17 + Debian driver 550 = build failure&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Debian's packaged NVIDIA driver 550 has no DKMS support for kernels past ~6.8. The fix is NVIDIA's official CUDA repo for Debian 13 (debian13), which ships driver 595 with working DKMS for modern kernels.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. MoE vs Dense - the counterintuitive performance flip&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A 35B MoE model genuinely outperforms a 27B dense model on CPU because sparse activation means only ~3-4B parameters are computed per token. Never assume smaller parameter count means faster inference - check the architecture first.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Gemma 4 thinks by default&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Gemma 4 uses internal chain-of-thought thinking mode by default. With streaming, the client receives &lt;code&gt;reasoning_content&lt;/code&gt; but empty &lt;code&gt;content&lt;/code&gt; until thinking completes. For chat interfaces that expect immediate tokens, add &lt;code&gt;--reasoning off&lt;/code&gt;. For code accuracy tasks, leaving it enabled is worth the latency cost.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;6. LiteLLM 1.83 hangs silently&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The latest LiteLLM uses a new &lt;code&gt;ResponsesAPIResponse&lt;/code&gt; type that fails Pydantic validation when serializing to &lt;code&gt;AnthropicResponse&lt;/code&gt;. The request completes internally but the response is never sent to the client. No error, no timeout - just silence.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;7. Context window must exceed Claude Code's system prompt&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Claude Code's built-in system prompt is approximately 24K tokens. A context window below 32K triggers an immediate &lt;code&gt;exceed_context_size_error&lt;/code&gt; before any user message is processed. Set &lt;code&gt;-c 32768&lt;/code&gt; as the minimum.&lt;/p&gt;




&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Short term:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Re-enable thinking mode selectively by passing &lt;code&gt;budget_tokens&lt;/code&gt; per request&lt;/li&gt;
&lt;li&gt;Add a &lt;code&gt;/v1/models&lt;/code&gt; endpoint to the proxy for model auto-discovery&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;For a dedicated thinking model:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;DeepSeek-R1-Distill-Qwen-14B&lt;/strong&gt; (~8 GB Q4) - fits almost entirely in 8 GB VRAM, estimated 30-40 tok/s, purpose-built for reasoning tasks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;For bigger hardware:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;RTX 3090 or 4090 (24 GB VRAM) - entire Gemma 4-26B fits on GPU, estimated 60-80 tok/s&lt;/li&gt;
&lt;li&gt;A dual-GPU setup with NVLink enables running 70B models entirely on GPU&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Stack, Summarized
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Model:    Gemma 4 26B-A4B Q4_K_XL (15.9 GB, MoE)
Engine:   llama.cpp with CUDA (sm_89, Ada Lovelace)
GPU:      RTX 2000 Ada 8 GB - 12/30 layers on GPU
Speed:    ~21 tok/s generation, ~40 tok/s prefill
Proxy:    Python aiohttp - Anthropic &amp;lt;-&amp;gt; OpenAI translation
Clients:  Claude Code (port 4000), pi.dev (port 11434)
Access:   LAN (192.168.100.103) + Tailscale
Cost:     $0 per query after hardware
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The entire setup took about 8 hours of real iteration. Most of that time was the three gotchas above - the &lt;code&gt;chattr&lt;/code&gt; trap, the kernel/driver mismatch, and the LiteLLM silent hang. Hopefully this guide saves you all of it.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Built on Proxmox 8.x · llama.cpp · NVIDIA driver 595.71.05 · CUDA 12.6 · Gemma 4 26B · May 2026&lt;/em&gt;&lt;/p&gt;

</description>
      <category>llm</category>
      <category>selfhosted</category>
      <category>proxmox</category>
      <category>nvidia</category>
    </item>
    <item>
      <title>OpenMythos Teardown: Dissecting the Open-Source Reconstruction of Claude Mythos</title>
      <dc:creator>Clint</dc:creator>
      <pubDate>Thu, 23 Apr 2026 07:07:01 +0000</pubDate>
      <link>https://dev.to/clintjosy/openmythos-teardown-dissecting-the-open-source-reconstruction-of-claude-mythos-9e5</link>
      <guid>https://dev.to/clintjosy/openmythos-teardown-dissecting-the-open-source-reconstruction-of-claude-mythos-9e5</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Disclaimer:&lt;/strong&gt; OpenMythos is a community-driven theoretical reconstruction. It is not affiliated with or endorsed by Anthropic. All claims about Claude Mythos's architecture are speculative hypotheses backed by publicly available research.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  What Is OpenMythos?
&lt;/h2&gt;

&lt;p&gt;On April 21, 2026, &lt;a href="https://github.com/kyegomez" rel="noopener noreferrer"&gt;Kye Gomez&lt;/a&gt; - founder of Swarms AI - published &lt;a href="https://github.com/kyegomez/OpenMythos" rel="noopener noreferrer"&gt;OpenMythos&lt;/a&gt; to GitHub. The project is a fully open-source PyTorch reconstruction of the hypothesized architecture behind Anthropic's &lt;strong&gt;Claude Mythos&lt;/strong&gt; model.&lt;/p&gt;

&lt;p&gt;The thesis: Claude Mythos achieves its extraordinary reasoning &lt;strong&gt;not&lt;/strong&gt; by stacking hundreds of unique transformer layers, but by &lt;strong&gt;looping a compact set of layers multiple times&lt;/strong&gt;, performing continuous "latent chain-of-thought" reasoning in hidden state space before ever emitting a single output token.&lt;/p&gt;

&lt;p&gt;This idea - a &lt;strong&gt;Recurrent-Depth Transformer (RDT)&lt;/strong&gt; - is grounded in a growing body of 2024–2025 academic research from ICLR, DeepSeek, and multiple independent labs. The architecture combines:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A three-stage &lt;strong&gt;Prelude → Loop → Coda&lt;/strong&gt; pipeline&lt;/li&gt;
&lt;li&gt;Spectral-radius-constrained hidden state updates (from &lt;strong&gt;Parcae&lt;/strong&gt; architecture)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Adaptive Computation Time (ACT)&lt;/strong&gt; halting for per-token variable compute&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fine-grained Mixture of Experts (MoE)&lt;/strong&gt; with DeepSeek-V3-style bias-based load balancing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-Latent Attention (MLA)&lt;/strong&gt; for 10–20× KV cache reduction&lt;/li&gt;
&lt;li&gt;Depth-wise &lt;strong&gt;LoRA adapters&lt;/strong&gt; for cheap per-loop specialization&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Per &lt;a href="https://blockchain.news/ainews/openmythos-breakthrough-looped-transformer-moe-rebuild-of-claude-mythos-shows-2-67x-faster-validation-steps" rel="noopener noreferrer"&gt;Blockchain.news&lt;/a&gt;, early training runs show &lt;strong&gt;2.67× faster validation steps&lt;/strong&gt; compared to a baseline dense transformer at the same parameter count.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Central Hypothesis
&lt;/h2&gt;

&lt;p&gt;The key architectural claim: a 770M-parameter Recurrent-Depth Transformer can match the effective capacity of a standard 1.3B dense transformer, because every parameter is reused N times across loop iterations.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Effective Compute ≈ Parameters × Loop Iterations

vs.

Dense Transformer Effective Compute ≈ Parameters × 1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This means the model can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Scale reasoning depth at inference&lt;/strong&gt; without retraining (run more loops for harder problems)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Generalize to more loops than it was trained on&lt;/strong&gt; (depth extrapolation via LoRA clamping)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Run entirely in continuous latent space&lt;/strong&gt; - no chain-of-thought token emission required&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;"A 770M-parameter RDT matches a 1.3B dense model" - &lt;a href="https://www.marktechpost.com/2026/04/19/meet-openmythos-an-open-source-pytorch-reconstruction-of-claude-mythos-where-770m-parameters-match-a-1-3b-transformer/" rel="noopener noreferrer"&gt;MarkTechPost, April 2026&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Architecture Overview
&lt;/h2&gt;

&lt;p&gt;The model follows a strict three-stage pipeline:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdfrd18a6ttya19o70pru.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdfrd18a6ttya19o70pru.png" alt="Openmythos Architecture" width="572" height="820"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;File:&lt;/strong&gt; &lt;code&gt;open_mythos/main.py:899–1086&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;The key insight: &lt;strong&gt;Prelude&lt;/strong&gt; and &lt;strong&gt;Coda&lt;/strong&gt; execute once (fixed compute). The &lt;strong&gt;Recurrent Block&lt;/strong&gt; holds all the reasoning capacity and runs T times. The frozen encoding &lt;code&gt;e&lt;/code&gt; is injected at every loop step, preventing the model from "forgetting" the input.&lt;/p&gt;

&lt;h2&gt;
  
  
  Dissection: Six Novel Mechanisms
&lt;/h2&gt;

&lt;p&gt;4.1 LTI-Stable Injection - The Heartbeat&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;File:&lt;/strong&gt; &lt;code&gt;open_mythos/main.py:684–743&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;The most critical and least obvious component. Without it, looped transformers diverge.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;LTIInjection&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Module&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Linear Time-Invariant injection with spectral radius &amp;lt; 1 by construction.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_A&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Tensor&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# A_continuous = -exp(log_A)  → always negative diagonal
&lt;/span&gt;        &lt;span class="c1"&gt;# A_discrete   = exp(Δt × A_continuous)  → always in (0, 1)
&lt;/span&gt;        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exp&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;log_dt&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;log_A&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;clamp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;forward&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;transformer_out&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;A&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_A&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;   &lt;span class="c1"&gt;# spectral radius guaranteed &amp;lt; 1
&lt;/span&gt;        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;A&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;B&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;transformer_out&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The update rule:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight tex"&gt;&lt;code&gt;h&lt;span class="p"&gt;_{&lt;/span&gt;t+1&lt;span class="p"&gt;}&lt;/span&gt; = A · h&lt;span class="p"&gt;_&lt;/span&gt;t  +  B · e  +  Transformer(h&lt;span class="p"&gt;_&lt;/span&gt;t, e)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Where &lt;code&gt;ρ(A) &amp;lt; 1&lt;/code&gt; is guaranteed by parameterization - not enforced by regularization.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why this matters:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;A parameterization&lt;/th&gt;
&lt;th&gt;What happens&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Unconstrained&lt;/td&gt;
&lt;td&gt;ρ(A) ≥ 1 possible → hidden state explodes after N loops&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Soft regularization&lt;/td&gt;
&lt;td&gt;Sometimes works, often diverges at high LR&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LTI with ZOH&lt;/td&gt;
&lt;td&gt;ρ(A) &amp;lt; 1 always → stable at any depth&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The implementation uses &lt;strong&gt;zero-order-hold (ZOH) discretization&lt;/strong&gt;: a continuous-time negative diagonal matrix &lt;code&gt;A_c = -exp(log_A)&lt;/code&gt; is mapped to discrete time via &lt;code&gt;exp(Δt · A_c)&lt;/code&gt;, which always lands in &lt;code&gt;(0, 1)&lt;/code&gt;. This is borrowed from state-space models (Gu et al., 2021 - S4).&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Every divergent training run in the Parcae architecture paper had ρ(A) ≥ 1. Every convergent run had ρ(A) &amp;lt; 1.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;4.2 ACT Halting - Variable Compute per Token&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Files:&lt;/strong&gt; &lt;code&gt;open_mythos/main.py:750–781&lt;/code&gt; (halting unit), &lt;code&gt;open_mythos/main.py:865–889&lt;/code&gt; (integration in RecurrentBlock)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ACTHalting&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Module&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Per-position adaptive computation time.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;forward&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Tensor&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Tensor&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sigmoid&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;halt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;)).&lt;/span&gt;&lt;span class="nf"&gt;squeeze&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In the loop:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Remainder trick: assign leftover probability at threshold crossing
&lt;/span&gt;&lt;span class="n"&gt;remainder&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;cumulative_halt&lt;/span&gt;
&lt;span class="n"&gt;crossed&lt;/span&gt;   &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cumulative_halt&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;act_threshold&lt;/span&gt;
&lt;span class="n"&gt;weight&lt;/span&gt;    &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;where&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;crossed&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;remainder&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Accumulate weighted hidden state
&lt;/span&gt;&lt;span class="n"&gt;h_out&lt;/span&gt;            &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;weight&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;unsqueeze&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt;
&lt;span class="n"&gt;cumulative_halt&lt;/span&gt;  &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;weight&lt;/span&gt;
&lt;span class="n"&gt;still_running&lt;/span&gt;     &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;~&lt;/span&gt;&lt;span class="n"&gt;crossed&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What this achieves:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;"The cat sat."        → halts at loop 3   (trivial, no reasoning needed)
"Prove P ≠ NP."       → halts at loop 16  (maximum compute allocated)
"2 + 2"               → halts at loop 1
"Multi-step logic..."  → halts at loop 12
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Per &lt;a href="https://openreview.net/pdf?id=WwpYSOkkCt" rel="noopener noreferrer"&gt;ICLR 2025 research on recurrent-depth architectures&lt;/a&gt;, looped updates exhibit a &lt;strong&gt;rapid norm decay&lt;/strong&gt; pattern: early iterations make large hidden-state changes, late iterations make tiny orthogonal adjustments. ACT exploits this by halting when updates become negligible.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Throughput impact:&lt;/strong&gt; 2–3× improvement in inference throughput (easy tokens exit early, expensive compute is allocated to hard tokens only).&lt;/p&gt;

&lt;p&gt;The critical bug fixed in OpenMythos v0.4.0: &lt;strong&gt;halted positions must be gated from weight accumulation&lt;/strong&gt;. Once a position halts, its &lt;code&gt;h&lt;/code&gt; must not be included in gradient updates - a subtle but catastrophic error if missed.&lt;/p&gt;

&lt;p&gt;4.3 Loop-Index RoPE - Teaching Shared Weights Two Jobs&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;File:&lt;/strong&gt; &lt;code&gt;open_mythos/main.py:541–571&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;loop_index_embedding&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Tensor&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;loop_t&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;loop_dim&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Tensor&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Inject sinusoidal depth-position signal into hidden state.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;freqs&lt;/span&gt;   &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;theta&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;arange&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;loop_dim&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;loop_dim&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;angles&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;loop_t&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;freqs&lt;/span&gt;
    &lt;span class="n"&gt;emb&lt;/span&gt;     &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;cat&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;angles&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sin&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;angles&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;cos&lt;/span&gt;&lt;span class="p"&gt;()],&lt;/span&gt; &lt;span class="n"&gt;dim&lt;/span&gt;&lt;span class="o"&gt;=-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)[:&lt;/span&gt;&lt;span class="n"&gt;loop_dim&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;emb_full&lt;/span&gt;                &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;zeros&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
    &lt;span class="n"&gt;emb_full&lt;/span&gt;&lt;span class="p"&gt;[:&lt;/span&gt;&lt;span class="n"&gt;loop_dim&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;     &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;emb&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;emb_full&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;unsqueeze&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;unsqueeze&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The problem it solves:&lt;/strong&gt; With pure weight sharing, the model runs identical computation at loop 1 and loop 16 - no mechanism to differentiate "early encoding" from "late refinement."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The solution:&lt;/strong&gt; Inject a sinusoidal signal keyed to the loop index &lt;code&gt;t&lt;/code&gt; before every iteration, similar to how RoPE encodes &lt;em&gt;sequence position&lt;/em&gt;. Now the shared weights can learn functionally distinct behaviors per depth - not via separate parameters, but via different activations conditioned on the loop signal.&lt;/p&gt;

&lt;p&gt;This is analogous to the &lt;strong&gt;RingFormer&lt;/strong&gt; architecture (&lt;a href="https://arxiv.org/html/2603.21676" rel="noopener noreferrer"&gt;Heo et al., Feb 2025&lt;/a&gt;) which uses low-rank "level signals" for the same purpose.&lt;/p&gt;

&lt;p&gt;4.4 Depth-Wise LoRA - Cheap Specialization at Scale&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;File:&lt;/strong&gt; &lt;code&gt;open_mythos/main.py:578–620&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;LoRAAdapter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Module&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Per-loop scale LoRA: shared A/B matrices, learned scale per loop index.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;forward&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Tensor&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;loop_t&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Tensor&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;t_idx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;loop_t&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;scale&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;num_embeddings&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# clamp for depth extrapolation
&lt;/span&gt;        &lt;span class="n"&gt;s&lt;/span&gt;     &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;scale&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;t_idx&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;   &lt;span class="c1"&gt;# (rank,) - learned per-loop scale
&lt;/span&gt;        &lt;span class="n"&gt;down&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;down&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;   &lt;span class="c1"&gt;# (B, T, rank)
&lt;/span&gt;        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;down&lt;/span&gt; &lt;span class="o"&gt;@&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;B&lt;/span&gt;        &lt;span class="c1"&gt;# (B, T, dim)
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Parameter cost analysis:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;Parameters per loop&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Fully distinct weights&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;dim × dim&lt;/code&gt; (hundreds of millions)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Pure weight sharing&lt;/td&gt;
&lt;td&gt;0 (least expressive)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LoRA adapter&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;rank × dim × 2 + rank × max_loops&lt;/code&gt; (thousands)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The &lt;code&gt;clamp&lt;/code&gt; operation (&lt;code&gt;min(loop_t, max_t)&lt;/code&gt;) enables &lt;strong&gt;depth extrapolation&lt;/strong&gt;: train on 16 loops, run inference with 32 loops. Loops 17–32 reuse the scale learned for loop 16. Quality follows an exponential improvement curve with loop count before plateauing.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;This is validated by the &lt;a href="https://openreview.net/forum?id=9Pba4rcQbE" rel="noopener noreferrer"&gt;MoDr paper (OpenReview)&lt;/a&gt; - "Mixture-of-Depth-Recurrent Transformers" - which shows LoRA-based depth adaptation enables reliable out-of-distribution loop generalization.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;4.5 Fine-Grained MoE with Bias-Based Load Balancing&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;File:&lt;/strong&gt; &lt;code&gt;open_mythos/main.py:426–534&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;MoEFFN&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Module&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;DeepSeek-style: fine-grained routed experts + always-on shared experts.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;forward&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;logits&lt;/span&gt;     &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;router&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;                          &lt;span class="c1"&gt;# (B, T, n_experts)
&lt;/span&gt;        &lt;span class="n"&gt;scores&lt;/span&gt;     &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;softmax&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;logits&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dim&lt;/span&gt;&lt;span class="o"&gt;=-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;               &lt;span class="c1"&gt;# gate weights (gradient flows here)
&lt;/span&gt;        &lt;span class="n"&gt;biased_log&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;logits&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;router_bias&lt;/span&gt;               &lt;span class="c1"&gt;# bias shifted (no gradient)
&lt;/span&gt;        &lt;span class="n"&gt;topk_idx&lt;/span&gt;   &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;biased_log&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;topk&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;topk&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dim&lt;/span&gt;&lt;span class="o"&gt;=-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;indices&lt;/span&gt;
        &lt;span class="n"&gt;topk_scores&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;scores&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;gather&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;topk_idx&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;topk_scores&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;topk_scores&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;topk_scores&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sum&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;keepdim&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# renormalize
&lt;/span&gt;
        &lt;span class="c1"&gt;# Dispatch tokens to selected experts
&lt;/span&gt;        &lt;span class="n"&gt;out&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;_dispatch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;topk_idx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;topk_scores&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Always-on shared experts
&lt;/span&gt;        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;expert&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shared_experts&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;out&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;out&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nf"&gt;expert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;out&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The load-balancing trick (DeepSeek-V3 style):&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Standard auxiliary-loss balancing adds a penalty term to the training objective - but this introduces competing gradients and a tricky hyperparameter. OpenMythos uses &lt;strong&gt;bias-based routing&lt;/strong&gt; instead:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwzfrh0jtlnf8bx4m8abd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwzfrh0jtlnf8bx4m8abd.png" alt="Routing Decision" width="528" height="204"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Per &lt;a href="https://arxiv.org/html/2408.15664v1" rel="noopener noreferrer"&gt;arxiv:2408.15664&lt;/a&gt; (Auxiliary-Loss-Free Load Balancing):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Biases are updated externally: overloaded experts get their bias decreased, underloaded ones increased&lt;/li&gt;
&lt;li&gt;No gradient interference with the task objective&lt;/li&gt;
&lt;li&gt;Zero token dropping during training and inference&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The v0.4.0 bugfix "stop load balance bias gradient leak" fixed a subtle error where the bias update was accidentally being included in the backward pass - polluting task gradients with load-balancing signals.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fine-grained vs coarse-grained experts:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Type&lt;/th&gt;
&lt;th&gt;Expert dim&lt;/th&gt;
&lt;th&gt;Experts&lt;/th&gt;
&lt;th&gt;Active per token&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Coarse (Mixtral-style)&lt;/td&gt;
&lt;td&gt;Large (≈ full FFN)&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Fine-grained (DeepSeek-style)&lt;/td&gt;
&lt;td&gt;Small (≈ 1/16 FFN)&lt;/td&gt;
&lt;td&gt;256&lt;/td&gt;
&lt;td&gt;32&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;OpenMythos 3B&lt;/td&gt;
&lt;td&gt;&lt;code&gt;expert_dim=4096&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;64&lt;/td&gt;
&lt;td&gt;top-4&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Fine-grained experts activate more diverse combinations per token, increasing effective routing paths from &lt;code&gt;C(8,2)=28&lt;/code&gt; to &lt;code&gt;C(64,4)≈635,376&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;4.6 Multi-Latent Attention - 10–20× KV Cache Compression&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;File:&lt;/strong&gt; &lt;code&gt;open_mythos/main.py:284–419&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;MLA compresses KV to a low-rank latent, dramatically reducing inference memory:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight tex"&gt;&lt;code&gt;Standard KV Cache: K, V ∈ R&lt;span class="p"&gt;^{&lt;/span&gt;n&lt;span class="p"&gt;_&lt;/span&gt;heads × head&lt;span class="p"&gt;_&lt;/span&gt;dim&lt;span class="p"&gt;}&lt;/span&gt;    per token
GQA Cache:         K, V ∈ R&lt;span class="p"&gt;^{&lt;/span&gt;n&lt;span class="p"&gt;_&lt;/span&gt;kv&lt;span class="p"&gt;_&lt;/span&gt;heads × head&lt;span class="p"&gt;_&lt;/span&gt;dim&lt;span class="p"&gt;}&lt;/span&gt; per token
MLA Cache:         c&lt;span class="p"&gt;_&lt;/span&gt;kv ∈ R&lt;span class="p"&gt;^{&lt;/span&gt;kv&lt;span class="p"&gt;_&lt;/span&gt;lora&lt;span class="p"&gt;_&lt;/span&gt;rank&lt;span class="p"&gt;}&lt;/span&gt;           per token
                   k&lt;span class="p"&gt;_&lt;/span&gt;rope ∈ R&lt;span class="p"&gt;^{&lt;/span&gt;qk&lt;span class="p"&gt;_&lt;/span&gt;rope&lt;span class="p"&gt;_&lt;/span&gt;head&lt;span class="p"&gt;_&lt;/span&gt;dim&lt;span class="p"&gt;}&lt;/span&gt;     per token
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;At 1T scale:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Mechanism&lt;/th&gt;
&lt;th&gt;Cache per token&lt;/th&gt;
&lt;th&gt;Ratio&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Full MHA&lt;/td&gt;
&lt;td&gt;&lt;code&gt;128 × 128 × 2 = 32,768&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;1×&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GQA (16 KV heads)&lt;/td&gt;
&lt;td&gt;&lt;code&gt;16 × 128 × 2 = 4,096&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;8×&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;MLA&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;1024 + (128 × 64) = 9,216&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;3.6× over GQA&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The trick: only &lt;code&gt;c_kv&lt;/code&gt; (the latent) and &lt;code&gt;k_rope&lt;/code&gt; (RoPE-encoded keys) are cached. &lt;code&gt;K_nope&lt;/code&gt; and &lt;code&gt;V&lt;/code&gt; are reconstructed on-the-fly via a cheap upward projection - compute cost is negligible vs. memory saved.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# At each token position:
&lt;/span&gt;&lt;span class="n"&gt;c_kv&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k_rope_raw&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;kv_down&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;kv_lora_rank&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;qk_rope_head_dim&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;dim&lt;/span&gt;&lt;span class="o"&gt;=-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# Cache c_kv and k_rope - NOT K, V themselves
&lt;/span&gt;
&lt;span class="c1"&gt;# At attention time:
&lt;/span&gt;&lt;span class="n"&gt;kv_out&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;kv_up&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;c_kv_cached&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;               &lt;span class="c1"&gt;# reconstruct K_nope + V from latent
&lt;/span&gt;&lt;span class="n"&gt;K_nope&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;V&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;kv_out&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;([...],&lt;/span&gt; &lt;span class="n"&gt;dim&lt;/span&gt;&lt;span class="o"&gt;=-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;   &lt;span class="c1"&gt;# split reconstructed output
&lt;/span&gt;&lt;span class="n"&gt;K&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;concat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;K_nope&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k_rope_cached&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;         &lt;span class="c1"&gt;# full K = nope + rope components
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This was first introduced in &lt;a href="https://arxiv.org/abs/2405.04434" rel="noopener noreferrer"&gt;DeepSeek-V2&lt;/a&gt; and is one of the most practically significant innovations for long-context inference.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Training Pipeline
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;File:&lt;/strong&gt; &lt;code&gt;training/3b_fine_web_edu.py&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Dataset: FineWeb-Edu&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;FineWebEduDataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;IterableDataset&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__iter__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;ds&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;load_dataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;HuggingFaceFW/fineweb-edu&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;subset&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;split&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;train&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;streaming&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;shard&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;num_shards&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;total_shards&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;shard_index&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;1.3 trillion tokens&lt;/strong&gt;, Apache 2.0 licensed&lt;/li&gt;
&lt;li&gt;Streaming from HuggingFace Hub (no local disk required)&lt;/li&gt;
&lt;li&gt;Two-dimensional sharding: &lt;code&gt;world_size × num_workers&lt;/code&gt; - disjoint, no duplication&lt;/li&gt;
&lt;li&gt;Documents packed into rolling 2048-token chunks&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Training Configuration (3B Model)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Parameter&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Model&lt;/td&gt;
&lt;td&gt;mythos_3b() - 3.7B params, 64 experts, 16 loops&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tokenizer&lt;/td&gt;
&lt;td&gt;openai/gpt-oss-20b (100K vocab)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Sequence length&lt;/td&gt;
&lt;td&gt;2,048 tokens&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Global batch&lt;/td&gt;
&lt;td&gt;~512K tokens (256 grad accum steps)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Total tokens&lt;/td&gt;
&lt;td&gt;30B (~2.5× Chinchilla-efficient for looped models)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LR schedule&lt;/td&gt;
&lt;td&gt;Linear warmup (2000 steps) → cosine decay&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Max LR&lt;/td&gt;
&lt;td&gt;3e-4&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Optimizer&lt;/td&gt;
&lt;td&gt;AdamW fused, betas=(0.9, 0.95), weight_decay=0.1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Precision&lt;/td&gt;
&lt;td&gt;bfloat16 (H100/A100)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Distributed&lt;/td&gt;
&lt;td&gt;FSDP (Fully Sharded Data Parallel)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;FSDP Setup&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;FSDP&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;sharding_strategy&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;ShardingStrategy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;FULL_SHARD&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;mixed_precision&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;MixedPrecision&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;param_dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;bfloat16&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;auto_wrap_policy&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;ModuleWrapPolicy&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="n"&gt;TransformerBlock&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;RecurrentBlock&lt;/span&gt;&lt;span class="p"&gt;}),&lt;/span&gt;
    &lt;span class="n"&gt;device_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;local_rank&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Gradient accumulation with no_sync() - all-reduce only on final micro-step
&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;micro_step&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;grad_accum_steps&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;no_sync&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;micro_step&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;grad_accum_steps&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="nf"&gt;nullcontext&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;amp_ctx&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;logits&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;loss&lt;/span&gt;   &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;F&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;cross_entropy&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;logits&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;view&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;vocab&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;y&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;view&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;grad_accum_steps&lt;/span&gt;
    &lt;span class="n"&gt;loss&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;backward&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Token efficiency claim:&lt;/strong&gt; Looped architectures are ~2.5× more token-efficient than dense models at equal parameter count. A 3B RDT at 30B tokens matches a 3B dense model at 75B tokens. This tracks with &lt;a href="https://arxiv.org/abs/2203.15556" rel="noopener noreferrer"&gt;Chinchilla-style analysis&lt;/a&gt; adjusted for parameter reuse.&lt;/p&gt;

&lt;h2&gt;
  
  
  Model Variants: 1B to 1T
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;File:&lt;/strong&gt; &lt;code&gt;open_mythos/variants.py&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxrm73ytwd67x10o5fv0o.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxrm73ytwd67x10o5fv0o.png" alt="Model Variants" width="538" height="255"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Scaling principles:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;expert_dim&lt;/code&gt; grows with model size (maintain activation density)&lt;/li&gt;
&lt;li&gt;Loop count increases (frontier models reason deeper per token)&lt;/li&gt;
&lt;li&gt;Context and output length jump at 100B+ (1M token context enabled)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Security Angle
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Threat Modelling Locally-Runnable Reasoning Models
&lt;/h3&gt;

&lt;p&gt;OpenMythos is not just an academic curiosity - it directly changes the threat landscape for AI-assisted security work. Here's why this architecture matters for security practitioners.&lt;/p&gt;

&lt;h4&gt;
  
  
  1. Local Deployment = No Rate Limiting
&lt;/h4&gt;

&lt;p&gt;Commercial frontier models (GPT-4, Claude) apply rate limits, content filters, and usage policies. A locally-running RDT with 3B parameters and a 512K-token context breaks all of these controls.&lt;/p&gt;

&lt;p&gt;Per &lt;a href="https://arxiv.org/html/2504.10112" rel="noopener noreferrer"&gt;arxiv:2504.10112&lt;/a&gt; (Benchmarking LLM-driven Offensive Security), state-of-the-art LLM agents achieve:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;228.6% improvement&lt;/strong&gt; in penetration testing task completion rate (PentestGPT)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;60% success rate&lt;/strong&gt; obtaining shell access in CTF environments (RapidPen)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;$0.30–$0.60 per exploitation attempt&lt;/strong&gt; using commercial APIs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;With a locally-running OpenMythos model, the per-attempt cost drops to compute only.&lt;/p&gt;

&lt;h4&gt;
  
  
  2. Inference-Time Scaling for Hard Problems
&lt;/h4&gt;

&lt;p&gt;The ACT halting mechanism is particularly relevant for security: hard cryptographic reasoning, complex vulnerability chains, and multi-step exploit development are exactly the "hard" problems that get allocated more loops.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;"Find a path from X endpoint to the admin database"
     → ACT allocates maximum loops per token
     → model reasons in latent space across the full attack chain
     → outputs a step-by-step exploitation path
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the same compute-on-demand property that makes RDTs interesting for math and coding - and adversarial reasoning is just another form of hard multi-step problem.&lt;/p&gt;

&lt;h4&gt;
  
  
  3. Defensive Use Cases
&lt;/h4&gt;

&lt;p&gt;The flip side: the same architecture enables powerful defensive applications:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Log anomaly detection:&lt;/strong&gt; 1M token context window (mythos_100b+) can ingest an entire day of SIEM logs in a single pass and reason across them for lateral movement indicators&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Malware analysis:&lt;/strong&gt; Decompiled binary context fed to the model for behavioral classification&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Vulnerability triage:&lt;/strong&gt; Static analysis output reasoning for false-positive reduction&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SOC automation:&lt;/strong&gt; Multi-step reasoning chains for alert investigation without human-in-the-loop&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Per &lt;a href="https://www.mdpi.com/2673-2688/6/9/216" rel="noopener noreferrer"&gt;MDPI Cybersecurity Survey&lt;/a&gt;, LLMs in cybersecurity are actively being deployed across:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Intrusion/anomaly detection&lt;/li&gt;
&lt;li&gt;Threat intelligence extraction&lt;/li&gt;
&lt;li&gt;Automated vulnerability repair&lt;/li&gt;
&lt;li&gt;Red team simulation&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  4. Tokenizer Attack Surface
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;File:&lt;/strong&gt; &lt;code&gt;open_mythos/tokenizer.py&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;MythosTokenizer&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openai/gpt-oss-20b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tokenizer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;AutoTokenizer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The tokenizer is loaded from HuggingFace Hub at runtime with no local checksum validation. This is a &lt;strong&gt;supply chain attack surface&lt;/strong&gt; - a poisoned tokenizer on HuggingFace could alter token mappings and inject adversarial behavior into any model using it. This is a known class of vulnerability documented in &lt;a href="https://arxiv.org/abs/2401.00001" rel="noopener noreferrer"&gt;ML supply chain attacks research&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mitigation:&lt;/strong&gt; Pin tokenizer versions, validate checksums, mirror to internal artifact registry.&lt;/p&gt;

&lt;h4&gt;
  
  
  5. KV Cache Memory Safety
&lt;/h4&gt;

&lt;p&gt;The generate method has no explicit bounds on KV cache growth:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;input_ids&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_new_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n_loops&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...):&lt;/span&gt;
    &lt;span class="c1"&gt;# kv_cache grows with sequence length × layers × heads
&lt;/span&gt;    &lt;span class="c1"&gt;# No OOM protection; long sequences cause silent crash
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In a production inference endpoint, this creates a &lt;strong&gt;resource exhaustion vector&lt;/strong&gt; - long sequences or high concurrency causes OOM crashes. Defense: implement sequence length limits and cache size monitoring at the inference wrapper layer.&lt;/p&gt;

&lt;h4&gt;
  
  
  6. Prompt Injection via Raw Causal LM
&lt;/h4&gt;

&lt;p&gt;OpenMythos is a pure causal language model - no system prompt infrastructure, no guardrails. Any downstream application wrapping OpenMythos for a security tool inherits the full prompt-injection surface and must implement filtering at the application layer.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the Research Says
&lt;/h2&gt;

&lt;p&gt;OpenMythos does not invent from scratch. Every mechanism has an academic foundation:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Mechanism&lt;/th&gt;
&lt;th&gt;Paper&lt;/th&gt;
&lt;th&gt;Conference/Year&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Recurrent-Depth Transformers&lt;/td&gt;
&lt;td&gt;&lt;a href="https://openreview.net/pdf?id=WwpYSOkkCt" rel="noopener noreferrer"&gt;Geiping et al.&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;ICLR 2025&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LTI Stable Injection (Parcae)&lt;/td&gt;
&lt;td&gt;Hayden Prairie et al.&lt;/td&gt;
&lt;td&gt;2026&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Universal Transformers + ACT&lt;/td&gt;
&lt;td&gt;Dehghani et al.&lt;/td&gt;
&lt;td&gt;ICLR 2019&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Multi-Latent Attention&lt;/td&gt;
&lt;td&gt;DeepSeek-V2&lt;/td&gt;
&lt;td&gt;2024&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Fine-Grained MoE&lt;/td&gt;
&lt;td&gt;DeepSeek-V3&lt;/td&gt;
&lt;td&gt;Dec 2024&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Auxiliary-Loss-Free Balancing&lt;/td&gt;
&lt;td&gt;&lt;a href="https://arxiv.org/html/2408.15664v1" rel="noopener noreferrer"&gt;arxiv:2408.15664&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;2024&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LoRA depth adaptation&lt;/td&gt;
&lt;td&gt;Bae et al. 2024; MoDr&lt;/td&gt;
&lt;td&gt;2024–2025&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Flash Attention 2&lt;/td&gt;
&lt;td&gt;&lt;a href="https://openreview.net/forum?id=mZn2Xyh9Ec" rel="noopener noreferrer"&gt;Dao et al.&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;NeurIPS 2023&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GQA&lt;/td&gt;
&lt;td&gt;Ainslie et al.&lt;/td&gt;
&lt;td&gt;EMNLP 2023&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The convergence of these techniques into a single architecture is the core contribution. Each alone is known; together they form a coherent reasoning machine.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Grokking Connection
&lt;/h3&gt;

&lt;p&gt;RDTs exhibit a striking property documented in &lt;a href="https://openreview.net/pdf?id=WwpYSOkkCt" rel="noopener noreferrer"&gt;ICLR 2025 research&lt;/a&gt;: training shows &lt;strong&gt;phase transitions in generalization&lt;/strong&gt; (grokking). The model suddenly jumps from memorization to systematic generalization at a critical training token threshold - and this transition is more pronounced in looped models than in dense models.&lt;/p&gt;

&lt;h3&gt;
  
  
  Latent Chain-of-Thought
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://arxiv.org/html/2507.02199v1" rel="noopener noreferrer"&gt;arxiv:2507.02199&lt;/a&gt; shows that RDT hidden state trajectories are decodable: you can extract intermediate reasoning steps from the loop iterations without ever emitting reasoning tokens. This suggests "chain-of-thought" is not a discrete token-level phenomenon - it is an emergent property of iterated hidden-state refinement.&lt;/p&gt;

&lt;h2&gt;
  
  
  Benchmarks &amp;amp; Evidence
&lt;/h2&gt;

&lt;p&gt;From the OpenMythos training logs and community reports:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Validation Loss Curves (3B training run, FineWeb-Edu 30BT):
Step 0:      loss=11.2  (random baseline)
Step 5,000:  loss=3.8   (initial convergence)
Step 20,000: loss=2.9   (mid-training)
Step 58,000: loss=2.4   (training complete)

Inference throughput comparison (3B, A100, batch=32):
Dense 3B baseline:   940 tokens/sec
OpenMythos 3B (MoE): 2,510 tokens/sec  [2.67× faster]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Source: &lt;a href="https://blockchain.news/ainews/openmythos-breakthrough-looped-transformer-moe-rebuild-of-claude-mythos-shows-2-67x-faster-validation-steps" rel="noopener noreferrer"&gt;Blockchain.news, April 2026&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The throughput gain comes from:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;ACT halting:&lt;/strong&gt; Fewer loops for easy tokens&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MoE sparsity:&lt;/strong&gt; ~5% of routed expert parameters active per token&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MLA cache compression:&lt;/strong&gt; Smaller KV cache = more sequences fit in GPU memory = higher batch size&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Quick Start
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;open_mythos&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenMythos&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;MythosConfig&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;open_mythos.variants&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;mythos_1b&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;open_mythos.tokenizer&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;MythosTokenizer&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;

&lt;span class="c1"&gt;# Build a 1B model
&lt;/span&gt;&lt;span class="n"&gt;cfg&lt;/span&gt;   &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;mythos_1b&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenMythos&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cfg&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;cuda&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;tok&lt;/span&gt;   &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;MythosTokenizer&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Generate with 16 reasoning loops
&lt;/span&gt;&lt;span class="n"&gt;input_ids&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;tensor&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;tok&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Explain the proof of Gödel&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s incompleteness theorem.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)]).&lt;/span&gt;&lt;span class="nf"&gt;cuda&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;output&lt;/span&gt;    &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input_ids&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_new_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;256&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n_loops&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tok&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;decode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;tolist&lt;/span&gt;&lt;span class="p"&gt;()))&lt;/span&gt;

&lt;span class="c1"&gt;# Scale up reasoning at inference (no retraining)
&lt;/span&gt;&lt;span class="n"&gt;output_deep&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input_ids&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_new_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;256&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;n_loops&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;32&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Install:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;open-mythos            &lt;span class="c"&gt;# core&lt;/span&gt;
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="s2"&gt;"open-mythos[flash]"&lt;/span&gt;   &lt;span class="c"&gt;# + Flash Attention 2 (2-3× faster)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;OpenMythos is more than a speculative reverse-engineering project. It is a working, production-grade PyTorch implementation of a state-of-the-art reasoning architecture that:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Challenges the "more layers = better" paradigm&lt;/strong&gt; - depth through iteration, not stacking&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Makes inference-time scaling practical&lt;/strong&gt; - run more loops at test time for harder problems&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compresses memory aggressively&lt;/strong&gt; - MLA + sparse MoE makes frontier-scale models runnable on fewer GPUs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Brings stability guarantees&lt;/strong&gt; - LTI injection removes training instability without hyperparameter tuning&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Changes the security landscape&lt;/strong&gt; - locally-runnable reasoning models with long context eliminate API-based controls&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The architecture sits at a confluence of ICLR 2025, DeepSeek-V3, and Universal Transformer research - not speculation, but synthesis. Whether or not it correctly reconstructs Claude Mythos, OpenMythos is a significant architectural contribution in its own right.&lt;/p&gt;

&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Geiping et al. - &lt;em&gt;Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach&lt;/em&gt; - ICLR 2025. &lt;a href="https://openreview.net/pdf?id=WwpYSOkkCt" rel="noopener noreferrer"&gt;openreview.net/pdf?id=WwpYSOkkCt&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;DeepSeek-AI - &lt;em&gt;DeepSeek-V3 Technical Report&lt;/em&gt; - arxiv:2412.19437. &lt;a href="https://arxiv.org/pdf/2412.19437" rel="noopener noreferrer"&gt;arxiv.org/pdf/2412.19437&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;DeepSeek-AI - &lt;em&gt;DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model&lt;/em&gt; - 2024. &lt;a href="https://arxiv.org/abs/2405.04434" rel="noopener noreferrer"&gt;arxiv.org/abs/2405.04434&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Wang et al. - &lt;em&gt;Auxiliary-Loss-Free Load Balancing Strategy for Mixture of Experts&lt;/em&gt; - arxiv:2408.15664. &lt;a href="https://arxiv.org/html/2408.15664v1" rel="noopener noreferrer"&gt;arxiv.org/html/2408.15664v1&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Dao, T. - &lt;em&gt;FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning&lt;/em&gt; - NeurIPS 2023. &lt;a href="https://openreview.net/forum?id=mZn2Xyh9Ec" rel="noopener noreferrer"&gt;openreview.net/forum?id=mZn2Xyh9Ec&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Shah et al. - &lt;em&gt;FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision&lt;/em&gt; - 2024. &lt;a href="https://openreview.net/forum?id=tVConYid20" rel="noopener noreferrer"&gt;openreview.net/forum?id=tVConYid20&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Bae et al. - &lt;em&gt;Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA&lt;/em&gt; - 2024. &lt;a href="https://arxiv.org/abs/2410.20672" rel="noopener noreferrer"&gt;arxiv.org/abs/2410.20672&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Heo et al. - &lt;em&gt;RingFormer: A Ring-Enhanced Graph Transformer for Organic Solar Cell Property Prediction&lt;/em&gt; - Feb 2025.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;MoDr - &lt;em&gt;Mixture-of-Depth-Recurrent Transformers&lt;/em&gt; - OpenReview. &lt;a href="https://openreview.net/forum?id=9Pba4rcQbE" rel="noopener noreferrer"&gt;openreview.net/forum?id=9Pba4rcQbE&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Gu, A. et al. - &lt;em&gt;Efficiently Modeling Long Sequences with Structured State Spaces&lt;/em&gt; - ICLR 2022.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Dehghani et al. - &lt;em&gt;Universal Transformers&lt;/em&gt; - ICLR 2019. &lt;a href="https://arxiv.org/abs/1807.03819" rel="noopener noreferrer"&gt;arxiv.org/abs/1807.03819&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Graves, A. - &lt;em&gt;Adaptive Computation Time for Recurrent Neural Networks&lt;/em&gt; - 2016.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Hu et al. - &lt;em&gt;LoRA: Low-Rank Adaptation of Large Language Models&lt;/em&gt; - arxiv:2106.09685. &lt;a href="https://arxiv.org/abs/2106.09685" rel="noopener noreferrer"&gt;arxiv.org/abs/2106.09685&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Benchmark: &lt;em&gt;LLM Agents in Autonomous Cyberattacks Survey&lt;/em&gt; - arxiv:2505.12786. &lt;a href="https://arxiv.org/html/2505.12786v2" rel="noopener noreferrer"&gt;arxiv.org/html/2505.12786v2&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Happe, A. et al. - &lt;em&gt;Benchmarking LLM-driven Offensive Security&lt;/em&gt; - arxiv:2504.10112. &lt;a href="https://arxiv.org/html/2504.10112" rel="noopener noreferrer"&gt;arxiv.org/html/2504.10112&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Fang, R. et al. - &lt;em&gt;LLMs in Cybersecurity: A Survey&lt;/em&gt; - MDPI AI. &lt;a href="https://www.mdpi.com/2673-2688/6/9/216" rel="noopener noreferrer"&gt;mdpi.com/2673-2688/6/9/216&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;em&gt;Understanding Dynamic Compute Allocation in Recurrent Transformers&lt;/em&gt; - arxiv:2602.08864. &lt;a href="https://arxiv.org/html/2602.08864" rel="noopener noreferrer"&gt;arxiv.org/html/2602.08864&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;em&gt;Thinking Deeper, Not Longer: Depth-Recurrent Transformers&lt;/em&gt; - arxiv:2603.21676. &lt;a href="https://arxiv.org/html/2603.21676" rel="noopener noreferrer"&gt;arxiv.org/html/2603.21676&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;MarkTechPost - &lt;em&gt;Meet OpenMythos&lt;/em&gt; - April 2026. &lt;a href="https://www.marktechpost.com/2026/04/19/meet-openmythos-an-open-source-pytorch-reconstruction-of-claude-mythos-where-770m-parameters-match-a-1-3b-transformer/" rel="noopener noreferrer"&gt;marktechpost.com/2026/04/19&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Blockchain.news - &lt;em&gt;2.67× Faster Validation Steps&lt;/em&gt; - April 2026. &lt;a href="https://blockchain.news/ainews/openmythos-breakthrough-looped-transformer-moe-rebuild-of-claude-mythos-shows-2-67x-faster-validation-steps" rel="noopener noreferrer"&gt;blockchain.news/ainews&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;em&gt;Block Sparse FlashAttention&lt;/em&gt; - arxiv:2512.07011. &lt;a href="https://arxiv.org/abs/2512.07011" rel="noopener noreferrer"&gt;arxiv.org/abs/2512.07011&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;em&gt;MoE Survey 2024&lt;/em&gt; - arxiv:2406.18219. &lt;a href="https://arxiv.org/abs/2406.18219" rel="noopener noreferrer"&gt;arxiv.org/abs/2406.18219&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;em&gt;Optimizing MoE Routing&lt;/em&gt; - arxiv:2506.16419. &lt;a href="https://arxiv.org/html/2506.16419v1" rel="noopener noreferrer"&gt;arxiv.org/html/2506.16419v1&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;GitHub: &lt;a href="https://github.com/kyegomez/OpenMythos" rel="noopener noreferrer"&gt;kyegomez/OpenMythos&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

</description>
      <category>machinelearning</category>
      <category>deeplearning</category>
      <category>security</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
