DEV Community: Clint

Your AI, Your Rules: Running a Local LLM with GPU Acceleration on Proxmox

Clint — Fri, 01 May 2026 16:26:51 +0000

From 3 tok/s frustration to 21 tok/s GPU-hybrid inference - a real engineer's guide to self-hosted AI that actually works.

Why Bother Running Local LLMs?

Before we get into the how, let's address the obvious question: why not just use Claude, GPT, or Gemini?

The honest answer is - for many tasks, you should. But local LLMs make sense when:

Privacy matters. Code, internal documents, proprietary configs - none of it leaves your machine.
Cost at scale. API calls add up fast when you're running a coding agent all day.
Latency control. No network round-trips, no rate limits, no API downtime.
Offline capability. Works on a plane, in a data center, behind a firewall.
Experimentation. Swap models freely, tune inference parameters, benchmark to your heart's content.

This guide documents a real setup - not a toy demo - built specifically to run Claude Code and pi.dev against a local model, transparently, with no API key required.

The Hardware Stack

Component	Spec
Host	Proxmox VE 8.x, kernel 6.17.x
CPU	12-core (AMD/Intel)
RAM	40 GB allocated to LLM container
GPU	NVIDIA RTX 2000 Ada Generation Laptop (8 GB VRAM)
Storage	120 GB root + `/mnt/models` for model files
Network	Tailscale mesh for remote access

The GPU is the critical piece - even a modest 8 GB card dramatically changes what's possible.

The Architecture: What I Built

Two services, two ports, one model:

Port 11434 - llama.cpp native OpenAI-compatible API (for pi.dev, curl, anything OpenAI-compatible)
Port 4000 - thin Python proxy translating Anthropic Messages API to OpenAI format (for Claude Code)

Part 1: The Container - Proxmox LXC Setup

Use an LXC container rather than a full VM because:

Near-native CPU performance (no hypervisor overhead)
Shared host kernel means GPU passthrough works with the host's NVIDIA driver
Faster to snapshot, clone, and manage

Container Config

File: /etc/pve/lxc/103.conf

arch: amd64
cores: 12
features: nesting=1
hostname: llm-server
memory: 40000
net0: name=eth0,bridge=vmbr0,hwaddr=BC:24:11:F0:15:BA,type=veth
ostype: debian
rootfs: local-lvm:vm-103-disk-0,size=120G
swap: 0
dev0: /dev/nvidia0,uid=0,gid=0
dev1: /dev/nvidiactl,uid=0,gid=0
dev2: /dev/nvidia-uvm,uid=0,gid=0
dev3: /dev/nvidia-uvm-tools,uid=0,gid=0

The dev0 through dev3 lines are the magic - Proxmox's native device passthrough syntax. No lxc.mount.entry hacks required in newer PVE versions.

Critical gotcha: If you ever set chattr +i on /etc/resolv.conf inside the container (e.g., to prevent Tailscale from overwriting DNS), it will break Proxmox's pre-start hook which atomically updates the DNS config. The container won't start. Fix it from the host:
pct mount 103
chattr -i /var/lib/lxc/103/rootfs/etc/resolv.conf
pct unmount 103

Part 2: The Model - Picking the Right One

Model selection for a GPU-constrained system is non-obvious. The key insight is MoE vs Dense architecture:

Architecture	Example	Active Params/Token	CPU Speed	GPU Benefit
Dense	Qwen 3.6 27B	27B	~3.5 tok/s	High - all layers benefit
MoE	Qwen 3.6 35B-A3B	~3B	~18 tok/s	Lower - sparse routing already fast
MoE	Gemma 4 26B-A4B	~4B	~16 tok/s	Medium - GPU boosts active layers

The counter-intuitive result: a 35B MoE model runs 5x faster than a 27B dense model on CPU because MoE only activates a small fraction of weights per token. Don't assume smaller parameter count means faster inference.

I chose Gemma 4 26B-A4B Q4_K_XL (15.9 GB) for its:

Strong instruction following and coding ability
Multimodal capability (vision via mmproj)
262K token context window
4B active parameters (MoE) - fast despite large parameter count
Available from unsloth/gemma-4-26B-it-GGUF

Part 3: CPU-Only First - Getting llama.cpp Running

Always start CPU-only. It's simpler, debuggable, and gives you a baseline to measure GPU gains against.

Build llama.cpp

# Inside the container
apt-get install -y git cmake build-essential libopenblas-dev

git clone https://github.com/ggml-org/llama.cpp /opt/llm/llama.cpp
cd /opt/llm/llama.cpp && mkdir build && cd build
cmake -DGGML_CUDA=OFF .. && make -j$(nproc) llama-server

Systemd Service

File: /etc/systemd/system/llama-server.service

[Unit]
Description=LLM Inference Server (llama.cpp)
After=network.target

[Service]
Type=simple
User=root
ExecStart=/opt/llm/llama.cpp/build/bin/llama-server \
  -m /mnt/models/gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf \
  -t 11 \
  -c 32768 \
  --batch-size 512 \
  --ubatch-size 128 \
  -ngl 0 \
  -fa on \
  --cache-type-k q4_0 \
  --cache-type-v q4_0 \
  --no-mmap \
  --reasoning off \
  --host 0.0.0.0 \
  --port 11434
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target

Key Flags Explained

Flag	Why
`-t 11`	Use 11 of 12 cores - leave 1 for the OS
`-c 32768`	32K context - Claude Code's system prompt alone is ~24K tokens
`--batch-size 512`	Larger batches = higher throughput during prompt processing
`--cache-type-k/v q4_0`	4-bit KV cache - 75% smaller than f16, minimal quality loss
`--no-mmap`	Load model fully into RAM - avoids slow first requests
`--reasoning off`	Disable Gemma's thinking mode - outputs go to `reasoning_content` by default, which breaks OpenAI clients
`-fa on`	Flash Attention - ~30% faster, same quality

CPU baseline: ~16-18 tok/s (MoE model, 4B active params)

Part 4: GPU Passthrough - The Hard Part

This is where most guides give up or give wrong advice. Here's what actually works.

Step 1: Install the Right Driver on the Host

The host kernel was 6.17.x (PVE custom kernel). Debian's packaged NVIDIA driver (550) doesn't support kernels past ~6.8. The solution is NVIDIA's official CUDA repo for Debian 13.

# On the Proxmox host
wget https://developer.download.nvidia.com/compute/cuda/repos/debian13/x86_64/cuda-keyring_1.1-1_all.deb
dpkg -i cuda-keyring_1.1-1_all.deb
apt-get update
apt-get install -y nvidia-driver-595 nvidia-open-kernel-dkms

This installs driver 595.71.05 with DKMS support - it automatically builds kernel modules for all installed kernels including the PVE 6.17.x series.

Verify with nvidia-smi on the host. It should show your GPU.

Step 2: Add GPU Devices to the Container Config

In /etc/pve/lxc/103.conf, add:

dev0: /dev/nvidia0,uid=0,gid=0
dev1: /dev/nvidiactl,uid=0,gid=0
dev2: /dev/nvidia-uvm,uid=0,gid=0
dev3: /dev/nvidia-uvm-tools,uid=0,gid=0

Restart the container:

pct stop 103 && pct start 103

Step 3: Verify GPU Visibility Inside the Container

ls -la /dev/nvidia*
# crw-rw---- 1 root root 195,   0 /dev/nvidia0
# crw-rw---- 1 root root 195, 255 /dev/nvidiactl
# crw-rw---- 1 root root 505,   0 /dev/nvidia-uvm
# crw-rw---- 1 root root 505,   1 /dev/nvidia-uvm-tools

The devices are visible - but nvidia-smi won't work yet. You need userspace libraries.

Part 5: CUDA Inside LXC - Making the GPU Work

The container needs NVIDIA userspace libraries that exactly match the host driver version (595.71.05). Version mismatch causes nvidia-smi: Failed to initialize NVML.

Install Matching Userspace Libraries

# Inside the container - add the CUDA repo for Debian 12
wget https://developer.download.nvidia.com/compute/cuda/repos/debian12/x86_64/cuda-keyring_1.1-1_all.deb
dpkg -i cuda-keyring_1.1-1_all.deb
apt-get update

# Install userspace libraries pinned to host driver version
apt-get install -y libnvidia-ml1=595.71.05-1 nvidia-driver-cuda=595.71.05-1

# Verify
nvidia-smi

Expected output:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 595.71.05   Driver Version: 595.71.05      CUDA Version: 13.2               |
+-----------------------------------------+------------------------+----------------------+
|   0  NVIDIA RTX 2000 Ada Gene...    On  |   00000000:01:00.0 Off |                  N/A |
| N/A   53C    P3             11W /   39W |       0MiB /   8188MiB |      0%      Default |

Now install the CUDA toolkit for building llama.cpp:

apt-get install -y --no-install-recommends cuda-toolkit-12-6
export PATH=/usr/local/cuda/bin:$PATH
nvcc --version

Rebuild llama.cpp with CUDA

cd /opt/llm/llama.cpp
rm -rf build && mkdir build && cd build

# sm_89 = Ada Lovelace (RTX 2000 Ada, RTX 4xxx series)
# Use sm_86 for Ampere (RTX 3xxx), sm_75 for Turing (RTX 2xxx)
PATH=/usr/local/cuda/bin:$PATH cmake \
  -DGGML_CUDA=ON \
  -DCMAKE_CUDA_ARCHITECTURES=89 \
  ..

make -j$(nproc) llama-server

Verify CUDA is linked:

ldd build/bin/llama-server | grep cuda
# libggml-cuda.so.0 => .../libggml-cuda.so.0
# libcudart.so.12  => .../libcudart.so.12
# libcublas.so.12  => .../libcublas.so.12

Finding the Optimal GPU Layer Count

With 8 GB VRAM and a 15.9 GB model, you can't fit everything on GPU. The math:

Model has 30 transformer layers
Available VRAM after system overhead: ~7.5 GB
Per-layer cost: ~600-700 MB
Safe layer count: 12 layers (leaves ~700 MB free for compute buffers)

Start low and increase until you hit OOM, then back off one step. Update the service with -ngl 12 and restart:

systemctl restart llama-server

GPU-hybrid result: ~21 tok/s (vs 16-18 tok/s CPU-only). The GPU hits 60%+ SM utilization during inference.

Part 6: The Proxy - Bridging Claude Code to Your LLM

Here's the problem nobody warns you about: Claude Code uses the Anthropic Messages API format, while llama.cpp serves the OpenAI Chat Completions format. They're incompatible.

The proxy handles both sync and streaming (SSE) responses - Claude Code uses streaming for the interactive terminal experience.

Key Translation Points

Anthropic	OpenAI
`system` (string or content array)	`messages[0]` with `role: system`
`content` (array of blocks)	`content` (plain string)
`choices[0].message.content`	`content[0].text`
SSE `content_block_delta` events	SSE `choices[0].delta.content` chunks
`stop_reason: end_turn`	`finish_reason: stop`

The full proxy is ~150 lines of Python using aiohttp. Run it as a systemd service on port 4000:

# /etc/systemd/system/anthropic-proxy.service
[Unit]
Description=Anthropic API Proxy for llama.cpp
After=network.target llama-server.service

[Service]
Type=simple
User=root
ExecStart=/usr/bin/python3 /opt/anthropic-proxy.py
Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target

Claude Code Config

File: ~/.claude/settings.json on the client machine

{
  "env": {
    "ANTHROPIC_BASE_URL": "http://192.168.100.103:4000",
    "ANTHROPIC_API_KEY": "sk-no-key-required"
  }
}

Test it:

claude -p "hi"
# Hello! How can I help you today?

Why not LiteLLM? I tried it. LiteLLM 1.83 introduced a ResponsesAPIResponse type internally that fails validation when converting back to AnthropicResponse. Requests hang silently with no error returned to the client. The 150-line custom proxy was faster to write and debug than fighting the library.

Part 7: pi.dev Integration

pi.dev speaks OpenAI format natively - no proxy needed, connect directly to port 11434.

File: ~/.pi/agent/models.json

{
  "providers": {
    "llama-local": {
      "baseUrl": "http://192.168.100.103:11434/v1",
      "api": "openai-completions",
      "apiKey": "sk-dummy",
      "compat": {
        "supportsDeveloperRole": false,
        "supportsReasoningEffort": false
      },
      "models": [
        {
          "id": "gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf",
          "name": "Gemma 4 26B (Local GPU)",
          "reasoning": false,
          "input": ["text"],
          "contextWindow": 32768,
          "maxTokens": 8192,
          "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 }
        }
      ]
    }
  }
}

The compat block is important - llama.cpp doesn't understand the developer role (used by pi for reasoning-capable models) or the reasoning_effort parameter. Setting both to false makes pi send standard system messages instead.

Open pi and type /model to select your local model. The file reloads automatically - no restart needed.

Performance Results

Stage	Speed	Notes
CPU-only (Dense 27B)	~3.5 tok/s	Wrong model choice - dense is slow on CPU
CPU-only (MoE 35B)	~18 tok/s	Switched to MoE - massive improvement
CPU-only (Gemma 4 26B MoE)	~16 tok/s	Better model quality, similar speed
GPU hybrid (12/30 layers)	~21 tok/s	30% improvement, GPU at 60%+ utilization
Prompt processing (prefill)	~40 tok/s	GPU accelerates context loading significantly

Typical response times:

"hi" - ~1 second
100-token code explanation - ~5 seconds
500-token code generation - ~25 seconds

The GPU contributes most to prompt processing speed - loading a large codebase context into the KV cache is noticeably faster with GPU layers active.

Gotchas and Lessons Learned

1. chattr +i on resolv.conf breaks container startup

Proxmox's pre-start hook atomically renames a temp file to /etc/resolv.conf. The immutable flag blocks this. The container fails silently - only visible in lxc-start -l DEBUG logs as close (rename) atomic file failed: Operation not permitted.

2. Driver version must match exactly

Userspace libraries inside the container must match the host kernel module version exactly. A mismatch causes nvidia-smi: Failed to initialize NVML. Pin the version explicitly: apt-get install libnvidia-ml1=595.71.05-1.

3. Kernel 6.17 + Debian driver 550 = build failure

Debian's packaged NVIDIA driver 550 has no DKMS support for kernels past ~6.8. The fix is NVIDIA's official CUDA repo for Debian 13 (debian13), which ships driver 595 with working DKMS for modern kernels.

4. MoE vs Dense - the counterintuitive performance flip

A 35B MoE model genuinely outperforms a 27B dense model on CPU because sparse activation means only ~3-4B parameters are computed per token. Never assume smaller parameter count means faster inference - check the architecture first.

5. Gemma 4 thinks by default

Gemma 4 uses internal chain-of-thought thinking mode by default. With streaming, the client receives reasoning_content but empty content until thinking completes. For chat interfaces that expect immediate tokens, add --reasoning off. For code accuracy tasks, leaving it enabled is worth the latency cost.

6. LiteLLM 1.83 hangs silently

The latest LiteLLM uses a new ResponsesAPIResponse type that fails Pydantic validation when serializing to AnthropicResponse. The request completes internally but the response is never sent to the client. No error, no timeout - just silence.

7. Context window must exceed Claude Code's system prompt

Claude Code's built-in system prompt is approximately 24K tokens. A context window below 32K triggers an immediate exceed_context_size_error before any user message is processed. Set -c 32768 as the minimum.

What's Next

Short term:

Re-enable thinking mode selectively by passing budget_tokens per request
Add a /v1/models endpoint to the proxy for model auto-discovery

For a dedicated thinking model:

DeepSeek-R1-Distill-Qwen-14B (~8 GB Q4) - fits almost entirely in 8 GB VRAM, estimated 30-40 tok/s, purpose-built for reasoning tasks

For bigger hardware:

RTX 3090 or 4090 (24 GB VRAM) - entire Gemma 4-26B fits on GPU, estimated 60-80 tok/s
A dual-GPU setup with NVLink enables running 70B models entirely on GPU

The Stack, Summarized

Model:    Gemma 4 26B-A4B Q4_K_XL (15.9 GB, MoE)
Engine:   llama.cpp with CUDA (sm_89, Ada Lovelace)
GPU:      RTX 2000 Ada 8 GB - 12/30 layers on GPU
Speed:    ~21 tok/s generation, ~40 tok/s prefill
Proxy:    Python aiohttp - Anthropic <-> OpenAI translation
Clients:  Claude Code (port 4000), pi.dev (port 11434)
Access:   LAN (192.168.100.103) + Tailscale
Cost:     $0 per query after hardware

The entire setup took about 8 hours of real iteration. Most of that time was the three gotchas above - the chattr trap, the kernel/driver mismatch, and the LiteLLM silent hang. Hopefully this guide saves you all of it.

Built on Proxmox 8.x · llama.cpp · NVIDIA driver 595.71.05 · CUDA 12.6 · Gemma 4 26B · May 2026

OpenMythos Teardown: Dissecting the Open-Source Reconstruction of Claude Mythos

Clint — Thu, 23 Apr 2026 07:07:01 +0000

Disclaimer: OpenMythos is a community-driven theoretical reconstruction. It is not affiliated with or endorsed by Anthropic. All claims about Claude Mythos's architecture are speculative hypotheses backed by publicly available research.

What Is OpenMythos?

On April 21, 2026, Kye Gomez - founder of Swarms AI - published OpenMythos to GitHub. The project is a fully open-source PyTorch reconstruction of the hypothesized architecture behind Anthropic's Claude Mythos model.

The thesis: Claude Mythos achieves its extraordinary reasoning not by stacking hundreds of unique transformer layers, but by looping a compact set of layers multiple times, performing continuous "latent chain-of-thought" reasoning in hidden state space before ever emitting a single output token.

This idea - a Recurrent-Depth Transformer (RDT) - is grounded in a growing body of 2024–2025 academic research from ICLR, DeepSeek, and multiple independent labs. The architecture combines:

A three-stage Prelude → Loop → Coda pipeline
Spectral-radius-constrained hidden state updates (from Parcae architecture)
Adaptive Computation Time (ACT) halting for per-token variable compute
Fine-grained Mixture of Experts (MoE) with DeepSeek-V3-style bias-based load balancing
Multi-Latent Attention (MLA) for 10–20× KV cache reduction
Depth-wise LoRA adapters for cheap per-loop specialization

Per Blockchain.news, early training runs show 2.67× faster validation steps compared to a baseline dense transformer at the same parameter count.

The Central Hypothesis

The key architectural claim: a 770M-parameter Recurrent-Depth Transformer can match the effective capacity of a standard 1.3B dense transformer, because every parameter is reused N times across loop iterations.

Effective Compute ≈ Parameters × Loop Iterations

vs.

Dense Transformer Effective Compute ≈ Parameters × 1

This means the model can:

Scale reasoning depth at inference without retraining (run more loops for harder problems)
Generalize to more loops than it was trained on (depth extrapolation via LoRA clamping)
Run entirely in continuous latent space - no chain-of-thought token emission required

"A 770M-parameter RDT matches a 1.3B dense model" - MarkTechPost, April 2026

Architecture Overview

The model follows a strict three-stage pipeline:

File: open_mythos/main.py:899–1086

The key insight: Prelude and Coda execute once (fixed compute). The Recurrent Block holds all the reasoning capacity and runs T times. The frozen encoding e is injected at every loop step, preventing the model from "forgetting" the input.

Dissection: Six Novel Mechanisms

4.1 LTI-Stable Injection - The Heartbeat

File: open_mythos/main.py:684–743

The most critical and least obvious component. Without it, looped transformers diverge.

class LTIInjection(nn.Module):
    """Linear Time-Invariant injection with spectral radius < 1 by construction."""
    def get_A(self) -> torch.Tensor:
        # A_continuous = -exp(log_A)  → always negative diagonal
        # A_discrete   = exp(Δt × A_continuous)  → always in (0, 1)
        return torch.exp(
            -torch.exp((self.log_dt + self.log_A).clamp(-20, 20))
        )

    def forward(self, h, e, transformer_out):
        A = self.get_A()   # spectral radius guaranteed < 1
        return A * h + self.B * e + transformer_out

The update rule:

h_{t+1} = A · h_t  +  B · e  +  Transformer(h_t, e)

Where ρ(A) < 1 is guaranteed by parameterization - not enforced by regularization.

Why this matters:

A parameterization	What happens
Unconstrained	ρ(A) ≥ 1 possible → hidden state explodes after N loops
Soft regularization	Sometimes works, often diverges at high LR
LTI with ZOH	ρ(A) < 1 always → stable at any depth

The implementation uses zero-order-hold (ZOH) discretization: a continuous-time negative diagonal matrix A_c = -exp(log_A) is mapped to discrete time via exp(Δt · A_c), which always lands in (0, 1). This is borrowed from state-space models (Gu et al., 2021 - S4).

Every divergent training run in the Parcae architecture paper had ρ(A) ≥ 1. Every convergent run had ρ(A) < 1.

4.2 ACT Halting - Variable Compute per Token

Files: open_mythos/main.py:750–781 (halting unit), open_mythos/main.py:865–889 (integration in RecurrentBlock)

class ACTHalting(nn.Module):
    """Per-position adaptive computation time."""
    def forward(self, h: torch.Tensor) -> torch.Tensor:
        return torch.sigmoid(self.halt(h)).squeeze(-1)

In the loop:

# Remainder trick: assign leftover probability at threshold crossing
remainder = 1.0 - cumulative_halt
crossed   = (cumulative_halt + p) >= self.act_threshold
weight    = torch.where(crossed, remainder, p)

# Accumulate weighted hidden state
h_out            += weight.unsqueeze(-1) * h
cumulative_halt  += weight
still_running     = ~crossed

What this achieves:

"The cat sat."        → halts at loop 3   (trivial, no reasoning needed)
"Prove P ≠ NP."       → halts at loop 16  (maximum compute allocated)
"2 + 2"               → halts at loop 1
"Multi-step logic..."  → halts at loop 12

Per ICLR 2025 research on recurrent-depth architectures, looped updates exhibit a rapid norm decay pattern: early iterations make large hidden-state changes, late iterations make tiny orthogonal adjustments. ACT exploits this by halting when updates become negligible.

Throughput impact: 2–3× improvement in inference throughput (easy tokens exit early, expensive compute is allocated to hard tokens only).

The critical bug fixed in OpenMythos v0.4.0: halted positions must be gated from weight accumulation. Once a position halts, its h must not be included in gradient updates - a subtle but catastrophic error if missed.

4.3 Loop-Index RoPE - Teaching Shared Weights Two Jobs

File: open_mythos/main.py:541–571

def loop_index_embedding(h: torch.Tensor, loop_t: int, loop_dim: int) -> torch.Tensor:
    """Inject sinusoidal depth-position signal into hidden state."""
    freqs   = 1.0 / (theta ** (torch.arange(0, loop_dim, 2) / loop_dim))
    angles  = loop_t * freqs
    emb     = torch.cat([angles.sin(), angles.cos()], dim=-1)[:loop_dim]
    emb_full                = torch.zeros(h.shape[-1])
    emb_full[:loop_dim]     = emb
    return h + emb_full.unsqueeze(0).unsqueeze(0)

The problem it solves: With pure weight sharing, the model runs identical computation at loop 1 and loop 16 - no mechanism to differentiate "early encoding" from "late refinement."

The solution: Inject a sinusoidal signal keyed to the loop index t before every iteration, similar to how RoPE encodes sequence position. Now the shared weights can learn functionally distinct behaviors per depth - not via separate parameters, but via different activations conditioned on the loop signal.

This is analogous to the RingFormer architecture (Heo et al., Feb 2025) which uses low-rank "level signals" for the same purpose.

4.4 Depth-Wise LoRA - Cheap Specialization at Scale

File: open_mythos/main.py:578–620

class LoRAAdapter(nn.Module):
    """Per-loop scale LoRA: shared A/B matrices, learned scale per loop index."""
    def forward(self, x: torch.Tensor, loop_t: int) -> torch.Tensor:
        t_idx = min(loop_t, self.scale.num_embeddings - 1)  # clamp for depth extrapolation
        s     = self.scale(t_idx)   # (rank,) - learned per-loop scale
        down  = self.down(x) * s   # (B, T, rank)
        return down @ self.B        # (B, T, dim)

Parameter cost analysis:

Approach	Parameters per loop
Fully distinct weights	`dim × dim` (hundreds of millions)
Pure weight sharing	0 (least expressive)
LoRA adapter	`rank × dim × 2 + rank × max_loops` (thousands)

The clamp operation (min(loop_t, max_t)) enables depth extrapolation: train on 16 loops, run inference with 32 loops. Loops 17–32 reuse the scale learned for loop 16. Quality follows an exponential improvement curve with loop count before plateauing.

This is validated by the MoDr paper (OpenReview) - "Mixture-of-Depth-Recurrent Transformers" - which shows LoRA-based depth adaptation enables reliable out-of-distribution loop generalization.

4.5 Fine-Grained MoE with Bias-Based Load Balancing

File: open_mythos/main.py:426–534

class MoEFFN(nn.Module):
    """DeepSeek-style: fine-grained routed experts + always-on shared experts."""
    def forward(self, x):
        logits     = self.router(x)                          # (B, T, n_experts)
        scores     = F.softmax(logits, dim=-1)               # gate weights (gradient flows here)
        biased_log = logits + self.router_bias               # bias shifted (no gradient)
        topk_idx   = biased_log.topk(self.topk, dim=-1).indices
        topk_scores = scores.gather(-1, topk_idx)
        topk_scores = topk_scores / topk_scores.sum(-1, keepdim=True)  # renormalize

        # Dispatch tokens to selected experts
        out = self._dispatch(x, topk_idx, topk_scores)

        # Always-on shared experts
        for expert in self.shared_experts:
            out = out + expert(x)
        return out

The load-balancing trick (DeepSeek-V3 style):

Standard auxiliary-loss balancing adds a penalty term to the training objective - but this introduces competing gradients and a tricky hyperparameter. OpenMythos uses bias-based routing instead:

Per arxiv:2408.15664 (Auxiliary-Loss-Free Load Balancing):

Biases are updated externally: overloaded experts get their bias decreased, underloaded ones increased
No gradient interference with the task objective
Zero token dropping during training and inference

The v0.4.0 bugfix "stop load balance bias gradient leak" fixed a subtle error where the bias update was accidentally being included in the backward pass - polluting task gradients with load-balancing signals.

Fine-grained vs coarse-grained experts:

Type	Expert dim	Experts	Active per token
Coarse (Mixtral-style)	Large (≈ full FFN)	8	2
Fine-grained (DeepSeek-style)	Small (≈ 1/16 FFN)	256	32
OpenMythos 3B	`expert_dim=4096`	64	top-4

Fine-grained experts activate more diverse combinations per token, increasing effective routing paths from C(8,2)=28 to C(64,4)≈635,376.

4.6 Multi-Latent Attention - 10–20× KV Cache Compression

File: open_mythos/main.py:284–419

MLA compresses KV to a low-rank latent, dramatically reducing inference memory:

Standard KV Cache: K, V ∈ R^{n_heads × head_dim}    per token
GQA Cache:         K, V ∈ R^{n_kv_heads × head_dim} per token
MLA Cache:         c_kv ∈ R^{kv_lora_rank}           per token
                   k_rope ∈ R^{qk_rope_head_dim}     per token

At 1T scale:

Mechanism	Cache per token	Ratio
Full MHA	`128 × 128 × 2 = 32,768`	1×
GQA (16 KV heads)	`16 × 128 × 2 = 4,096`	8×
MLA	`1024 + (128 × 64) = 9,216`	3.6× over GQA

The trick: only c_kv (the latent) and k_rope (RoPE-encoded keys) are cached. K_nope and V are reconstructed on-the-fly via a cheap upward projection - compute cost is negligible vs. memory saved.

# At each token position:
c_kv, k_rope_raw = kv_down(x).split([kv_lora_rank, qk_rope_head_dim], dim=-1)
# Cache c_kv and k_rope - NOT K, V themselves

# At attention time:
kv_out = kv_up(c_kv_cached)               # reconstruct K_nope + V from latent
K_nope, V = kv_out.split([...], dim=-1)   # split reconstructed output
K = concat(K_nope, k_rope_cached)         # full K = nope + rope components

This was first introduced in DeepSeek-V2 and is one of the most practically significant innovations for long-context inference.

The Training Pipeline

File: training/3b_fine_web_edu.py

Dataset: FineWeb-Edu

class FineWebEduDataset(IterableDataset):
    def __iter__(self):
        ds = load_dataset(
            "HuggingFaceFW/fineweb-edu",
            name=self.subset,
            split="train",
            streaming=True,
        ).shard(num_shards=total_shards, index=shard_index)

1.3 trillion tokens, Apache 2.0 licensed
Streaming from HuggingFace Hub (no local disk required)
Two-dimensional sharding: world_size × num_workers - disjoint, no duplication
Documents packed into rolling 2048-token chunks

Training Configuration (3B Model)

Parameter	Value
Model	mythos_3b() - 3.7B params, 64 experts, 16 loops
Tokenizer	openai/gpt-oss-20b (100K vocab)
Sequence length	2,048 tokens
Global batch	~512K tokens (256 grad accum steps)
Total tokens	30B (~2.5× Chinchilla-efficient for looped models)
LR schedule	Linear warmup (2000 steps) → cosine decay
Max LR	3e-4
Optimizer	AdamW fused, betas=(0.9, 0.95), weight_decay=0.1
Precision	bfloat16 (H100/A100)
Distributed	FSDP (Fully Sharded Data Parallel)

FSDP Setup

model = FSDP(
    model,
    sharding_strategy=ShardingStrategy.FULL_SHARD,
    mixed_precision=MixedPrecision(param_dtype=torch.bfloat16),
    auto_wrap_policy=ModuleWrapPolicy({TransformerBlock, RecurrentBlock}),
    device_id=local_rank,
)

# Gradient accumulation with no_sync() - all-reduce only on final micro-step
for micro_step in range(grad_accum_steps):
    ctx = model.no_sync() if micro_step < grad_accum_steps - 1 else nullcontext()
    with ctx, amp_ctx:
        logits = model(x)
        loss   = F.cross_entropy(logits.view(-1, vocab), y.view(-1)) / grad_accum_steps
    loss.backward()

Token efficiency claim: Looped architectures are ~2.5× more token-efficient than dense models at equal parameter count. A 3B RDT at 30B tokens matches a 3B dense model at 75B tokens. This tracks with Chinchilla-style analysis adjusted for parameter reuse.

Model Variants: 1B to 1T

File: open_mythos/variants.py

Scaling principles:

expert_dim grows with model size (maintain activation density)
Loop count increases (frontier models reason deeper per token)
Context and output length jump at 100B+ (1M token context enabled)

Security Angle

Threat Modelling Locally-Runnable Reasoning Models

OpenMythos is not just an academic curiosity - it directly changes the threat landscape for AI-assisted security work. Here's why this architecture matters for security practitioners.

1. Local Deployment = No Rate Limiting

Commercial frontier models (GPT-4, Claude) apply rate limits, content filters, and usage policies. A locally-running RDT with 3B parameters and a 512K-token context breaks all of these controls.

Per arxiv:2504.10112 (Benchmarking LLM-driven Offensive Security), state-of-the-art LLM agents achieve:

228.6% improvement in penetration testing task completion rate (PentestGPT)
60% success rate obtaining shell access in CTF environments (RapidPen)
$0.30–$0.60 per exploitation attempt using commercial APIs

With a locally-running OpenMythos model, the per-attempt cost drops to compute only.

2. Inference-Time Scaling for Hard Problems

The ACT halting mechanism is particularly relevant for security: hard cryptographic reasoning, complex vulnerability chains, and multi-step exploit development are exactly the "hard" problems that get allocated more loops.

"Find a path from X endpoint to the admin database"
     → ACT allocates maximum loops per token
     → model reasons in latent space across the full attack chain
     → outputs a step-by-step exploitation path

This is the same compute-on-demand property that makes RDTs interesting for math and coding - and adversarial reasoning is just another form of hard multi-step problem.

3. Defensive Use Cases

The flip side: the same architecture enables powerful defensive applications:

Log anomaly detection: 1M token context window (mythos_100b+) can ingest an entire day of SIEM logs in a single pass and reason across them for lateral movement indicators
Malware analysis: Decompiled binary context fed to the model for behavioral classification
Vulnerability triage: Static analysis output reasoning for false-positive reduction
SOC automation: Multi-step reasoning chains for alert investigation without human-in-the-loop

Per MDPI Cybersecurity Survey, LLMs in cybersecurity are actively being deployed across:

Intrusion/anomaly detection
Threat intelligence extraction
Automated vulnerability repair
Red team simulation

4. Tokenizer Attack Surface

File: open_mythos/tokenizer.py

class MythosTokenizer:
    def __init__(self, model_id: str = "openai/gpt-oss-20b"):
        self.tokenizer = AutoTokenizer.from_pretrained(model_id)

The tokenizer is loaded from HuggingFace Hub at runtime with no local checksum validation. This is a supply chain attack surface - a poisoned tokenizer on HuggingFace could alter token mappings and inject adversarial behavior into any model using it. This is a known class of vulnerability documented in ML supply chain attacks research.

Mitigation: Pin tokenizer versions, validate checksums, mirror to internal artifact registry.

5. KV Cache Memory Safety

The generate method has no explicit bounds on KV cache growth:

def generate(self, input_ids, max_new_tokens=64, n_loops=8, ...):
    # kv_cache grows with sequence length × layers × heads
    # No OOM protection; long sequences cause silent crash

In a production inference endpoint, this creates a resource exhaustion vector - long sequences or high concurrency causes OOM crashes. Defense: implement sequence length limits and cache size monitoring at the inference wrapper layer.

6. Prompt Injection via Raw Causal LM

OpenMythos is a pure causal language model - no system prompt infrastructure, no guardrails. Any downstream application wrapping OpenMythos for a security tool inherits the full prompt-injection surface and must implement filtering at the application layer.

What the Research Says

OpenMythos does not invent from scratch. Every mechanism has an academic foundation:

Mechanism	Paper	Conference/Year
Recurrent-Depth Transformers	Geiping et al.	ICLR 2025
LTI Stable Injection (Parcae)	Hayden Prairie et al.	2026
Universal Transformers + ACT	Dehghani et al.	ICLR 2019
Multi-Latent Attention	DeepSeek-V2	2024
Fine-Grained MoE	DeepSeek-V3	Dec 2024
Auxiliary-Loss-Free Balancing	arxiv:2408.15664	2024
LoRA depth adaptation	Bae et al. 2024; MoDr	2024–2025
Flash Attention 2	Dao et al.	NeurIPS 2023
GQA	Ainslie et al.	EMNLP 2023

The convergence of these techniques into a single architecture is the core contribution. Each alone is known; together they form a coherent reasoning machine.

The Grokking Connection

RDTs exhibit a striking property documented in ICLR 2025 research: training shows phase transitions in generalization (grokking). The model suddenly jumps from memorization to systematic generalization at a critical training token threshold - and this transition is more pronounced in looped models than in dense models.

Latent Chain-of-Thought

arxiv:2507.02199 shows that RDT hidden state trajectories are decodable: you can extract intermediate reasoning steps from the loop iterations without ever emitting reasoning tokens. This suggests "chain-of-thought" is not a discrete token-level phenomenon - it is an emergent property of iterated hidden-state refinement.

Benchmarks & Evidence

From the OpenMythos training logs and community reports:

Validation Loss Curves (3B training run, FineWeb-Edu 30BT):
Step 0:      loss=11.2  (random baseline)
Step 5,000:  loss=3.8   (initial convergence)
Step 20,000: loss=2.9   (mid-training)
Step 58,000: loss=2.4   (training complete)

Inference throughput comparison (3B, A100, batch=32):
Dense 3B baseline:   940 tokens/sec
OpenMythos 3B (MoE): 2,510 tokens/sec  [2.67× faster]

Source: Blockchain.news, April 2026

The throughput gain comes from:

ACT halting: Fewer loops for easy tokens
MoE sparsity: ~5% of routed expert parameters active per token
MLA cache compression: Smaller KV cache = more sequences fit in GPU memory = higher batch size

Quick Start

from open_mythos import OpenMythos, MythosConfig
from open_mythos.variants import mythos_1b
from open_mythos.tokenizer import MythosTokenizer
import torch

# Build a 1B model
cfg   = mythos_1b()
model = OpenMythos(cfg).cuda()
tok   = MythosTokenizer()

# Generate with 16 reasoning loops
input_ids = torch.tensor([tok.encode("Explain the proof of Gödel's incompleteness theorem.")]).cuda()
output    = model.generate(input_ids, max_new_tokens=256, n_loops=16)
print(tok.decode(output[0].tolist()))

# Scale up reasoning at inference (no retraining)
output_deep = model.generate(input_ids, max_new_tokens=256, n_loops=32)

Install:

pip install open-mythos            # core
pip install "open-mythos[flash]"   # + Flash Attention 2 (2-3× faster)

Conclusion

OpenMythos is more than a speculative reverse-engineering project. It is a working, production-grade PyTorch implementation of a state-of-the-art reasoning architecture that:

Challenges the "more layers = better" paradigm - depth through iteration, not stacking
Makes inference-time scaling practical - run more loops at test time for harder problems
Compresses memory aggressively - MLA + sparse MoE makes frontier-scale models runnable on fewer GPUs
Brings stability guarantees - LTI injection removes training instability without hyperparameter tuning
Changes the security landscape - locally-runnable reasoning models with long context eliminate API-based controls

The architecture sits at a confluence of ICLR 2025, DeepSeek-V3, and Universal Transformer research - not speculation, but synthesis. Whether or not it correctly reconstructs Claude Mythos, OpenMythos is a significant architectural contribution in its own right.

References

Geiping et al. - Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach - ICLR 2025. openreview.net/pdf?id=WwpYSOkkCt
DeepSeek-AI - DeepSeek-V3 Technical Report - arxiv:2412.19437. arxiv.org/pdf/2412.19437
DeepSeek-AI - DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model - 2024. arxiv.org/abs/2405.04434
Wang et al. - Auxiliary-Loss-Free Load Balancing Strategy for Mixture of Experts - arxiv:2408.15664. arxiv.org/html/2408.15664v1
Dao, T. - FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning - NeurIPS 2023. openreview.net/forum?id=mZn2Xyh9Ec
Shah et al. - FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision - 2024. openreview.net/forum?id=tVConYid20
Bae et al. - Relaxed Recursive Transformers: Effective Parameter Sharing with Layer-wise LoRA - 2024. arxiv.org/abs/2410.20672
Heo et al. - RingFormer: A Ring-Enhanced Graph Transformer for Organic Solar Cell Property Prediction - Feb 2025.
MoDr - Mixture-of-Depth-Recurrent Transformers - OpenReview. openreview.net/forum?id=9Pba4rcQbE
Gu, A. et al. - Efficiently Modeling Long Sequences with Structured State Spaces - ICLR 2022.
Dehghani et al. - Universal Transformers - ICLR 2019. arxiv.org/abs/1807.03819
Graves, A. - Adaptive Computation Time for Recurrent Neural Networks - 2016.
Hu et al. - LoRA: Low-Rank Adaptation of Large Language Models - arxiv:2106.09685. arxiv.org/abs/2106.09685
Benchmark: LLM Agents in Autonomous Cyberattacks Survey - arxiv:2505.12786. arxiv.org/html/2505.12786v2
Happe, A. et al. - Benchmarking LLM-driven Offensive Security - arxiv:2504.10112. arxiv.org/html/2504.10112
Fang, R. et al. - LLMs in Cybersecurity: A Survey - MDPI AI. mdpi.com/2673-2688/6/9/216
Understanding Dynamic Compute Allocation in Recurrent Transformers - arxiv:2602.08864. arxiv.org/html/2602.08864
Thinking Deeper, Not Longer: Depth-Recurrent Transformers - arxiv:2603.21676. arxiv.org/html/2603.21676
MarkTechPost - Meet OpenMythos - April 2026. marktechpost.com/2026/04/19
Blockchain.news - 2.67× Faster Validation Steps - April 2026. blockchain.news/ainews
Block Sparse FlashAttention - arxiv:2512.07011. arxiv.org/abs/2512.07011
MoE Survey 2024 - arxiv:2406.18219. arxiv.org/abs/2406.18219
Optimizing MoE Routing - arxiv:2506.16419. arxiv.org/html/2506.16419v1
GitHub: kyegomez/OpenMythos