DEV Community

Sergio Solis
Sergio Solis

Posted on

The Frozen Context Pattern: Adding State to Deep Equilibrium Models

DEQ models converge only if their update function is a contraction. We found a design pattern that lets you inject arbitrary external state — Mamba memory, attention, anything — without touching the Lipschitz bound.
tags: rust, machinelearning, ai, webgpu


We're building AIDEEN, an open-source AI engine in Rust that runs on consumer GPUs via WebGPU. The core is a Deep Equilibrium Model (DEQ) — a single parameter block iterated until it converges to a fixed point.

DEQs have a hard constraint: the update function must be a contraction. Every component you add risks widening the Lipschitz bound and breaking convergence. We ran into this wall trying to add Mamba-style temporal memory, then attention. Both broke convergence the same way. Both were fixed the same way.

Here's the pattern.

The Constraint

A DEQ finds h* by iterating:

h^(k+1) = f(h^(k); x)    until    |h^(k+1) - h^(k)| < epsilon
Enter fullscreen mode Exit fullscreen mode

Convergence requires L(f) < 1. We enforce this via spectral normalization on every weight matrix every 4 gradient steps. This works cleanly as long as the Jacobian df/dh only contains h-dependent terms.

The moment you add something that couples back to h inside the loop, the Jacobian grows cross-terms that blow up the bound.

What Happens When You Ignore It

We tried adding a Mamba-style SSM inside the loop:

h^(k+1) = f(h^(k), M^(k); x)
M^(k+1) = g(h^(k); M^(k-1))
Enter fullscreen mode Exit fullscreen mode

Now h depends on M depends on h. The combined Jacobian has cross-terms. In practice:

  1. Spectral norm of the combined system exceeded 1.0
  2. Picard iteration hit the cap without converging
  3. The model oscillated indefinitely

No amount of damping fixed it. The feedback loop is structural — you can't regularize your way out of it.

Same thing happened when we naively put slot attention inside the iteration: the attention weights depend on h, V depends on h, the output feeds back into h. Same instability.

The Pattern: Frozen Context

The fix is the same in both cases. Any external component — Mamba state, attention, anything — can safely enter the DEQ if it follows this structure:

  1. Compute it once, before the Picard loop starts, from the previous converged state
  2. Freeze it — treat it as a constant during iteration (stop-gradient)
  3. Inject it additively into the loop body
  4. Update it after convergence, using h*
# ── Prelude (before the loop) ───────────────────────────
ctx_A = component_A(prev_state_A)   # frozen — computed once
ctx_B = component_B(prev_state_B)   # frozen — computed once

# ── Picard loop ─────────────────────────────────────────
for k in range(max_iters):
    h_next = f(h_curr, x_t) + ctx_A + ctx_B   # ctx never changes
    if converged(h_next, h_curr): break
    h_curr = h_next

# ── Post-convergence updates ────────────────────────────
state_A = update_A(h_star, prev_state_A)
state_B = update_B(h_star, prev_state_B)
Enter fullscreen mode Exit fullscreen mode

Why this works: ctx_A and ctx_B are constants with respect to h. The Jacobian df/dh contains no cross-terms from them. Spectral normalization of the h-dependent path alone is sufficient.

The frozen terms still shift h* — they participate in the final fixed point. They just don't affect the convergence guarantee.

AIDEEN architecture: Mamba outside the DEQ loop---

Applied to Mamba

For temporal memory across tokens:

# Prelude: M_{t-1} → hist_ctx (frozen)
hist_ctx = gate(W_hist * M_prev)   # stop-gradient

# Loop: hist_ctx is read-only
h_next = RMSNorm(attn_signal + slot_bias + hist_ctx)

# Post-convergence: update M
M_t = a * M_prev + (1 - a) * x_proj(h_star)
Enter fullscreen mode Exit fullscreen mode

The Mamba state carries temporal information token-to-token. The DEQ sees it as a fixed bias — shifts the fixed point but doesn't affect contractivity.

Applied to Slot Attention

Same pattern. Q, K, V are projected from the previous converged state, attention weights are computed once, frozen:

# Prelude: compute attention from H_prev
Q, K, V = project(H_prev)
attn_ctx = softmax(Q @ K.T / sqrt(d)) @ V   # frozen

# Loop: attn_ctx is read-only
h_next = RMSNorm(signal + attn_ctx + hist_ctx + slot_bias)
Enter fullscreen mode Exit fullscreen mode

Cross-slot attention runs once per token at full cost, but zero times per Picard iteration. The DEQ refines h given fixed attention context, not competing with it.

In the GPU Shader

The boundary is explicit in our WGSL shaders. Inside the Picard loop:

// hist_ctx was computed in the prelude from M_{t-1}.
// ∂hist_ctx/∂h = 0 — no contribution to the Lipschitz bound.
let hist_ctx    = Scratch[hist_ctx_base + slot * d_model + d];  // READ ONLY
let attn_ctx    = Scratch[attn_base    + slot * d_model + d];  // READ ONLY
let attn_signal = Scratch[signal_base  + slot * d_model + d];  // h-dependent

let final_h = attn_signal + attn_ctx + hist_ctx + slot_bias;
H_next[...] = final_h;   // h updated — external state untouched
Enter fullscreen mode Exit fullscreen mode

After convergence:

// M_t written exactly once, here.
let m_new = alpha * M_prev + (1.0 - alpha) * x_proj(h_star);
H_curr[carry_base + slot * d_model + d] = m_new;
Enter fullscreen mode Exit fullscreen mode

The ∂/∂h = 0 annotation is load-bearing. It's what lets spectral normalization work on the h-dependent path alone.

Does It Hold?

After adopting this pattern for both Mamba and attention:

Metric Value
Picard convergence rate 100% (0 unconverged tokens)
Average iterations 5-6 per token (cap: 20)
Contractivity < 0.85 throughout training
Training stability 12+ hours continuous on AMD Radeon 780M (2GB VRAM)

The history signal contributes a stable context (hist/inj ratio ~ 0.25) — temporal information flows into the DEQ without destabilizing it.

The General Rule

Any component can be added to a DEQ using this pattern:

  • Eligible: anything whose output can be computed from the previous converged state — recurrent memory, attention over past tokens, retrieved embeddings, external signals
  • Not eligible: anything that must see the current h to compute its output and feeds back into f — that creates the circular dependency

The frozen context pattern lets DEQ models grow in expressivity without paying in convergence stability. The Lipschitz constraint stays local to the h-dependent core.

Try It

AIDEEN is open source (MIT license), written entirely in Rust with WGSL GPU compute shaders. No Python, no CUDA — runs on any GPU with Vulkan/Metal/DX12/WebGPU support.

git clone https://github.com/SergioAriel/aideen.git
cd aideen
cargo build --release --workspace
cargo test --workspace --exclude aideen-block --exclude aideen-engine --exclude aideen-node
Enter fullscreen mode Exit fullscreen mode

We're currently training our first full model and will publish DEQ vs. transformer benchmarks soon.


References:


We're two developers building this from scratch. If you're interested in DEQ architectures, Rust ML, or WebGPU compute — contributions welcome.

Top comments (0)