The Frozen Context Pattern: Adding State to Deep Equilibrium Models

#ai #deeplearning #machinelearning #rust

DEQ models converge only if their update function is a contraction. We found a design pattern that lets you inject arbitrary external state — Mamba memory, attention, anything — without touching the Lipschitz bound.
tags: rust, machinelearning, ai, webgpu

We're building AIDEEN, an open-source AI engine in Rust that runs on consumer GPUs via WebGPU. The core is a Deep Equilibrium Model (DEQ) — a single parameter block iterated until it converges to a fixed point.

DEQs have a hard constraint: the update function must be a contraction. Every component you add risks widening the Lipschitz bound and breaking convergence. We ran into this wall trying to add Mamba-style temporal memory, then attention. Both broke convergence the same way. Both were fixed the same way.

Here's the pattern.

The Constraint

A DEQ finds h* by iterating:

h^(k+1) = f(h^(k); x)    until    |h^(k+1) - h^(k)| < epsilon

Convergence requires L(f) < 1. We enforce this via spectral normalization on every weight matrix every 4 gradient steps. This works cleanly as long as the Jacobian df/dh only contains h-dependent terms.

The moment you add something that couples back to h inside the loop, the Jacobian grows cross-terms that blow up the bound.

What Happens When You Ignore It

We tried adding a Mamba-style SSM inside the loop:

h^(k+1) = f(h^(k), M^(k); x)
M^(k+1) = g(h^(k); M^(k-1))

Now h depends on M depends on h. The combined Jacobian has cross-terms. In practice:

Spectral norm of the combined system exceeded 1.0
Picard iteration hit the cap without converging
The model oscillated indefinitely

No amount of damping fixed it. The feedback loop is structural — you can't regularize your way out of it.

Same thing happened when we naively put slot attention inside the iteration: the attention weights depend on h, V depends on h, the output feeds back into h. Same instability.

The Pattern: Frozen Context

The fix is the same in both cases. Any external component — Mamba state, attention, anything — can safely enter the DEQ if it follows this structure:

Compute it once, before the Picard loop starts, from the previous converged state
Freeze it — treat it as a constant during iteration (stop-gradient)
Inject it additively into the loop body
Update it after convergence, using h*

# ── Prelude (before the loop) ───────────────────────────
ctx_A = component_A(prev_state_A)   # frozen — computed once
ctx_B = component_B(prev_state_B)   # frozen — computed once

# ── Picard loop ─────────────────────────────────────────
for k in range(max_iters):
    h_next = f(h_curr, x_t) + ctx_A + ctx_B   # ctx never changes
    if converged(h_next, h_curr): break
    h_curr = h_next

# ── Post-convergence updates ────────────────────────────
state_A = update_A(h_star, prev_state_A)
state_B = update_B(h_star, prev_state_B)

Why this works: ctx_A and ctx_B are constants with respect to h. The Jacobian df/dh contains no cross-terms from them. Spectral normalization of the h-dependent path alone is sufficient.

The frozen terms still shift h* — they participate in the final fixed point. They just don't affect the convergence guarantee.

---

Applied to Mamba

For temporal memory across tokens:

# Prelude: M_{t-1} → hist_ctx (frozen)
hist_ctx = gate(W_hist * M_prev)   # stop-gradient

# Loop: hist_ctx is read-only
h_next = RMSNorm(attn_signal + slot_bias + hist_ctx)

# Post-convergence: update M
M_t = a * M_prev + (1 - a) * x_proj(h_star)

The Mamba state carries temporal information token-to-token. The DEQ sees it as a fixed bias — shifts the fixed point but doesn't affect contractivity.

Applied to Slot Attention

Same pattern. Q, K, V are projected from the previous converged state, attention weights are computed once, frozen:

# Prelude: compute attention from H_prev
Q, K, V = project(H_prev)
attn_ctx = softmax(Q @ K.T / sqrt(d)) @ V   # frozen

# Loop: attn_ctx is read-only
h_next = RMSNorm(signal + attn_ctx + hist_ctx + slot_bias)

Cross-slot attention runs once per token at full cost, but zero times per Picard iteration. The DEQ refines h given fixed attention context, not competing with it.

In the GPU Shader

The boundary is explicit in our WGSL shaders. Inside the Picard loop:

// hist_ctx was computed in the prelude from M_{t-1}.
// ∂hist_ctx/∂h = 0 — no contribution to the Lipschitz bound.
let hist_ctx    = Scratch[hist_ctx_base + slot * d_model + d];  // READ ONLY
let attn_ctx    = Scratch[attn_base    + slot * d_model + d];  // READ ONLY
let attn_signal = Scratch[signal_base  + slot * d_model + d];  // h-dependent

let final_h = attn_signal + attn_ctx + hist_ctx + slot_bias;
H_next[...] = final_h;   // h updated — external state untouched

After convergence:

// M_t written exactly once, here.
let m_new = alpha * M_prev + (1.0 - alpha) * x_proj(h_star);
H_curr[carry_base + slot * d_model + d] = m_new;

The ∂/∂h = 0 annotation is load-bearing. It's what lets spectral normalization work on the h-dependent path alone.

Does It Hold?

After adopting this pattern for both Mamba and attention:

Metric	Value
Picard convergence rate	100% (0 unconverged tokens)
Average iterations	5-6 per token (cap: 20)
Contractivity	< 0.85 throughout training
Training stability	12+ hours continuous on AMD Radeon 780M (2GB VRAM)

The history signal contributes a stable context (hist/inj ratio ~ 0.25) — temporal information flows into the DEQ without destabilizing it.

The General Rule

Any component can be added to a DEQ using this pattern:

Eligible: anything whose output can be computed from the previous converged state — recurrent memory, attention over past tokens, retrieved embeddings, external signals
Not eligible: anything that must see the current h to compute its output and feeds back into f — that creates the circular dependency

The frozen context pattern lets DEQ models grow in expressivity without paying in convergence stability. The Lipschitz constraint stays local to the h-dependent core.

Try It

AIDEEN is open source (MIT license), written entirely in Rust with WGSL GPU compute shaders. No Python, no CUDA — runs on any GPU with Vulkan/Metal/DX12/WebGPU support.

git clone https://github.com/SergioAriel/aideen.git
cd aideen
cargo build --release --workspace
cargo test --workspace --exclude aideen-block --exclude aideen-engine --exclude aideen-node