DEQ models converge only if their update function is a contraction. We found a design pattern that lets you inject arbitrary external state — Mamba memory, attention, anything — without touching the Lipschitz bound.
tags: rust, machinelearning, ai, webgpu
We're building AIDEEN, an open-source AI engine in Rust that runs on consumer GPUs via WebGPU. The core is a Deep Equilibrium Model (DEQ) — a single parameter block iterated until it converges to a fixed point.
DEQs have a hard constraint: the update function must be a contraction. Every component you add risks widening the Lipschitz bound and breaking convergence. We ran into this wall trying to add Mamba-style temporal memory, then attention. Both broke convergence the same way. Both were fixed the same way.
Here's the pattern.
The Constraint
A DEQ finds h* by iterating:
h^(k+1) = f(h^(k); x) until |h^(k+1) - h^(k)| < epsilon
Convergence requires L(f) < 1. We enforce this via spectral normalization on every weight matrix every 4 gradient steps. This works cleanly as long as the Jacobian df/dh only contains h-dependent terms.
The moment you add something that couples back to h inside the loop, the Jacobian grows cross-terms that blow up the bound.
What Happens When You Ignore It
We tried adding a Mamba-style SSM inside the loop:
h^(k+1) = f(h^(k), M^(k); x)
M^(k+1) = g(h^(k); M^(k-1))
Now h depends on M depends on h. The combined Jacobian has cross-terms. In practice:
- Spectral norm of the combined system exceeded 1.0
- Picard iteration hit the cap without converging
- The model oscillated indefinitely
No amount of damping fixed it. The feedback loop is structural — you can't regularize your way out of it.
Same thing happened when we naively put slot attention inside the iteration: the attention weights depend on h, V depends on h, the output feeds back into h. Same instability.
The Pattern: Frozen Context
The fix is the same in both cases. Any external component — Mamba state, attention, anything — can safely enter the DEQ if it follows this structure:
- Compute it once, before the Picard loop starts, from the previous converged state
- Freeze it — treat it as a constant during iteration (stop-gradient)
- Inject it additively into the loop body
- Update it after convergence, using h*
# ── Prelude (before the loop) ───────────────────────────
ctx_A = component_A(prev_state_A) # frozen — computed once
ctx_B = component_B(prev_state_B) # frozen — computed once
# ── Picard loop ─────────────────────────────────────────
for k in range(max_iters):
h_next = f(h_curr, x_t) + ctx_A + ctx_B # ctx never changes
if converged(h_next, h_curr): break
h_curr = h_next
# ── Post-convergence updates ────────────────────────────
state_A = update_A(h_star, prev_state_A)
state_B = update_B(h_star, prev_state_B)
Why this works: ctx_A and ctx_B are constants with respect to h. The Jacobian df/dh contains no cross-terms from them. Spectral normalization of the h-dependent path alone is sufficient.
The frozen terms still shift h* — they participate in the final fixed point. They just don't affect the convergence guarantee.
Applied to Mamba
For temporal memory across tokens:
# Prelude: M_{t-1} → hist_ctx (frozen)
hist_ctx = gate(W_hist * M_prev) # stop-gradient
# Loop: hist_ctx is read-only
h_next = RMSNorm(attn_signal + slot_bias + hist_ctx)
# Post-convergence: update M
M_t = a * M_prev + (1 - a) * x_proj(h_star)
The Mamba state carries temporal information token-to-token. The DEQ sees it as a fixed bias — shifts the fixed point but doesn't affect contractivity.
Applied to Slot Attention
Same pattern. Q, K, V are projected from the previous converged state, attention weights are computed once, frozen:
# Prelude: compute attention from H_prev
Q, K, V = project(H_prev)
attn_ctx = softmax(Q @ K.T / sqrt(d)) @ V # frozen
# Loop: attn_ctx is read-only
h_next = RMSNorm(signal + attn_ctx + hist_ctx + slot_bias)
Cross-slot attention runs once per token at full cost, but zero times per Picard iteration. The DEQ refines h given fixed attention context, not competing with it.
In the GPU Shader
The boundary is explicit in our WGSL shaders. Inside the Picard loop:
// hist_ctx was computed in the prelude from M_{t-1}.
// ∂hist_ctx/∂h = 0 — no contribution to the Lipschitz bound.
let hist_ctx = Scratch[hist_ctx_base + slot * d_model + d]; // READ ONLY
let attn_ctx = Scratch[attn_base + slot * d_model + d]; // READ ONLY
let attn_signal = Scratch[signal_base + slot * d_model + d]; // h-dependent
let final_h = attn_signal + attn_ctx + hist_ctx + slot_bias;
H_next[...] = final_h; // h updated — external state untouched
After convergence:
// M_t written exactly once, here.
let m_new = alpha * M_prev + (1.0 - alpha) * x_proj(h_star);
H_curr[carry_base + slot * d_model + d] = m_new;
The ∂/∂h = 0 annotation is load-bearing. It's what lets spectral normalization work on the h-dependent path alone.
Does It Hold?
After adopting this pattern for both Mamba and attention:
| Metric | Value |
|---|---|
| Picard convergence rate | 100% (0 unconverged tokens) |
| Average iterations | 5-6 per token (cap: 20) |
| Contractivity | < 0.85 throughout training |
| Training stability | 12+ hours continuous on AMD Radeon 780M (2GB VRAM) |
The history signal contributes a stable context (hist/inj ratio ~ 0.25) — temporal information flows into the DEQ without destabilizing it.
The General Rule
Any component can be added to a DEQ using this pattern:
- Eligible: anything whose output can be computed from the previous converged state — recurrent memory, attention over past tokens, retrieved embeddings, external signals
- Not eligible: anything that must see the current h to compute its output and feeds back into f — that creates the circular dependency
The frozen context pattern lets DEQ models grow in expressivity without paying in convergence stability. The Lipschitz constraint stays local to the h-dependent core.
Try It
AIDEEN is open source (MIT license), written entirely in Rust with WGSL GPU compute shaders. No Python, no CUDA — runs on any GPU with Vulkan/Metal/DX12/WebGPU support.
git clone https://github.com/SergioAriel/aideen.git
cd aideen
cargo build --release --workspace
cargo test --workspace --exclude aideen-block --exclude aideen-engine --exclude aideen-node
We're currently training our first full model and will publish DEQ vs. transformer benchmarks soon.
References:
- Bai, Kolter & Koltun (2019). Deep Equilibrium Models. NeurIPS.
- Bai et al. (2021). Stabilizing Equilibrium Models by Jacobian Regularization. ICML.
- Gu & Dao (2023). Mamba: Linear-Time Sequence Modeling with Selective State Spaces.
- Winston & Kolter (2020). Monotone Operator Equilibrium Networks. NeurIPS.
We're two developers building this from scratch. If you're interested in DEQ architectures, Rust ML, or WebGPU compute — contributions welcome.

Top comments (0)