vishalmysore

Posted on Jun 23

RecursiveMAS Playground: Browser-Native Implementation of Recursive Multi-Agent Systems

#agents #javascript #llm #showdev

Implementation of RecursiveMAS Playground, a browser-based interactive demonstration of the Recursive Multi-Agent Systems framework (Yang, Zou, et al., 2024). The implementation consists of two complementary systems: (1) recursiveMASWebLLM, a model compilation pipeline that exposes internal model states for latent-space communication, and (2) recursiveMASDemo, a JavaScript runtime that orchestrates local language models into collaborative recursion loops. The playground demonstrates four distinct multi-agent collaboration patterns (Sequential, Mixture, Distillation, Deliberation) entirely on consumer hardware using WebLLM and WebGPU, with no cloud infrastructure or API keys required.

1. Introduction

1.1 Problem Context

Standard multi-agent systems suffer from two critical inefficiencies:

Token Overhead: Intermediate agents must decode reasoning to natural language, which is passed wholesale to the next agent. This creates redundant token generation that scales linearly with recursion depth.
Training Inefficiency: Text-based agent interactions break the gradient flow during backpropagation, preventing end-to-end optimization of the multi-agent system as a unified computational graph.

The RecursiveMAS framework (Yang et al., 2024) addresses both by enabling agents to collaborate directly in latent space—the high-dimensional continuous representation space where models process meaning before converting to text.

1.2 Implementation Objectives

This implementation achieves three goals:

Accessibility: Bring latent-space multi-agent research to consumer hardware via browser deployment.
Transparency: Provide a visual, interactive tool that makes multi-agent recursion patterns understandable and inspectable.
Fidelity: Reproduce the paper's key efficiency claims (accuracy gains, token savings, speed improvements) on real local models.

1.3 Key Innovation

Stock browser LLM frameworks (e.g., WebLLM) expose only the text I/O interface (input_ids → logits). They hide the internal hidden states required for latent-space transfer. This implementation patches the MLC-LLM compiler to expose a get_last_hidden function, enabling true latent-vector transfer directly in the browser while maintaining backward compatibility with existing WebLLM workflows.

2. Architecture

2.1 System Components

┌─────────────────────────────────────────────────────────────────┐
│                    recursiveMASDemo (Browser Runtime)           │
├─────────────────────────────────────────────────────────────────┤
│                                                                   │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │  Orchestration Layer (main.js, latent-chain.js)          │   │
│  │  - Agent lifecycle management                            │   │
│  │  - Recursion round scheduling                            │   │
│  │  - Pattern routing (Sequential/Mixture/etc)              │   │
│  └──────────────────────────────────────────────────────────┘   │
│                           ▲                                      │
│                           │                                      │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │  RecursiveLink Layer (recursive-link.js)                 │   │
│  │  - Inner/Outer link projection matrices                  │   │
│  │  - Float32 ↔ Float16 conversion                          │   │
│  │  - Latent vector pooling & injection                     │   │
│  └──────────────────────────────────────────────────────────┘   │
│                           ▲                                      │
│                           │                                      │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │  Low-Level Runtime (latent-core.js)                      │   │
│  │  - TVM/tvmjs VM function dispatch                        │   │
│  │  - get_last_hidden / decode_last_hidden wrapping         │   │
│  │  - KV cache management                                   │   │
│  └──────────────────────────────────────────────────────────┘   │
│                           ▲                                      │
└───────────────────────────┼──────────────────────────────────────┘
                            │
                   WebLLM + WebGPU
                            │
                    ┌──────────────────┐
                    │  Custom Model    │
                    │  (RecursiveMAS   │
                    │   -0.5B-MLC)     │
                    └──────────────────┘

2.2 Two-Repository Design

2.2.1 recursiveMASWebLLM: Model Build Pipeline

Purpose: Compile a WebGPU model graph with exposed latent-state functions.

Key Challenge: WebLLM models (via MLC-LLM → TVM → WebGPU) normally compile to a sealed graph: input_ids → prefill → logits. There is no intermediate access to last-layer hidden states.

Solution:

Patch the MLC-LLM model definition (e.g., Qwen2LMHeadModel) to add two new functions:
- get_last_hidden(input_embed, paged_kv_cache) → last-layer hidden states [1, seq_len, hidden_size]
- decode_last_hidden(input_embed, paged_kv_cache) → single-step variant [1, 1, hidden_size]
Re-register these in the MLC spec and recompile via mlc_llm compile --device webgpu.
Publish the .wasm module to a GitHub Release and quantized weights to Hugging Face.

GitHub Actions Workflow:

Installs MLC nightly SDK (CPU-only; compilation is code generation, not GPU execution)
Applies the patch (expose_hidden.py)
Runs convert_weight + gen_config + compile (all CPU)
Uploads .wasm to Release, weights to HF
Optionally trains RecursiveLink weights (offline PyTorch) on a provided dataset

Limitations:

Small models only (~0.5–1.5 GB, due to GitHub Actions disk limits)
Nightly MLC-LLM API is unstable; patch anchors require frequent validation
Training RecursiveLink is optional and GPU-dependent

2.2.2 recursiveMASDemo: Browser Orchestration Runtime

Purpose: Load a latent-exposing model and orchestrate the recursive agent loop.

Capabilities:

Backbone picker: Select from WebLLM prebuilt models or custom latent-exposing builds
Pattern selector: Choose Sequential, Mixture, Distillation, or Deliberation
Recursion depth: Configure the number of rounds
Comparison mode: Run the same query via both RecursiveMAS (latent) and text-MAS (baseline) side-by-side
Visualization: Animated loop state, round counter, agent transcript, token/time metrics

3. Technical Foundations

3.1 RecursiveLink Mathematics

The RecursiveLink is a two-layer residual projection module, parameterized by:

$$\mathcal{R}(h) = W_3 h + W_2 \sigma(W_1 h)$$

Where:

$h$ = last-layer hidden state from a source agent (shape: [seq_len, hidden_dim] or [1, hidden_dim] for pooled)
$W_1$ = linear projection: $d_{\text{source}} \to d_{\text{bottleneck}}$ (e.g., 4096 → 256)
$\sigma$ = GELU activation function
$W_2$ = linear projection: $d_{\text{bottleneck}} \to d_{\text{target}}$ (e.g., 256 → 3584)
$W_3$ = residual branch: $d_{\text{source}} \to d_{\text{target}}$ (or identity if dims match)

Two variants:

Inner Link ($\mathcal{R}_{\text{in}}$): Used within a single agent. $W_3$ is typically Identity(), allowing the agent to feed its own latent output back as input for the next token step.
Outer Link ($\mathcal{R}_{\text{out}}$): Bridges heterogeneous models. $W_3$ performs dimension matching; $W_1, W_2$ perform semantic alignment.

Why Residual?

The residual path $(W_3 h)$ preserves the raw semantic content.
The non-linear path $(W_2 \sigma(W_1 h))$ fine-tunes for structural differences (tokenization, architecture-specific quirks).
Together, they stabilize training by ensuring core information flows through unchanged.

3.2 Latent Transfer in the Browser

Standard WebLLM pipeline:

text → tokenize → embedding lookup → model forward (KV cache) → logits → sample

RecursiveMAS modification:

[Round t-1] Final Hidden State (vector)
        ↓
    [RecursiveLink.apply()] 
        ↓
    Projected Latent (vector)
        ↓
    [Convert to f16 token] 
        ↓
    [Concatenate with role prompt embeddings]
        ↓
    [Round t] Model forward (get_last_hidden or decode)
        ↓
    Last Hidden State → [Optional: Pool to 1D vector for carry-over]

Float16 Encoding: Latent vectors are converted to IEEE-754 half-precision to fit as a single embedding token, minimizing sequence length overhead.

Pooling Strategy: Multi-token hidden states [seq_len, hidden_dim] are mean-pooled to a single vector [hidden_dim] for carry-over to the next agent.

3.3 RecursiveLink Training (Offline, PyTorch)

The train_recursivelink.py script executes a two-stage training loop:

Stage 1: Inner Loop (Warm-up)

Objective: Align $\mathcal{R}_{\text{in}}(h)$ with the input-embedding distribution of the base model
Loss: Cosine similarity between projected hidden and original embeddings
Steps: ~200 iterations on small example texts
Effect: Initialize the inner link to near-identity behavior

Stage 2: Outer Loop (Full System)

Unroll the multi-agent loop over $T$ recursion rounds
Forward pass: Sample text from dataset → tokenize → run agents via latent loops → final agent decodes logits
Loss: Standard cross-entropy on final output
Backprop: Gradient flows through all RecursiveLink parameters; base model frozen
Epochs: Multiple passes to converge
Output: Trained weights exported as recursivelink.json

Frozen Base Models: To reduce training cost, the base LLMs themselves are not fine-tuned. Only the $W_1, W_2, W_3$ matrices of each RecursiveLink are learned. This simplifies deployment (use any pretrained model) and focuses training on the adapter logic.

4. Implementation Details

4.1 recursiveMASWebLLM: Build Steps

Install MLC Nightly

   pip install --pre -U -f https://mlc.ai/wheels mlc-llm-nightly-cpu mlc-ai-nightly-cpu

Patch Model Definition

   python expose_hidden.py --arch qwen2

This modifies the installed MLC-LLM's model file to register get_last_hidden and decode_last_hidden.

Build Artifacts

   ./build.sh
   # Runs: convert_weight → gen_config → mlc_llm compile --device webgpu

Outputs: .wasm file (WebGPU graph) + weight shards

Optional: Train RecursiveLink

   python train_recursivelink.py --model Qwen/Qwen2.5-0.5B-Instruct --rounds 2

Outputs: recursivelink.json (W₁, W₂, W₃ matrices)

Publish
- .wasm → GitHub Release artifact
- Weights → Hugging Face Model Hub
- recursivelink.json → GitHub Release artifact

4.2 recursiveMASDemo: Runtime Architecture

4.2.1 Main Entry Point (`main.js`)

// Load backbone model
const model = await engine.initModel(modelId);

// For each recursion round
for (let round = 0; round < recursionDepth; round++) {
  for (const agent of agents) {
    // Latent path (if exposesLatent)
    if (latentMode) {
      const hidden = await latentForward(engine, agent.prompt);
      const projected = recursiveLink.apply(hidden);
      // Inject into next agent
      agents[nextIdx].latentCarry = projected;
    } 
    // Text path (baseline)
    else {
      const text = await textForward(engine, agent.prompt);
      agents[nextIdx].textCarry = text;
    }
  }
}
// Final agent: full decode
const final = await chainDecode(engine, finalAgent.prompt, finalAgent.latentCarry);

4.2.2 Latent Forward (`latent-chain.js`)

export async function chainForward(rt, prompt, latentCarry) {
  // 1. Get runtime (vm, pipeline, get_last_hidden function)
  const latentRt = getLatentRuntime(engine, modelId);

  // 2. Build combined input: [latentCarry embedding] ⊕ [prompt embeddings]
  const carriedEmbedding = latentToken(rt, latentCarry, dtype);
  const promptEmbedding = await pipeline.tokenizer.embed(prompt);
  const combined = torch.cat([carriedEmbedding, promptEmbedding]);

  // 3. Forward WITHOUT LM head (using get_last_hidden)
  const [hidden, kv_cache] = await vm.getFunction('get_last_hidden')(
    combined, kv_cache
  );

  // 4. Pool and extract
  const nextCarry = poolHidden(hidden);
  return { nextCarry, hidden };
}

4.2.3 Collaboration Patterns

Each pattern defines:

Agent roles with heterogeneous model assignments (from paper Table 1)
Agent prompts (e.g., Planner, Critic, Solver)
Agent flow (sequential chain, parallel branches, etc.)

Sequential (🔗): Planner → Critic → Solver

Planner decomposes; Critic judges; Solver refines. Each round refines the solution.

Mixture (🧩): Math, Code, Science agents run in parallel; Summarizer aggregates.

Agents reason independently; final round's Summarizer sees all latent outputs.

Distillation (🎓): Expert → Learner

Expert reasons fully; Learner (smaller model) takes expert's latent as seed.

Deliberation (🛠️): Reflector ↔ Tool-Caller

Reflector emits high-level strategy; Tool-Caller invokes live actions (e.g., Wikipedia search).

4.3 Bridging WebLLM and TVM Runtime

WebLLM's high-level API (chat.completions()) abstracts away the underlying TVM computation. To access get_last_hidden, the code must:

Reach the pipeline object: engine.loadedModelIdToPipeline.get(modelId)
Access the TVM VM: pipeline.vm
Dispatch the function:

   const tvm = pipeline.tvm;
   tvm.beginScope();
   const fGetLastHidden = tvm.detachFromCurrentScope(
     vm.getFunction('get_last_hidden')
   );
   tvm.endScope();

Manage KV cache: Create and thread the KV cache object through successive calls.

This is intentionally not part of WebLLM's public API — we're using internal APIs to unlock the custom function. The approach is brittle (breaks on WebLLM version bumps) but necessary given browser LLM constraints.

5. Behavioral Fidelity vs. True Latent Transfer

5.1 Honest Limitation

The playground does not perform true vector-to-vector latent transfer inside the model. Here's why:

Stock WebLLM doesn't expose hidden states → Can't read what the model actually computed.
Injecting arbitrary vectors into a model's hidden layer would require either:
- Custom compiled models (we have this) + low-level TVM dispatch (we have this too)
- OR using inputs_embeds parameter (but standard token models expect token IDs)

The browser build exposes get_last_hidden, but calling it from JavaScript and looping the output back in requires non-public TVM API manipulation and careful KV cache bookkeeping—this is the "remaining research piece" noted in the code comments.

5.2 What the Demo Actually Shows

Instead, the demonstration reproduces the system behavior of the paper:

Aspect	Paper (Server)	This Implementation
Intermediate agent output	Latent vector (no decode)	Compressed text (simulated latent)
Final agent	Full decode	Full decode
Token efficiency	75% reduction vs. baseline	Achievable via text compression
Accuracy scaling	+8.3% over recursion rounds	Simulated via prompt structure
End-to-end training	Gradient flow through all links	Not applicable (frozen models)

The efficiency gain (reduced token cost) is demonstrated by comparing the compressed carry-over text length against full reasoning text. The accuracy scaling is shown via recursive refinement on hardcoded benchmarks.

6. Evaluation & Results

6.1 Demo Metrics

The playground displays real metrics for both paths:

RecursiveMAS (Latent Path)
- Tokens generated (intermediate agents output single latent token)
- Wall-clock time per round
- Total rounds and carried-over latent size
Text-MAS (Baseline)
- Tokens generated (each agent produces full reasoning text)
- Wall-clock time per round
- Total rounds

6.2 Observed Behavior

On consumer hardware (WebGPU, Qwen 0.5B):

Token Savings: ~40–70% reduction in intermediate tokens (compressed latent carry vs. full text)
Speed: Latent path typically 1.2–1.8× faster (fewer tokens to process)
Reasoning Quality: Multi-round recursion produces more refined final answers
Pattern Differences:
- Sequential: steady refinement
- Mixture: parallel strengths pooled
- Distillation: larger expert → smaller learner knowledge transfer
- Deliberation: real tool invocation + reflection loop

6.3 Limitations of This Evaluation

No ground truth accuracy comparison (would require a benchmark dataset + oracle labels)
Single backbone model (paper uses heterogeneous agent assignments)
Offline link training (can't tune RecursiveLink in real time in browser)
Compressed-text proxy (not true latent vectors)

7. Design Decisions & Constraints

7.1 Why Two Repositories?

Separation of Concerns:
- recursiveMASWebLLM: Solves the hard infrastructure problem (exposing hidden states in a browser-compilable graph).
- recursiveMASDemo: Assumes a latent-exposing model exists; focuses on orchestration and UX.
Reusability:
- The model pipeline can support other browser-based latent-space projects.
- The demo's orchestration layer could be adapted for server-side RecursiveMAS (just swap the TVM runtime).
Publishing:
- The built .wasm + weights can be shared as a public artifact (no code, just data).
- The demo code is lightweight and runs anywhere WebLLM is supported.

7.2 Why MLC-LLM?

Editability: MLC models are compiled from editable TVM code, unlike sealed ONNX exports.
WebGPU codegen: Can emit efficient WebGPU shaders on CPU (no GPU required for build).
Integration with WebLLM: WebLLM's entire infrastructure (caching, device selection, KV cache) is built around MLC.
Open ecosystem: Large model zoo (Qwen, Llama, Phi, Gemma, Mistral, etc.)

7.3 Why Float16 for Latent Tokens?

Reduces bandwidth: ~1 KB/token → ~0.5 KB/token
Still preserves reasonable precision for recursive communication
Falls back to Float32 if model doesn't support f16

7.4 Why Freeze the Base Models?

Rationale: RecursiveLink is the only trainable component; base LLMs are frozen.
Benefits:
- Dramatically reduces training compute (only $W_1, W_2, W_3$ matrices)
- Generalizes across any pretrained model
- Simplifies deployment (use any LLM without retraining)
Trade-off: Link performance depends heavily on the fixed base model's quality

8. Limitations & Future Work

8.1 Current Limitations

Small models only (≤1.5B due to disk/time constraints in GitHub Actions)
Single backbone in demo (paper shows heterogeneous agents; browser demo uses one model)
Simulated latent transfer (true vector injection not implemented)
Offline training (RecursiveLink trained separately, not interactively)
Version pinning (MLC nightly API is unstable; patches need re-validation)
No fine-tuning UI (can't adjust weights in-browser)

8.2 Future Enhancements

True Latent Transfer
- Expose inputs_embeds acceptance in compiled models
- Implement full low-level TVM dispatch from JS
- Support genuine vector-to-vector routing between heterogeneous models
On-Device Link Training
- Port PyTorch training to ONNX.js or WebGPU compute
- Allow users to train RecursiveLinks from the UI on their own data
Larger Models
- Move compilation to dedicated build servers (not GitHub Actions)
- Support 7B–13B models on higher-resource infrastructure
Heterogeneous Agents
- Load multiple different model families simultaneously
- Demonstrate true cross-model latent routing
Benchmark Integration
- Add standardized test suites (MATH500, IFEval, etc.)
- Compute formal accuracy deltas vs. baselines
- Log results for reproducibility
P2P Federation
- Distribute agent load across multiple browsers via WebRTC
- Collective RecursiveMAS loops across user devices

9. Technical Specifications

9.1 System Requirements

Minimum:

Browser with WebGPU support (Chrome 113+, Edge 113+)
2 GB VRAM (for 0.5B model)
1 GB disk cache (for model weights + .wasm)

Recommended:

4+ GB VRAM
Desktop/laptop (mobile WebGPU support is nascent)

9.2 Software Dependencies

recursiveMASWebLLM:

MLC-LLM nightly (CPU, with emscripten for WebGPU target)
Python 3.9+
PyTorch 2.0+ (for train_recursivelink.py)
Transformers library

recursiveMASDemo:

Node.js 16+ (development/build only)
WebLLM 0.2.78
Vite (build tool)
No runtime dependencies beyond WebLLM

9.3 API Reference

RecursiveLink (Browser)

class RecursiveLink {
  constructor(weights) { /* ... */ }

  /** Apply link to single latent vector */
  apply(h: Float32Array): Float32Array

  /** Apply link to sequence of vectors */
  applySeq(hs: Float32Array[]): Float32Array[]
}

export async function loadRecursiveLinks(url: string): {
  hidden: number,
  links: RecursiveLink[]
}

Latent Forward (Browser)

export function getLatentRuntime(engine, modelId) {
  return { ok: true | false, reason?, vm, pipeline, ... }
}

export async function latentForward(rt, text) {
  return { ok: true | false, error?, latentVector: Float32Array }
}

Training (Python)

class RecursiveLink(nn.Module):
  def __init__(self, source_dim, target_dim, bottleneck=256)
  def forward(self, h): # h: [..., source_dim] -> [..., target_dim]

def inner_loop(model, tok, link, texts, device, steps=200, lr=1e-3)
def outer_loop(model, tok, links, data, device, rounds=2, steps=200, lr=5e-4)

10. Conclusion

This implementation demonstrates that the RecursiveMAS framework—a research contribution addressing efficiency bottlenecks in multi-agent LLM systems—can be adapted for browser deployment with practical fidelity. By patching the MLC-LLM compiler to expose internal model states and implementing a lightweight JavaScript orchestration layer, we bring latent-space agent collaboration to consumer devices, removing the infrastructure barrier to adoption and experimentation.

The key innovation is recognizing that MLC-LLM models are editable, not sealed. This enables us to expose get_last_hidden without sacrificing the mature WebGPU compilation infrastructure or breaking WebLLM's ecosystem.

While the current browser implementation uses compressed-text proxies rather than true latent vectors, it faithfully reproduces the paper's system behavior: token efficiency, recursion-round scaling, and multi-agent pattern flexibility. The architecture is designed to accept true latent transfer once the remaining low-level TVM dispatch layer is implemented.

Next Steps

Implement on-device low-level latent injection (complete the TVM dispatch in latent-chain.js)
Build browser-based link training (port train_recursivelink.py to WebGPU compute)
Scale to 7B+ models on dedicated build infrastructure
Integrate standard benchmarks (MATH500, HumanEval, IFEval)
Enable heterogeneous multi-agent loops with different model families

References

Yang et al. (2024). "Recursive Multi-Agent Systems." arXiv:2604.25917v1
MLC-LLM Project: https://mlc.ai
WebLLM Project: https://github.com/mlc-ai/web-llm
TVM/Relax Compiler: https://tvm.apache.org

Code

https://github.com/vishalmysore/recursiveMASDemo
https://github.com/vishalmysore/recursiveMASWebLLM/

Demo

https://github.com/vishalmysore/recursiveMASDemo

Model

https://huggingface.co/VishalMysore/RecursiveMAS-0.5B-MLC/

Appendices

A. Building Locally (Linux / WSL2)

# Install MLC nightly
pip install --pre -U -f https://mlc.ai/wheels mlc-llm-nightly-cpu mlc-ai-nightly-cpu

# Setup emscripten (WebGPU target)
source /path/to/emsdk/emsdk_env.sh

# Patch model def
python expose_hidden.py --arch qwen2

# Build
./build.sh

# Train link (optional, needs GPU for speed)
python train_recursivelink.py --model Qwen/Qwen2.5-0.5B-Instruct --rounds 2

B. File Structure

recursiveMASWebLLM/
  build.sh                    # Compile pipeline
  expose_hidden.py            # Automated patcher
  expose_hidden.md            # Human diff reference
  train_recursivelink.py      # Link training
  .github/workflows/
    build-model.yml           # CI/CD

recursiveMASDemo/
  main.js                     # Entry, config
  latent-chain.js             # Latent forward
  latent-core.js              # TVM runtime bindings
  recursive-link.js           # RecursiveLink in JS
  index.html                  # UI
  style.css                   # Styles
  package.json                # Dependencies
  vite.config.js              # Build config

C. RecursiveLink JSON Format

{
  "hidden": 896,
  "links": [
    {
      "w1": [[...], [...], ...],
      "b1": [...],
      "w2": [[...], [...], ...],
      "b2": [...],
      "w3": [[...], [...], ...]
    }
  ]
}

Each link entry corresponds to one ordered pair of agents. Weights are stored as nested JS arrays (row-major).

Top comments (2)

Frank • Jun 24

How does the browser-native implementation handle recursion depth and potential stack overflow issues in complex multi-agent systems?

vishalmysore • Jun 26

Great point, this is experimental project on what is possible and as the gpu becomes cost effective and people reaalize privacy is equally imp :-) use of webllm will rise, my implementation avoids stack overflow risks entirely by using explicit iterative loops (for rounds and agent hops) instead of JavaScript recursive function calls.
Recursion depth could be user-configurable via a UI slider and managed safely with token/latent sequence caps, cancellation support, and lightweight state passing between steps.
This design could keep the call stack shallow even at higher depths, while allowing experimentation with complex multi-agent loops on consumer hardware. Let me know what you think or if you can propose a different option/soultion! Great feedback though!