DEV Community

vishalmysore
vishalmysore

Posted on

RecursiveMAS Playground: Browser-Native Implementation of Recursive Multi-Agent Systems

Implementation of RecursiveMAS Playground, a browser-based interactive demonstration of the Recursive Multi-Agent Systems framework (Yang, Zou, et al., 2024). The implementation consists of two complementary systems: (1) recursiveMASWebLLM, a model compilation pipeline that exposes internal model states for latent-space communication, and (2) recursiveMASDemo, a JavaScript runtime that orchestrates local language models into collaborative recursion loops. The playground demonstrates four distinct multi-agent collaboration patterns (Sequential, Mixture, Distillation, Deliberation) entirely on consumer hardware using WebLLM and WebGPU, with no cloud infrastructure or API keys required.

1. Introduction

1.1 Problem Context

Standard multi-agent systems suffer from two critical inefficiencies:

  1. Token Overhead: Intermediate agents must decode reasoning to natural language, which is passed wholesale to the next agent. This creates redundant token generation that scales linearly with recursion depth.

  2. Training Inefficiency: Text-based agent interactions break the gradient flow during backpropagation, preventing end-to-end optimization of the multi-agent system as a unified computational graph.

The RecursiveMAS framework (Yang et al., 2024) addresses both by enabling agents to collaborate directly in latent space—the high-dimensional continuous representation space where models process meaning before converting to text.

1.2 Implementation Objectives

This implementation achieves three goals:

  1. Accessibility: Bring latent-space multi-agent research to consumer hardware via browser deployment.
  2. Transparency: Provide a visual, interactive tool that makes multi-agent recursion patterns understandable and inspectable.
  3. Fidelity: Reproduce the paper's key efficiency claims (accuracy gains, token savings, speed improvements) on real local models.

1.3 Key Innovation

Stock browser LLM frameworks (e.g., WebLLM) expose only the text I/O interface (input_ids → logits). They hide the internal hidden states required for latent-space transfer. This implementation patches the MLC-LLM compiler to expose a get_last_hidden function, enabling true latent-vector transfer directly in the browser while maintaining backward compatibility with existing WebLLM workflows.


2. Architecture

2.1 System Components

┌─────────────────────────────────────────────────────────────────┐
│                    recursiveMASDemo (Browser Runtime)           │
├─────────────────────────────────────────────────────────────────┤
│                                                                   │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │  Orchestration Layer (main.js, latent-chain.js)          │   │
│  │  - Agent lifecycle management                            │   │
│  │  - Recursion round scheduling                            │   │
│  │  - Pattern routing (Sequential/Mixture/etc)              │   │
│  └──────────────────────────────────────────────────────────┘   │
│                           ▲                                      │
│                           │                                      │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │  RecursiveLink Layer (recursive-link.js)                 │   │
│  │  - Inner/Outer link projection matrices                  │   │
│  │  - Float32 ↔ Float16 conversion                          │   │
│  │  - Latent vector pooling & injection                     │   │
│  └──────────────────────────────────────────────────────────┘   │
│                           ▲                                      │
│                           │                                      │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │  Low-Level Runtime (latent-core.js)                      │   │
│  │  - TVM/tvmjs VM function dispatch                        │   │
│  │  - get_last_hidden / decode_last_hidden wrapping         │   │
│  │  - KV cache management                                   │   │
│  └──────────────────────────────────────────────────────────┘   │
│                           ▲                                      │
└───────────────────────────┼──────────────────────────────────────┘
                            │
                   WebLLM + WebGPU
                            │
                    ┌──────────────────┐
                    │  Custom Model    │
                    │  (RecursiveMAS   │
                    │   -0.5B-MLC)     │
                    └──────────────────┘
Enter fullscreen mode Exit fullscreen mode

2.2 Two-Repository Design

2.2.1 recursiveMASWebLLM: Model Build Pipeline

Purpose: Compile a WebGPU model graph with exposed latent-state functions.

Key Challenge: WebLLM models (via MLC-LLM → TVM → WebGPU) normally compile to a sealed graph: input_ids → prefill → logits. There is no intermediate access to last-layer hidden states.

Solution:

  • Patch the MLC-LLM model definition (e.g., Qwen2LMHeadModel) to add two new functions:
    • get_last_hidden(input_embed, paged_kv_cache) → last-layer hidden states [1, seq_len, hidden_size]
    • decode_last_hidden(input_embed, paged_kv_cache) → single-step variant [1, 1, hidden_size]
  • Re-register these in the MLC spec and recompile via mlc_llm compile --device webgpu.
  • Publish the .wasm module to a GitHub Release and quantized weights to Hugging Face.

GitHub Actions Workflow:

  • Installs MLC nightly SDK (CPU-only; compilation is code generation, not GPU execution)
  • Applies the patch (expose_hidden.py)
  • Runs convert_weight + gen_config + compile (all CPU)
  • Uploads .wasm to Release, weights to HF
  • Optionally trains RecursiveLink weights (offline PyTorch) on a provided dataset

Limitations:

  • Small models only (~0.5–1.5 GB, due to GitHub Actions disk limits)
  • Nightly MLC-LLM API is unstable; patch anchors require frequent validation
  • Training RecursiveLink is optional and GPU-dependent

2.2.2 recursiveMASDemo: Browser Orchestration Runtime

Purpose: Load a latent-exposing model and orchestrate the recursive agent loop.

Capabilities:

  • Backbone picker: Select from WebLLM prebuilt models or custom latent-exposing builds
  • Pattern selector: Choose Sequential, Mixture, Distillation, or Deliberation
  • Recursion depth: Configure the number of rounds
  • Comparison mode: Run the same query via both RecursiveMAS (latent) and text-MAS (baseline) side-by-side
  • Visualization: Animated loop state, round counter, agent transcript, token/time metrics

3. Technical Foundations

3.1 RecursiveLink Mathematics

The RecursiveLink is a two-layer residual projection module, parameterized by:

$$\mathcal{R}(h) = W_3 h + W_2 \sigma(W_1 h)$$

Where:

  • $h$ = last-layer hidden state from a source agent (shape: [seq_len, hidden_dim] or [1, hidden_dim] for pooled)
  • $W_1$ = linear projection: $d_{\text{source}} \to d_{\text{bottleneck}}$ (e.g., 4096 → 256)
  • $\sigma$ = GELU activation function
  • $W_2$ = linear projection: $d_{\text{bottleneck}} \to d_{\text{target}}$ (e.g., 256 → 3584)
  • $W_3$ = residual branch: $d_{\text{source}} \to d_{\text{target}}$ (or identity if dims match)

Two variants:

  1. Inner Link ($\mathcal{R}_{\text{in}}$): Used within a single agent. $W_3$ is typically Identity(), allowing the agent to feed its own latent output back as input for the next token step.

  2. Outer Link ($\mathcal{R}_{\text{out}}$): Bridges heterogeneous models. $W_3$ performs dimension matching; $W_1, W_2$ perform semantic alignment.

Why Residual?

  • The residual path $(W_3 h)$ preserves the raw semantic content.
  • The non-linear path $(W_2 \sigma(W_1 h))$ fine-tunes for structural differences (tokenization, architecture-specific quirks).
  • Together, they stabilize training by ensuring core information flows through unchanged.

3.2 Latent Transfer in the Browser

Standard WebLLM pipeline:

text → tokenize → embedding lookup → model forward (KV cache) → logits → sample
Enter fullscreen mode Exit fullscreen mode

RecursiveMAS modification:

[Round t-1] Final Hidden State (vector)
        ↓
    [RecursiveLink.apply()] 
        ↓
    Projected Latent (vector)
        ↓
    [Convert to f16 token] 
        ↓
    [Concatenate with role prompt embeddings]
        ↓
    [Round t] Model forward (get_last_hidden or decode)
        ↓
    Last Hidden State → [Optional: Pool to 1D vector for carry-over]
Enter fullscreen mode Exit fullscreen mode

Float16 Encoding: Latent vectors are converted to IEEE-754 half-precision to fit as a single embedding token, minimizing sequence length overhead.

Pooling Strategy: Multi-token hidden states [seq_len, hidden_dim] are mean-pooled to a single vector [hidden_dim] for carry-over to the next agent.

3.3 RecursiveLink Training (Offline, PyTorch)

The train_recursivelink.py script executes a two-stage training loop:

Stage 1: Inner Loop (Warm-up)

  • Objective: Align $\mathcal{R}_{\text{in}}(h)$ with the input-embedding distribution of the base model
  • Loss: Cosine similarity between projected hidden and original embeddings
  • Steps: ~200 iterations on small example texts
  • Effect: Initialize the inner link to near-identity behavior

Stage 2: Outer Loop (Full System)

  • Unroll the multi-agent loop over $T$ recursion rounds
  • Forward pass: Sample text from dataset → tokenize → run agents via latent loops → final agent decodes logits
  • Loss: Standard cross-entropy on final output
  • Backprop: Gradient flows through all RecursiveLink parameters; base model frozen
  • Epochs: Multiple passes to converge
  • Output: Trained weights exported as recursivelink.json

Frozen Base Models: To reduce training cost, the base LLMs themselves are not fine-tuned. Only the $W_1, W_2, W_3$ matrices of each RecursiveLink are learned. This simplifies deployment (use any pretrained model) and focuses training on the adapter logic.


4. Implementation Details

4.1 recursiveMASWebLLM: Build Steps

  1. Install MLC Nightly
   pip install --pre -U -f https://mlc.ai/wheels mlc-llm-nightly-cpu mlc-ai-nightly-cpu
Enter fullscreen mode Exit fullscreen mode
  1. Patch Model Definition
   python expose_hidden.py --arch qwen2
Enter fullscreen mode Exit fullscreen mode

This modifies the installed MLC-LLM's model file to register get_last_hidden and decode_last_hidden.

  1. Build Artifacts
   ./build.sh
   # Runs: convert_weight → gen_config → mlc_llm compile --device webgpu
Enter fullscreen mode Exit fullscreen mode

Outputs: .wasm file (WebGPU graph) + weight shards

  1. Optional: Train RecursiveLink
   python train_recursivelink.py --model Qwen/Qwen2.5-0.5B-Instruct --rounds 2
Enter fullscreen mode Exit fullscreen mode

Outputs: recursivelink.json (W₁, W₂, W₃ matrices)

  1. Publish
    • .wasm → GitHub Release artifact
    • Weights → Hugging Face Model Hub
    • recursivelink.json → GitHub Release artifact

4.2 recursiveMASDemo: Runtime Architecture

4.2.1 Main Entry Point (main.js)

// Load backbone model
const model = await engine.initModel(modelId);

// For each recursion round
for (let round = 0; round < recursionDepth; round++) {
  for (const agent of agents) {
    // Latent path (if exposesLatent)
    if (latentMode) {
      const hidden = await latentForward(engine, agent.prompt);
      const projected = recursiveLink.apply(hidden);
      // Inject into next agent
      agents[nextIdx].latentCarry = projected;
    } 
    // Text path (baseline)
    else {
      const text = await textForward(engine, agent.prompt);
      agents[nextIdx].textCarry = text;
    }
  }
}
// Final agent: full decode
const final = await chainDecode(engine, finalAgent.prompt, finalAgent.latentCarry);
Enter fullscreen mode Exit fullscreen mode

4.2.2 Latent Forward (latent-chain.js)

export async function chainForward(rt, prompt, latentCarry) {
  // 1. Get runtime (vm, pipeline, get_last_hidden function)
  const latentRt = getLatentRuntime(engine, modelId);

  // 2. Build combined input: [latentCarry embedding] ⊕ [prompt embeddings]
  const carriedEmbedding = latentToken(rt, latentCarry, dtype);
  const promptEmbedding = await pipeline.tokenizer.embed(prompt);
  const combined = torch.cat([carriedEmbedding, promptEmbedding]);

  // 3. Forward WITHOUT LM head (using get_last_hidden)
  const [hidden, kv_cache] = await vm.getFunction('get_last_hidden')(
    combined, kv_cache
  );

  // 4. Pool and extract
  const nextCarry = poolHidden(hidden);
  return { nextCarry, hidden };
}
Enter fullscreen mode Exit fullscreen mode

4.2.3 Collaboration Patterns

Each pattern defines:

  • Agent roles with heterogeneous model assignments (from paper Table 1)
  • Agent prompts (e.g., Planner, Critic, Solver)
  • Agent flow (sequential chain, parallel branches, etc.)

Sequential (🔗): Planner → Critic → Solver

Planner decomposes; Critic judges; Solver refines. Each round refines the solution.

Mixture (🧩): Math, Code, Science agents run in parallel; Summarizer aggregates.

Agents reason independently; final round's Summarizer sees all latent outputs.

Distillation (🎓): Expert → Learner

Expert reasons fully; Learner (smaller model) takes expert's latent as seed.

Deliberation (🛠️): Reflector ↔ Tool-Caller

Reflector emits high-level strategy; Tool-Caller invokes live actions (e.g., Wikipedia search).

4.3 Bridging WebLLM and TVM Runtime

WebLLM's high-level API (chat.completions()) abstracts away the underlying TVM computation. To access get_last_hidden, the code must:

  1. Reach the pipeline object: engine.loadedModelIdToPipeline.get(modelId)
  2. Access the TVM VM: pipeline.vm
  3. Dispatch the function:
   const tvm = pipeline.tvm;
   tvm.beginScope();
   const fGetLastHidden = tvm.detachFromCurrentScope(
     vm.getFunction('get_last_hidden')
   );
   tvm.endScope();
Enter fullscreen mode Exit fullscreen mode
  1. Manage KV cache: Create and thread the KV cache object through successive calls.

This is intentionally not part of WebLLM's public API — we're using internal APIs to unlock the custom function. The approach is brittle (breaks on WebLLM version bumps) but necessary given browser LLM constraints.


5. Behavioral Fidelity vs. True Latent Transfer

5.1 Honest Limitation

The playground does not perform true vector-to-vector latent transfer inside the model. Here's why:

  1. Stock WebLLM doesn't expose hidden states → Can't read what the model actually computed.
  2. Injecting arbitrary vectors into a model's hidden layer would require either:
    • Custom compiled models (we have this) + low-level TVM dispatch (we have this too)
    • OR using inputs_embeds parameter (but standard token models expect token IDs)

The browser build exposes get_last_hidden, but calling it from JavaScript and looping the output back in requires non-public TVM API manipulation and careful KV cache bookkeeping—this is the "remaining research piece" noted in the code comments.

5.2 What the Demo Actually Shows

Instead, the demonstration reproduces the system behavior of the paper:

Aspect Paper (Server) This Implementation
Intermediate agent output Latent vector (no decode) Compressed text (simulated latent)
Final agent Full decode Full decode
Token efficiency 75% reduction vs. baseline Achievable via text compression
Accuracy scaling +8.3% over recursion rounds Simulated via prompt structure
End-to-end training Gradient flow through all links Not applicable (frozen models)

The efficiency gain (reduced token cost) is demonstrated by comparing the compressed carry-over text length against full reasoning text. The accuracy scaling is shown via recursive refinement on hardcoded benchmarks.


6. Evaluation & Results

6.1 Demo Metrics

The playground displays real metrics for both paths:

  • RecursiveMAS (Latent Path)

    • Tokens generated (intermediate agents output single latent token)
    • Wall-clock time per round
    • Total rounds and carried-over latent size
  • Text-MAS (Baseline)

    • Tokens generated (each agent produces full reasoning text)
    • Wall-clock time per round
    • Total rounds

6.2 Observed Behavior

On consumer hardware (WebGPU, Qwen 0.5B):

  1. Token Savings: ~40–70% reduction in intermediate tokens (compressed latent carry vs. full text)
  2. Speed: Latent path typically 1.2–1.8× faster (fewer tokens to process)
  3. Reasoning Quality: Multi-round recursion produces more refined final answers
  4. Pattern Differences:
    • Sequential: steady refinement
    • Mixture: parallel strengths pooled
    • Distillation: larger expert → smaller learner knowledge transfer
    • Deliberation: real tool invocation + reflection loop

6.3 Limitations of This Evaluation

  • No ground truth accuracy comparison (would require a benchmark dataset + oracle labels)
  • Single backbone model (paper uses heterogeneous agent assignments)
  • Offline link training (can't tune RecursiveLink in real time in browser)
  • Compressed-text proxy (not true latent vectors)

7. Design Decisions & Constraints

7.1 Why Two Repositories?

  1. Separation of Concerns:

    • recursiveMASWebLLM: Solves the hard infrastructure problem (exposing hidden states in a browser-compilable graph).
    • recursiveMASDemo: Assumes a latent-exposing model exists; focuses on orchestration and UX.
  2. Reusability:

    • The model pipeline can support other browser-based latent-space projects.
    • The demo's orchestration layer could be adapted for server-side RecursiveMAS (just swap the TVM runtime).
  3. Publishing:

    • The built .wasm + weights can be shared as a public artifact (no code, just data).
    • The demo code is lightweight and runs anywhere WebLLM is supported.

7.2 Why MLC-LLM?

  • Editability: MLC models are compiled from editable TVM code, unlike sealed ONNX exports.
  • WebGPU codegen: Can emit efficient WebGPU shaders on CPU (no GPU required for build).
  • Integration with WebLLM: WebLLM's entire infrastructure (caching, device selection, KV cache) is built around MLC.
  • Open ecosystem: Large model zoo (Qwen, Llama, Phi, Gemma, Mistral, etc.)

7.3 Why Float16 for Latent Tokens?

  • Reduces bandwidth: ~1 KB/token → ~0.5 KB/token
  • Still preserves reasonable precision for recursive communication
  • Falls back to Float32 if model doesn't support f16

7.4 Why Freeze the Base Models?

  • Rationale: RecursiveLink is the only trainable component; base LLMs are frozen.
  • Benefits:
    • Dramatically reduces training compute (only $W_1, W_2, W_3$ matrices)
    • Generalizes across any pretrained model
    • Simplifies deployment (use any LLM without retraining)
  • Trade-off: Link performance depends heavily on the fixed base model's quality

8. Limitations & Future Work

8.1 Current Limitations

  1. Small models only (≤1.5B due to disk/time constraints in GitHub Actions)
  2. Single backbone in demo (paper shows heterogeneous agents; browser demo uses one model)
  3. Simulated latent transfer (true vector injection not implemented)
  4. Offline training (RecursiveLink trained separately, not interactively)
  5. Version pinning (MLC nightly API is unstable; patches need re-validation)
  6. No fine-tuning UI (can't adjust weights in-browser)

8.2 Future Enhancements

  1. True Latent Transfer

    • Expose inputs_embeds acceptance in compiled models
    • Implement full low-level TVM dispatch from JS
    • Support genuine vector-to-vector routing between heterogeneous models
  2. On-Device Link Training

    • Port PyTorch training to ONNX.js or WebGPU compute
    • Allow users to train RecursiveLinks from the UI on their own data
  3. Larger Models

    • Move compilation to dedicated build servers (not GitHub Actions)
    • Support 7B–13B models on higher-resource infrastructure
  4. Heterogeneous Agents

    • Load multiple different model families simultaneously
    • Demonstrate true cross-model latent routing
  5. Benchmark Integration

    • Add standardized test suites (MATH500, IFEval, etc.)
    • Compute formal accuracy deltas vs. baselines
    • Log results for reproducibility
  6. P2P Federation

    • Distribute agent load across multiple browsers via WebRTC
    • Collective RecursiveMAS loops across user devices

9. Technical Specifications

9.1 System Requirements

Minimum:

  • Browser with WebGPU support (Chrome 113+, Edge 113+)
  • 2 GB VRAM (for 0.5B model)
  • 1 GB disk cache (for model weights + .wasm)

Recommended:

  • 4+ GB VRAM
  • Desktop/laptop (mobile WebGPU support is nascent)

9.2 Software Dependencies

recursiveMASWebLLM:

  • MLC-LLM nightly (CPU, with emscripten for WebGPU target)
  • Python 3.9+
  • PyTorch 2.0+ (for train_recursivelink.py)
  • Transformers library

recursiveMASDemo:

  • Node.js 16+ (development/build only)
  • WebLLM 0.2.78
  • Vite (build tool)
  • No runtime dependencies beyond WebLLM

9.3 API Reference

RecursiveLink (Browser)

class RecursiveLink {
  constructor(weights) { /* ... */ }

  /** Apply link to single latent vector */
  apply(h: Float32Array): Float32Array

  /** Apply link to sequence of vectors */
  applySeq(hs: Float32Array[]): Float32Array[]
}

export async function loadRecursiveLinks(url: string): {
  hidden: number,
  links: RecursiveLink[]
}
Enter fullscreen mode Exit fullscreen mode

Latent Forward (Browser)

export function getLatentRuntime(engine, modelId) {
  return { ok: true | false, reason?, vm, pipeline, ... }
}

export async function latentForward(rt, text) {
  return { ok: true | false, error?, latentVector: Float32Array }
}
Enter fullscreen mode Exit fullscreen mode

Training (Python)

class RecursiveLink(nn.Module):
  def __init__(self, source_dim, target_dim, bottleneck=256)
  def forward(self, h): # h: [..., source_dim] -> [..., target_dim]

def inner_loop(model, tok, link, texts, device, steps=200, lr=1e-3)
def outer_loop(model, tok, links, data, device, rounds=2, steps=200, lr=5e-4)
Enter fullscreen mode Exit fullscreen mode

10. Conclusion

This implementation demonstrates that the RecursiveMAS framework—a research contribution addressing efficiency bottlenecks in multi-agent LLM systems—can be adapted for browser deployment with practical fidelity. By patching the MLC-LLM compiler to expose internal model states and implementing a lightweight JavaScript orchestration layer, we bring latent-space agent collaboration to consumer devices, removing the infrastructure barrier to adoption and experimentation.

The key innovation is recognizing that MLC-LLM models are editable, not sealed. This enables us to expose get_last_hidden without sacrificing the mature WebGPU compilation infrastructure or breaking WebLLM's ecosystem.

While the current browser implementation uses compressed-text proxies rather than true latent vectors, it faithfully reproduces the paper's system behavior: token efficiency, recursion-round scaling, and multi-agent pattern flexibility. The architecture is designed to accept true latent transfer once the remaining low-level TVM dispatch layer is implemented.

Next Steps

  1. Implement on-device low-level latent injection (complete the TVM dispatch in latent-chain.js)
  2. Build browser-based link training (port train_recursivelink.py to WebGPU compute)
  3. Scale to 7B+ models on dedicated build infrastructure
  4. Integrate standard benchmarks (MATH500, HumanEval, IFEval)
  5. Enable heterogeneous multi-agent loops with different model families

References

Code

https://github.com/vishalmysore/recursiveMASDemo
https://github.com/vishalmysore/recursiveMASWebLLM/

Demo

https://github.com/vishalmysore/recursiveMASDemo

Model

https://huggingface.co/VishalMysore/RecursiveMAS-0.5B-MLC/


Appendices

A. Building Locally (Linux / WSL2)

# Install MLC nightly
pip install --pre -U -f https://mlc.ai/wheels mlc-llm-nightly-cpu mlc-ai-nightly-cpu

# Setup emscripten (WebGPU target)
source /path/to/emsdk/emsdk_env.sh

# Patch model def
python expose_hidden.py --arch qwen2

# Build
./build.sh

# Train link (optional, needs GPU for speed)
python train_recursivelink.py --model Qwen/Qwen2.5-0.5B-Instruct --rounds 2
Enter fullscreen mode Exit fullscreen mode

B. File Structure

recursiveMASWebLLM/
  build.sh                    # Compile pipeline
  expose_hidden.py            # Automated patcher
  expose_hidden.md            # Human diff reference
  train_recursivelink.py      # Link training
  .github/workflows/
    build-model.yml           # CI/CD

recursiveMASDemo/
  main.js                     # Entry, config
  latent-chain.js             # Latent forward
  latent-core.js              # TVM runtime bindings
  recursive-link.js           # RecursiveLink in JS
  index.html                  # UI
  style.css                   # Styles
  package.json                # Dependencies
  vite.config.js              # Build config
Enter fullscreen mode Exit fullscreen mode

C. RecursiveLink JSON Format

{
  "hidden": 896,
  "links": [
    {
      "w1": [[...], [...], ...],
      "b1": [...],
      "w2": [[...], [...], ...],
      "b2": [...],
      "w3": [[...], [...], ...]
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode

Each link entry corresponds to one ordered pair of agents. Weights are stored as nested JS arrays (row-major).

Top comments (0)