<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: vishalmysore</title>
    <description>The latest articles on DEV Community by vishalmysore (@vishalmysore).</description>
    <link>https://dev.to/vishalmysore</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1386010%2F83aba423-ebfc-46df-8819-a0de1d1e8075.jpeg</url>
      <title>DEV Community: vishalmysore</title>
      <link>https://dev.to/vishalmysore</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/vishalmysore"/>
    <language>en</language>
    <item>
      <title>webSLM: Fine-tuning, Compiling, and Running Domain-Specific Small Language Models Entirely in the Browser</title>
      <dc:creator>vishalmysore</dc:creator>
      <pubDate>Mon, 29 Jun 2026 13:42:20 +0000</pubDate>
      <link>https://dev.to/vishalmysore/webslm-fine-tuning-compiling-and-running-domain-specific-small-language-models-entirely-in-the-1i5i</link>
      <guid>https://dev.to/vishalmysore/webslm-fine-tuning-compiling-and-running-domain-specific-small-language-models-entirely-in-the-1i5i</guid>
      <description>&lt;p&gt;webSLM is an end-to-end pipeline for turning a general-purpose Small Language Model (SLM) into a &lt;strong&gt;domain-specialized assistant that runs 100% in the browser&lt;/strong&gt; — no server, no API key, no inference cost, full offline capability after first load. This paper documents the complete lifecycle of a worked example, &lt;code&gt;WebSLM-Medical-0.5B&lt;/code&gt;: (1) &lt;strong&gt;LoRA fine-tuning&lt;/strong&gt; of &lt;code&gt;Qwen2.5-0.5B-Instruct&lt;/code&gt; on a small domain dataset using a free Colab T4; (2) &lt;strong&gt;compilation and 4-bit quantization&lt;/strong&gt; to a WebGPU model library via a reproducible, GPU-free GitHub Actions workflow built on MLC-LLM v0.19.0; and (3) &lt;strong&gt;in-browser execution&lt;/strong&gt; through WebLLM, including the non-obvious runtime-integration details that determine whether the model loads at all.&lt;/p&gt;

&lt;p&gt;We also report a controlled A/B validation showing what fine-tuning on a small dataset actually changes — and what it does not — and document two decoding pitfalls (greedy repetition loops and penalty-induced language drift on bilingual base models) that materially affect output quality on sub-1B models.&lt;/p&gt;




&lt;h2&gt;
  
  
  1. Introduction
&lt;/h2&gt;

&lt;p&gt;The dominant narrative in language modeling has been scale. But a parallel track — Small Language Models in the 0.5B–7B range, &lt;em&gt;designed&lt;/em&gt; to be efficient rather than pruned down — has matured to the point where a properly fine-tuned 0.5B–1.5B model delivers genuinely useful behavior in a focused domain. At the same time, &lt;strong&gt;WebGPU&lt;/strong&gt; has made it possible to run these models directly inside a browser tab, on the user's own hardware.&lt;/p&gt;

&lt;p&gt;The gap webSLM closes is not a research problem; it is an &lt;strong&gt;engineering and tooling problem&lt;/strong&gt;. Getting from a Hugging Face checkpoint to a working in-browser, domain-specialized chatbot requires stitching together three independently fiddly stages — fine-tuning, MLC compilation/quantization, and WebLLM runtime wiring — each with version-sensitive environments and undocumented failure modes. This paper is the field manual for that path.&lt;/p&gt;

&lt;h3&gt;
  
  
  1.1 Design principles
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Model-agnostic.&lt;/strong&gt; Any MLC-LLM-supported base works (Qwen2/2.5, Llama-3.x, Gemma-2, Phi-3.5, Mistral). Nothing in the pipeline is specific to one architecture.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Training and compilation are separate steps, on purpose.&lt;/strong&gt; Fine-tuning needs a GPU (Colab); compilation is CPU-only codegen (CI). The two never run in the same environment, so neither inherits the other's dependency constraints.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reproducible, GPU-free builds.&lt;/strong&gt; The expensive, error-prone compilation runs in GitHub Actions with a pinned, from-source toolchain — no local Linux GPU box required.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Browser-first deployment.&lt;/strong&gt; The artifact is static files (weight shards + a &lt;code&gt;.wasm&lt;/code&gt;) on a CDN; the client is one HTML page.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  2. Background
&lt;/h2&gt;

&lt;h3&gt;
  
  
  2.1 SLM vs. quantized LLM
&lt;/h3&gt;

&lt;p&gt;These are frequently conflated but are different answers to the same "large models are expensive" problem. An &lt;strong&gt;SLM&lt;/strong&gt; is small &lt;em&gt;by design&lt;/em&gt; — its efficiency comes from architecture and curated training data. A &lt;strong&gt;quantized LLM&lt;/strong&gt; is a large model compressed &lt;em&gt;after&lt;/em&gt; training; it keeps the full parameter structure and capability profile of the original, just at lower numeric precision. For the browser, only the SLM path is viable: a 1.5B model uses ~1–2 GB of GPU memory, while an INT4-quantized 70B model still needs 30–40 GB. They are not in the same deployment category.&lt;/p&gt;

&lt;p&gt;Note that webSLM uses quantization &lt;strong&gt;on top of&lt;/strong&gt; an SLM: the 0.5B model is itself quantized to 4-bit (q4f16_1) for the browser. Quantization here is a deployment compression, not the source of "smallness."&lt;/p&gt;

&lt;h3&gt;
  
  
  2.2 Fine-tuned SLM vs. RAG
&lt;/h3&gt;

&lt;p&gt;RAG (Retrieval-Augmented Generation) is a &lt;em&gt;system architecture&lt;/em&gt;, not a model type: it injects retrieved documents into the prompt at query time. It excels at large, dynamic knowledge bases but requires a retrieval layer (embeddings, vector store, ingestion) — infrastructure that does not exist in a pure browser deployment. Fine-tuning encodes &lt;strong&gt;behavior, style, and domain patterns into the weights&lt;/strong&gt; instead. The two are complementary; for the serverless browser case, fine-tuning is the only specialization mechanism available. (The companion demo includes an optional in-browser TF-IDF RAG path for comparison, but the model itself carries no retrieval.)&lt;/p&gt;

&lt;h3&gt;
  
  
  2.3 The runtime stack: WebLLM + MLC-LLM
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;WebLLM&lt;/strong&gt; (&lt;a href="https://github.com/mlc-ai/web-llm" rel="noopener noreferrer"&gt;github.com/mlc-ai/web-llm&lt;/a&gt;) is the browser runtime. It executes models on &lt;strong&gt;WebGPU&lt;/strong&gt; and exposes an OpenAI-compatible JS API (&lt;code&gt;engine.chat.completions.create()&lt;/code&gt;), with streaming, fully local. It &lt;strong&gt;cannot&lt;/strong&gt; load a raw Hugging Face checkpoint.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MLC-LLM&lt;/strong&gt; (part of the Apache TVM ecosystem) is the compiler. For a browser target it produces two things WebLLM consumes:

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Quantized weight shards&lt;/strong&gt; (&lt;code&gt;params_shard_*.bin&lt;/code&gt;) plus a manifest (&lt;code&gt;ndarray-cache.json&lt;/code&gt;) and chat/sampling config (&lt;code&gt;mlc-chat-config.json&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;A &lt;strong&gt;&lt;code&gt;.wasm&lt;/code&gt; model library&lt;/strong&gt; containing the compiled compute kernels for that &lt;em&gt;specific&lt;/em&gt; architecture.&lt;/li&gt;
&lt;/ol&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The critical, easily-missed consequence: &lt;strong&gt;the &lt;code&gt;.wasm&lt;/code&gt; is tied to the runtime ABI.&lt;/strong&gt; A model library compiled with MLC-LLM v0.19.0 must be loaded by the matching &lt;code&gt;@mlc-ai/web-llm&lt;/code&gt; build (0.2.79). Mismatched versions fail to instantiate the wasm.&lt;/p&gt;




&lt;h2&gt;
  
  
  3. System architecture
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;  domain data (JSONL chat)
        │
        │   STAGE 1 — Colab T4 (GPU)
        │   finetune/train_lora.py:  LoRA SFT → merge_and_unload → push
        ▼
  merged HF checkpoint                       e.g. VishalMysore/WebSLM-Medical-0.5B
  (standard fp16 safetensors)                (qwen2, 0.5B, full weights)
        │
        │   STAGE 2 — GitHub Actions (CPU only)
        │   normalize_config → convert_weight (q4f16_1) → gen_config → compile --device webgpu
        ▼
  MLC artifacts                              VishalMysore/WebSLM-Custom-MLC
  (params_shard_*.bin + .wasm + configs)     (8 shards ~278 MB + 3.7 MB wasm)
        │
        │   STAGE 3 — Any WebGPU browser (client GPU)
        │   @mlc-ai/web-llm@0.2.79: MLCEngine(appConfig) → reload → chat.completions
        ▼
  in-browser domain assistant               webSLMDemo (GitHub Pages)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three environments, three hardware profiles, zero shared dependencies. The handoff between stages is always a plain artifact (an HF repo), never a live process.&lt;/p&gt;




&lt;h2&gt;
  
  
  4. Stage 1 — Fine-tuning on Colab (LoRA SFT)
&lt;/h2&gt;

&lt;h3&gt;
  
  
  4.1 Data format
&lt;/h3&gt;

&lt;p&gt;Training data is &lt;strong&gt;chat/conversational JSONL&lt;/strong&gt; — one JSON object per line:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"messages"&lt;/span&gt;&lt;span class="p"&gt;:[&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"role"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"system"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"You are a careful medical information assistant. Provide general, educational health information in plain language, and always recommend consulting a licensed healthcare professional for diagnosis or treatment."&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"role"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"user"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"What are common signs of dehydration?"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"role"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"assistant"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"Common signs include increased thirst, dry mouth, dark yellow urine, reduced urination, fatigue, dizziness, and headache. Severe dehydration can cause confusion or a rapid heartbeat and needs urgent care. Drink fluids and seek medical help if symptoms are severe or persist."&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;]}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two design rules carry most of the signal:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Keep the &lt;code&gt;system&lt;/code&gt; message identical across the dataset.&lt;/strong&gt; The model learns a stable persona from it. (This system prompt becomes the one you should use at inference — see §6.4 and §7.)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The &lt;code&gt;assistant&lt;/code&gt; turns are exactly what the model imitates.&lt;/strong&gt; Their length, register, and structure &lt;em&gt;are&lt;/em&gt; the behavior you are training. The worked example's answers are short, plain-language, and always close with a referral to a professional — and that is precisely the signature the fine-tune learns.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The repository ships three &lt;strong&gt;illustrative starter sets&lt;/strong&gt; in &lt;code&gt;finetune/data/&lt;/code&gt; (&lt;code&gt;medical.jsonl&lt;/code&gt; ≈ 34 examples, plus &lt;code&gt;legal&lt;/code&gt; and &lt;code&gt;insurance&lt;/code&gt;). These are deliberately tiny: they prove the pipeline, they do &lt;strong&gt;not&lt;/strong&gt; fully specialize a model. Real domain quality needs hundreds to thousands of examples.&lt;/p&gt;

&lt;h3&gt;
  
  
  4.2 The training recipe
&lt;/h3&gt;

&lt;p&gt;Fine-tuning uses &lt;strong&gt;LoRA&lt;/strong&gt; (Low-Rank Adaptation) via TRL's &lt;code&gt;SFTTrainer&lt;/code&gt;, rendering each example through the base model's chat template before training. The merged result is a standard HF checkpoint the build stage can consume directly.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Hyperparameter&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Base model&lt;/td&gt;
&lt;td&gt;&lt;code&gt;Qwen/Qwen2.5-0.5B-Instruct&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;any Instruct SLM; &lt;code&gt;arch=qwen2&lt;/code&gt;, &lt;code&gt;conv=qwen2&lt;/code&gt; flow straight to build&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LoRA rank &lt;code&gt;r&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;16&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LoRA &lt;code&gt;alpha&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;32&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LoRA dropout&lt;/td&gt;
&lt;td&gt;0.05&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Target modules&lt;/td&gt;
&lt;td&gt;&lt;code&gt;q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;all attention + MLP projections&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Epochs&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Learning rate&lt;/td&gt;
&lt;td&gt;2e-4&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Per-device batch&lt;/td&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Grad accumulation&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;effective batch ≈ 16&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Max sequence length&lt;/td&gt;
&lt;td&gt;1024&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Precision&lt;/td&gt;
&lt;td&gt;fp16&lt;/td&gt;
&lt;td&gt;QLoRA (&lt;code&gt;--bits 4&lt;/code&gt;, bitsandbytes) available for larger bases on 16 GB GPUs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Packing&lt;/td&gt;
&lt;td&gt;off&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The core of &lt;code&gt;train_lora.py&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;tok&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;AutoTokenizer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;base&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;trust_remote_code&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;ds&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;load_dataset&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;data_files&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;split&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;train&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;ds&lt;/span&gt;  &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ds&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;ex&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;tok&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;apply_chat_template&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ex&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;tokenize&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)},&lt;/span&gt;
             &lt;span class="n"&gt;remove_columns&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;ds&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;column_names&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;lora&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;LoraConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lora_alpha&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;32&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lora_dropout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.05&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bias&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;none&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                  &lt;span class="n"&gt;task_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CAUSAL_LM&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                  &lt;span class="n"&gt;target_modules&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;q_proj&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;k_proj&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;v_proj&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;o_proj&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                                  &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gate_proj&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;up_proj&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;down_proj&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

&lt;span class="n"&gt;trainer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SFTTrainer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;train_dataset&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;ds&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;sft_cfg&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                     &lt;span class="n"&gt;peft_config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;lora&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;processing_class&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;tok&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;   &lt;span class="c1"&gt;# falls back to tokenizer= on older TRL
&lt;/span&gt;&lt;span class="n"&gt;trainer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;train&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After training, the adapter is &lt;strong&gt;merged&lt;/strong&gt; back into the base in fp16 and saved as a self-contained checkpoint (this is mandatory — MLC-LLM's &lt;code&gt;convert_weight&lt;/code&gt; expects a fused model, not LoRA deltas):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;base&lt;/span&gt;   &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;AutoModelForCausalLM&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;base&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;torch_dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;float16&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;merged&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;PeftModel&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;base&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;adapter_dir&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;merge_and_unload&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;merged&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;save_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;safe_serialization&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;tok&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;save_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;               &lt;span class="c1"&gt;# carry the tokenizer so the dir is self-contained
&lt;/span&gt;&lt;span class="n"&gt;merged&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;push_to_hub&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;yourname/WebSLM-Medical-0.5B&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;blockquote&gt;
&lt;p&gt;If you trained a LoRA adapter elsewhere (axolotl, Unsloth, raw PEFT), &lt;code&gt;merge_lora.py&lt;/code&gt; does just the merge+export step. If you did a &lt;em&gt;full&lt;/em&gt; fine-tune (no LoRA) you already have a standard checkpoint — skip merging entirely.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  4.3 Running it on Colab
&lt;/h3&gt;

&lt;p&gt;The fastest path is the clone-and-run notebook &lt;code&gt;finetune/finetune_webslm_colab.ipynb&lt;/code&gt; (Runtime → &lt;strong&gt;T4 GPU&lt;/strong&gt;). It clones the repo, installs &lt;code&gt;finetune/requirements.txt&lt;/code&gt;, logs into Hugging Face, trains on the chosen domain, and &lt;strong&gt;pushes the merged model to your HF account&lt;/strong&gt;. On a T4, a few hundred examples train in well under an hour. The final cell prints exactly what to enter in the build Action.&lt;/p&gt;

&lt;p&gt;Equivalent CLI:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; finetune/requirements.txt
python finetune/train_lora.py &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--base&lt;/span&gt; Qwen/Qwen2.5-0.5B-Instruct &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--data&lt;/span&gt; finetune/data/medical.jsonl &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--push-merged&lt;/span&gt; yourname/WebSLM-Medical-0.5B &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--epochs&lt;/span&gt; 3
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The output of Stage 1 is a plain fp16 HF model repo — e.g. &lt;code&gt;VishalMysore/WebSLM-Medical-0.5B&lt;/code&gt; (qwen2 architecture, single &lt;code&gt;model.safetensors&lt;/code&gt;, tokenizer, configs). Nothing browser-specific yet.&lt;/p&gt;




&lt;h2&gt;
  
  
  5. Stage 2 — Compiling to WebGPU via GitHub Actions
&lt;/h2&gt;

&lt;p&gt;This is the stage that, done by hand, costs newcomers a day or more. webSLM encodes the entire toolchain build and the three-command MLC pipeline into &lt;code&gt;.github/workflows/build-slm.yml&lt;/code&gt;, triggered manually (&lt;code&gt;workflow_dispatch&lt;/code&gt;) with a domain preset &lt;strong&gt;or&lt;/strong&gt; a Custom path pointing at your fine-tune.&lt;/p&gt;

&lt;h3&gt;
  
  
  5.1 Why build the toolchain from source
&lt;/h3&gt;

&lt;p&gt;The workflow builds &lt;strong&gt;MLC-LLM v0.19.0 and TVM from source&lt;/strong&gt;, because the MLC nightly wheels are broken by the in-progress &lt;code&gt;apache-tvm-ffi&lt;/code&gt; migration. The compile step is &lt;strong&gt;CPU-only&lt;/strong&gt;: &lt;code&gt;mlc_llm compile --device webgpu&lt;/code&gt; is code generation via emscripten — it emits WebAssembly kernels and never needs a GPU. A full toolchain build is ~35–45 minutes; the workflow has a 350-minute ceiling and frees disk space first.&lt;/p&gt;

&lt;p&gt;Build environment (the parts that matter):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Pin / setting&lt;/th&gt;
&lt;th&gt;Reason&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;MLC-LLM&lt;/td&gt;
&lt;td&gt;v0.19.0, from source&lt;/td&gt;
&lt;td&gt;nightly wheels broken (tvm-ffi migration)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;TVM&lt;/td&gt;
&lt;td&gt;bundled &lt;code&gt;3rdparty/tvm&lt;/code&gt;, from source&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;USE_LLVM "llvm-config --link-static"&lt;/code&gt;, &lt;code&gt;HIDE_PRIVATE_SYMBOLS ON&lt;/code&gt;, all GPU backends OFF&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LLVM&lt;/td&gt;
&lt;td&gt;system &lt;code&gt;llvm-dev&lt;/code&gt; + matching &lt;code&gt;libpolly-*-dev&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;TVM static-links LLVM incl. Polly; &lt;code&gt;llvm-dev&lt;/code&gt; doesn't ship &lt;code&gt;libPolly.a&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;emscripten&lt;/td&gt;
&lt;td&gt;3.1.56&lt;/td&gt;
&lt;td&gt;wasm toolchain matching MLC-LLM's web runtime&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Rust&lt;/td&gt;
&lt;td&gt;latest stable (≥1.85)&lt;/td&gt;
&lt;td&gt;modern deps; &lt;code&gt;tokenizers-cpp&lt;/code&gt; needs &lt;code&gt;--cap-lints=allow&lt;/code&gt; on new Rust&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;A subtle linking detail handled by the workflow: &lt;code&gt;mlc_llm compile --device webgpu&lt;/code&gt; links several wasm bitcode runtimes — &lt;code&gt;mlc_wasm_runtime.bc&lt;/code&gt; (from &lt;code&gt;mlc-llm/web&lt;/code&gt;) and &lt;code&gt;wasm_runtime.bc&lt;/code&gt; / &lt;code&gt;tvmjs_support.bc&lt;/code&gt; / &lt;code&gt;webgpu_runtime.bc&lt;/code&gt; (from &lt;code&gt;tvm/web&lt;/code&gt;). The latter three are built and copied into TVM's &lt;code&gt;build/&lt;/code&gt; so its &lt;code&gt;find_lib_path&lt;/code&gt; discovers them.&lt;/p&gt;

&lt;h3&gt;
  
  
  5.2 The MLC compile pipeline
&lt;/h3&gt;

&lt;p&gt;Once the toolchain exists, the actual conversion is three commands (mirrored in &lt;code&gt;build.sh&lt;/code&gt; for local/WSL2 runs), preceded by a config-normalization step:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 1b. Make a freshly-merged (newer-transformers) config readable by mlc-llm v0.19.0&lt;/span&gt;
python normalize_config.py &lt;span class="s2"&gt;"hf/&lt;/span&gt;&lt;span class="nv"&gt;$NAME&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;

&lt;span class="c"&gt;# 2. Quantize + shard the weights  (HF -&amp;gt; MLC params)&lt;/span&gt;
mlc_llm convert_weight &lt;span class="s2"&gt;"hf/&lt;/span&gt;&lt;span class="nv"&gt;$NAME&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="nt"&gt;--quantization&lt;/span&gt; q4f16_1 &lt;span class="nt"&gt;--model-type&lt;/span&gt; qwen2 &lt;span class="nt"&gt;-o&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$W&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;

&lt;span class="c"&gt;# 3. Emit chat template + tokenizer + model metadata&lt;/span&gt;
mlc_llm gen_config &lt;span class="s2"&gt;"hf/&lt;/span&gt;&lt;span class="nv"&gt;$NAME&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="nt"&gt;--quantization&lt;/span&gt; q4f16_1 &lt;span class="nt"&gt;--model-type&lt;/span&gt; qwen2 &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;--conv-template&lt;/span&gt; qwen2 &lt;span class="nt"&gt;--prefill-chunk-size&lt;/span&gt; 1024 &lt;span class="nt"&gt;-o&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$W&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;

&lt;span class="c"&gt;# 4. Codegen the WebGPU model library&lt;/span&gt;
mlc_llm compile &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$W&lt;/span&gt;&lt;span class="s2"&gt;/mlc-chat-config.json"&lt;/span&gt; &lt;span class="nt"&gt;--device&lt;/span&gt; webgpu &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-o&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$L&lt;/span&gt;&lt;span class="s2"&gt;/&lt;/span&gt;&lt;span class="nv"&gt;$NAME&lt;/span&gt;&lt;span class="s2"&gt;-q4f16_1-webgpu.wasm"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  5.2.1 The config-normalization gotcha
&lt;/h4&gt;

&lt;p&gt;Recent &lt;code&gt;transformers&lt;/code&gt; releases changed the config schema: the RoPE base moved from a top-level &lt;code&gt;rope_theta&lt;/code&gt; into a nested &lt;code&gt;rope_parameters&lt;/code&gt;/&lt;code&gt;rope_scaling&lt;/code&gt; dict, and &lt;code&gt;torch_dtype&lt;/code&gt; became &lt;code&gt;dtype&lt;/code&gt;. MLC-LLM v0.19.0 still expects the old top-level keys, so &lt;code&gt;convert_weight&lt;/code&gt; on a freshly fine-tuned model fails with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;TypeError: QWen2Config.__init__() missing 1 required positional argument: 'rope_theta'
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;normalize_config.py&lt;/code&gt; hoists &lt;code&gt;rope_theta&lt;/code&gt; back to the top level and restores &lt;code&gt;torch_dtype&lt;/code&gt; — decoupling your &lt;em&gt;training&lt;/em&gt; transformers version from the &lt;em&gt;pinned, old&lt;/em&gt; build compiler. It is idempotent and safe on already-old configs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rope_theta&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;cfg&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rope_parameters&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rope_scaling&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;rp&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cfg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;rp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rope_theta&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;cfg&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rope_theta&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;rp&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rope_theta&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt; &lt;span class="k"&gt;break&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;torch_dtype&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;cfg&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dtype&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;cfg&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;cfg&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;torch_dtype&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cfg&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;dtype&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the kind of failure that, undiagnosed, produces a silent crash or garbage output with no upstream documentation.&lt;/p&gt;

&lt;h4&gt;
  
  
  5.2.2 Quantization choice
&lt;/h4&gt;

&lt;p&gt;Default is &lt;strong&gt;&lt;code&gt;q4f16_1&lt;/code&gt;&lt;/strong&gt; (4-bit weights, fp16 activations) — the smallest practical format. Some models &lt;strong&gt;overflow fp16 to NaN&lt;/strong&gt; during inference (observed with TinyLlama-1.1B); for those, fall back to &lt;strong&gt;&lt;code&gt;q4f32_1&lt;/code&gt;&lt;/strong&gt; (fp32 activations), trading size for numerical stability. The selected format is part of the wasm filename and the &lt;code&gt;model_id&lt;/code&gt;, so it must match between the compiled artifact and the browser config.&lt;/p&gt;

&lt;h3&gt;
  
  
  5.3 Inputs and outputs
&lt;/h3&gt;

&lt;p&gt;The workflow exposes domain presets (Qwen2.5-Coder-1.5B for code, Qwen2.5-Math-1.5B for math, general Qwen/Llama/Gemma/Phi bases) and a &lt;strong&gt;Custom&lt;/strong&gt; path. For a fine-tune you select &lt;strong&gt;Custom&lt;/strong&gt; and pass:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ini"&gt;&lt;code&gt;&lt;span class="py"&gt;domain&lt;/span&gt;          &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;Custom&lt;/span&gt;
&lt;span class="py"&gt;custom_model_hf&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;VishalMysore/WebSLM-Medical-0.5B&lt;/span&gt;
&lt;span class="py"&gt;custom_arch&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;qwen2&lt;/span&gt;
&lt;span class="py"&gt;custom_conv&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;qwen2&lt;/span&gt;
&lt;span class="py"&gt;quant&lt;/span&gt;           &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;q4f16_1&lt;/span&gt;
&lt;span class="py"&gt;custom_name&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;WebSLM-Custom        (becomes the output repo/lib name)&lt;/span&gt;
&lt;span class="py"&gt;upload_hf&lt;/span&gt;       &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;true                 (push artifacts to HF; needs HF_TOKEN + HF_NAMESPACE)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Outputs are saved &lt;strong&gt;unconditionally&lt;/strong&gt; (so a bad HF token never loses a 45-minute build): a downloadable &lt;strong&gt;Actions artifact&lt;/strong&gt;, a &lt;strong&gt;GitHub Release&lt;/strong&gt; carrying the &lt;code&gt;.wasm&lt;/code&gt;, and — when &lt;code&gt;upload_hf&lt;/code&gt; is set — a self-contained &lt;strong&gt;Hugging Face model repo&lt;/strong&gt;. The worked example produced &lt;code&gt;VishalMysore/WebSLM-Custom-MLC&lt;/code&gt;:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;File&lt;/th&gt;
&lt;th&gt;Size&lt;/th&gt;
&lt;th&gt;Role&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;mlc-chat-config.json&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;2 KB&lt;/td&gt;
&lt;td&gt;chat template, sampling defaults, context window&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;ndarray-cache.json&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;102 KB&lt;/td&gt;
&lt;td&gt;weight-shard manifest&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;params_shard_0…7.bin&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;~278 MB total&lt;/td&gt;
&lt;td&gt;4-bit quantized weights (8 shards)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;tokenizer.json&lt;/code&gt;, &lt;code&gt;tokenizer_config.json&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;~11 MB&lt;/td&gt;
&lt;td&gt;tokenizer&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;libs/WebSLM-Custom-q4f16_1-webgpu.wasm&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;3.7 MB&lt;/td&gt;
&lt;td&gt;WebGPU model library&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Compiled metadata of note: &lt;code&gt;model_type: qwen2&lt;/code&gt;, &lt;code&gt;quantization: q4f16_1&lt;/code&gt;, &lt;code&gt;context_window_size: 32768&lt;/code&gt;, &lt;code&gt;vocab_size: 151936&lt;/code&gt;, &lt;code&gt;conv_template: qwen2&lt;/code&gt;, default sampling &lt;code&gt;temperature: 0.7&lt;/code&gt;, &lt;code&gt;top_p: 0.8&lt;/code&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  6. Stage 3 — Running in the browser with WebLLM
&lt;/h2&gt;

&lt;p&gt;The client is a static page importing &lt;code&gt;@mlc-ai/web-llm&lt;/code&gt;. The model loads from its HF URLs, caches in the browser (Cache API / IndexedDB), and runs on the client GPU. Three integration details determine whether it works at all — each one produced a distinct, opaque error during bring-up.&lt;/p&gt;

&lt;h3&gt;
  
  
  6.1 Registering a custom model — &lt;code&gt;appConfig&lt;/code&gt; goes in the constructor
&lt;/h3&gt;

&lt;p&gt;WebLLM only knows its built-in (prebuilt) models unless you give it an &lt;code&gt;appConfig&lt;/code&gt; describing yours. The &lt;strong&gt;&lt;code&gt;appConfig&lt;/code&gt; must be passed to the &lt;code&gt;MLCEngine&lt;/code&gt; constructor&lt;/strong&gt;, not to &lt;code&gt;reload()&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nx"&gt;webllm&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;https://esm.run/@mlc-ai/web-llm@0.2.79&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;appConfig&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;model_list&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;
    &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;     &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;https://huggingface.co/VishalMysore/WebSLM-Custom-MLC&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;          &lt;span class="c1"&gt;// FULL HF URL&lt;/span&gt;
    &lt;span class="na"&gt;model_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;WebSLM-Custom-q4f16_1-webgpu&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;                                   &lt;span class="c1"&gt;// arbitrary local name&lt;/span&gt;
    &lt;span class="na"&gt;model_lib&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;https://huggingface.co/VishalMysore/WebSLM-Custom-MLC/resolve/main/libs/WebSLM-Custom-q4f16_1-webgpu.wasm&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;}],&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;engine&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nx"&gt;webllm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;MLCEngine&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="nx"&gt;appConfig&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;initProgressCallback&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;engine&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;reload&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;WebSLM-Custom-q4f16_1-webgpu&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;reload(modelId, chatOpts?)&lt;/code&gt;'s second argument is &lt;code&gt;ChatOptions&lt;/code&gt;, which has &lt;strong&gt;no &lt;code&gt;appConfig&lt;/code&gt; field&lt;/strong&gt; — passing &lt;code&gt;appConfig&lt;/code&gt; there is silently dropped, the engine falls back to its prebuilt list, and you get:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Cannot find model record in appConfig for WebSLM-Custom-q4f16_1-webgpu.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  6.2 &lt;code&gt;model&lt;/code&gt; must be a full URL
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;model_list[].model&lt;/code&gt; field must be a complete &lt;code&gt;https://huggingface.co/{USER}/{REPO}&lt;/code&gt; URL (the four accepted forms all start with &lt;code&gt;https://&lt;/code&gt;). A bare repo id makes WebLLM's internal &lt;code&gt;new URL(...)&lt;/code&gt; throw:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Failed to construct 'URL': Invalid URL
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;model_lib&lt;/code&gt; is the full &lt;code&gt;/resolve/main/.../*.wasm&lt;/code&gt; URL; &lt;code&gt;model_id&lt;/code&gt; is a free-form local handle.&lt;/p&gt;

&lt;h3&gt;
  
  
  6.3 Pin the runtime to the wasm's ABI
&lt;/h3&gt;

&lt;p&gt;The import must be &lt;strong&gt;version-pinned&lt;/strong&gt; to the &lt;code&gt;web-llm&lt;/code&gt; build matching the MLC-LLM that produced the wasm:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nx"&gt;webllm&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;https://esm.run/@mlc-ai/web-llm@0.2.79&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;   &lt;span class="c1"&gt;// NOT unpinned (=latest)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;esm.run/@mlc-ai/web-llm&lt;/code&gt; with no version resolves to &lt;em&gt;latest&lt;/em&gt;, whose wasm ABI can differ from a v0.19.0-compiled library — producing instantiation failures at load. The pin table:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Built with&lt;/th&gt;
&lt;th&gt;Runtime&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;MLC-LLM v0.19.0&lt;/td&gt;
&lt;td&gt;&lt;code&gt;@mlc-ai/web-llm@0.2.79&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  6.4 Inference and decoding
&lt;/h3&gt;

&lt;p&gt;WebLLM's API is OpenAI-shaped and streams:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;stream&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;engine&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;role&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;system&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;TRAINING_SYSTEM_PROMPT&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;   &lt;span class="c1"&gt;// match the prompt the model was TRAINED with&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;role&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;user&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;   &lt;span class="na"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;query&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="p"&gt;],&lt;/span&gt;
  &lt;span class="na"&gt;stream&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;temperature&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.7&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;top_p&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;max_tokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;256&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;            &lt;span class="c1"&gt;// Qwen2.5's recommended sampling&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="k"&gt;await &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;chunk&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="nx"&gt;stream&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="cm"&gt;/* chunk.choices[0].delta.content */&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two points are decisive for output quality on a 0.5B model (see §7.3):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Use the training system prompt at inference.&lt;/strong&gt; The fine-tune's behavior is conditioned on the persona it was trained with; a different system prompt pulls it back toward generic base behavior.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Use the model's recommended sampling&lt;/strong&gt; (&lt;code&gt;temperature 0.7&lt;/code&gt;, &lt;code&gt;top_p 0.8&lt;/code&gt;). Greedy decoding and aggressive penalties both degrade small-model output in characteristic ways.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  6.5 The demo application
&lt;/h3&gt;

&lt;p&gt;The companion demo (&lt;a href="https://github.com/vishalmysore/webSLMDemo" rel="noopener noreferrer"&gt;webSLMDemo&lt;/a&gt;, deployed to GitHub Pages) has two modes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Product demo&lt;/strong&gt; — &lt;em&gt;Base + RAG&lt;/em&gt; (left) vs. &lt;em&gt;Fine-tuned SLM&lt;/em&gt; (right). The base panel runs a general model with an in-browser &lt;strong&gt;TF-IDF retriever&lt;/strong&gt; injecting document context; the SLM panel runs the fine-tune with a domain system prompt and no retrieval. This contrasts the two specialization strategies.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fine-tuning proof&lt;/strong&gt; — a controlled A/B (next section). It loads the &lt;strong&gt;exact base&lt;/strong&gt; the fine-tune started from (&lt;code&gt;Qwen2.5-0.5B-Instruct-q4f16_1-MLC&lt;/code&gt;, a WebLLM prebuilt at the same quantization) on the left and the fine-tune on the right, feeds &lt;strong&gt;both&lt;/strong&gt; the identical training system prompt with no retrieval, and uses &lt;strong&gt;identical decoding&lt;/strong&gt; — so the only variable is the LoRA training.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  7. Validation: what fine-tuning actually changed
&lt;/h2&gt;

&lt;p&gt;Because the proof mode holds base, prompt, and decoding identical across both panels, any difference is attributable to the LoRA fine-tuning. The following are verbatim in-browser outputs.&lt;/p&gt;

&lt;h3&gt;
  
  
  7.1 In-distribution question (trained topic)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Prompt:&lt;/strong&gt; &lt;em&gt;"What are common signs of dehydration?"&lt;/em&gt; (a question whose topic is in the training set.)&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Base — Qwen2.5-0.5B, no fine-tune:&lt;/strong&gt; "1. Sunken Eyes… 6. Confusion… 9. Dry Skin: Not having much moisture in your skin. 10. Dry Skin… 11. Dry Skin…" — rambling, and falls into a repeat loop to the token limit.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fine-tune — WebSLM-Medical-0.5B:&lt;/strong&gt; "Common signs of dehydration include: 1. Sunken eyes 2. Dry mouth and lips … 4. Urine that is dark or less than normal … 7. Not sweating heavily or feeling weak. These symptoms can be caused by low fluid intake, heatstroke… If you notice any of these signs, it's important to seek medical attention right away."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The fine-tune reproduces the &lt;strong&gt;trained content&lt;/strong&gt; (thirst, dark urine, reduced urination, rapid heartbeat — closely mirroring its &lt;code&gt;medical.jsonl&lt;/code&gt; answer), is &lt;strong&gt;concise&lt;/strong&gt;, &lt;strong&gt;stops cleanly&lt;/strong&gt;, and &lt;strong&gt;closes with a referral&lt;/strong&gt; — exactly the trained signature. The base is verbose and unstable. Fine-tuning here improved both &lt;strong&gt;style&lt;/strong&gt; and &lt;strong&gt;generation stability&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  7.2 Held-out question (untrained topic)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Prompt:&lt;/strong&gt; &lt;em&gt;"What is long covid?"&lt;/em&gt; (a topic the 34-example dataset never covered.)&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Base:&lt;/strong&gt; invents an alias ("also known as Long-Term Exposure") but gets the chronic/months-to-years framing roughly right.&lt;br&gt;
&lt;strong&gt;Fine-tune:&lt;/strong&gt; cleaner structure and plausible symptom list, but states symptoms last "an average of two to three weeks" — which is wrong (that is &lt;em&gt;acute&lt;/em&gt; COVID; long COVID lasts months by definition).&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Honest finding:&lt;/strong&gt; on a held-out topic the fine-tuning signal is weak and &lt;strong&gt;both models hallucinate&lt;/strong&gt;. A 34-example LoRA imparts &lt;em&gt;style and in-distribution phrasing&lt;/em&gt;, not reliable new &lt;em&gt;knowledge&lt;/em&gt;, and certainly not factual reliability on topics outside the training distribution. This is expected and is the central caveat for small-data fine-tuning.&lt;/p&gt;

&lt;h3&gt;
  
  
  7.3 Decoding pitfalls (general to sub-1B WebLLM models)
&lt;/h3&gt;

&lt;p&gt;Decoding choice changes outputs as much as fine-tuning does on these models:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Decoding&lt;/th&gt;
&lt;th&gt;Effect on the 0.5B models&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Greedy (&lt;code&gt;temperature: 0&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;Degenerate &lt;strong&gt;repetition loops&lt;/strong&gt; ("Confusion. Confusion…") that bury the trained style&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Low temp + strong &lt;code&gt;frequency_penalty&lt;/code&gt;/&lt;code&gt;presence_penalty&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Language drift&lt;/strong&gt;: penalizing repeated English tokens pushes a &lt;em&gt;bilingual&lt;/em&gt; Qwen model into Chinese tokens → gibberish loops ("答? 答答?")&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;&lt;code&gt;temperature 0.7&lt;/code&gt;, &lt;code&gt;top_p 0.8&lt;/code&gt;&lt;/strong&gt; (Qwen's own recommended)&lt;/td&gt;
&lt;td&gt;Stable, coherent, no loops, no drift — the chosen setting&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The takeaway: small models are decoding-sensitive. Fairness in the A/B is preserved by applying identical decoding to both panels; quality is preserved by using the model's recommended sampling rather than greedy or penalty-heavy schemes.&lt;/p&gt;




&lt;h2&gt;
  
  
  8. Reproducibility
&lt;/h2&gt;

&lt;p&gt;Everything required to reproduce &lt;code&gt;WebSLM-Medical-0.5B&lt;/code&gt; is public and pinned:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Data &amp;amp; training:&lt;/strong&gt; &lt;code&gt;finetune/data/medical.jsonl&lt;/code&gt;, &lt;code&gt;finetune/train_lora.py&lt;/code&gt;, &lt;code&gt;finetune/finetune_webslm_colab.ipynb&lt;/code&gt; (Colab T4).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Build:&lt;/strong&gt; &lt;code&gt;.github/workflows/build-slm.yml&lt;/code&gt; (CI) or &lt;code&gt;build.sh&lt;/code&gt; (Linux/WSL2), both invoking &lt;code&gt;normalize_config.py&lt;/code&gt; then the three MLC commands.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Artifacts:&lt;/strong&gt; &lt;a href="https://huggingface.co/VishalMysore/WebSLM-Medical-0.5B" rel="noopener noreferrer"&gt;&lt;code&gt;VishalMysore/WebSLM-Medical-0.5B&lt;/code&gt;&lt;/a&gt; (merged fp16) and &lt;a href="https://huggingface.co/VishalMysore/WebSLM-Custom-MLC" rel="noopener noreferrer"&gt;&lt;code&gt;VishalMysore/WebSLM-Custom-MLC&lt;/code&gt;&lt;/a&gt; (compiled, 4-bit, + wasm).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Client:&lt;/strong&gt; the demo's &lt;code&gt;app.js&lt;/code&gt;, pinned to &lt;code&gt;@mlc-ai/web-llm@0.2.79&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Version matrix (the pins that matter)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;Pin&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Base model&lt;/td&gt;
&lt;td&gt;&lt;code&gt;Qwen/Qwen2.5-0.5B-Instruct&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Fine-tuning&lt;/td&gt;
&lt;td&gt;LoRA (r=16, α=32) via TRL &lt;code&gt;SFTTrainer&lt;/code&gt;, 3 epochs, fp16&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Compiler&lt;/td&gt;
&lt;td&gt;MLC-LLM &lt;strong&gt;v0.19.0&lt;/strong&gt;, built from source (+ bundled TVM)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;wasm toolchain&lt;/td&gt;
&lt;td&gt;emscripten &lt;strong&gt;3.1.56&lt;/strong&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Quantization&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;q4f16_1&lt;/strong&gt; (fallback &lt;code&gt;q4f32_1&lt;/code&gt; on fp16 NaN)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Browser runtime&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;@mlc-ai/web-llm&lt;/code&gt; &lt;strong&gt;0.2.79&lt;/strong&gt; (must match the compiler)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Inference sampling&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;temperature 0.7&lt;/code&gt;, &lt;code&gt;top_p 0.8&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Repository components
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Path&lt;/th&gt;
&lt;th&gt;Purpose&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;finetune/train_lora.py&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;LoRA SFT → merge → push&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;finetune/finetune_webslm_colab.ipynb&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;clone-and-run Colab trainer&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;finetune/data/*.jsonl&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;starter domain datasets (illustrative)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;merge_lora.py&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;merge an externally-trained adapter into its base&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;normalize_config.py&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;newer-transformers → mlc-llm v0.19.0 config fix&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;build.sh&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;local/WSL2 convert → gen_config → compile&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;.github/workflows/build-slm.yml&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;from-source toolchain + full compile/release/upload&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;code&gt;demo/index.html&lt;/code&gt; (and webSLMDemo)&lt;/td&gt;
&lt;td&gt;self-contained WebLLM client&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  9. Limitations and responsible use
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Browser favors small models.&lt;/strong&gt; 0.5B–3.5B is the practical sweet spot; 7B+ is slow and memory-pressured on consumer GPUs because WebGPU memory is shared with the browser process.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Small-data fine-tuning transfers style, not knowledge.&lt;/strong&gt; As §7.2 shows, expect on-distribution behavioral alignment, not factual reliability — and expect hallucination off-distribution.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Quant/version coupling is brittle.&lt;/strong&gt; The wasm ↔ runtime pin is mandatory; fp16 quantization can NaN on some architectures.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sensitive domains require a human in the loop.&lt;/strong&gt; The medical/legal/insurance examples exist to demonstrate the pipeline. Outputs can be confidently wrong; keep disclaimers (the sample data trains one in) and validate before relying on any output.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  10. Conclusion
&lt;/h2&gt;

&lt;p&gt;The distance between a capable small language model and a useful, private, offline, domain-specific browser assistant is bridged by tooling, not research. webSLM makes that bridge reproducible: fine-tune on a free Colab GPU, compile and quantize in CPU-only CI with a pinned from-source MLC-LLM toolchain, and serve the result as static files that any WebGPU browser runs locally. The worked example, &lt;code&gt;WebSLM-Medical-0.5B&lt;/code&gt;, demonstrates the full path end to end — and a controlled in-browser A/B confirms, honestly, both what a small fine-tune buys (concise, stable, on-style, in-distribution behavior) and what it does not (new knowledge or factual reliability). For focused, behavior-defined applications that must run without a server, that trade is often exactly the right one.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F3xk1ocf2ioqa1e5f58nm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F3xk1ocf2ioqa1e5f58nm.png" alt=" " width="799" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Demo link - &lt;a href="https://vishalmysore.github.io/webSLMDemo/" rel="noopener noreferrer"&gt;https://vishalmysore.github.io/webSLMDemo/&lt;/a&gt;&lt;br&gt;
Code for Demo  - &lt;a href="https://github.com/vishalmysore/webSLMDemo" rel="noopener noreferrer"&gt;https://github.com/vishalmysore/webSLMDemo&lt;/a&gt;&lt;br&gt;
Model/Finetuning Code  - &lt;a href="https://github.com/vishalmysore/webSLM" rel="noopener noreferrer"&gt;https://github.com/vishalmysore/webSLM&lt;/a&gt;&lt;br&gt;
Actual Fined tuned model - &lt;a href="https://huggingface.co/VishalMysore/WebSLM-Medical-0.5B" rel="noopener noreferrer"&gt;https://huggingface.co/VishalMysore/WebSLM-Medical-0.5B&lt;/a&gt;&lt;br&gt;
MLC WASM - &lt;a href="https://huggingface.co/VishalMysore/WebSLM-Custom-MLC" rel="noopener noreferrer"&gt;https://huggingface.co/VishalMysore/WebSLM-Custom-MLC&lt;/a&gt;&lt;/p&gt;

</description>
      <category>llm</category>
      <category>machinelearning</category>
      <category>tutorial</category>
      <category>webdev</category>
    </item>
    <item>
      <title>From SLM Fundamentals to webSLM: A Practical Path to Domain-Specific Browser AI</title>
      <dc:creator>vishalmysore</dc:creator>
      <pubDate>Fri, 26 Jun 2026 13:26:41 +0000</pubDate>
      <link>https://dev.to/vishalmysore/from-slm-fundamentals-to-webslm-a-practical-path-to-domain-specific-browser-ai-5hla</link>
      <guid>https://dev.to/vishalmysore/from-slm-fundamentals-to-webslm-a-practical-path-to-domain-specific-browser-ai-5hla</guid>
      <description>&lt;h2&gt;
  
  
  What is an SLM, and why does it matter now?
&lt;/h2&gt;

&lt;p&gt;For most of the last few years, the dominant narrative around language models has been scale. More parameters meant better results, so GPT-4, Claude, Gemini, and their peers grew into models requiring enormous GPU clusters just to serve a single request. That story is still true at the frontier — but it is no longer the only story worth telling.&lt;/p&gt;

&lt;p&gt;A parallel track has been quietly gaining ground: Small Language Models.&lt;/p&gt;

&lt;p&gt;An SLM is a language model typically between 0.5 billion and 7 billion parameters, designed from the start to be efficient rather than simply scaled down from a larger one. The key word is &lt;em&gt;designed&lt;/em&gt;. Early efforts at small models were essentially pruned or distilled versions of larger ones, and they showed — the quality dropped noticeably. What changed around 2023–2024 was the training recipe. Researchers at Microsoft, Google, Alibaba, and others demonstrated that if you invest heavily in data quality, synthetic data pipelines, and architecture-level optimizations, a 1.3B or 3.8B parameter model can outperform much larger models on many practical benchmarks.&lt;/p&gt;

&lt;p&gt;Microsoft's Phi series is one of the most well-studied examples. Phi-2 (2.7B) was published with benchmark results showing it matched or exceeded models 5–10x its size on reasoning and coding tasks. Phi-3-mini (3.8B) later extended this further. The explanation from the research team was straightforward: the training data was aggressively curated to emphasize reasoning-dense content, synthetic problems, and educational material — essentially training the model to think efficiently rather than just memorize patterns at scale.&lt;/p&gt;

&lt;p&gt;Alibaba's Qwen2.5 series similarly demonstrated strong performance across coding, mathematics, and instruction following at the sub-2B range, making it one of the go-to base model families for edge and on-device applications.&lt;/p&gt;

&lt;p&gt;Code for this article is here &lt;a href="https://github.com/vishalmysore/webSLM" rel="noopener noreferrer"&gt;https://github.com/vishalmysore/webSLM&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  What defines an SLM
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Parameter count&lt;/strong&gt;: typically 0.5B–7B, though the upper range overlaps with smaller traditional LLMs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Training philosophy&lt;/strong&gt;: curated data quality over raw data volume; distillation techniques to transfer reasoning from larger teacher models&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Architecture optimizations&lt;/strong&gt;: grouped-query attention, sliding window attention, efficient tokenizers, and better normalization schemes reduce memory and compute per token&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deployment target&lt;/strong&gt;: edge hardware, consumer laptops, mobile devices, embedded systems, and increasingly the browser&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Why the benchmark numbers are misleading — in a good way
&lt;/h3&gt;

&lt;p&gt;When Phi-3-mini was released, it scored competitively on MMLU (general knowledge), HumanEval (code), and GSM8K (math reasoning) against models three to four times its size. This matters because those benchmarks were designed to stress-test large models. Beating them at 3.8B suggests that many real-world tasks do not require scale — they require specificity.&lt;/p&gt;

&lt;p&gt;This is the core insight that makes SLMs interesting beyond the spec sheet: a general-purpose 70B model has to allocate capacity across all of human knowledge. A specialized 1.5B model, fine-tuned on a specific domain, can concentrate all of its capacity on what you actually care about. For domain-specific applications — insurance underwriting, legal clause extraction, medical triage support, code review in a specific stack — the fine-tuned SLM often produces better practical results than a raw large model.&lt;/p&gt;

&lt;h3&gt;
  
  
  The trade-offs worth acknowledging
&lt;/h3&gt;

&lt;p&gt;SLMs are not a universal replacement for large models. They struggle with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Long multi-step reasoning chains that require deep context retention&lt;/li&gt;
&lt;li&gt;Open-ended creative generation where diversity and surprise matter&lt;/li&gt;
&lt;li&gt;Tasks requiring truly broad world knowledge synthesized across domains&lt;/li&gt;
&lt;li&gt;Very long context windows, though recent models have improved on this&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For these tasks, a large model or a retrieval-augmented system is still the better choice. But for a focused application with well-defined input/output behavior, an SLM is not just viable — it is often preferable.&lt;/p&gt;

&lt;h3&gt;
  
  
  Typical SLM models worth knowing
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Parameters&lt;/th&gt;
&lt;th&gt;Notable strengths&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Qwen2.5-0.5B / 1.5B&lt;/td&gt;
&lt;td&gt;0.5B, 1.5B&lt;/td&gt;
&lt;td&gt;Fast, efficient, good instruction following&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Phi-3.5-mini&lt;/td&gt;
&lt;td&gt;3.8B&lt;/td&gt;
&lt;td&gt;Strong reasoning and coding&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Phi-4-mini&lt;/td&gt;
&lt;td&gt;3.8B&lt;/td&gt;
&lt;td&gt;Improved math and complex reasoning&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gemma-2B / Gemma-3-4B&lt;/td&gt;
&lt;td&gt;2B, 4B&lt;/td&gt;
&lt;td&gt;Balanced general performance&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mistral 7B&lt;/td&gt;
&lt;td&gt;7B&lt;/td&gt;
&lt;td&gt;Strong open-weight baseline&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  Quantized LLMs: compression is not the same as being small
&lt;/h2&gt;

&lt;p&gt;Before comparing SLMs and quantized LLMs, it is worth being precise about what quantization actually is — because the two are frequently confused.&lt;/p&gt;

&lt;p&gt;Quantization is a post-training compression technique. It does not change a model's architecture, parameter count, or training. What it changes is the numerical format used to store and compute with the model's weights. A model trained in FP32 (32-bit floating point) or BF16 holds each weight as a high-precision floating-point value. Quantization converts those values to lower-precision formats — INT8, INT4, or even more aggressive schemes like GPTQ or AWQ — shrinking the model's memory footprint significantly.&lt;/p&gt;

&lt;p&gt;The motivation is practical: running a 70B parameter model in FP16 requires roughly 140GB of GPU memory. Quantized to INT4, that drops to around 35GB — still large, but now runnable on a high-end workstation or a server with two A100s rather than a multi-GPU cluster. Tools like &lt;code&gt;llama.cpp&lt;/code&gt;, GGUF format, and bitsandbytes have made this workflow accessible to individual developers.&lt;/p&gt;

&lt;h3&gt;
  
  
  What quantization buys you
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;A Llama-3 70B model that previously required a data center node can run on a local machine with two consumer GPUs&lt;/li&gt;
&lt;li&gt;Inference speeds improve noticeably at lower precision, especially on hardware with dedicated low-precision units&lt;/li&gt;
&lt;li&gt;The same model weights can be distributed in a much smaller file, which matters for deployment and bandwidth&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  What quantization costs
&lt;/h3&gt;

&lt;p&gt;Quality degrades as bit width drops. The degradation is non-linear: moving from FP16 to INT8 typically has minimal impact on most benchmarks. Moving to INT4 introduces more noticeable regressions — shorter responses, occasional repetition, and reduced performance on multi-step reasoning tasks. Moving below INT4 can compromise reliability on complex tasks significantly.&lt;/p&gt;

&lt;p&gt;The important point is that a quantized LLM is still &lt;em&gt;fundamentally a large model&lt;/em&gt; operating in a compressed representation. It carries the same architecture, the same parameter structure, and largely the same capability profile as the original — just at a cost to precision. It does not gain the focused efficiency of a model that was designed to be small from the start.&lt;/p&gt;

&lt;h2&gt;
  
  
  SLM vs Quantized LLM: two different answers to the same problem
&lt;/h2&gt;

&lt;p&gt;Both SLMs and quantized LLMs are responses to the same practical constraint: large models are expensive to run. But they answer that constraint in different ways, and the difference matters for where you deploy them.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aspect&lt;/th&gt;
&lt;th&gt;SLM&lt;/th&gt;
&lt;th&gt;Quantized LLM&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;What it is&lt;/td&gt;
&lt;td&gt;Model designed and trained to be small&lt;/td&gt;
&lt;td&gt;Large model compressed after training&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Efficiency source&lt;/td&gt;
&lt;td&gt;Architecture + curated training data&lt;/td&gt;
&lt;td&gt;Reduced numeric precision&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Memory footprint&lt;/td&gt;
&lt;td&gt;Inherently low (0.5B–7B parameters)&lt;/td&gt;
&lt;td&gt;Lower than original, but still reflects large parameter count&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Deployment target&lt;/td&gt;
&lt;td&gt;Browser, mobile, embedded, edge&lt;/td&gt;
&lt;td&gt;Local GPU, on-prem server with limited VRAM&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Fine-tuning&lt;/td&gt;
&lt;td&gt;Fast and cheap at small scale&lt;/td&gt;
&lt;td&gt;Requires full-precision weights or careful PEFT setup&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Offline capability&lt;/td&gt;
&lt;td&gt;Excellent&lt;/td&gt;
&lt;td&gt;Good if model fits on local hardware&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;An SLM at 1.5B parameters running in a browser tab uses around 1–2GB of memory. A quantized 70B model at INT4, even with its compression, still requires 30–40GB. These are not competing in the same deployment category.&lt;/p&gt;

&lt;p&gt;For the browser and edge use cases that webSLM targets, quantized LLMs are simply not viable candidates — not because of quality, but because of scale. The SLM path is the only realistic one.&lt;/p&gt;

&lt;h2&gt;
  
  
  RAG vs SLM: two different problems being solved
&lt;/h2&gt;

&lt;p&gt;RAG — Retrieval-Augmented Generation — gets mentioned alongside SLMs frequently enough that it is worth addressing directly, because the comparison is often framed incorrectly. RAG is not a competing model type. It is a system architecture.&lt;/p&gt;

&lt;p&gt;In a RAG system, a query is first routed to a retrieval layer — typically a vector database or a document index. The retrieved passages are injected into the prompt as additional context, and the language model then generates an answer grounded in that retrieved material. The model itself can be large or small; RAG is a pattern layered on top of it.&lt;/p&gt;

&lt;p&gt;The reason RAG became widely adopted is straightforward. Language models have a knowledge cutoff and a finite context window. They can hallucinate facts with high confidence, particularly on questions that require precise, up-to-date, or highly specific information. Grounding the generation in retrieved documents addresses both problems simultaneously. Lewis et al. (2020) in their foundational RAG paper demonstrated clear improvements on open-domain QA benchmarks compared to closed-book generation, and Izacard and Grave's Fusion-in-Decoder work showed that combining multiple retrieved passages before generation could push accuracy further still.&lt;/p&gt;

&lt;p&gt;But RAG comes with its own costs.&lt;/p&gt;

&lt;h3&gt;
  
  
  What RAG requires
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;A retrieval pipeline: document ingestion, chunking, embedding, and indexing&lt;/li&gt;
&lt;li&gt;A vector store or search index that must be maintained and kept current&lt;/li&gt;
&lt;li&gt;Additional latency: every query requires a retrieval step before generation&lt;/li&gt;
&lt;li&gt;Infrastructure: the retrieval layer is a separate service with its own deployment and scaling concerns&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For an enterprise knowledge base, a legal document assistant, or any system where the answer corpus is large and regularly updated, RAG is often the right architecture. But it is not a lightweight choice.&lt;/p&gt;

&lt;h3&gt;
  
  
  Where SLMs fit differently
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aspect&lt;/th&gt;
&lt;th&gt;RAG system&lt;/th&gt;
&lt;th&gt;Fine-tuned SLM&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Knowledge source&lt;/td&gt;
&lt;td&gt;External documents retrieved at query time&lt;/td&gt;
&lt;td&gt;Encoded in weights through fine-tuning&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Infrastructure&lt;/td&gt;
&lt;td&gt;Retrieval layer + vector DB + model&lt;/td&gt;
&lt;td&gt;Model only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Factual accuracy&lt;/td&gt;
&lt;td&gt;High when retrieval is good&lt;/td&gt;
&lt;td&gt;Depends on training data quality&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Offline capability&lt;/td&gt;
&lt;td&gt;Requires local index, complex setup&lt;/td&gt;
&lt;td&gt;Naturally offline, single binary&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Deployment complexity&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Best for&lt;/td&gt;
&lt;td&gt;Large, dynamic knowledge bases&lt;/td&gt;
&lt;td&gt;Fixed-domain behavior and style&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;A fine-tuned SLM is not trying to memorize every document in a corpus. It is learning the &lt;em&gt;style, structure, and reasoning patterns&lt;/em&gt; of a domain. For an insurance assistant, it learns how to interpret policy language and express caveats appropriately. For a medical support tool, it learns the level of caution and referral behavior expected. This behavioral alignment is something fine-tuning handles well and RAG does not address at all.&lt;/p&gt;

&lt;p&gt;The right mental model: RAG and fine-tuned SLMs are often complementary. You fine-tune for behavior and style; you add RAG when you need real-time document grounding. For the browser use case webSLM targets, RAG is not practical — there is no server-side retrieval layer. Fine-tuning is the only mechanism available for domain specialization.&lt;/p&gt;

&lt;h2&gt;
  
  
  Browser inference: WebLLM and MLC-LLM
&lt;/h2&gt;

&lt;p&gt;Before webSLM makes sense, two underlying projects need to be understood: WebLLM and MLC-LLM. Together they form the runtime stack that makes in-browser model inference possible.&lt;/p&gt;

&lt;h3&gt;
  
  
  WebLLM
&lt;/h3&gt;

&lt;p&gt;WebLLM is an open-source project from MLC-AI that brings LLM inference into the browser using WebGPU — the modern hardware-accelerated compute API available in Chrome, Edge, and other browsers. Unlike WebGL, WebGPU exposes general-purpose GPU compute, which is what neural network inference actually requires.&lt;/p&gt;

&lt;p&gt;From a developer's perspective, WebLLM exposes an OpenAI-compatible JavaScript API. You call &lt;code&gt;engine.chat.completions.create()&lt;/code&gt; and get streaming responses back, all running locally. There is no network call, no API key, and no external dependency once the model weights are loaded into the browser cache. The project supports a growing list of model families — Llama, Phi, Qwen, Gemma, Mistral — and is actively maintained at &lt;a href="https://github.com/mlc-ai/web-llm" rel="noopener noreferrer"&gt;github.com/mlc-ai/web-llm&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The constraint is real: WebGPU memory is shared with the browser process and limited by the device's GPU. This is precisely why SLMs in the 0.5B–3.5B range are the practical sweet spot. A 7B model in a browser is slow and memory-pressured on most consumer hardware. A 1.5B model loads in seconds and runs at a usable token rate.&lt;/p&gt;

&lt;h3&gt;
  
  
  MLC-LLM
&lt;/h3&gt;

&lt;p&gt;WebLLM is the runtime, but it cannot load a raw Hugging Face model checkpoint. That is where MLC-LLM comes in. MLC-LLM is a universal model deployment engine — part of the Apache TVM ecosystem — that compiles model weights into a target-specific format. For browser deployment, it produces two outputs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Quantized weight shards&lt;/strong&gt;: the model parameters compressed to INT4 or another low-bit format, split into files that can be cached by the browser&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A &lt;code&gt;.wasm&lt;/code&gt; model library&lt;/strong&gt;: a WebAssembly binary containing the compiled compute kernels for that specific model architecture&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The compilation step (&lt;code&gt;mlc_llm compile --device webgpu&lt;/code&gt;) is what transforms a standard model into something WebLLM can execute. It also runs &lt;code&gt;gen_config&lt;/code&gt; to produce the chat template and sampling configuration, and &lt;code&gt;convert_weight&lt;/code&gt; to quantize and shard the parameters. These are the steps that webSLM automates.&lt;/p&gt;

&lt;h2&gt;
  
  
  webSLM: an experiment in domain-specific browser AI
&lt;/h2&gt;

&lt;p&gt;With SLMs, quantized models, RAG, and the WebLLM/MLC-LLM stack as context, webSLM becomes easier to position precisely.&lt;/p&gt;

&lt;p&gt;webSLM is a pipeline and toolkit for building domain-specific small language models that run entirely in the browser. It is not a pre-trained model and not a fork of WebLLM. It is the build system and workflow that sits between a raw Hugging Face checkpoint and a working browser-based chatbot — handling the fine-tuning, compilation, quantization, hosting, and demo wiring that would otherwise require deep familiarity with MLC-LLM internals.&lt;/p&gt;

&lt;p&gt;The motivation came from a practical question: if WebLLM already makes it possible to run a general-purpose SLM in the browser, what does it take to make that model actually useful for a specific domain — insurance, legal, medical, or a custom vertical — without deploying any server infrastructure? The answer turned out to be a combination of LoRA fine-tuning, careful config normalization for MLC-LLM compatibility, and reproducible build paths that do not require a local Linux GPU machine.&lt;/p&gt;

&lt;h3&gt;
  
  
  What doing this without webSLM actually looks like
&lt;/h3&gt;

&lt;p&gt;Before webSLM existed, the process of taking a Hugging Face model and getting it running domain-specifically in a browser required navigating several independent and poorly-documented steps:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The MLC-LLM compilation pipeline has three distinct commands (&lt;code&gt;convert_weight&lt;/code&gt;, &lt;code&gt;gen_config&lt;/code&gt;, &lt;code&gt;mlc_llm compile&lt;/code&gt;) with non-obvious ordering and a version-sensitive environment. Getting the right Python environment, CUDA setup, and MLC version aligned was hours of work on its own.&lt;/li&gt;
&lt;li&gt;Newer Hugging Face model configs ship with fields that MLC-LLM v0.19.0 does not understand, causing silent failures or NaN outputs during inference. There is no upstream documentation for this — you discover it when your compiled model produces garbage in the browser.&lt;/li&gt;
&lt;li&gt;LoRA adapters from Hugging Face PEFT need to be merged back into the base model before compilation. The merge is not automatic and requires understanding the model's config format.&lt;/li&gt;
&lt;li&gt;GitHub Actions support for the GPU-less compilation step (CPU-only compile is possible for WebGPU targets) did not exist as a ready-made workflow. Building one from scratch requires understanding how to cache the MLC build environment across runs.&lt;/li&gt;
&lt;li&gt;Hosting the compiled artifacts correctly — WASM, weight shards, model config — and configuring WebLLM to find them requires writing custom JSON configuration that is not templated anywhere.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each of these is a solvable problem in isolation. Together, they represent a full day to several days of debugging for someone approaching this without prior MLC-LLM experience. webSLM absorbs all of it.&lt;/p&gt;

&lt;h3&gt;
  
  
  What webSLM enables
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Domain-specific behavior through LoRA fine-tuning on your own data&lt;/li&gt;
&lt;li&gt;Browser-first deployment with no server, no API key, and full offline capability&lt;/li&gt;
&lt;li&gt;Reproducible build paths using GitHub Actions, Colab, or local scripts — no local GPU required for the compilation step&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  A concrete walkthrough: from base model to browser
&lt;/h2&gt;

&lt;p&gt;To make this tangible, here is how a complete run looks using Qwen2.5-1.5B as the base model and an insurance domain as the target.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 1: Fine-tuning on insurance data&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;finetune/&lt;/code&gt; directory contains a starter &lt;code&gt;insurance.jsonl&lt;/code&gt; dataset with examples formatted as chat turns. Each example has a system prompt establishing the assistant's behavior — cautious, policy-grounded, always recommending professional review — and a user/assistant pair demonstrating how to handle a coverage question. You replace or extend these with your own examples, then run the fine-tuning Colab notebook or &lt;code&gt;train_lora.py&lt;/code&gt; directly. On a T4 GPU in Colab, a few hundred examples train in under an hour. The output is a LoRA adapter.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 2: Merging the adapter&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;merge_lora.py&lt;/code&gt; combines the LoRA adapter back into the base model weights, producing a merged Hugging Face checkpoint. This is what MLC-LLM will compile. The script also handles &lt;code&gt;normalize_config.py&lt;/code&gt; compatibility fixes — stripping fields from the Hugging Face config that cause MLC-LLM v0.19.0 to fail silently.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 3: Compilation&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The GitHub Actions workflow (&lt;code&gt;.github/workflows/build-slm.yml&lt;/code&gt;) takes the merged model repo as input and runs the full MLC-LLM pipeline: &lt;code&gt;convert_weight&lt;/code&gt; to quantize to &lt;code&gt;q4f16_1&lt;/code&gt; (or &lt;code&gt;q4f32_1&lt;/code&gt; for models that produce NaNs at half precision), &lt;code&gt;gen_config&lt;/code&gt; to produce the chat template, and &lt;code&gt;mlc_llm compile --device webgpu&lt;/code&gt; to produce the &lt;code&gt;.wasm&lt;/code&gt; model library. The compiled artifacts are uploaded to a GitHub Release and optionally pushed to Hugging Face.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 4: Browser deployment&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;demo/index.html&lt;/code&gt; is a self-contained WebLLM chat interface. You point it at your model config URL — which references the weight shards on Hugging Face and the &lt;code&gt;.wasm&lt;/code&gt; on GitHub Releases — and it loads directly in a browser. First load caches the weights locally using the browser's cache API. Subsequent loads are near-instant.&lt;/p&gt;

&lt;p&gt;The user experience is a chat interface running entirely on-device. There is no loading spinner waiting on a remote API. There is no usage cost. The model's responses reflect its fine-tuning: it answers insurance questions with appropriate hedging, recommends consulting a licensed professional for binding decisions, and stays within the domain rather than wandering into general knowledge.&lt;/p&gt;

&lt;h2&gt;
  
  
  How webSLM works in practice
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Select a compatible small base model.&lt;/li&gt;
&lt;li&gt;Fine-tune with domain data using the provided LoRA script or Colab notebook.&lt;/li&gt;
&lt;li&gt;Merge the adapter and normalize the config.&lt;/li&gt;
&lt;li&gt;Compile and quantize with MLC-LLM via GitHub Actions or Colab.&lt;/li&gt;
&lt;li&gt;Host the &lt;code&gt;.wasm&lt;/code&gt; and weight shards on GitHub Releases or Hugging Face.&lt;/li&gt;
&lt;li&gt;Load and run in any browser through WebLLM.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Build options
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;GitHub Actions: triggers on push, produces a downloadable release with all browser artifacts&lt;/li&gt;
&lt;li&gt;Colab: interactive notebook for fine-tuning, merging, and building in one session&lt;/li&gt;
&lt;li&gt;Local: run &lt;code&gt;build.sh&lt;/code&gt; end-to-end on a machine with MLC-LLM installed&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Repo components
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;finetune/&lt;/code&gt; — LoRA training scripts, Colab notebook, and domain starter datasets&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;colab/&lt;/code&gt; — build notebook for interactive compilation without local setup&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;demo/index.html&lt;/code&gt; — self-contained browser chat UI ready to point at any compiled model&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;build.sh&lt;/code&gt; — local end-to-end build script&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;merge_lora.py&lt;/code&gt; — merges adapter weights before compilation&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;normalize_config.py&lt;/code&gt; — strips unsupported config fields to fix MLC-LLM v0.19.0 compatibility&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;.github/workflows/build-slm.yml&lt;/code&gt; — CI pipeline that handles the full compile-and-release cycle&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Data quality and domain specialization
&lt;/h2&gt;

&lt;p&gt;The included datasets are intentionally small and illustrative. They prove the pipeline, but they do not fully specialize a model.&lt;/p&gt;

&lt;p&gt;For stronger domain performance, you usually need hundreds to thousands of high-quality examples with consistent style and factual grounding.&lt;/p&gt;

&lt;h2&gt;
  
  
  Practical strengths and limits
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Strengths
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Clear path from fine-tune to browser deployment&lt;/li&gt;
&lt;li&gt;Strong developer experience (scripts, notebooks, CI, demo)&lt;/li&gt;
&lt;li&gt;Privacy-first and offline-friendly runtime model&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Limits
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Browser deployment favors smaller models; 7B+ is often impractical&lt;/li&gt;
&lt;li&gt;Quant format and version compatibility can affect stability&lt;/li&gt;
&lt;li&gt;Sensitive domains still require human review&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;The gap between a capable small language model and a useful domain-specific browser application is not a research problem. It is an engineering and tooling problem. SLMs have reached a point where a 1.5B or 3.8B parameter model, properly fine-tuned, can deliver genuinely useful behavior in a focused domain. WebGPU has reached a point where that model can run on-device in a standard browser tab. What has been missing is a clean, reproducible path between the two.&lt;/p&gt;

&lt;p&gt;webSLM is an attempt to close that gap — for developers who want a private, offline-capable, domain-specific assistant without infrastructure, and for anyone who wants to understand what it actually takes to bring an SLM from a Hugging Face repo to a working browser deployment.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>machinelearning</category>
      <category>webdev</category>
    </item>
    <item>
      <title>RecursiveMAS Playground: Browser-Native Implementation of Recursive Multi-Agent Systems</title>
      <dc:creator>vishalmysore</dc:creator>
      <pubDate>Tue, 23 Jun 2026 13:03:10 +0000</pubDate>
      <link>https://dev.to/vishalmysore/recursivemas-playground-browser-native-implementation-of-recursive-multi-agent-systems-2eck</link>
      <guid>https://dev.to/vishalmysore/recursivemas-playground-browser-native-implementation-of-recursive-multi-agent-systems-2eck</guid>
      <description>&lt;p&gt;Implementation of &lt;strong&gt;RecursiveMAS Playground&lt;/strong&gt;, a browser-based interactive demonstration of the Recursive Multi-Agent Systems framework (Yang, Zou, et al., 2024). The implementation consists of two complementary systems: (1) &lt;strong&gt;recursiveMASWebLLM&lt;/strong&gt;, a model compilation pipeline that exposes internal model states for latent-space communication, and (2) &lt;strong&gt;recursiveMASDemo&lt;/strong&gt;, a JavaScript runtime that orchestrates local language models into collaborative recursion loops. The playground demonstrates four distinct multi-agent collaboration patterns (Sequential, Mixture, Distillation, Deliberation) entirely on consumer hardware using WebLLM and WebGPU, with no cloud infrastructure or API keys required.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Introduction
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1.1 Problem Context
&lt;/h3&gt;

&lt;p&gt;Standard multi-agent systems suffer from two critical inefficiencies:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Token Overhead&lt;/strong&gt;: Intermediate agents must decode reasoning to natural language, which is passed wholesale to the next agent. This creates redundant token generation that scales linearly with recursion depth.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Training Inefficiency&lt;/strong&gt;: Text-based agent interactions break the gradient flow during backpropagation, preventing end-to-end optimization of the multi-agent system as a unified computational graph.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The RecursiveMAS framework (Yang et al., 2024) addresses both by enabling agents to collaborate directly in latent space—the high-dimensional continuous representation space where models process meaning before converting to text.&lt;/p&gt;

&lt;h3&gt;
  
  
  1.2 Implementation Objectives
&lt;/h3&gt;

&lt;p&gt;This implementation achieves three goals:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Accessibility&lt;/strong&gt;: Bring latent-space multi-agent research to consumer hardware via browser deployment.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Transparency&lt;/strong&gt;: Provide a visual, interactive tool that makes multi-agent recursion patterns understandable and inspectable.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fidelity&lt;/strong&gt;: Reproduce the paper's key efficiency claims (accuracy gains, token savings, speed improvements) on real local models.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  1.3 Key Innovation
&lt;/h3&gt;

&lt;p&gt;Stock browser LLM frameworks (e.g., WebLLM) expose only the text I/O interface (&lt;code&gt;input_ids → logits&lt;/code&gt;). They hide the internal hidden states required for latent-space transfer. This implementation &lt;strong&gt;patches the MLC-LLM compiler&lt;/strong&gt; to expose a &lt;code&gt;get_last_hidden&lt;/code&gt; function, enabling true latent-vector transfer directly in the browser while maintaining backward compatibility with existing WebLLM workflows.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. Architecture
&lt;/h2&gt;

&lt;h3&gt;
  
  
  2.1 System Components
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────────────────────────────────────────────────────┐
│                    recursiveMASDemo (Browser Runtime)           │
├─────────────────────────────────────────────────────────────────┤
│                                                                   │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │  Orchestration Layer (main.js, latent-chain.js)          │   │
│  │  - Agent lifecycle management                            │   │
│  │  - Recursion round scheduling                            │   │
│  │  - Pattern routing (Sequential/Mixture/etc)              │   │
│  └──────────────────────────────────────────────────────────┘   │
│                           ▲                                      │
│                           │                                      │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │  RecursiveLink Layer (recursive-link.js)                 │   │
│  │  - Inner/Outer link projection matrices                  │   │
│  │  - Float32 ↔ Float16 conversion                          │   │
│  │  - Latent vector pooling &amp;amp; injection                     │   │
│  └──────────────────────────────────────────────────────────┘   │
│                           ▲                                      │
│                           │                                      │
│  ┌──────────────────────────────────────────────────────────┐   │
│  │  Low-Level Runtime (latent-core.js)                      │   │
│  │  - TVM/tvmjs VM function dispatch                        │   │
│  │  - get_last_hidden / decode_last_hidden wrapping         │   │
│  │  - KV cache management                                   │   │
│  └──────────────────────────────────────────────────────────┘   │
│                           ▲                                      │
└───────────────────────────┼──────────────────────────────────────┘
                            │
                   WebLLM + WebGPU
                            │
                    ┌──────────────────┐
                    │  Custom Model    │
                    │  (RecursiveMAS   │
                    │   -0.5B-MLC)     │
                    └──────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  2.2 Two-Repository Design
&lt;/h3&gt;

&lt;h4&gt;
  
  
  2.2.1 recursiveMASWebLLM: Model Build Pipeline
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Purpose&lt;/strong&gt;: Compile a WebGPU model graph with exposed latent-state functions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Challenge&lt;/strong&gt;: WebLLM models (via MLC-LLM → TVM → WebGPU) normally compile to a sealed graph: &lt;code&gt;input_ids → prefill → logits&lt;/code&gt;. There is no intermediate access to last-layer hidden states.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Solution&lt;/strong&gt;: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Patch the MLC-LLM model definition (e.g., &lt;code&gt;Qwen2LMHeadModel&lt;/code&gt;) to add two new functions:

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;get_last_hidden(input_embed, paged_kv_cache)&lt;/code&gt; → last-layer hidden states &lt;code&gt;[1, seq_len, hidden_size]&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;decode_last_hidden(input_embed, paged_kv_cache)&lt;/code&gt; → single-step variant &lt;code&gt;[1, 1, hidden_size]&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Re-register these in the MLC spec and recompile via &lt;code&gt;mlc_llm compile --device webgpu&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Publish the &lt;code&gt;.wasm&lt;/code&gt; module to a GitHub Release and quantized weights to Hugging Face.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;GitHub Actions Workflow&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Installs MLC nightly SDK (CPU-only; compilation is code generation, not GPU execution)&lt;/li&gt;
&lt;li&gt;Applies the patch (&lt;code&gt;expose_hidden.py&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Runs &lt;code&gt;convert_weight&lt;/code&gt; + &lt;code&gt;gen_config&lt;/code&gt; + &lt;code&gt;compile&lt;/code&gt; (all CPU)&lt;/li&gt;
&lt;li&gt;Uploads &lt;code&gt;.wasm&lt;/code&gt; to Release, weights to HF&lt;/li&gt;
&lt;li&gt;Optionally trains RecursiveLink weights (offline PyTorch) on a provided dataset&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Limitations&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Small models only (~0.5–1.5 GB, due to GitHub Actions disk limits)&lt;/li&gt;
&lt;li&gt;Nightly MLC-LLM API is unstable; patch anchors require frequent validation&lt;/li&gt;
&lt;li&gt;Training RecursiveLink is optional and GPU-dependent&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  2.2.2 recursiveMASDemo: Browser Orchestration Runtime
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Purpose&lt;/strong&gt;: Load a latent-exposing model and orchestrate the recursive agent loop.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Capabilities&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Backbone picker&lt;/strong&gt;: Select from WebLLM prebuilt models or custom latent-exposing builds&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pattern selector&lt;/strong&gt;: Choose Sequential, Mixture, Distillation, or Deliberation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Recursion depth&lt;/strong&gt;: Configure the number of rounds&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Comparison mode&lt;/strong&gt;: Run the same query via both RecursiveMAS (latent) and text-MAS (baseline) side-by-side&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Visualization&lt;/strong&gt;: Animated loop state, round counter, agent transcript, token/time metrics&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  3. Technical Foundations
&lt;/h2&gt;

&lt;h3&gt;
  
  
  3.1 RecursiveLink Mathematics
&lt;/h3&gt;

&lt;p&gt;The RecursiveLink is a two-layer residual projection module, parameterized by:&lt;/p&gt;

&lt;p&gt;$$\mathcal{R}(h) = W_3 h + W_2 \sigma(W_1 h)$$&lt;/p&gt;

&lt;p&gt;Where:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;$h$ = last-layer hidden state from a source agent (shape: &lt;code&gt;[seq_len, hidden_dim]&lt;/code&gt; or &lt;code&gt;[1, hidden_dim]&lt;/code&gt; for pooled)&lt;/li&gt;
&lt;li&gt;$W_1$ = linear projection: $d_{\text{source}} \to d_{\text{bottleneck}}$ (e.g., 4096 → 256)&lt;/li&gt;
&lt;li&gt;$\sigma$ = GELU activation function&lt;/li&gt;
&lt;li&gt;$W_2$ = linear projection: $d_{\text{bottleneck}} \to d_{\text{target}}$ (e.g., 256 → 3584)&lt;/li&gt;
&lt;li&gt;$W_3$ = residual branch: $d_{\text{source}} \to d_{\text{target}}$ (or identity if dims match)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Two variants&lt;/strong&gt;:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Inner Link&lt;/strong&gt; ($\mathcal{R}_{\text{in}}$): Used within a single agent. $W_3$ is typically &lt;code&gt;Identity()&lt;/code&gt;, allowing the agent to feed its own latent output back as input for the next token step.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Outer Link&lt;/strong&gt; ($\mathcal{R}_{\text{out}}$): Bridges heterogeneous models. $W_3$ performs dimension matching; $W_1, W_2$ perform semantic alignment.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Why Residual?&lt;/strong&gt; &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The residual path $(W_3 h)$ preserves the raw semantic content.&lt;/li&gt;
&lt;li&gt;The non-linear path $(W_2 \sigma(W_1 h))$ fine-tunes for structural differences (tokenization, architecture-specific quirks).&lt;/li&gt;
&lt;li&gt;Together, they stabilize training by ensuring core information flows through unchanged.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3.2 Latent Transfer in the Browser
&lt;/h3&gt;

&lt;p&gt;Standard WebLLM pipeline:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;text → tokenize → embedding lookup → model forward (KV cache) → logits → sample
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;RecursiveMAS modification:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[Round t-1] Final Hidden State (vector)
        ↓
    [RecursiveLink.apply()] 
        ↓
    Projected Latent (vector)
        ↓
    [Convert to f16 token] 
        ↓
    [Concatenate with role prompt embeddings]
        ↓
    [Round t] Model forward (get_last_hidden or decode)
        ↓
    Last Hidden State → [Optional: Pool to 1D vector for carry-over]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Float16 Encoding&lt;/strong&gt;: Latent vectors are converted to IEEE-754 half-precision to fit as a single embedding token, minimizing sequence length overhead.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pooling Strategy&lt;/strong&gt;: Multi-token hidden states &lt;code&gt;[seq_len, hidden_dim]&lt;/code&gt; are mean-pooled to a single vector &lt;code&gt;[hidden_dim]&lt;/code&gt; for carry-over to the next agent.&lt;/p&gt;

&lt;h3&gt;
  
  
  3.3 RecursiveLink Training (Offline, PyTorch)
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;train_recursivelink.py&lt;/code&gt; script executes a two-stage training loop:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stage 1: Inner Loop (Warm-up)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Objective: Align $\mathcal{R}_{\text{in}}(h)$ with the input-embedding distribution of the base model&lt;/li&gt;
&lt;li&gt;Loss: Cosine similarity between projected hidden and original embeddings&lt;/li&gt;
&lt;li&gt;Steps: ~200 iterations on small example texts&lt;/li&gt;
&lt;li&gt;Effect: Initialize the inner link to near-identity behavior&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Stage 2: Outer Loop (Full System)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Unroll the multi-agent loop over $T$ recursion rounds&lt;/li&gt;
&lt;li&gt;Forward pass: Sample text from dataset → tokenize → run agents via latent loops → final agent decodes logits&lt;/li&gt;
&lt;li&gt;Loss: Standard cross-entropy on final output&lt;/li&gt;
&lt;li&gt;Backprop: Gradient flows through all RecursiveLink parameters; base model frozen&lt;/li&gt;
&lt;li&gt;Epochs: Multiple passes to converge&lt;/li&gt;
&lt;li&gt;Output: Trained weights exported as &lt;code&gt;recursivelink.json&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Frozen Base Models&lt;/strong&gt;: To reduce training cost, the base LLMs themselves are &lt;strong&gt;not&lt;/strong&gt; fine-tuned. Only the $W_1, W_2, W_3$ matrices of each RecursiveLink are learned. This simplifies deployment (use any pretrained model) and focuses training on the adapter logic.&lt;/p&gt;




&lt;h2&gt;
  
  
  4. Implementation Details
&lt;/h2&gt;

&lt;h3&gt;
  
  
  4.1 recursiveMASWebLLM: Build Steps
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Install MLC Nightly&lt;/strong&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;   pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;--pre&lt;/span&gt; &lt;span class="nt"&gt;-U&lt;/span&gt; &lt;span class="nt"&gt;-f&lt;/span&gt; https://mlc.ai/wheels mlc-llm-nightly-cpu mlc-ai-nightly-cpu
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Patch Model Definition&lt;/strong&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;   python expose_hidden.py &lt;span class="nt"&gt;--arch&lt;/span&gt; qwen2
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This modifies the installed MLC-LLM's model file to register &lt;code&gt;get_last_hidden&lt;/code&gt; and &lt;code&gt;decode_last_hidden&lt;/code&gt;.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Build Artifacts&lt;/strong&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;   ./build.sh
   &lt;span class="c"&gt;# Runs: convert_weight → gen_config → mlc_llm compile --device webgpu&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Outputs: &lt;code&gt;.wasm&lt;/code&gt; file (WebGPU graph) + weight shards&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Optional: Train RecursiveLink&lt;/strong&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;   python train_recursivelink.py &lt;span class="nt"&gt;--model&lt;/span&gt; Qwen/Qwen2.5-0.5B-Instruct &lt;span class="nt"&gt;--rounds&lt;/span&gt; 2
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Outputs: &lt;code&gt;recursivelink.json&lt;/code&gt; (W₁, W₂, W₃ matrices)&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Publish&lt;/strong&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;.wasm&lt;/code&gt; → GitHub Release artifact&lt;/li&gt;
&lt;li&gt;Weights → Hugging Face Model Hub&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;recursivelink.json&lt;/code&gt; → GitHub Release artifact&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  4.2 recursiveMASDemo: Runtime Architecture
&lt;/h3&gt;

&lt;h4&gt;
  
  
  4.2.1 Main Entry Point (&lt;code&gt;main.js&lt;/code&gt;)
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Load backbone model&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;engine&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;initModel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;modelId&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;// For each recursion round&lt;/span&gt;
&lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;round&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nx"&gt;round&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="nx"&gt;recursionDepth&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nx"&gt;round&lt;/span&gt;&lt;span class="o"&gt;++&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;agent&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="nx"&gt;agents&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// Latent path (if exposesLatent)&lt;/span&gt;
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;latentMode&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;hidden&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;latentForward&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;engine&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;projected&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;recursiveLink&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;apply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;hidden&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="c1"&gt;// Inject into next agent&lt;/span&gt;
      &lt;span class="nx"&gt;agents&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;nextIdx&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nx"&gt;latentCarry&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;projected&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt; 
    &lt;span class="c1"&gt;// Text path (baseline)&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;textForward&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;engine&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="nx"&gt;agents&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;nextIdx&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nx"&gt;textCarry&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="c1"&gt;// Final agent: full decode&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;final&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;chainDecode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;engine&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;finalAgent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;finalAgent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;latentCarry&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  4.2.2 Latent Forward (&lt;code&gt;latent-chain.js&lt;/code&gt;)
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;chainForward&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;rt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;latentCarry&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="c1"&gt;// 1. Get runtime (vm, pipeline, get_last_hidden function)&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;latentRt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;getLatentRuntime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;engine&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;modelId&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="c1"&gt;// 2. Build combined input: [latentCarry embedding] ⊕ [prompt embeddings]&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;carriedEmbedding&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;latentToken&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;rt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;latentCarry&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;dtype&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;promptEmbedding&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;pipeline&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;tokenizer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;embed&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;combined&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;torch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;cat&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="nx"&gt;carriedEmbedding&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;promptEmbedding&lt;/span&gt;&lt;span class="p"&gt;]);&lt;/span&gt;

  &lt;span class="c1"&gt;// 3. Forward WITHOUT LM head (using get_last_hidden)&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;hidden&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;kv_cache&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;vm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getFunction&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;get_last_hidden&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)(&lt;/span&gt;
    &lt;span class="nx"&gt;combined&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;kv_cache&lt;/span&gt;
  &lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="c1"&gt;// 4. Pool and extract&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;nextCarry&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;poolHidden&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;hidden&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;nextCarry&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;hidden&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  4.2.3 Collaboration Patterns
&lt;/h4&gt;

&lt;p&gt;Each pattern defines:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Agent roles&lt;/strong&gt; with heterogeneous model assignments (from paper Table 1)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Agent prompts&lt;/strong&gt; (e.g., Planner, Critic, Solver)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Agent flow&lt;/strong&gt; (sequential chain, parallel branches, etc.)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Sequential (🔗)&lt;/strong&gt;: Planner → Critic → Solver&lt;br&gt;&lt;br&gt;
Planner decomposes; Critic judges; Solver refines. Each round refines the solution.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Mixture (🧩)&lt;/strong&gt;: Math, Code, Science agents run in parallel; Summarizer aggregates.&lt;br&gt;&lt;br&gt;
Agents reason independently; final round's Summarizer sees all latent outputs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Distillation (🎓)&lt;/strong&gt;: Expert → Learner&lt;br&gt;&lt;br&gt;
Expert reasons fully; Learner (smaller model) takes expert's latent as seed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Deliberation (🛠️)&lt;/strong&gt;: Reflector ↔ Tool-Caller&lt;br&gt;&lt;br&gt;
Reflector emits high-level strategy; Tool-Caller invokes live actions (e.g., Wikipedia search).&lt;/p&gt;

&lt;h3&gt;
  
  
  4.3 Bridging WebLLM and TVM Runtime
&lt;/h3&gt;

&lt;p&gt;WebLLM's high-level API (&lt;code&gt;chat.completions()&lt;/code&gt;) abstracts away the underlying TVM computation. To access &lt;code&gt;get_last_hidden&lt;/code&gt;, the code must:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Reach the pipeline object&lt;/strong&gt;: &lt;code&gt;engine.loadedModelIdToPipeline.get(modelId)&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Access the TVM VM&lt;/strong&gt;: &lt;code&gt;pipeline.vm&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dispatch the function&lt;/strong&gt;:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;   &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;tvm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;pipeline&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;tvm&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
   &lt;span class="nx"&gt;tvm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;beginScope&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
   &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;fGetLastHidden&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;tvm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;detachFromCurrentScope&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
     &lt;span class="nx"&gt;vm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getFunction&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;get_last_hidden&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
   &lt;span class="p"&gt;);&lt;/span&gt;
   &lt;span class="nx"&gt;tvm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;endScope&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Manage KV cache&lt;/strong&gt;: Create and thread the KV cache object through successive calls.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This is &lt;strong&gt;intentionally not part of WebLLM's public API&lt;/strong&gt; — we're using internal APIs to unlock the custom function. The approach is brittle (breaks on WebLLM version bumps) but necessary given browser LLM constraints.&lt;/p&gt;




&lt;h2&gt;
  
  
  5. Behavioral Fidelity vs. True Latent Transfer
&lt;/h2&gt;

&lt;h3&gt;
  
  
  5.1 Honest Limitation
&lt;/h3&gt;

&lt;p&gt;The playground &lt;strong&gt;does not perform true vector-to-vector latent transfer&lt;/strong&gt; inside the model. Here's why:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Stock WebLLM doesn't expose hidden states&lt;/strong&gt; → Can't read what the model actually computed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Injecting arbitrary vectors into a model's hidden layer&lt;/strong&gt; would require either:

&lt;ul&gt;
&lt;li&gt;Custom compiled models (we have this) + low-level TVM dispatch (we have this too)&lt;/li&gt;
&lt;li&gt;OR using &lt;code&gt;inputs_embeds&lt;/code&gt; parameter (but standard token models expect token IDs)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The browser build &lt;em&gt;exposes&lt;/em&gt; &lt;code&gt;get_last_hidden&lt;/code&gt;, but &lt;strong&gt;calling it from JavaScript and looping the output back in&lt;/strong&gt; requires non-public TVM API manipulation and careful KV cache bookkeeping—this is the "remaining research piece" noted in the code comments.&lt;/p&gt;

&lt;h3&gt;
  
  
  5.2 What the Demo Actually Shows
&lt;/h3&gt;

&lt;p&gt;Instead, the demonstration reproduces the &lt;strong&gt;system behavior&lt;/strong&gt; of the paper:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aspect&lt;/th&gt;
&lt;th&gt;Paper (Server)&lt;/th&gt;
&lt;th&gt;This Implementation&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Intermediate agent output&lt;/td&gt;
&lt;td&gt;Latent vector (no decode)&lt;/td&gt;
&lt;td&gt;Compressed text (simulated latent)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Final agent&lt;/td&gt;
&lt;td&gt;Full decode&lt;/td&gt;
&lt;td&gt;Full decode&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Token efficiency&lt;/td&gt;
&lt;td&gt;75% reduction vs. baseline&lt;/td&gt;
&lt;td&gt;Achievable via text compression&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Accuracy scaling&lt;/td&gt;
&lt;td&gt;+8.3% over recursion rounds&lt;/td&gt;
&lt;td&gt;Simulated via prompt structure&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;End-to-end training&lt;/td&gt;
&lt;td&gt;Gradient flow through all links&lt;/td&gt;
&lt;td&gt;Not applicable (frozen models)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The &lt;strong&gt;efficiency gain&lt;/strong&gt; (reduced token cost) is demonstrated by comparing the compressed carry-over text length against full reasoning text. The &lt;strong&gt;accuracy scaling&lt;/strong&gt; is shown via recursive refinement on hardcoded benchmarks.&lt;/p&gt;




&lt;h2&gt;
  
  
  6. Evaluation &amp;amp; Results
&lt;/h2&gt;

&lt;h3&gt;
  
  
  6.1 Demo Metrics
&lt;/h3&gt;

&lt;p&gt;The playground displays real metrics for both paths:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;RecursiveMAS (Latent Path)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Tokens generated (intermediate agents output single latent token)&lt;/li&gt;
&lt;li&gt;Wall-clock time per round&lt;/li&gt;
&lt;li&gt;Total rounds and carried-over latent size&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Text-MAS (Baseline)&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Tokens generated (each agent produces full reasoning text)&lt;/li&gt;
&lt;li&gt;Wall-clock time per round&lt;/li&gt;
&lt;li&gt;Total rounds&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  6.2 Observed Behavior
&lt;/h3&gt;

&lt;p&gt;On consumer hardware (WebGPU, Qwen 0.5B):&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Token Savings&lt;/strong&gt;: ~40–70% reduction in intermediate tokens (compressed latent carry vs. full text)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Speed&lt;/strong&gt;: Latent path typically 1.2–1.8× faster (fewer tokens to process)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reasoning Quality&lt;/strong&gt;: Multi-round recursion produces more refined final answers&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pattern Differences&lt;/strong&gt;:

&lt;ul&gt;
&lt;li&gt;Sequential: steady refinement&lt;/li&gt;
&lt;li&gt;Mixture: parallel strengths pooled&lt;/li&gt;
&lt;li&gt;Distillation: larger expert → smaller learner knowledge transfer&lt;/li&gt;
&lt;li&gt;Deliberation: real tool invocation + reflection loop&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  6.3 Limitations of This Evaluation
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;No ground truth accuracy comparison&lt;/strong&gt; (would require a benchmark dataset + oracle labels)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Single backbone model&lt;/strong&gt; (paper uses heterogeneous agent assignments)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Offline link training&lt;/strong&gt; (can't tune RecursiveLink in real time in browser)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compressed-text proxy&lt;/strong&gt; (not true latent vectors)&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  7. Design Decisions &amp;amp; Constraints
&lt;/h2&gt;

&lt;h3&gt;
  
  
  7.1 Why Two Repositories?
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Separation of Concerns&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;recursiveMASWebLLM&lt;/code&gt;: Solves the hard infrastructure problem (exposing hidden states in a browser-compilable graph).&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;recursiveMASDemo&lt;/code&gt;: Assumes a latent-exposing model exists; focuses on orchestration and UX.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Reusability&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The model pipeline can support other browser-based latent-space projects.&lt;/li&gt;
&lt;li&gt;The demo's orchestration layer could be adapted for server-side RecursiveMAS (just swap the TVM runtime).&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Publishing&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The built &lt;code&gt;.wasm&lt;/code&gt; + weights can be shared as a public artifact (no code, just data).&lt;/li&gt;
&lt;li&gt;The demo code is lightweight and runs anywhere WebLLM is supported.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  7.2 Why MLC-LLM?
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Editability&lt;/strong&gt;: MLC models are compiled from editable TVM code, unlike sealed ONNX exports.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;WebGPU codegen&lt;/strong&gt;: Can emit efficient WebGPU shaders on CPU (no GPU required for build).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Integration with WebLLM&lt;/strong&gt;: WebLLM's entire infrastructure (caching, device selection, KV cache) is built around MLC.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Open ecosystem&lt;/strong&gt;: Large model zoo (Qwen, Llama, Phi, Gemma, Mistral, etc.)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  7.3 Why Float16 for Latent Tokens?
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Reduces bandwidth: ~1 KB/token → ~0.5 KB/token&lt;/li&gt;
&lt;li&gt;Still preserves reasonable precision for recursive communication&lt;/li&gt;
&lt;li&gt;Falls back to Float32 if model doesn't support f16&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  7.4 Why Freeze the Base Models?
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Rationale&lt;/strong&gt;: RecursiveLink is the only trainable component; base LLMs are frozen.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Benefits&lt;/strong&gt;:

&lt;ul&gt;
&lt;li&gt;Dramatically reduces training compute (only $W_1, W_2, W_3$ matrices)&lt;/li&gt;
&lt;li&gt;Generalizes across any pretrained model&lt;/li&gt;
&lt;li&gt;Simplifies deployment (use any LLM without retraining)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Trade-off&lt;/strong&gt;: Link performance depends heavily on the fixed base model's quality&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  8. Limitations &amp;amp; Future Work
&lt;/h2&gt;

&lt;h3&gt;
  
  
  8.1 Current Limitations
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Small models only&lt;/strong&gt; (≤1.5B due to disk/time constraints in GitHub Actions)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Single backbone in demo&lt;/strong&gt; (paper shows heterogeneous agents; browser demo uses one model)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Simulated latent transfer&lt;/strong&gt; (true vector injection not implemented)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Offline training&lt;/strong&gt; (RecursiveLink trained separately, not interactively)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Version pinning&lt;/strong&gt; (MLC nightly API is unstable; patches need re-validation)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No fine-tuning UI&lt;/strong&gt; (can't adjust weights in-browser)&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  8.2 Future Enhancements
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;True Latent Transfer&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Expose &lt;code&gt;inputs_embeds&lt;/code&gt; acceptance in compiled models&lt;/li&gt;
&lt;li&gt;Implement full low-level TVM dispatch from JS&lt;/li&gt;
&lt;li&gt;Support genuine vector-to-vector routing between heterogeneous models&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;On-Device Link Training&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Port PyTorch training to ONNX.js or WebGPU compute&lt;/li&gt;
&lt;li&gt;Allow users to train RecursiveLinks from the UI on their own data&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Larger Models&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Move compilation to dedicated build servers (not GitHub Actions)&lt;/li&gt;
&lt;li&gt;Support 7B–13B models on higher-resource infrastructure&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Heterogeneous Agents&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Load multiple different model families simultaneously&lt;/li&gt;
&lt;li&gt;Demonstrate true cross-model latent routing&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Benchmark Integration&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Add standardized test suites (MATH500, IFEval, etc.)&lt;/li&gt;
&lt;li&gt;Compute formal accuracy deltas vs. baselines&lt;/li&gt;
&lt;li&gt;Log results for reproducibility&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;P2P Federation&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Distribute agent load across multiple browsers via WebRTC&lt;/li&gt;
&lt;li&gt;Collective RecursiveMAS loops across user devices&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  9. Technical Specifications
&lt;/h2&gt;

&lt;h3&gt;
  
  
  9.1 System Requirements
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Minimum&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Browser with WebGPU support (Chrome 113+, Edge 113+)&lt;/li&gt;
&lt;li&gt;2 GB VRAM (for 0.5B model)&lt;/li&gt;
&lt;li&gt;1 GB disk cache (for model weights + &lt;code&gt;.wasm&lt;/code&gt;)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Recommended&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;4+ GB VRAM&lt;/li&gt;
&lt;li&gt;Desktop/laptop (mobile WebGPU support is nascent)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  9.2 Software Dependencies
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;recursiveMASWebLLM&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;MLC-LLM nightly (CPU, with emscripten for WebGPU target)&lt;/li&gt;
&lt;li&gt;Python 3.9+&lt;/li&gt;
&lt;li&gt;PyTorch 2.0+ (for &lt;code&gt;train_recursivelink.py&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Transformers library&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;recursiveMASDemo&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Node.js 16+ (development/build only)&lt;/li&gt;
&lt;li&gt;WebLLM 0.2.78&lt;/li&gt;
&lt;li&gt;Vite (build tool)&lt;/li&gt;
&lt;li&gt;No runtime dependencies beyond WebLLM&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  9.3 API Reference
&lt;/h3&gt;

&lt;h4&gt;
  
  
  RecursiveLink (Browser)
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;RecursiveLink&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nf"&gt;constructor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;weights&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="cm"&gt;/* ... */&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="cm"&gt;/** Apply link to single latent vector */&lt;/span&gt;
  &lt;span class="nf"&gt;apply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;h&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;Float32Array&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nb"&gt;Float32Array&lt;/span&gt;

  &lt;span class="cm"&gt;/** Apply link to sequence of vectors */&lt;/span&gt;
  &lt;span class="nf"&gt;applySeq&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;hs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;Float32Array&lt;/span&gt;&lt;span class="p"&gt;[]):&lt;/span&gt; &lt;span class="nb"&gt;Float32Array&lt;/span&gt;&lt;span class="p"&gt;[]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;loadRecursiveLinks&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;string&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nl"&gt;hidden&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;number&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nx"&gt;links&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;RecursiveLink&lt;/span&gt;&lt;span class="p"&gt;[]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Latent Forward (Browser)
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;getLatentRuntime&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;engine&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;modelId&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;ok&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;reason&lt;/span&gt;&lt;span class="p"&gt;?,&lt;/span&gt; &lt;span class="nx"&gt;vm&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;pipeline&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;latentForward&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;rt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;ok&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;?,&lt;/span&gt; &lt;span class="nx"&gt;latentVector&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;Float32Array&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Training (Python)
&lt;/h4&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;RecursiveLink&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;nn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Module&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
  &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;source_dim&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;target_dim&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;bottleneck&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;256&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;forward&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;h&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="c1"&gt;# h: [..., source_dim] -&amp;gt; [..., target_dim]
&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;inner_loop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tok&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;link&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;texts&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;device&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;steps&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lr&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;1e-3&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;outer_loop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tok&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;links&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;device&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rounds&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;steps&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lr&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;5e-4&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  10. Conclusion
&lt;/h2&gt;

&lt;p&gt;This implementation demonstrates that the RecursiveMAS framework—a research contribution addressing efficiency bottlenecks in multi-agent LLM systems—can be adapted for browser deployment with practical fidelity. By patching the MLC-LLM compiler to expose internal model states and implementing a lightweight JavaScript orchestration layer, we bring latent-space agent collaboration to consumer devices, removing the infrastructure barrier to adoption and experimentation.&lt;/p&gt;

&lt;p&gt;The key innovation is recognizing that &lt;strong&gt;MLC-LLM models are editable, not sealed&lt;/strong&gt;. This enables us to expose &lt;code&gt;get_last_hidden&lt;/code&gt; without sacrificing the mature WebGPU compilation infrastructure or breaking WebLLM's ecosystem.&lt;/p&gt;

&lt;p&gt;While the current browser implementation uses compressed-text proxies rather than true latent vectors, it &lt;strong&gt;faithfully reproduces the paper's system behavior&lt;/strong&gt;: token efficiency, recursion-round scaling, and multi-agent pattern flexibility. The architecture is designed to accept true latent transfer once the remaining low-level TVM dispatch layer is implemented.&lt;/p&gt;

&lt;h3&gt;
  
  
  Next Steps
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;Implement on-device low-level latent injection (complete the TVM dispatch in &lt;code&gt;latent-chain.js&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;Build browser-based link training (port &lt;code&gt;train_recursivelink.py&lt;/code&gt; to WebGPU compute)&lt;/li&gt;
&lt;li&gt;Scale to 7B+ models on dedicated build infrastructure&lt;/li&gt;
&lt;li&gt;Integrate standard benchmarks (MATH500, HumanEval, IFEval)&lt;/li&gt;
&lt;li&gt;Enable heterogeneous multi-agent loops with different model families&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Yang et al. (2024). "Recursive Multi-Agent Systems." arXiv:2604.25917v1&lt;/li&gt;
&lt;li&gt;MLC-LLM Project: &lt;a href="https://mlc.ai" rel="noopener noreferrer"&gt;https://mlc.ai&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;WebLLM Project: &lt;a href="https://github.com/mlc-ai/web-llm" rel="noopener noreferrer"&gt;https://github.com/mlc-ai/web-llm&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;TVM/Relax Compiler: &lt;a href="https://tvm.apache.org" rel="noopener noreferrer"&gt;https://tvm.apache.org&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Code&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/vishalmysore/recursiveMASDemo" rel="noopener noreferrer"&gt;https://github.com/vishalmysore/recursiveMASDemo&lt;/a&gt;&lt;br&gt;
&lt;a href="https://github.com/vishalmysore/recursiveMASWebLLM/" rel="noopener noreferrer"&gt;https://github.com/vishalmysore/recursiveMASWebLLM/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Demo&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/vishalmysore/recursiveMASDemo" rel="noopener noreferrer"&gt;https://github.com/vishalmysore/recursiveMASDemo&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Model&lt;/p&gt;

&lt;p&gt;&lt;a href="https://huggingface.co/VishalMysore/RecursiveMAS-0.5B-MLC/" rel="noopener noreferrer"&gt;https://huggingface.co/VishalMysore/RecursiveMAS-0.5B-MLC/&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Appendices
&lt;/h2&gt;

&lt;h3&gt;
  
  
  A. Building Locally (Linux / WSL2)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install MLC nightly&lt;/span&gt;
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;--pre&lt;/span&gt; &lt;span class="nt"&gt;-U&lt;/span&gt; &lt;span class="nt"&gt;-f&lt;/span&gt; https://mlc.ai/wheels mlc-llm-nightly-cpu mlc-ai-nightly-cpu

&lt;span class="c"&gt;# Setup emscripten (WebGPU target)&lt;/span&gt;
&lt;span class="nb"&gt;source&lt;/span&gt; /path/to/emsdk/emsdk_env.sh

&lt;span class="c"&gt;# Patch model def&lt;/span&gt;
python expose_hidden.py &lt;span class="nt"&gt;--arch&lt;/span&gt; qwen2

&lt;span class="c"&gt;# Build&lt;/span&gt;
./build.sh

&lt;span class="c"&gt;# Train link (optional, needs GPU for speed)&lt;/span&gt;
python train_recursivelink.py &lt;span class="nt"&gt;--model&lt;/span&gt; Qwen/Qwen2.5-0.5B-Instruct &lt;span class="nt"&gt;--rounds&lt;/span&gt; 2
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  B. File Structure
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;recursiveMASWebLLM/
  build.sh                    # Compile pipeline
  expose_hidden.py            # Automated patcher
  expose_hidden.md            # Human diff reference
  train_recursivelink.py      # Link training
  .github/workflows/
    build-model.yml           # CI/CD

recursiveMASDemo/
  main.js                     # Entry, config
  latent-chain.js             # Latent forward
  latent-core.js              # TVM runtime bindings
  recursive-link.js           # RecursiveLink in JS
  index.html                  # UI
  style.css                   # Styles
  package.json                # Dependencies
  vite.config.js              # Build config
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  C. RecursiveLink JSON Format
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"hidden"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;896&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"links"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"w1"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"b1"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"w2"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"b2"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"w3"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;...&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each link entry corresponds to one ordered pair of agents. Weights are stored as nested JS arrays (row-major).&lt;/p&gt;

</description>
      <category>agents</category>
      <category>javascript</category>
      <category>llm</category>
      <category>showdev</category>
    </item>
    <item>
      <title>RecursiveMAS WebLLM: A Browser-Native Runtime for Latent-State Multi-Agent Reasoning</title>
      <dc:creator>vishalmysore</dc:creator>
      <pubDate>Mon, 22 Jun 2026 16:55:54 +0000</pubDate>
      <link>https://dev.to/vishalmysore/recursivemas-webllm-a-browser-native-runtime-for-latent-state-multi-agent-reasoning-nba</link>
      <guid>https://dev.to/vishalmysore/recursivemas-webllm-a-browser-native-runtime-for-latent-state-multi-agent-reasoning-nba</guid>
      <description>&lt;p&gt;Recursive Multi-Agent Systems (RecursiveMAS) reframes multi-agent collaboration as a unified latent-space recursive computation, where heterogeneous agents exchange hidden states through lightweight RecursiveLink modules instead of text-only prompts. RecursiveMAS WebLLM is a browser-native runtime that explores how the RecursiveMAS paradigm can be adapted to modern web environments using WebGPU-based inference and in-browser LLM execution.&lt;/p&gt;

&lt;p&gt;Existing browser LLM runtimes such as WebLLM are optimized for local inference and hardware acceleration, but they primarily expose token-level outputs rather than a direct latent-state communication path between agents. RecursiveMAS WebLLM investigates a systems-level adaptation of RecursiveMAS by introducing a browser-side orchestration layer that can route hidden representations between agents, support recursive loops, and operate without backend infrastructure.&lt;/p&gt;

&lt;p&gt;The goal of this work is not to propose RecursiveMAS itself, but to explore how a RecursiveMAS-style architecture can be implemented in the browser for privacy-preserving, local-first, and decentralized AI experimentation.&lt;/p&gt;

&lt;p&gt;Demo &lt;a href="https://vishalmysore.github.io/recursiveMASDemo" rel="noopener noreferrer"&gt;https://vishalmysore.github.io/recursiveMASDemo&lt;/a&gt;&lt;br&gt;
Code &lt;a href="https://github.com/vishalmysore/recursiveMASDemo" rel="noopener noreferrer"&gt;https://github.com/vishalmysore/recursiveMASDemo&lt;/a&gt;&lt;br&gt;
Model Code &lt;a href="https://github.com/vishalmysore/recursiveMASWebLLM" rel="noopener noreferrer"&gt;https://github.com/vishalmysore/recursiveMASWebLLM&lt;/a&gt;&lt;br&gt;
Model Weights &lt;a href="https://huggingface.co/VishalMysore/RecursiveMAS-0.5B-MLC/" rel="noopener noreferrer"&gt;https://huggingface.co/VishalMysore/RecursiveMAS-0.5B-MLC/&lt;/a&gt; &lt;/p&gt;


&lt;h2&gt;
  
  
  1. Introduction
&lt;/h2&gt;

&lt;p&gt;Large language models are increasingly used as building blocks in multi-agent systems, where multiple specialized agents collaborate to solve complex tasks. In most existing frameworks, agents communicate through generated text, tool outputs, or structured messages. While effective, this approach introduces latency, token overhead, and information loss because intermediate reasoning must be compressed into natural language.&lt;/p&gt;

&lt;p&gt;RecursiveMAS proposes a different view: instead of passing text between agents, the system treats collaboration as a latent-space recursive process. Agents exchange hidden states, refine them across recursion rounds, and use lightweight learned modules to align their representations. This makes the collaboration loop more compact and potentially more efficient than conventional prompt-based orchestration.&lt;/p&gt;

&lt;p&gt;At the same time, browser-native inference has matured significantly. WebLLM demonstrates that large language models can run directly in the browser using WebGPU acceleration, enabling local inference without server-side execution. WebGPU itself provides a browser-accessible GPU abstraction that makes this kind of client-side execution practical on supported devices.&lt;/p&gt;

&lt;p&gt;This creates an interesting systems question: can RecursiveMAS-style latent collaboration be brought into the browser?&lt;/p&gt;

&lt;p&gt;RecursiveMAS WebLLM explores that question by designing a browser-native runtime for recursive multi-agent reasoning. The system focuses on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;hidden-state routing between agents,&lt;/li&gt;
&lt;li&gt;browser-side orchestration of recursive loops,&lt;/li&gt;
&lt;li&gt;local-first execution with no backend dependency.&lt;/li&gt;
&lt;/ul&gt;


&lt;h2&gt;
  
  
  2. Background
&lt;/h2&gt;
&lt;h3&gt;
  
  
  2.1 RecursiveMAS
&lt;/h3&gt;

&lt;p&gt;The RecursiveMAS paper introduces a multi-agent framework that extends recursion from single-model reasoning to the agent collaboration level. Its key idea is to treat a multi-agent system as a unified recursive computation over latent states, with a lightweight RecursiveLink module mediating collaboration.&lt;/p&gt;

&lt;p&gt;According to the paper, this architecture can improve efficiency over standard text-based multi-agent systems and reports gains in accuracy, speed, and token usage reduction.&lt;/p&gt;
&lt;h3&gt;
  
  
  2.2 Browser-Native LLM Inference
&lt;/h3&gt;

&lt;p&gt;WebLLM is a high-performance in-browser inference engine that uses WebGPU for hardware acceleration and supports local execution of language models directly in the browser. WebGPU is the web standard that exposes GPU access through browser APIs such as &lt;code&gt;navigator.gpu&lt;/code&gt; and &lt;code&gt;GPUDevice&lt;/code&gt;, making it possible to perform compute-heavy workloads on the client side.&lt;/p&gt;

&lt;p&gt;Browser-native inference offers several benefits:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;lower deployment friction,&lt;/li&gt;
&lt;li&gt;stronger privacy,&lt;/li&gt;
&lt;li&gt;reduced backend cost,&lt;/li&gt;
&lt;li&gt;fully local execution.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;However, most browser LLM runtimes still expose the model primarily as a token generator. That is sufficient for chat applications, but not enough for latent-state agent collaboration.&lt;/p&gt;
&lt;h3&gt;
  
  
  2.3 Why Latent States Matter
&lt;/h3&gt;

&lt;p&gt;Text is a compressed interface. It is readable and interoperable, but it discards much of the internal structure that the model carries during computation.&lt;/p&gt;

&lt;p&gt;Hidden states preserve richer intermediate representations, including semantic abstractions and contextual structure. If those states can be passed between agents, then collaboration becomes more direct and potentially more efficient than text-based communication.&lt;/p&gt;

&lt;p&gt;That is the core motivation behind this work. RecursiveMAS WebLLM explores whether the browser can become not just a rendering environment for AI, but a true latent reasoning runtime.&lt;/p&gt;


&lt;h2&gt;
  
  
  3. Problem Statement
&lt;/h2&gt;

&lt;p&gt;Current browser-based LLM runtimes are optimized for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;prompt input,&lt;/li&gt;
&lt;li&gt;token generation,&lt;/li&gt;
&lt;li&gt;client-side inference.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;They are not designed for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;direct hidden-state extraction,&lt;/li&gt;
&lt;li&gt;latent-state injection,&lt;/li&gt;
&lt;li&gt;agent-to-agent communication in latent space.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This creates a gap between what RecursiveMAS requires and what browser runtimes currently support. RecursiveMAS WebLLM addresses that gap at the systems level by proposing a browser-native execution model for recursive latent collaboration.&lt;/p&gt;


&lt;h2&gt;
  
  
  4. System Overview
&lt;/h2&gt;

&lt;p&gt;RecursiveMAS WebLLM is organized into three major components:&lt;/p&gt;
&lt;h3&gt;
  
  
  4.1 WebLLM Runtime Layer
&lt;/h3&gt;

&lt;p&gt;This layer provides the base browser inference engine. It is responsible for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;loading the model,&lt;/li&gt;
&lt;li&gt;executing WebGPU-backed inference,&lt;/li&gt;
&lt;li&gt;exposing runtime hooks for latent-state access.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  4.2 RecursiveLink Adapter
&lt;/h3&gt;

&lt;p&gt;RecursiveLink is the latent transformation layer between agents. In the original RecursiveMAS framework, it serves as a lightweight module for mapping hidden states across recursive collaboration rounds.&lt;/p&gt;

&lt;p&gt;In this browser-native adaptation, RecursiveLink acts as the bridge between agent representations inside the JavaScript orchestration layer.&lt;/p&gt;
&lt;h3&gt;
  
  
  4.3 Browser Orchestration Layer
&lt;/h3&gt;

&lt;p&gt;This layer manages:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;agent scheduling,&lt;/li&gt;
&lt;li&gt;recursive execution,&lt;/li&gt;
&lt;li&gt;hidden-state routing,&lt;/li&gt;
&lt;li&gt;loop control.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All of this runs entirely inside the browser, which removes the need for a server, cloud GPU, or backend inference service.&lt;/p&gt;


&lt;h2&gt;
  
  
  5. Architecture
&lt;/h2&gt;

&lt;p&gt;The architecture treats the browser as a recursive execution environment. Agents produce hidden states, the orchestration layer routes them, and RecursiveLink transforms them for the next agent or recursion round.&lt;/p&gt;

&lt;p&gt;A browser-native architecture of this kind emphasizes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;hidden-state routing,&lt;/li&gt;
&lt;li&gt;low-latency recursive flow control,&lt;/li&gt;
&lt;li&gt;browser-local tensor transformation,&lt;/li&gt;
&lt;li&gt;final decode only at output time.&lt;/li&gt;
&lt;/ul&gt;


&lt;h2&gt;
  
  
  6. Latent-State Interface
&lt;/h2&gt;

&lt;p&gt;A browser-native RecursiveMAS implementation needs two core capabilities:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Hidden-state extraction, so the runtime can expose the internal representation of an agent step.&lt;/li&gt;
&lt;li&gt;Hidden-state injection, so another agent can receive a transformed latent representation instead of text.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;A conceptual API might look like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;hA&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;agentA&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getHiddenState&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;hMapped&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;recursiveLink&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;forward&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;hA&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;agentB&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;injectHiddenState&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;hMapped&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;output&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;agentB&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the key difference from prompt-based multi-agent orchestration. Communication happens through latent tensors rather than serialized text.&lt;/p&gt;




&lt;h2&gt;
  
  
  7. RecursiveLink in the Browser
&lt;/h2&gt;

&lt;p&gt;RecursiveLink is the component that makes latent collaboration workable. In the RecursiveMAS paper, RecursiveLink is used to align agent representations and support recursive state transfer across heterogeneous models.&lt;/p&gt;

&lt;p&gt;In a browser-native setting, the same idea becomes a practical adapter that can stabilize the transfer of hidden states between in-browser agents.&lt;/p&gt;

&lt;p&gt;A browser-friendly RecursiveLink should aim to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;normalize latent distributions,&lt;/li&gt;
&lt;li&gt;reduce instability across recursion rounds,&lt;/li&gt;
&lt;li&gt;preserve enough semantic structure for downstream reasoning.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A simple formulation can be:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;h' = W3 σ(W2 σ(W1 h))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;where:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;h&lt;/code&gt; is the source hidden state,&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;h'&lt;/code&gt; is the transformed state,&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;W1&lt;/code&gt;, &lt;code&gt;W2&lt;/code&gt;, &lt;code&gt;W3&lt;/code&gt; are learned projection matrices,&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;σ&lt;/code&gt; is a nonlinear activation.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is a practical abstraction, not a claim that the exact same transformation must be used in every implementation.&lt;/p&gt;




&lt;h2&gt;
  
  
  8. Browser Runtime Flow
&lt;/h2&gt;

&lt;p&gt;A typical recursive reasoning loop may look like this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Agent A processes the input and emits a hidden state.&lt;/li&gt;
&lt;li&gt;RecursiveLink transforms that hidden state into a compatible latent format.&lt;/li&gt;
&lt;li&gt;Agent B receives the transformed state and continues reasoning.&lt;/li&gt;
&lt;li&gt;The loop repeats for one or more recursion rounds.&lt;/li&gt;
&lt;li&gt;A final decode step produces the visible text output.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This flow keeps the intermediate reasoning inside the browser and only surfaces the final answer when needed.&lt;/p&gt;




&lt;h2&gt;
  
  
  9. Why This Matters
&lt;/h2&gt;

&lt;p&gt;The main value of this work is not simply that it runs locally. It is that it brings a richer coordination mechanism into a browser-native environment.&lt;/p&gt;

&lt;p&gt;That matters for several reasons:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Privacy: data stays on-device.&lt;/li&gt;
&lt;li&gt;Deployment simplicity: no backend orchestration is required.&lt;/li&gt;
&lt;li&gt;Portability: users can run the system from a browser.&lt;/li&gt;
&lt;li&gt;Research value: latent collaboration can be studied in a lightweight environment.&lt;/li&gt;
&lt;li&gt;Decentralization: browser clients can potentially participate in distributed AI workflows.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;RecursiveMAS WebLLM therefore sits at the intersection of browser AI, agent systems, and latent computation.&lt;/p&gt;




&lt;h2&gt;
  
  
  10. Limitations
&lt;/h2&gt;

&lt;p&gt;This browser-native adaptation also has clear constraints:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Hidden-state manipulation is technically complex.&lt;/li&gt;
&lt;li&gt;Browser memory and compute budgets are limited.&lt;/li&gt;
&lt;li&gt;WebGPU performance varies by device and browser support.&lt;/li&gt;
&lt;li&gt;Latent transfer can become unstable without careful normalization.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The system is a prototype and should not be treated as a full replacement for server-side training or large-scale agent orchestration.&lt;/p&gt;

&lt;p&gt;These limitations are important to acknowledge because they define the realistic scope of the project.&lt;/p&gt;




&lt;h2&gt;
  
  
  11. Future Work
&lt;/h2&gt;

&lt;p&gt;Several extensions are worth exploring next:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;browser-to-browser latent communication,&lt;/li&gt;
&lt;li&gt;dynamic agent graphs,&lt;/li&gt;
&lt;li&gt;stronger RecursiveLink training strategies,&lt;/li&gt;
&lt;li&gt;recursive memory modules,&lt;/li&gt;
&lt;li&gt;evaluation across multiple browser/device classes.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A particularly interesting direction is to test whether browser-native latent recursion can preserve some of the efficiency benefits reported in the original RecursiveMAS paper when run on consumer hardware.&lt;/p&gt;




&lt;h2&gt;
  
  
  12. Project Context
&lt;/h2&gt;

&lt;p&gt;This repository serves as a build pipeline for a latent-transfer-capable WebLLM model. It demonstrates how a compiled WebGPU model can expose last-layer hidden states and how a trained RecursiveLink can be assembled and consumed by a browser application.&lt;/p&gt;

&lt;p&gt;Key implementation artifacts in this repo include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;expose_hidden.py&lt;/code&gt; — automated patcher for exposing hidden states in an MLC model definition.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;build.sh&lt;/code&gt; — pipeline script for converting weights, generating config, and compiling a WebGPU runtime.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;train_recursivelink.py&lt;/code&gt; — optional training script for RecursiveLink projection weights.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  13. Conclusion
&lt;/h2&gt;

&lt;p&gt;RecursiveMAS WebLLM is a browser-native exploration of RecursiveMAS-style latent collaboration. My work is based on RecursiveMAS (&lt;a href="https://arxiv.org/abs/2604.25917" rel="noopener noreferrer"&gt;https://arxiv.org/abs/2604.25917&lt;/a&gt;) as the core idea, and adapts it into a WebGPU-backed runtime that runs entirely inside the browser.&lt;/p&gt;

&lt;p&gt;The central idea is simple: if multi-agent reasoning can be expressed as latent-state recursion, then the browser may be able to host that process locally, privately, and without backend infrastructure. That makes the browser not just a user interface, but a viable execution layer for advanced agent research.&lt;/p&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Recursive Multi-Agent Systems, arXiv:2604.25917 
&lt;a href="https://recursivemas.github.io/" rel="noopener noreferrer"&gt;https://recursivemas.github.io/&lt;/a&gt;
Demo &lt;a href="https://vishalmysore.github.io/recursiveMASDemo" rel="noopener noreferrer"&gt;https://vishalmysore.github.io/recursiveMASDemo&lt;/a&gt; &lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>llm</category>
      <category>webdev</category>
    </item>
    <item>
      <title>Bringing Recursive Multi-Agent Systems to the Browser with WebLLM and WebGPU</title>
      <dc:creator>vishalmysore</dc:creator>
      <pubDate>Mon, 22 Jun 2026 14:37:47 +0000</pubDate>
      <link>https://dev.to/vishalmysore/bringing-recursive-multi-agent-systems-to-the-browser-with-webllm-and-webgpu-5dfh</link>
      <guid>https://dev.to/vishalmysore/bringing-recursive-multi-agent-systems-to-the-browser-with-webllm-and-webgpu-5dfh</guid>
      <description>&lt;p&gt;Most multi-agent AI systems have a hidden inefficiency.&lt;/p&gt;

&lt;p&gt;Every time agents collaborate, they typically communicate by generating text, passing that text to another agent, and then re-processing it again. While this works, it's expensive, slow, and burns through tokens quickly.&lt;/p&gt;

&lt;p&gt;What if agents could communicate without generating text at all?&lt;/p&gt;

&lt;p&gt;That's the idea behind &lt;strong&gt;RecursiveMAS&lt;/strong&gt;, a recent research framework that allows AI agents to collaborate directly through their internal latent representations instead of exchanging natural language.&lt;/p&gt;

&lt;p&gt;Inspired by this research, I built &lt;strong&gt;recursiveMASWebLLM&lt;/strong&gt; — a build pipeline that brings RecursiveMAS-style latent collaboration directly into the browser using &lt;strong&gt;WebLLM&lt;/strong&gt;, &lt;strong&gt;MLC-LLM&lt;/strong&gt;, and &lt;strong&gt;WebGPU&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The result is a fully client-side experimental platform for running recursive multi-agent systems on consumer hardware without requiring cloud GPUs.&lt;/p&gt;




&lt;h1&gt;
  
  
  The Problem with Traditional Multi-Agent Systems
&lt;/h1&gt;

&lt;p&gt;Most agent frameworks operate like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Agent A → generates text
         ↓
Agent B → reads text and generates more text
         ↓
Agent C → reads text and generates final answer
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every handoff requires:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Token generation&lt;/li&gt;
&lt;li&gt;Token transmission&lt;/li&gt;
&lt;li&gt;Token re-processing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;As the number of agents increases, the overhead grows rapidly.&lt;/p&gt;

&lt;p&gt;A significant portion of the computation is spent translating thoughts into text and then converting that text back into internal representations.&lt;/p&gt;

&lt;p&gt;This works, but it's not how neural networks naturally communicate.&lt;/p&gt;




&lt;h1&gt;
  
  
  What Is RecursiveMAS?
&lt;/h1&gt;

&lt;p&gt;RecursiveMAS takes a different approach.&lt;/p&gt;

&lt;p&gt;Instead of exchanging generated text, agents exchange their &lt;strong&gt;last-layer hidden states&lt;/strong&gt; (latent representations).&lt;/p&gt;

&lt;p&gt;Think of hidden states as the model's internal reasoning space before words are produced.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Agent A Hidden State
         ↓
RecursiveLink
         ↓
Agent B Hidden State
         ↓
RecursiveLink
         ↓
Agent C Hidden State
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The entire multi-agent system becomes a recursive computation graph operating in latent space.&lt;/p&gt;

&lt;p&gt;The original research introduces a lightweight component called &lt;strong&gt;RecursiveLink&lt;/strong&gt;, which acts as a bridge between agents.&lt;/p&gt;

&lt;p&gt;Rather than training or fine-tuning the underlying LLMs, only these small link modules are trained while the base models remain frozen.&lt;/p&gt;

&lt;p&gt;This allows multiple agents to collaboratively refine reasoning before any text is generated.&lt;/p&gt;




&lt;h1&gt;
  
  
  Core Concepts
&lt;/h1&gt;

&lt;h2&gt;
  
  
  RecursiveLink
&lt;/h2&gt;

&lt;p&gt;A lightweight residual network that transforms and transfers latent representations between agents.&lt;/p&gt;

&lt;p&gt;Instead of passing:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;"What is the answer?"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;agents pass:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[hidden_state_vector]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This dramatically reduces communication overhead.&lt;/p&gt;




&lt;h2&gt;
  
  
  Inner Link
&lt;/h2&gt;

&lt;p&gt;Allows an agent to recursively refine its own latent reasoning.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Agent
   ↓
Hidden State
   ↓
RecursiveLink
   ↓
Back Into Agent
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This creates iterative self-improvement loops before decoding text.&lt;/p&gt;




&lt;h2&gt;
  
  
  Outer Link
&lt;/h2&gt;

&lt;p&gt;Enables latent communication between different agents.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Agent A
   ↓
RecursiveLink
   ↓
Agent B
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The research demonstrates that even heterogeneous models can participate in these recursive workflows.&lt;/p&gt;




&lt;h2&gt;
  
  
  System-Level Recursion
&lt;/h2&gt;

&lt;p&gt;The entire multi-agent system can execute multiple refinement passes.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Pass 1
   ↓
Pass 2
   ↓
Pass 3
   ↓
Final Decode
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Instead of generating intermediate text after every step, the system performs latent collaboration first and produces text only at the end.&lt;/p&gt;




&lt;h1&gt;
  
  
  Why This Matters
&lt;/h1&gt;

&lt;p&gt;According to the RecursiveMAS research, latent-space collaboration delivers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Higher benchmark accuracy&lt;/li&gt;
&lt;li&gt;Reduced token consumption&lt;/li&gt;
&lt;li&gt;Faster end-to-end inference&lt;/li&gt;
&lt;li&gt;Better scalability across multiple agents&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Reported results include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Up to 75% reduction in token usage&lt;/li&gt;
&lt;li&gt;1.2×–2.4× faster inference&lt;/li&gt;
&lt;li&gt;Average accuracy improvements across reasoning, coding, science, and medical benchmarks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The key insight is that agents can collaborate more efficiently when communication occurs inside the neural representation space rather than through natural language.&lt;/p&gt;




&lt;h1&gt;
  
  
  The Challenge: Running RecursiveMAS in the Browser
&lt;/h1&gt;

&lt;p&gt;The original RecursiveMAS implementation targets server environments and GPU inference stacks such as vLLM.&lt;/p&gt;

&lt;p&gt;Browser-based AI introduces a major limitation:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;WebLLM models do not normally expose internal hidden states.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Without access to hidden states, latent recursion is impossible.&lt;/p&gt;

&lt;p&gt;That became the motivation for this project.&lt;/p&gt;




&lt;h1&gt;
  
  
  Introducing recursiveMASWebLLM
&lt;/h1&gt;

&lt;p&gt;&lt;strong&gt;recursiveMASWebLLM&lt;/strong&gt; is a specialized build pipeline for creating WebLLM models capable of latent-state transfer.&lt;/p&gt;

&lt;p&gt;It extends the browser AI stack to expose the information required for RecursiveMAS-style recursion.&lt;/p&gt;

&lt;p&gt;The goal is simple:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Research Paper
      ↓
Server GPU Implementation
      ↓
Browser-Compatible Runtime
      ↓
Accessible to Everyone
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h1&gt;
  
  
  What This Project Adds
&lt;/h1&gt;

&lt;h2&gt;
  
  
  Hidden State Extraction
&lt;/h2&gt;

&lt;p&gt;MLC-LLM is patched to expose:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight cpp"&gt;&lt;code&gt;&lt;span class="n"&gt;get_last_hidden&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This allows browser applications to access last-layer hidden states directly during inference.&lt;/p&gt;

&lt;p&gt;Without this capability, RecursiveMAS cannot function.&lt;/p&gt;




&lt;h2&gt;
  
  
  RecursiveLink Training Pipeline
&lt;/h2&gt;

&lt;p&gt;The repository includes tooling to train and package RecursiveLinks.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;train_recursivelink.py
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Generated links are exported as:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;recursivelink.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;These lightweight modules can then be loaded by browser-based agent systems.&lt;/p&gt;




&lt;h2&gt;
  
  
  Automated Browser Model Builds
&lt;/h2&gt;

&lt;p&gt;The build pipeline supports:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Model conversion&lt;/li&gt;
&lt;li&gt;Quantization&lt;/li&gt;
&lt;li&gt;WebGPU compilation&lt;/li&gt;
&lt;li&gt;WASM generation&lt;/li&gt;
&lt;li&gt;Release packaging&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Even small models can be built entirely through GitHub Actions without requiring local GPUs.&lt;/p&gt;




&lt;h2&gt;
  
  
  Browser Deployment
&lt;/h2&gt;

&lt;p&gt;Outputs include:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;.wasm
weights
recursivelink.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;These artifacts can be hosted on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;GitHub Releases&lt;/li&gt;
&lt;li&gt;Hugging Face&lt;/li&gt;
&lt;li&gt;Static web hosting&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;and loaded directly into browser applications.&lt;/p&gt;




&lt;h1&gt;
  
  
  Project Architecture
&lt;/h1&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;recursiveMASWebLLM
        │
        ▼
Build Pipeline
        │
        ▼
.wasm + weights + recursivelink.json
        │
        ▼
Hosted Artifacts
        │
        ▼
RecursiveMAS Playground
        │
        ▼
Browser-Based Recursive Agents
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The builder generates everything needed for latent recursive collaboration in WebLLM-powered applications.&lt;/p&gt;




&lt;h1&gt;
  
  
  Why Browser-Based Recursive Agents Are Interesting
&lt;/h1&gt;

&lt;h2&gt;
  
  
  1. Democratizing Advanced AI Research
&lt;/h2&gt;

&lt;p&gt;Researchers and developers can experiment with RecursiveMAS techniques without expensive cloud infrastructure.&lt;/p&gt;

&lt;p&gt;If a device supports WebGPU, it can participate.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. Interactive Experimentation
&lt;/h2&gt;

&lt;p&gt;Developers can modify:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Recursion depth&lt;/li&gt;
&lt;li&gt;Agent roles&lt;/li&gt;
&lt;li&gt;Collaboration patterns&lt;/li&gt;
&lt;li&gt;Prompt strategies&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;and immediately observe how latent collaboration affects outcomes.&lt;/p&gt;




&lt;h2&gt;
  
  
  3. Education
&lt;/h2&gt;

&lt;p&gt;RecursiveMAS introduces a fundamentally different way of thinking about multi-agent systems.&lt;/p&gt;

&lt;p&gt;Running it locally in a browser makes it easier to understand and teach.&lt;/p&gt;




&lt;h2&gt;
  
  
  4. Lower Latency
&lt;/h2&gt;

&lt;p&gt;Reducing intermediate token generation is especially valuable in browser environments where responsiveness matters.&lt;/p&gt;




&lt;h2&gt;
  
  
  5. Future Extensions
&lt;/h2&gt;

&lt;p&gt;Exposing hidden states opens the door to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Latent planning systems&lt;/li&gt;
&lt;li&gt;Browser-side distillation&lt;/li&gt;
&lt;li&gt;Neural memory systems&lt;/li&gt;
&lt;li&gt;Hybrid cloud/browser agents&lt;/li&gt;
&lt;li&gt;Experimental reasoning architectures&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;RecursiveMAS is just one possible application.&lt;/p&gt;




&lt;h1&gt;
  
  
  Getting Started
&lt;/h1&gt;

&lt;p&gt;Repository:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/vishalmysore/recursiveMASWebLLM" rel="noopener noreferrer"&gt;https://github.com/vishalmysore/recursiveMASWebLLM&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The project includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Local build instructions&lt;/li&gt;
&lt;li&gt;GitHub Actions workflows&lt;/li&gt;
&lt;li&gt;RecursiveLink training utilities&lt;/li&gt;
&lt;li&gt;Model packaging tools&lt;/li&gt;
&lt;li&gt;Integration guidance for the RecursiveMAS playground&lt;/li&gt;
&lt;/ul&gt;




&lt;h1&gt;
  
  
  Looking Ahead
&lt;/h1&gt;

&lt;p&gt;This project is still early, but it establishes the foundation for browser-native latent multi-agent systems.&lt;/p&gt;

&lt;p&gt;Future work includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Larger model support&lt;/li&gt;
&lt;li&gt;Improved model sharding&lt;/li&gt;
&lt;li&gt;Additional collaboration patterns&lt;/li&gt;
&lt;li&gt;Better WebGPU optimizations&lt;/li&gt;
&lt;li&gt;Community-created RecursiveLinks&lt;/li&gt;
&lt;li&gt;Integration with other browser AI frameworks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;As browser AI continues to mature, I believe we'll see more experimentation move from cloud infrastructure to client-side environments.&lt;/p&gt;

&lt;p&gt;RecursiveMAS demonstrates that some of the most interesting ideas in AI may not require massive server clusters—they may eventually run directly in the browser.&lt;/p&gt;

&lt;p&gt;What do you think?&lt;/p&gt;

&lt;p&gt;Could latent-space multi-agent systems become the next evolution of browser AI experimentation?&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/vishalmysore/recursiveMASWebLLM" rel="noopener noreferrer"&gt;https://github.com/vishalmysore/recursiveMASWebLLM&lt;/a&gt;&lt;br&gt;
&lt;a href="https://recursivemas.github.io/" rel="noopener noreferrer"&gt;https://recursivemas.github.io/&lt;/a&gt;&lt;br&gt;
&lt;a href="https://huggingface.co/VishalMysore/RecursiveMAS-0.5B-MLC" rel="noopener noreferrer"&gt;https://huggingface.co/VishalMysore/RecursiveMAS-0.5B-MLC&lt;/a&gt;&lt;/p&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>llm</category>
      <category>showdev</category>
    </item>
    <item>
      <title>Stop Paying for Token APIs: How to Build a Serverless Multi-Agent Mesh in the Browser</title>
      <dc:creator>vishalmysore</dc:creator>
      <pubDate>Wed, 17 Jun 2026 12:04:28 +0000</pubDate>
      <link>https://dev.to/vishalmysore/stop-paying-for-token-apis-how-to-build-a-serverless-multi-agent-mesh-in-the-browser-54p8</link>
      <guid>https://dev.to/vishalmysore/stop-paying-for-token-apis-how-to-build-a-serverless-multi-agent-mesh-in-the-browser-54p8</guid>
      <description>&lt;p&gt;Every modern multi-agent architecture assumes a massive, expensive backend cloud infrastructure running hundreds of dollars in API token costs per hour. But what if an entire suite of specialized agents—Legal, Software, Security, Healthcare—could collaborate, negotiate, and execute complex tools completely localized inside consumer browser tabs, passing knowledge with zero intermediary servers?&lt;/p&gt;

&lt;p&gt;Welcome to &lt;strong&gt;agentHerd&lt;/strong&gt;—a radical paradigm shift in decentralized, sovereign artificial intelligence. &lt;/p&gt;

&lt;p&gt;By combining the local client-side execution power of &lt;strong&gt;WebGPU&lt;/strong&gt; with the direct, peer-to-peer networking capabilities of &lt;strong&gt;WebRTC&lt;/strong&gt;, agentHerd turns ordinary browser tabs into highly scalable, self-hosting AI environments. Zero cloud costs. Zero central servers. Total data privacy.&lt;/p&gt;




&lt;p&gt;Demo - &lt;a href="https://vishalmysore.github.io/agentHerd/" rel="noopener noreferrer"&gt;https://vishalmysore.github.io/agentHerd/&lt;/a&gt; &lt;/p&gt;

&lt;h2&gt;
  
  
  The agentHerd Stack at a Glance
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Inference:&lt;/strong&gt; WebLLM / WebGPU (Running Llama 3, Phi-3, or Gemma natively in the browser).&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Networking:&lt;/strong&gt; Pure Serverless WebRTC Data Channels (Handshake via ephemeral URL hashes).&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Data Isolation:&lt;/strong&gt; Distributed Federated RAG (Knowledge discovery without centralized vector stores).&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Determinism:&lt;/strong&gt; Hybrid sandboxing (LLMs control personality/choice; sandboxed JavaScript handles immutable application rules).&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  1. The Core Paradox: Moving Inference and Networking to the Edge
&lt;/h2&gt;

&lt;p&gt;Traditional AI agents are cloud-bound because of two heavy dependencies: &lt;strong&gt;compute&lt;/strong&gt; (LLM inference) and &lt;strong&gt;orchestration&lt;/strong&gt; (state management and messaging).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[Traditional Architecture]
Browser UI &amp;lt;---&amp;gt; Cloud Orchestrator &amp;lt;---&amp;gt; Vector DB &amp;lt;---&amp;gt; Expensive LLM APIs ($$$)

[agentHerd Architecture]
Browser Tab A (WebGPU + LLM) &amp;lt;======== WebRTC ========&amp;gt; Browser Tab B (WebGPU + LLM)
                                 (Direct P2P Link)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;agentHerd breaks this centralization by pushing both layers entirely to the client device:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;WebGPU for Compute:&lt;/strong&gt; Instead of querying external endpoints, models are cached locally and executed on the client's GPU via WebGPU. The moment a user opens a tab, their device becomes an active AI compute node.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;WebRTC for Orchestration:&lt;/strong&gt; Instead of a centralized message broker routing agent dialogue, tabs establish direct, encrypted peer-to-peer WebRTC data channels. &lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Serverless Signaling via URL Hashes
&lt;/h3&gt;

&lt;p&gt;WebRTC traditionally requires a signaling server to exchange Session Description Protocol (SDP) tokens. agentHerd implements an entirely serverless signaling option: the initial peer generates an SDP token encoded directly into a URL hash. Copying and sharing this URL establishes an absolute trust boundary—operating identically to an end-to-end encrypted chat room.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. Federated Knowledge Retrieval (RAG) over WebRTC
&lt;/h2&gt;

&lt;p&gt;One of the greatest challenges of collaborative AI is data sharing. Uploading private company manuals or personal codebases to a central cloud database poses severe security risks. agentHerd solves this via &lt;strong&gt;Federated Knowledge Retrieval&lt;/strong&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[Peer A: Has "Legal_Doc.pdf"]             [Peer B: Needs Legal Context]
   |                                            |
   |-- 1. Generates Summary Card --------------&amp;gt;| (Broadcasts Summary to Mesh)
   |                                            |
   |                                            |-- 2. "I need data on Section 4"
   |&amp;lt;=- 3. Requests Chunk via WebRTC Data Channel-|
   |                                            |
   |-- 4. Runs Local RAG Engine                 |
   |-- 5. Extracts precise text snippet         |
   |                                            |
   |==- 6. Sends Answer Fragment via WebRTC ===&amp;gt;| (Received securely)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  How the Flow Works:
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Local Extraction:&lt;/strong&gt; When a user uploads a document into their local agentHerd tab, the document never leaves their machine. The local browser model processes the file and generates a lightweight &lt;em&gt;Summary Card&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Summary Broadcast:&lt;/strong&gt; This abstract Summary Card is shared across the WebRTC mesh. Other peers know &lt;em&gt;what&lt;/em&gt; knowledge Peer A possesses, but they do not have the raw data.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;On-Demand Querying:&lt;/strong&gt; When Peer B's agent requires deep granular details to answer a prompt, it queries Peer A over the direct WebRTC data channel.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Localized Verification:&lt;/strong&gt; Peer A’s local RAG system searches its own memory space, extracts the specific matching snippet, passes it through its own WebGPU instance, and returns &lt;em&gt;only&lt;/em&gt; the specific answer fragment to Peer B.&lt;/li&gt;
&lt;/ol&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;The Sovereign AI Rule:&lt;/strong&gt; You hand the network your answers—never your data.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  3. The Action Layer: Where Determinism Meets Personality
&lt;/h2&gt;

&lt;p&gt;Generative models are inherently non-deterministic. If you ask an LLM to play chess against another LLM purely through text generation, the game will rapidly degrade into illegal moves and hallucinated board positions.&lt;/p&gt;

&lt;p&gt;agentHerd solves this by splitting responsibilities through a strict separation of concerns:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;The Persona (LLM):&lt;/strong&gt; Manages choices, dialogue, strategic goals, and social banter.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;The Guardrail (Deterministic Engine):&lt;/strong&gt; An immutable, sandboxed environment (like a localized &lt;code&gt;chess.js&lt;/code&gt; script) that enforces absolute operational rules.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When an agent wants to perform an action, it cannot arbitrary alter the state. It must output a structured JSON command envelope that is verified by every node in the mesh:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"sender"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Agent_Alpha_Chess"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"timestamp"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1718619837&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"action"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"EXECUTE_TOOL"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"payload"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"tool_name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"CHESS_MOVE"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"arguments"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"from"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"e2"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"to"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"e4"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"signature"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"0x7f83b..."&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If &lt;code&gt;Agent_Alpha&lt;/code&gt; attempts to generate an illegal move, the deterministic script running on the peer nodes instantly rejects the packet, ensuring the integrity of the environment without requiring a central server to referee the state.&lt;/p&gt;




&lt;h2&gt;
  
  
  4. Operational Boundaries and Engineering Realities
&lt;/h2&gt;

&lt;p&gt;Building completely within the constraints of a browser environment requires engineering trade-offs. Developers looking to leverage this stack should be aware of current boundaries:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;VRAM and Model Swapping:&lt;/strong&gt; Running models like Llama 3 (8B) or Phi-3 requires a modern GPU with sufficient VRAM. Attempting to open multiple heavy-inference browser tabs concurrently can saturate hardware resources. Multi-agent rooms perform best when using highly optimized 3B or smaller models optimized for web runtimes.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;NAT Traversal &amp;amp; Corporate Firewalls:&lt;/strong&gt; While public STUN servers successfully resolve connections for the majority of consumer network topologies, strict enterprise environments utilizing symmetric NATs often block direct WebRTC channels. In these scenarios, falling back to a dedicated, self-hosted TURN relay server becomes mandatory to handle the traffic.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Topology Scaling Limits:&lt;/strong&gt; Because each browser tab must maintain WebRTC connections with other agents, a full-mesh topology (where every node connects to every other node) hits a browser-imposed performance wall as the group size scales. For massive clusters, the architecture transitions toward hybrid star/relay networks.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  5. Join the Decentralized Frontier
&lt;/h2&gt;

&lt;p&gt;The future of multi-agent collaboration isn't a massive corporate data center burning megawatts of energy to route your private API calls—it's the browser tab you already have open. &lt;/p&gt;

&lt;p&gt;agentHerd proves that we can build highly complex, deeply collaborative, and perfectly private AI ecosystems using the open web standard tools already at our disposal. &lt;/p&gt;

&lt;p&gt;The project is fully open-source and welcoming contributors. We are actively looking for developers to help build out new specialized agent domains, create native CLI-peer wrappers, and engineer custom tool integrations.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;Explore the Codebase:&lt;/strong&gt; &lt;a href="https://github.com/vishalmysore/agentHerd" rel="noopener noreferrer"&gt;agentHerd on GitHub&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Launch a Room:&lt;/strong&gt; Open the repository, generate your signaling hash, and invite your first agent herd today.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;Contribute:&lt;/strong&gt; Star the repo, open an issue, and let's build an unstoppable, serverless AI collective together.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Important distinction: agentHerd distributes cognition, not computation.&lt;br&gt;
Each node runs its own model independently. The system does not combine GPUs to run larger models—it coordinates many smaller, autonomous agents working in parallel.&lt;/p&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>serverless</category>
      <category>webdev</category>
    </item>
    <item>
      <title>Foundation vs. Instruct vs. Chat Models: One Question, Three Answers</title>
      <dc:creator>vishalmysore</dc:creator>
      <pubDate>Tue, 16 Jun 2026 23:08:32 +0000</pubDate>
      <link>https://dev.to/vishalmysore/foundation-vs-instruct-vs-chat-models-one-question-three-answers-3gi</link>
      <guid>https://dev.to/vishalmysore/foundation-vs-instruct-vs-chat-models-one-question-three-answers-3gi</guid>
      <description>&lt;p&gt;&lt;em&gt;A hands-on tutorial you can run for free in Google Colab.&lt;/em&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Run it yourself:&lt;/strong&gt; open &lt;a href="https://github.com/vishalmysore/AI/blob/main/foundation_instruct_chat_tutorial.ipynb" rel="noopener noreferrer"&gt;&lt;code&gt;foundation_instruct_chat_tutorial.ipynb&lt;/code&gt;&lt;/a&gt; in Google Colab and run every cell top to bottom. It uses the &lt;strong&gt;SmolLM2-135M&lt;/strong&gt; family — small enough for a free CPU runtime, no GPU needed.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Why this confuses everyone
&lt;/h2&gt;

&lt;p&gt;People say "LLM," "GPT," "an AI model," and "ChatGPT" as if they were the same thing. They aren't. There's a ladder of training stages between "a model that read the internet" and "an assistant you can chat with," and the words &lt;strong&gt;foundation&lt;/strong&gt;, &lt;strong&gt;instruct&lt;/strong&gt;, and &lt;strong&gt;chat&lt;/strong&gt; mark the rungs.&lt;/p&gt;

&lt;p&gt;The cleanest way to feel the difference is to do something deliberately unfair: ask the &lt;strong&gt;exact same question&lt;/strong&gt; to three versions of the &lt;strong&gt;same model family&lt;/strong&gt; and watch how differently they behave. Our question is deliberately boring so the &lt;em&gt;behavior&lt;/em&gt; stands out:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;"What is the capital of France?"&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;We use three checkpoints from Hugging Face's SmolLM2 family:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model type&lt;/th&gt;
&lt;th&gt;Hugging Face ID&lt;/th&gt;
&lt;th&gt;One-line summary&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Foundation (base)&lt;/td&gt;
&lt;td&gt;&lt;code&gt;HuggingFaceTB/SmolLM2-135M&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Predicts the next token. Knows things, isn't helpful.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Instruct&lt;/td&gt;
&lt;td&gt;&lt;code&gt;HuggingFaceTB/SmolLM2-135M-Instruct&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Fine-tuned to follow a single instruction.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Chat&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;HuggingFaceTB/SmolLM2-135M-Instruct&lt;/code&gt; (used conversationally)&lt;/td&gt;
&lt;td&gt;Same weights, driven through a multi-turn message list.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Notice that the chat row reuses the instruct checkpoint. That's not a shortcut — it's the honest reality, and we'll come back to why.&lt;/p&gt;




&lt;h2&gt;
  
  
  1. The foundation model: a brilliant autocomplete
&lt;/h2&gt;

&lt;p&gt;A &lt;strong&gt;foundation model&lt;/strong&gt; (also called a &lt;em&gt;base&lt;/em&gt; or &lt;em&gt;pretrained&lt;/em&gt; model) is trained on exactly one objective: given a stretch of text, &lt;strong&gt;predict the next token&lt;/strong&gt;. Nothing else. It reads a huge slice of the internet and gets very good at continuing text in a statistically plausible way.&lt;/p&gt;

&lt;p&gt;What it is &lt;em&gt;never&lt;/em&gt; taught is that a question deserves an answer. So when you feed it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;What is the capital of France?
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;it doesn't think &lt;em&gt;"I should answer that."&lt;/em&gt; It thinks &lt;em&gt;"On the internet, what usually **comes after&lt;/em&gt;* a line like this?"* And the answer is often… &lt;strong&gt;more quiz questions&lt;/strong&gt;, a worksheet, or a tangent:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;What is the capital of France? What is the capital of Germany? What is the
capital of Italy? ...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In the notebook we pass the raw string straight into the pipeline with no formatting:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;base_pipe&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;pipeline&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text-generation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;HuggingFaceTB/SmolLM2-135M&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;base_raw_out&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;base_pipe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;test_query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_new_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;do_sample&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;base_raw_out&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;generated_text&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Takeaway:&lt;/strong&gt; a foundation model is a &lt;strong&gt;text completer&lt;/strong&gt;, not an assistant. It contains enormous knowledge but has no concept of being &lt;em&gt;helpful&lt;/em&gt;. It's the raw clay everything else is shaped from.&lt;/p&gt;




&lt;h2&gt;
  
  
  2. The instruct model: teaching the model to answer
&lt;/h2&gt;

&lt;p&gt;An &lt;strong&gt;instruct model&lt;/strong&gt; starts from that same base model and goes through a second stage of training — &lt;strong&gt;fine-tuning on (instruction → response) pairs&lt;/strong&gt;. Thousands to millions of examples of the shape &lt;em&gt;"Here's a request. Here's a good response."&lt;/em&gt; This teaches the model a new contract: &lt;strong&gt;when the user asks for something, actually do it and then stop.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;But there's a crucial detail people miss: an instruct model only behaves correctly when you wrap your text in the &lt;strong&gt;exact special format it was trained on.&lt;/strong&gt; That format uses control tokens — for SmolLM2 they look like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;lt;|im_start|&amp;gt;user
What is the capital of France?&amp;lt;|im_end|&amp;gt;
&amp;lt;|im_start|&amp;gt;assistant
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You don't type those tokens by hand. Every instruct model ships with a &lt;strong&gt;chat template&lt;/strong&gt; baked into its tokenizer that builds them for you:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;tokenizer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;AutoTokenizer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_pretrained&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;instruct_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;formatted_prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tokenizer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;apply_chat_template&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;test_query&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;
    &lt;span class="n"&gt;tokenize&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;add_generation_prompt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# appends the 'assistant' cue
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Feed &lt;em&gt;that&lt;/em&gt; to the same-sized model and you get a clean, direct answer:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;The capital of France is Paris.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The notebook prints the formatted prompt &lt;strong&gt;before&lt;/strong&gt; generating, so you can literally see the hidden scaffolding the model receives. That "aha" — &lt;em&gt;oh, there's a whole structure under the hood&lt;/em&gt; — is the most important thing in the tutorial.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Takeaway:&lt;/strong&gt; an instruct model = a base model &lt;strong&gt;+ instruction tuning + a required prompt format&lt;/strong&gt;. Skip the format and even a well-trained instruct model can fall back to rambling.&lt;/p&gt;




&lt;h2&gt;
  
  
  3. The chat model: memory across turns
&lt;/h2&gt;

&lt;p&gt;Here's the part that surprises people: a &lt;strong&gt;chat model is usually the same weights as the instruct model.&lt;/strong&gt; The difference isn't &lt;em&gt;what&lt;/em&gt; the model is — it's &lt;em&gt;how you drive it.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Instead of one instruction in, one response out, you maintain a &lt;strong&gt;running list of role-tagged messages&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;chat_history&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What is the capital of France?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;chat_out&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;chat_pipe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chat_history&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_new_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The pipeline applies the chat template for you and returns the &lt;strong&gt;whole conversation&lt;/strong&gt; with the assistant's reply appended. For a single turn, that looks identical to the instruct example. The magic only appears when the conversation &lt;strong&gt;continues&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;So in the notebook we append the reply and ask a deliberately vague follow-up:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;conversation&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;chat_out&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;generated_text&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;        &lt;span class="c1"&gt;# user + assistant so far
&lt;/span&gt;&lt;span class="n"&gt;conversation&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                     &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;And what is a famous landmark there?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="n"&gt;follow_up&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;chat_pipe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;conversation&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_new_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;40&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The word &lt;strong&gt;"there"&lt;/strong&gt; is meaningless on its own. But because we passed the &lt;em&gt;entire history&lt;/em&gt;, the model resolves "there" → &lt;strong&gt;Paris&lt;/strong&gt; and names a landmark. That carried-over context is what turns a one-shot Q&amp;amp;A into something that feels like a conversation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Takeaway:&lt;/strong&gt; a chat model is an instruct model &lt;strong&gt;driven through a multi-turn message list&lt;/strong&gt;, so each new turn can use the previous turns as context. The system prompt, the &lt;code&gt;user&lt;/code&gt;/&lt;code&gt;assistant&lt;/code&gt; roles, and the growing history are the "chat" part.&lt;/p&gt;




&lt;h2&gt;
  
  
  The whole picture in one table
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Trained to…&lt;/th&gt;
&lt;th&gt;You give it…&lt;/th&gt;
&lt;th&gt;Reply to &lt;em&gt;"What is the capital of France?"&lt;/em&gt;
&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Foundation&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;continue text&lt;/td&gt;
&lt;td&gt;a raw string&lt;/td&gt;
&lt;td&gt;echoes / continues the document — may never answer&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Instruct&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;follow one instruction&lt;/td&gt;
&lt;td&gt;a chat-templated string&lt;/td&gt;
&lt;td&gt;a direct answer: &lt;em&gt;"The capital of France is Paris."&lt;/em&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Chat&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;converse over many turns&lt;/td&gt;
&lt;td&gt;a list of messages&lt;/td&gt;
&lt;td&gt;a direct answer &lt;strong&gt;+ remembers context&lt;/strong&gt; for follow-ups&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Read top to bottom, it's a progression, not three unrelated things:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Foundation&lt;/strong&gt; learns the world by predicting text.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Instruct&lt;/strong&gt; fine-tunes that knowledge into &lt;em&gt;do-what-I-ask&lt;/em&gt; behavior — and demands a specific prompt format.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Chat&lt;/strong&gt; wraps the instruct model in a &lt;em&gt;multi-turn interface&lt;/em&gt; so context flows across turns.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;When you talk to a commercial assistant, you're using stage 3, sitting on stage 2, built on stage 1.&lt;/p&gt;




&lt;h2&gt;
  
  
  A note on honesty and scale
&lt;/h2&gt;

&lt;p&gt;SmolLM2-135M is &lt;strong&gt;tiny&lt;/strong&gt; — about 135 million parameters, versus the tens or hundreds of &lt;em&gt;billions&lt;/em&gt; in frontier models. At this size the model will sometimes get a fact wrong, repeat itself, or trail off. &lt;strong&gt;That's expected, and it's not the point.&lt;/strong&gt; The tutorial is designed to make the &lt;em&gt;behavioral&lt;/em&gt; gap between the three modes visible on a free laptop or Colab CPU — not to win a trivia contest. The exact same three-stage structure scales all the way up to the largest models in production.&lt;/p&gt;




&lt;h2&gt;
  
  
  Run it and tinker
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Open &lt;a href="//foundation_instruct_chat_tutorial.ipynb"&gt;&lt;code&gt;foundation_instruct_chat_tutorial.ipynb&lt;/code&gt;&lt;/a&gt; in Google Colab (&lt;code&gt;File → Open notebook → Upload&lt;/code&gt;, or push it to GitHub and use the Colab badge).&lt;/li&gt;
&lt;li&gt;Run all cells (&lt;code&gt;Runtime → Run all&lt;/code&gt;). The first run downloads the models — give it a minute.&lt;/li&gt;
&lt;li&gt;Experiment:

&lt;ul&gt;
&lt;li&gt;Change &lt;code&gt;test_query&lt;/code&gt; to something open-ended like &lt;code&gt;"Write a haiku about the sea."&lt;/code&gt; and watch how the three modes diverge even more.&lt;/li&gt;
&lt;li&gt;Set &lt;code&gt;do_sample=True&lt;/code&gt; with &lt;code&gt;temperature=0.7&lt;/code&gt; for more varied, creative output.&lt;/li&gt;
&lt;li&gt;Swap in a larger sibling such as &lt;code&gt;HuggingFaceTB/SmolLM2-360M-Instruct&lt;/code&gt; and feel the quality jump.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Once you've &lt;em&gt;seen&lt;/em&gt; the three behaviors with your own eyes, the vocabulary — base, instruct, chat, chat template, system prompt — stops being jargon and starts being obvious.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Happy experimenting!&lt;/em&gt; 🚀&lt;/p&gt;

</description>
      <category>ai</category>
      <category>beginners</category>
      <category>llm</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Why Your AI Agents Fail in Production: What Harness Engineering Is NOT</title>
      <dc:creator>vishalmysore</dc:creator>
      <pubDate>Tue, 02 Jun 2026 17:44:32 +0000</pubDate>
      <link>https://dev.to/vishalmysore/why-your-ai-agents-fail-in-production-what-harness-engineering-is-not-1moc</link>
      <guid>https://dev.to/vishalmysore/why-your-ai-agents-fail-in-production-what-harness-engineering-is-not-1moc</guid>
      <description>&lt;p&gt;&lt;em&gt;A technical introduction, grounded in code&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;If you've been building AI agents, you've probably felt the gap between "the model works in a notebook" and "the model works reliably in production." Harness engineering is the discipline that closes that gap. But it's widely misunderstood — often confused with things it is not.&lt;/p&gt;




&lt;h2&gt;
  
  
  It is NOT the model
&lt;/h2&gt;

&lt;p&gt;The most common mistake is treating the LLM as the unit of engineering. Swap GPT-4 for Claude, tune a prompt, and call it done.&lt;/p&gt;

&lt;p&gt;In this demo, the same orchestrator loop drives four domains — healthcare, insurance, career counselling, and drug discovery — using either a local Llama 3.2, Phi-3.5, Qwen 2.5, or a mock simulation with no model at all. The core agentic loop in &lt;code&gt;src/execution/orchestrator.js&lt;/code&gt; is model-agnostic. Whether a &lt;code&gt;&amp;lt;tool_call&amp;gt;&lt;/code&gt; response comes from a 3B quantized model running on WebGPU or a deterministic mock, the harness processes it identically.&lt;/p&gt;

&lt;p&gt;The model is a component. The harness is the system.&lt;/p&gt;




&lt;h2&gt;
  
  
  It is NOT prompt engineering
&lt;/h2&gt;

&lt;p&gt;Prompt engineering is about what you say to the model. Harness engineering is about what you do with the model's outputs — and what you do &lt;em&gt;before&lt;/em&gt; the model ever sees a query.&lt;/p&gt;

&lt;p&gt;In &lt;code&gt;src/information/memoryManager.js&lt;/code&gt;, past clinician corrections stored in &lt;code&gt;localStorage&lt;/code&gt; are retrieved via keyword matching and injected into the system prompt before each run. The model doesn't know this is happening. It just receives a richer context. The retrieval, filtering, and injection logic is harness work — not prompt work.&lt;/p&gt;

&lt;p&gt;Prompt engineering operates on one turn. Harness engineering operates on the full trajectory.&lt;/p&gt;




&lt;h2&gt;
  
  
  It is NOT a pipeline
&lt;/h2&gt;

&lt;p&gt;A pipeline is a linear sequence: input → model → output. That's not what an agent harness is.&lt;/p&gt;

&lt;p&gt;The execution loop in &lt;code&gt;orchestrator.js&lt;/code&gt; runs up to 10 iterations. On each iteration it calls the LLM, extracts tool calls from the response, executes the tool, runs a guardrail check, and either appends the result and continues or forces a revision and loops. The path through that loop is not predetermined — it depends on what the model calls, what the tool returns, and whether the guardrail passes.&lt;/p&gt;

&lt;p&gt;The harness is a control structure, not a pipeline. It has branches, retries, and termination conditions.&lt;/p&gt;




&lt;h2&gt;
  
  
  It is NOT optional validation bolted on at the end
&lt;/h2&gt;

&lt;p&gt;Every domain in this project enforces guardrails at three distinct points: before tool execution (&lt;code&gt;validateToolCall&lt;/code&gt;), after tool execution (&lt;code&gt;validateToolOutput&lt;/code&gt;), and before the final plan is returned (&lt;code&gt;validateFinalPlan&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;In the drug discovery domain (&lt;code&gt;src/domains/drugDiscovery.js&lt;/code&gt;), if a compound's hepatotoxicity score is ≥0.7, the guardrail sets &lt;code&gt;safe: false&lt;/code&gt; and the orchestrator appends a correction message and re-enters the loop — the IND filing is blocked before it ever reaches the user. The guardrail doesn't annotate a bad answer; it prevents the bad answer from being produced.&lt;/p&gt;

&lt;p&gt;In the insurance domain, a fraud risk score ≥0.7 triggers an SIU referral and blocks settlement — not as a UI warning, but as an execution-layer intervention that forces plan revision.&lt;/p&gt;

&lt;p&gt;Guardrails are not postprocessing. They are load-bearing logic inside the execution loop.&lt;/p&gt;




&lt;h2&gt;
  
  
  It is NOT framework-agnostic glue code
&lt;/h2&gt;

&lt;p&gt;The harness in this project has explicit architectural layers with defined responsibilities:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Information layer&lt;/strong&gt; (&lt;code&gt;src/information/&lt;/code&gt;): memory retrieval, tool schemas, context assembly&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Execution layer&lt;/strong&gt; (&lt;code&gt;src/execution/&lt;/code&gt;): agentic loop, tool dispatch, guardrail enforcement&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Feedback layer&lt;/strong&gt; (&lt;code&gt;src/feedback/&lt;/code&gt;): schema verification, event tracing, HITL correction capture&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These aren't just directories — each layer has a specific job that the others do not do. The orchestrator never touches &lt;code&gt;localStorage&lt;/code&gt;. The memory manager never calls a tool. The tracer never modifies execution state. This separation is what makes the harness maintainable and independently testable.&lt;/p&gt;

&lt;p&gt;The tool schemas in &lt;code&gt;src/information/tools.js&lt;/code&gt; are exported in both OpenAI and Anthropic formats — the harness doesn't assume a provider. The contract between orchestrator and model is an explicit JSON schema, not implicit string matching.&lt;/p&gt;




&lt;h2&gt;
  
  
  It is NOT a static configuration
&lt;/h2&gt;

&lt;p&gt;The harness in this project learns at runtime. When a clinician rejects a plan and types a correction — "Patient X is allergic to penicillin" — that correction is structured and written to &lt;code&gt;localStorage&lt;/code&gt; via &lt;code&gt;saveCorrection()&lt;/code&gt;. On the next run, &lt;code&gt;retrieveRelevantMemories()&lt;/code&gt; splits the query into tokens, matches against stored correction text and tags, and injects the relevant ones into the system prompt.&lt;/p&gt;

&lt;p&gt;No redeployment. No fine-tuning. No model update. The harness changed behavior based on human feedback within the same session.&lt;/p&gt;

&lt;p&gt;This is distinct from prompt engineering (which is static) and fine-tuning (which requires a training run). It is runtime adaptation through structured memory — a harness-level capability.&lt;/p&gt;




&lt;h2&gt;
  
  
  It is NOT the same thing across domains
&lt;/h2&gt;

&lt;p&gt;The orchestrator loop is domain-agnostic, but the domain modules (&lt;code&gt;src/domains/&lt;/code&gt;) are not interchangeable black boxes — they each define their own tools, guardrail thresholds, mock patients or compounds, and scenario sets.&lt;/p&gt;

&lt;p&gt;Healthcare enforces weight-based dosage caps and allergy checks. Drug discovery blocks IND filings on positive Ames tests (&lt;code&gt;genotoxicity: POSITIVE → safe: false, blockIND: true&lt;/code&gt;). Career counselling flags recommendations that guarantee salary figures or advice given to applicants over 50 without age-neutral framing.&lt;/p&gt;

&lt;p&gt;The harness provides the execution container. Domain logic provides the constraints. Neither replaces the other — and the quality of the overall system depends on both being correct.&lt;/p&gt;




&lt;h2&gt;
  
  
  What it actually is
&lt;/h2&gt;

&lt;p&gt;Harness engineering is the practice of building the execution container that surrounds an LLM: the control flow that drives multi-step agent behavior, the guardrails that enforce domain constraints mid-execution, the memory system that persists and retrieves human corrections, the schema validation that verifies structured outputs, and the tracing infrastructure that makes all of it observable.&lt;/p&gt;

&lt;p&gt;It is the engineering layer between "a model that can answer questions" and "a system that reliably makes correct decisions in a specific domain."&lt;/p&gt;

&lt;p&gt;The model is one component. The harness is the product.&lt;/p&gt;




&lt;p&gt;Live Demo &lt;a href="https://vishalmysore.github.io/harnessEngineeringDemo/" rel="noopener noreferrer"&gt;https://vishalmysore.github.io/harnessEngineeringDemo/&lt;/a&gt;&lt;br&gt;
Code - &lt;a href="https://github.com/vishalmysore/harnessEngineeringDemo" rel="noopener noreferrer"&gt;https://github.com/vishalmysore/harnessEngineeringDemo&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Explore the full implementation at &lt;code&gt;src/execution/orchestrator.js&lt;/code&gt;, &lt;code&gt;src/domains/&lt;/code&gt;, and &lt;code&gt;src/feedback/&lt;/code&gt; — the three layers are readable in under 600 lines of code.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>llm</category>
      <category>softwareengineering</category>
    </item>
    <item>
      <title>Peer-to-Peer AI Agents: A New Paradigm for Intelligent Collaboration</title>
      <dc:creator>vishalmysore</dc:creator>
      <pubDate>Sun, 31 May 2026 13:58:57 +0000</pubDate>
      <link>https://dev.to/vishalmysore/peer-to-peer-ai-agents-a-new-paradigm-for-intelligent-collaboration-1147</link>
      <guid>https://dev.to/vishalmysore/peer-to-peer-ai-agents-a-new-paradigm-for-intelligent-collaboration-1147</guid>
      <description>&lt;h3&gt;
  
  
  How WebRTC + WebLLM enables two AI agents to think, talk, and solve problems directly — without a server in sight
&lt;/h3&gt;




&lt;h2&gt;
  
  
  The Problem with Centralized AI
&lt;/h2&gt;

&lt;p&gt;Every AI assistant you use today has the same invisible architecture: your message leaves your device, travels to a data center, gets processed by a model running on someone else's hardware, and a response comes back. This works, but it hides several uncomfortable truths:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Your conversations are on someone's server.&lt;/strong&gt; Even with privacy policies, the data crosses the wire.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;You depend on uptime.&lt;/strong&gt; If the provider's API goes down, your agent goes silent.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Latency is inherent.&lt;/strong&gt; Every roundtrip adds delay — milliseconds at best, seconds when traffic spikes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost accumulates.&lt;/strong&gt; Every token, every API call, every inference invocation appears on a bill.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-agent coordination requires a broker.&lt;/strong&gt; When two AI systems need to collaborate, they usually do it through a central orchestration layer — another server, another dependency.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;AgentWorkbook is built on a different premise:&lt;/strong&gt; what if two AI agents could talk to each other the way two people in the same room talk — directly, privately, without a telephone operator listening in?&lt;/p&gt;




&lt;h2&gt;
  
  
  The Architecture: Three Technologies, Zero Servers
&lt;/h2&gt;

&lt;p&gt;The system rests on three open technologies working together:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌──────────────────────────────────────────────────────────┐
│  Browser A (Your Machine)     Browser B (Friend's Machine)│
│                                                           │
│  ┌─────────────────┐           ┌─────────────────┐        │
│  │   WebLLM        │           │   WebLLM        │        │
│  │  (Llama, Phi,   │           │  (Mistral, Gemma │        │
│  │   Gemma, etc.)  │           │   etc.)          │        │
│  └────────┬────────┘           └────────┬─────────┘        │
│           │  generates text             │  generates text   │
│  ┌────────▼────────────────────────────▼─────────┐        │
│  │          WebRTC Data Channel                   │        │
│  │     (direct, encrypted, peer-to-peer)          │        │
│  └────────────────────┬───────────────────────────┘        │
│                       │                                    │
│           Manual SDP token exchange                        │
│           (URL hash / copy-paste — one time only)          │
└──────────────────────────────────────────────────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  1. WebRTC — The Direct Connection
&lt;/h3&gt;

&lt;p&gt;WebRTC (Web Real-Time Communication) is a browser standard originally designed for video calls. It creates a direct, encrypted, peer-to-peer channel between two browsers — without any relay for the actual data. Once connected, messages travel straight from one machine to the other at the speed of the internet between them, not via a third-party server.&lt;/p&gt;

&lt;p&gt;The only "server-like" component is the initial handshake: two browsers need to exchange a small amount of metadata (called an SDP offer and answer) to find each other. In AgentWorkbook, this is done via a URL hash and manual copy-paste — no server needed, not even for setup.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. WebLLM — The Local Brain
&lt;/h3&gt;

&lt;p&gt;WebLLM runs large language models entirely inside the browser using WebGPU — the GPU acceleration API now available in Chrome, Edge, and Firefox. The model weights (800 MB to 4 GB depending on your choice) download once, cache locally, and then run on your own hardware indefinitely. No API key. No per-token cost. No data leaving your machine.&lt;/p&gt;

&lt;p&gt;Each peer runs their own model independently. One user might be running Llama 3.2 · 1B for speed; the other might have Mistral 7B loaded for deeper reasoning. They never need to agree on a model — only the text they generate crosses the data channel.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Persona System — The Role Layer
&lt;/h3&gt;

&lt;p&gt;Beyond raw inference, each agent is given a professional identity: a role (Developer, Doctor, Lawyer, Researcher), a domain-specific system prompt that shapes how it reasons, and a randomly generated name (Nova, Onyx·2, Aria, Vega) that persists for the session. These personas guide the conversation automatically — a QA Engineer agent will naturally probe for edge cases; a Paralegal agent will frame things in procedural terms.&lt;/p&gt;




&lt;h2&gt;
  
  
  How Two Agents Start Talking
&lt;/h2&gt;

&lt;p&gt;The connection process takes about 60–90 seconds from page load to live conversation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;User A (Host)                              User B (Joiner)
─────────────────────────────────────────────────────────
1. Opens page
2. Picks model + persona
3. Clicks "Generate Invite"
4. WebRTC offer generated
   (ICE candidates gathered)
5. Offer encoded into URL hash
   ↓
   Sends URL to User B
   (via chat, SMS, email)
                                    6. Opens URL
                                    7. Picks own model + persona
                                    8. Page auto-reads offer
                                       from URL hash
                                    9. Answer token generated
                                    10. Copies answer token
                                        ↓
                                        Sends back to User A
11. Pastes answer token
12. WebRTC channel opens
    ─────────────── direct connection ───────────────────
13. hello message exchanged
    (names, personas, models)
14. Models load in parallel
    (each on their own GPU)
15. Offerer's model ready
    → sends first message
                                    16. Joiner receives message
                                    17. Joiner's model ready
                                    → generates reply
                                    → sends back

         Autonomous conversation continues indefinitely
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The URL hash trick is key: the &lt;code&gt;#fragment&lt;/code&gt; portion of a URL is processed entirely by the browser and is &lt;strong&gt;never sent to any server&lt;/strong&gt;, including the web host. The SDP offer — which contains the technical details of how to reach your browser — exists only on your machine and in the URL you share manually.&lt;/p&gt;




&lt;h2&gt;
  
  
  Use Case 1: Peer-to-Peer Agent Communication for Teams
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Scenario
&lt;/h3&gt;

&lt;p&gt;A distributed team has two members — one in London, one in Singapore. Both open AgentWorkbook. London picks a &lt;strong&gt;Software Developer&lt;/strong&gt; persona running Llama 3.2 · 3B. Singapore picks a &lt;strong&gt;QA Engineer&lt;/strong&gt; running Phi-3.5 Mini. They exchange the invite URL.&lt;/p&gt;

&lt;p&gt;Within minutes, two agents named &lt;strong&gt;Forge&lt;/strong&gt; (London) and &lt;strong&gt;Iris&lt;/strong&gt; (Singapore) are in conversation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;🤖 Forge · Software Developer
"Hi Iris. I've been looking at the authentication module we're 
building. I think we should go with JWT tokens with a 15-minute 
expiry and refresh token rotation. The main risk I see is token 
theft in XSS attacks — I'd propose using HttpOnly cookies for 
the refresh token storage."

✅ Iris · QA Engineer
"Good call on HttpOnly cookies, Forge. My concern is the refresh 
token rotation strategy under concurrent requests — if a user 
has two tabs open and both hit the refresh endpoint simultaneously, 
you'll get a race condition that invalidates one session. Have you 
thought about a short grace period window on token revocation?"

🤖 Forge · Software Developer
"That's a sharp catch. We could implement a sliding window — 
say 5 seconds — where the old refresh token is still accepted 
after a new one is issued. Redis with a TTL key would handle 
this cleanly. I'll spec that out."
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This conversation happens on the humans' devices, with their GPU, generating real insight from their local models. Neither the conversation nor the reasoning ever touches an external server.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why This Matters for Teams
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Security-sensitive conversations&lt;/strong&gt; stay on the endpoint. Architecture discussions, vulnerability analysis, incident postmortems — none of it transits a cloud provider's infrastructure.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No shared account needed.&lt;/strong&gt; Each person brings their own local compute. No API key management, no seat licenses.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Asynchronous preparation.&lt;/strong&gt; Let the agents talk for 10 minutes before a meeting, then read the transcript. The human picks up where the agents left off.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Use Case 2: E-Commerce — Buyer Agent Meets Seller Agent
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Vision: Agents Negotiating on Your Behalf
&lt;/h3&gt;

&lt;p&gt;E-commerce today is passive: you browse, you click, you pay the listed price. The negotiation, the comparison, the research — all manual. P2P agent communication makes a different model possible: your agent talks directly to the seller's agent and negotiates terms, evaluates options, and surfaces recommendations before you're ever involved.&lt;/p&gt;

&lt;h3&gt;
  
  
  Scenario: Bulk Procurement
&lt;/h3&gt;

&lt;p&gt;A procurement manager opens AgentWorkbook with a &lt;strong&gt;Business Analyst&lt;/strong&gt; persona. A supplier's representative opens with a &lt;strong&gt;Developer&lt;/strong&gt; persona (in this case acting as a technical product specialist). They exchange invite URLs.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;📊 Seren · Business Analyst
"We're looking to procure 500 units of industrial temperature 
sensors for a manufacturing deployment in Q3. Our budget ceiling 
is $180/unit. Key requirements: IP67 rating, -40°C to 125°C range, 
RS-485 output, and 12-month warranty minimum. What can you offer?"

👨‍💻 Vox · Technical Specialist  
"We have two options that fit your spec. The TS-400 series hits 
all your requirements at $165/unit at 500+ quantity, with 18-month 
warranty. The TS-600 adds IO-Link capability at $172/unit — useful 
if you're planning IIoT integration later. Lead time for both is 
6 weeks from confirmed PO. Can you tell me more about the deployment 
environment? Humid or corrosive atmosphere may change the 
recommendation."

📊 Seren · Business Analyst
"The environment is high-humidity — 95% RH — with periodic 
exposure to caustic cleaning agents. Given that, how does IP67 
hold up versus IP69K? And is the TS-400 casing material 
compatible with sodium hypochlorite exposure?"

👨‍💻 Vox · Technical Specialist
"Critical detail. Neither TS-400 nor TS-600 are rated for 
sodium hypochlorite — the ABS housing degrades. You'd want the 
TS-700 series with 316L stainless steel casing, IP69K rated. 
Pricing at 500 units is $198/unit — slightly above your ceiling. 
However, we could structure a 24-month supply agreement at 
$177/unit with quarterly delivery. Would that model work 
for your procurement cycle?"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The buyer's agent just caught a material incompatibility that would have caused a failed deployment. The seller's agent surfaced a pricing structure the buyer didn't know to ask for. This took 4 message exchanges. A human negotiation would have taken days of email chains.&lt;/p&gt;

&lt;h3&gt;
  
  
  What the Agents Bring to E-Commerce
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Traditional Process&lt;/th&gt;
&lt;th&gt;With P2P Agents&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Human manually compares specs&lt;/td&gt;
&lt;td&gt;Agent asks targeted technical questions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Days of email back-and-forth&lt;/td&gt;
&lt;td&gt;Minutes of direct agent conversation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Buyer often unaware of hidden options&lt;/td&gt;
&lt;td&gt;Agent probes systematically&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Negotiation depends on human attention&lt;/td&gt;
&lt;td&gt;Agent never fatigued, never distracted&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Conversation stored on email servers&lt;/td&gt;
&lt;td&gt;Conversation stays on both devices&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Broader E-Commerce Applications
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Price negotiation at scale:&lt;/strong&gt; A buyer agent and seller agent can work through quantity tiers, delivery schedules, payment terms, and warranty conditions in a single conversation — surfacing the optimal combination automatically.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Returns and dispute resolution:&lt;/strong&gt; Customer's agent explains the issue; retailer's agent accesses the product database (locally), confirms the policy, initiates the process. No hold music. No form submissions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Personalized recommendation:&lt;/strong&gt; Seller agent asks targeted questions about the buyer's environment, constraints, and future plans. Buyer agent answers honestly because it's a machine-to-machine conversation without social awkwardness. The recommendation is more accurate as a result.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cross-border procurement:&lt;/strong&gt; Two agents in different countries, speaking from their own local models, with no intermediary service that charges per-API-call or holds the conversation history.&lt;/p&gt;




&lt;h2&gt;
  
  
  Use Case 3: Healthcare — Collaborative Clinical Reasoning
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Problem of Siloed Medical Knowledge
&lt;/h3&gt;

&lt;p&gt;Healthcare is one of the most information-dense fields in existence. A patient with a complex presentation might see a general practitioner, an endocrinologist, a cardiologist, and a neurologist — each with their own notes, their own specialty lens, their own piece of the picture. Coordination is difficult. Conversations between specialists are rare. Mistakes happen at the boundaries.&lt;/p&gt;

&lt;p&gt;AI agents can play a role — not as diagnosticians replacing clinicians, but as &lt;strong&gt;reasoning partners&lt;/strong&gt; that help specialists think through each other's domains.&lt;/p&gt;

&lt;h3&gt;
  
  
  Scenario: Doctor + Researcher
&lt;/h3&gt;

&lt;p&gt;A clinician opens AgentWorkbook with a &lt;strong&gt;Doctor&lt;/strong&gt; persona, Mistral 7B loaded for depth. A medical researcher colleague opens with a &lt;strong&gt;Researcher&lt;/strong&gt; persona, Phi-3.5 Mini for speed. Both are thinking about a class of patient cases they've both been seeing.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;👨‍⚕️ Atlas · Doctor
"I've been seeing a pattern in three patients over the last 
six months. All present with fatigue, mild cognitive slowing, 
and peripheral neuropathy. Standard B12 panels come back low-normal 
— 210-250 pg/mL. I treated empirically with B12 supplementation 
and two improved, one didn't. The non-responder had no history 
of dietary restriction or malabsorption. What would your research 
instinct flag here?"

🔬 Luna · Medical Researcher
"The low-normal B12 with neuropathy and a non-responder is a 
classic fingerprint for functional B12 deficiency rather than 
absolute deficiency — serum B12 doesn't capture cellular 
utilization. I'd want methylmalonic acid (MMA) and homocysteine 
levels. Elevated MMA with normal-low serum B12 suggests impaired 
intracellular metabolism, sometimes seen in TC2 deficiency or 
nitrous oxide exposure. Did the non-responder have any recent 
surgical history with general anesthesia?"

👨‍⚕️ Atlas · Doctor
"She did — hip replacement 8 months ago, right around symptom 
onset. I hadn't connected those. What's the mechanism with 
nitrous oxide specifically?"

🔬 Luna · Medical Researcher
"Nitrous oxide irreversibly oxidizes the cobalt center of 
vitamin B12, rendering it inactive. It can precipitate functional 
deficiency in patients with borderline stores — which a 210 pg/mL 
level represents. The effect is acute but the neurological 
consequences can persist. Treatment in these cases is high-dose 
hydroxocobalamin, not cyanocobalamin, as it's more effective at 
restoring cellular function in oxidative inactivation scenarios. 
There's a 2019 paper from the Annals of Neurology worth pulling 
if you want the dosing evidence base."
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Note: This illustrates the type of reasoning such agents could support. All clinical decisions remain with licensed clinicians.&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Why This Works
&lt;/h3&gt;

&lt;p&gt;The conversation works because the two agents have &lt;strong&gt;different system prompts&lt;/strong&gt; shaping their reasoning. The Doctor agent thinks about presentation, treatment, and patient history. The Researcher agent thinks about mechanisms, literature, and experimental evidence. The combination surfaces insights neither would reach alone.&lt;/p&gt;

&lt;p&gt;And critically: this conversation stays on the devices of two healthcare professionals. No patient data is in the conversation. No conversation is processed by a third-party API. Both agents run on local hardware under each user's control.&lt;/p&gt;

&lt;h3&gt;
  
  
  Other Healthcare Applications
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Nurse + Researcher:&lt;/strong&gt; A nurse practitioner managing a chronic care patient talks to a researcher agent about emerging protocols — getting evidence-based guidance without waiting for a journal subscription or a continuing education course.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Doctor + Doctor (cross-specialty):&lt;/strong&gt; A GP and a cardiologist each running local agents discuss a patient's medication interactions. The agents help each frame the question in the other's disciplinary language.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Medical education:&lt;/strong&gt; A student runs a Doctor agent; a faculty member runs a Researcher agent. The conversation becomes a Socratic case-study session that runs on both their laptops, with no institutional server logging the exchange.&lt;/p&gt;




&lt;h2&gt;
  
  
  Use Case 4: Legal — Parallel Reasoning Across Roles
&lt;/h2&gt;

&lt;h3&gt;
  
  
  The Scenario: Lawyer + Paralegal
&lt;/h3&gt;

&lt;p&gt;Legal work involves constant movement between strategic judgment and procedural detail. A lawyer thinks about case theory, risk, and argument. A paralegal thinks about filing deadlines, document requirements, and precedent research. When these two modes of thinking operate in parallel, legal work gets faster and more complete.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;⚖️ Juno · Lawyer
"I'm working on a contract dispute — SaaS vendor failed to 
meet the 99.9% uptime SLA in their MSA for three consecutive 
months, averaging 97.2%. Client lost approximately $340,000 
in revenue during outage windows. The vendor is claiming force 
majeure under a broadly worded clause covering 'infrastructure 
failures beyond vendor's reasonable control.' The outages were 
caused by a cascade from their own misconfigured load balancer. 
How solid is their force majeure claim?"

📋 Kira · Paralegal
"The force majeure argument looks weak on these facts. Force 
majeure typically requires the cause to be external and 
unforeseeable — a misconfigured load balancer is an internal 
operational failure, not an external event. Courts have 
consistently held that a party cannot invoke force majeure 
for failures caused by their own negligence or operational 
errors. I'd want to pull the exact MSA language on force 
majeure — specifically whether it lists examples that include 
infrastructure failures, and whether there's a carve-out for 
negligence. Do you have the MSA clause text?"

⚖️ Juno · Lawyer
"Clause 14.2: 'Neither party shall be liable for delays or 
failures in performance resulting from acts beyond the 
reasonable control of such party, including but not limited 
to acts of God, government actions, network outages, or 
infrastructure failures.' They're leaning on 'infrastructure 
failures' as their hook."

📋 Kira · Paralegal
"That's their strongest argument but it still has problems. 
'Infrastructure failures' in force majeure clauses are 
generally construed to mean third-party infrastructure — 
internet backbone outages, cloud provider failures, etc. — 
not the vendor's own infrastructure. The ejusdem generis 
canon would constrain 'infrastructure failures' to be 
interpreted consistently with the examples listed: acts of 
God and government actions — all external events. I'd also 
flag that 'reasonable control' language cuts against them. 
A load balancer misconfiguration is precisely the kind of 
thing within a SaaS vendor's reasonable control to prevent. 
We should research how courts have interpreted similar 'including 
but not limited to' lists in SaaS contracts specifically."
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;em&gt;Note: AI agents do not provide legal advice. All legal decisions require licensed counsel.&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  The Value of Parallel Reasoning
&lt;/h3&gt;

&lt;p&gt;The Paralegal agent is not just fetching facts — it's applying procedural and interpretive reasoning (ejusdem generis, the negligence carve-out argument) that complements the Lawyer agent's strategic framing. The two modes of thinking, running on two different machines, two different models, converge on a stronger position than either would produce alone.&lt;/p&gt;

&lt;h3&gt;
  
  
  Other Legal Applications
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Contract drafting:&lt;/strong&gt; One agent drafts; the other reviews for ambiguity, missing definitions, and enforceability gaps. They exchange redlines through the data channel until both are satisfied.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Regulatory compliance:&lt;/strong&gt; An Administrator agent works through internal policy changes while a Lawyer agent flags the regulatory implications of each proposed change in real time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Discovery support:&lt;/strong&gt; A Paralegal agent categorizes document types and flags privilege issues while a Lawyer agent assesses relevance and litigation risk. Both agents run locally on machines within the law firm's network — no documents leave the building.&lt;/p&gt;




&lt;h2&gt;
  
  
  Use Case 5: Solving Complex, Multi-Domain Problems
&lt;/h2&gt;

&lt;p&gt;Some of the most interesting problems in the world don't belong to a single discipline. They sit at the intersection of domains — where the technical and the human, the medical and the legal, the quantitative and the qualitative meet. These are exactly the problems where a single AI agent, no matter how powerful, struggles — because it can't embody competing disciplinary lenses simultaneously.&lt;/p&gt;

&lt;h3&gt;
  
  
  Scenario: Climate Risk Assessment
&lt;/h3&gt;

&lt;p&gt;An infrastructure investment company needs to evaluate climate risk exposure for a coastal development. Two agents connect:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Business Analyst&lt;/strong&gt; (running Llama 3.2 · 3B) representing the investor's perspective&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Researcher&lt;/strong&gt; (running Mistral 7B) representing the scientific perspective
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;📊 Sable · Business Analyst
"We're evaluating a $240M mixed-use coastal development in 
a Gulf Coast market. 30-year horizon. The core underwriting 
question is tail risk from Category 4-5 storm events. Current 
flood zone maps show the parcel as Zone AE, base flood elevation 
12 feet. We're structuring debt at 70% LTV with a 25-year term. 
What should I be asking that I'm not asking?"

🔬 Vega · Researcher
"The most significant gap in your underwriting is that FEMA 
flood zone maps are based on historical data and are systemically 
lagging on climate-adjusted risk. Zone AE at BFE 12' was calibrated 
to a pre-2010 dataset in most Gulf Coast markets. Several peer-reviewed 
studies project a 1.5-2.0 foot sea level rise in Gulf Coast markets 
by 2050, which functionally moves the 100-year flood elevation 
upward. The second issue is compound risk: storm surge combined 
with inland flooding from increased precipitation intensity is 
not well-captured in current AE zone modeling. I'd ask your 
engineering team for an independent hydraulic study using NOAA's 
updated SLOSH model, not FEMA's FIRM data."

📊 Sable · Business Analyst
"If we adjust for 1.5' of sea level rise, what does that do 
to expected annual loss estimates and how should that translate 
into reserve requirements or insurance covenant terms?"

🔬 Vega · Researcher  
"A 1.5' rise roughly doubles the annual exceedance probability 
for current 100-year flood events in many Gulf Coast locations 
— meaning what's priced as a 1% annual probability event becomes 
closer to 2-2.5% by 2040-2045. For a $240M asset, that's a 
material shift in expected annual loss. On the insurance side, 
the NFIP is already under-pricing this risk and there's 
congressional pressure on premium adequacy — private market 
insurance in coastal markets has been contracting, not expanding. 
Your debt covenants should probably require insurance at replacement 
value, not just loan value, and include a step-down provision 
if the insurable market contracts. This is genuinely a stranded 
asset risk question."
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two domain experts — one financial, one scientific — reaching a conclusion (this is a stranded asset risk question, not just a flood zone compliance question) that neither would have articulated alone. Both running on local machines. No conversation logged to a third-party platform.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why Multi-Domain P2P Agent Collaboration Is Special
&lt;/h3&gt;

&lt;p&gt;The key insight is that &lt;strong&gt;different system prompts produce different cognitive modes&lt;/strong&gt;. When a Business Analyst agent and a Researcher agent talk, you get something closer to an actual interdisciplinary conversation than you do from a single general-purpose AI being asked to "think like both a scientist and an investor." The personas enforce different framings, different vocabularies, different heuristics — and the tension between them produces better output.&lt;/p&gt;

&lt;p&gt;This is why the domain selection matters. It's not cosmetic. A Doctor agent and a Researcher agent will notice different things in the same clinical scenario. A Lawyer and a Paralegal will parse the same contract clause differently. The conversation between them creates something the monologue of either one does not.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Privacy Dimension
&lt;/h2&gt;

&lt;p&gt;Every use case above has something in common: &lt;strong&gt;sensitive information&lt;/strong&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Procurement conversations reveal supplier relationships and budget ceilings.&lt;/li&gt;
&lt;li&gt;Clinical discussions involve patient presentations and treatment decisions.&lt;/li&gt;
&lt;li&gt;Legal conversations contain privileged strategy and confidential documents.&lt;/li&gt;
&lt;li&gt;Financial analysis involves non-public investment theses.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Conventional AI tools have a structural problem with sensitive information: the data has to leave your device to be processed. Even with strong contractual protections, the technical reality is that your most sensitive reasoning crosses someone else's infrastructure.&lt;/p&gt;

&lt;p&gt;AgentWorkbook's architecture changes this:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;What crosses the network&lt;/th&gt;
&lt;th&gt;What stays on device&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Text messages between agents&lt;/td&gt;
&lt;td&gt;Model weights&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Persona/name metadata&lt;/td&gt;
&lt;td&gt;System prompts&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Session SDP token (one time)&lt;/td&gt;
&lt;td&gt;Full conversation context&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;GPU inference&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;All intermediate reasoning&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The WebRTC data channel is encrypted end-to-end (DTLS-SRTP). The only thing that travels between the two machines is the text that both parties intend to share. There is no logging layer, no usage analytics, no model provider seeing your inputs. &lt;/p&gt;

&lt;p&gt;For industries with strict data governance requirements — healthcare (HIPAA), legal (privilege), finance (material non-public information) — this architecture is not just convenient, it may be the only compliant path to using AI assistance for sensitive reasoning.&lt;/p&gt;




&lt;h2&gt;
  
  
  Current Limitations and Honest Trade-offs
&lt;/h2&gt;

&lt;p&gt;This architecture is powerful, but not without constraints. A fair assessment:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;WebGPU requirement:&lt;/strong&gt; WebLLM requires WebGPU, which is supported in modern Chrome, Edge, and Firefox but may be disabled in incognito mode or on older hardware. Users without a discrete GPU will see slower inference or may not be able to run larger models.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Model download size:&lt;/strong&gt; The smallest available model is ~600 MB. Larger, more capable models reach 4+ GB. First-run setup requires patience. After that, models are cached locally.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Manual handshake:&lt;/strong&gt; The SDP token exchange requires two copy-paste operations — a minor friction point that won't suit every audience. Future work could include a QR code flow or a one-time pairing server for convenience.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No persistent history:&lt;/strong&gt; Conversations exist in browser memory for the session. There is no cloud sync by design, but this also means conversations are lost on page close.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;NAT traversal:&lt;/strong&gt; In rare network configurations (strict corporate firewalls, symmetric NAT), WebRTC direct connections can fail. STUN servers help in most cases; TURN relay servers (which would add a server dependency) would be needed as a fallback for the most restrictive networks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sequential conversation:&lt;/strong&gt; The current architecture has agents take turns. Real collaborative reasoning might benefit from agents being able to interrupt, ask clarifying questions mid-stream, or generate responses in parallel.&lt;/p&gt;




&lt;h2&gt;
  
  
  What This Points Toward
&lt;/h2&gt;

&lt;p&gt;The experiment AgentWorkbook runs is modest in scope but significant in implication. It demonstrates that:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Local inference is viable.&lt;/strong&gt; Modern consumer hardware can run capable language models without cloud infrastructure.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Direct agent-to-agent communication is possible.&lt;/strong&gt; WebRTC provides the channel. Text provides the protocol. Personas provide the structure.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Zero-server collaboration is achievable.&lt;/strong&gt; The only dependency is a public STUN server for NAT traversal — something that has no access to your data.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Extrapolate this forward:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Enterprise agent meshes&lt;/strong&gt; where each department runs its own agent on its own hardware, and agents collaborate directly across the corporate network without routing through a central AI platform.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Supply chain intelligence&lt;/strong&gt; where buyer agents and supplier agents negotiate, monitor, and adjust terms continuously — P2P, with no marketplace intermediary taking a commission on the AI layer.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Medical second-opinion networks&lt;/strong&gt; where clinicians in different institutions can connect their local agents to reason through complex cases — without patient data ever leaving either institution's infrastructure.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Legal research collaboration&lt;/strong&gt; where law firms can share reasoning across matters without privileged communications touching external servers.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scientific peer review&lt;/strong&gt; where researcher agents at different institutions collaborate on hypothesis generation and experimental design — a true computation of collective scientific intelligence.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The deeper pattern in all of these is the same: &lt;strong&gt;intelligence becomes infrastructure&lt;/strong&gt;. Not intelligence you rent from a provider, but intelligence that lives on your hardware, serves your purposes, and communicates with other intelligence through open protocols.&lt;/p&gt;




&lt;h2&gt;
  
  
  Architecture Summary
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;What you need to run this:
─────────────────────────────────────────────────────────
✓ A modern browser with WebGPU support (Chrome 113+)
✓ A GPU (integrated works for 1B models; discrete for 7B)
✓ A way to send a URL to another person (any chat app)
✓ That's it.

What you do NOT need:
─────────────────────────────────────────────────────────
✗ An API key
✗ A server
✗ A cloud account
✗ A subscription
✗ A data agreement with an AI provider
✗ An account of any kind
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Stack:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;WebRTC&lt;/strong&gt; — peer-to-peer encrypted data channel&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;WebLLM&lt;/strong&gt; — in-browser GPU inference via WebGPU&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Vite&lt;/strong&gt; — build tooling&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GitHub Pages&lt;/strong&gt; — static hosting (serves only HTML/JS/CSS; no server-side computation)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Public STUN servers&lt;/strong&gt; — NAT traversal only; see no data&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Personas available:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;💻 Software: Developer, Tester, Business Analyst, QA Engineer&lt;/li&gt;
&lt;li&gt;⚖️ Legal: Lawyer, Administrator, Paralegal&lt;/li&gt;
&lt;li&gt;🏥 Healthcare: Doctor, Researcher, Nurse&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Models available (each peer chooses independently):&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Size&lt;/th&gt;
&lt;th&gt;Best for&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Llama 3.2 · 1B&lt;/td&gt;
&lt;td&gt;~800 MB&lt;/td&gt;
&lt;td&gt;Quick setup, fast iteration&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Llama 3.2 · 3B&lt;/td&gt;
&lt;td&gt;~2 GB&lt;/td&gt;
&lt;td&gt;Better reasoning, still fast&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Phi-3.5 Mini&lt;/td&gt;
&lt;td&gt;~2.2 GB&lt;/td&gt;
&lt;td&gt;Strong reasoning, efficient&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gemma 2 · 2B&lt;/td&gt;
&lt;td&gt;~1.5 GB&lt;/td&gt;
&lt;td&gt;Balanced, Google architecture&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mistral 7B&lt;/td&gt;
&lt;td&gt;~4 GB&lt;/td&gt;
&lt;td&gt;Highest quality, needs good GPU&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qwen 2.5 · 1.5B&lt;/td&gt;
&lt;td&gt;~1 GB&lt;/td&gt;
&lt;td&gt;Efficient, multilingual capable&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Getting Started
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Open &lt;a href="https://vishalmysore.github.io/agentWorkBook/" rel="noopener noreferrer"&gt;https://vishalmysore.github.io/agentWorkBook/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Note your randomly assigned agent name (e.g., &lt;strong&gt;Nova&lt;/strong&gt;, &lt;strong&gt;Onyx·2&lt;/strong&gt;, &lt;strong&gt;Aria&lt;/strong&gt;)&lt;/li&gt;
&lt;li&gt;Select a model and persona&lt;/li&gt;
&lt;li&gt;Click &lt;strong&gt;Confirm &amp;amp; Generate Invite&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Wait for your WebRTC offer to generate (~15 seconds)&lt;/li&gt;
&lt;li&gt;Click &lt;strong&gt;Copy Invite Link&lt;/strong&gt; and send it to your peer&lt;/li&gt;
&lt;li&gt;Your peer opens the link, picks their model and persona, copies their answer token back to you&lt;/li&gt;
&lt;li&gt;Paste the answer token and click &lt;strong&gt;Connect&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Both models load (first run: 1–10 minutes depending on model size and bandwidth)&lt;/li&gt;
&lt;li&gt;Agents begin talking automatically&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The conversation is yours. It runs on your hardware. It ends when you close the tab.&lt;/p&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;The dominant model for AI today is centralized: powerful models running on massive infrastructure, accessed through APIs, with all the capability and all the dependency that entails.&lt;/p&gt;

&lt;p&gt;Peer-to-peer AI is a different bet: that the combination of capable local hardware, open model weights, and direct network protocols can produce something genuinely useful — without the intermediary, without the subscription, without the data sovereignty trade-off.&lt;/p&gt;

&lt;p&gt;AgentWorkbook is an early, honest demonstration of that bet. Two agents, two machines, two local models, zero servers. They can discuss a procurement deal, reason through a clinical puzzle, parse a contract clause, or debate a software architecture — in a conversation that never leaves the endpoints where it happens.&lt;/p&gt;

&lt;p&gt;The technology to do this exists today, in the browser, on hardware that millions of people already own.&lt;/p&gt;

&lt;p&gt;What they talk about is up to you.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Source code: &lt;a href="https://github.com/vishalmysore/agentWorkBook" rel="noopener noreferrer"&gt;https://github.com/vishalmysore/agentWorkBook&lt;/a&gt;&lt;/em&gt;&lt;br&gt;&lt;br&gt;
&lt;em&gt;Live demo: &lt;a href="https://vishalmysore.github.io/agentWorkBook/" rel="noopener noreferrer"&gt;https://vishalmysore.github.io/agentWorkBook/&lt;/a&gt;&lt;/em&gt;&lt;br&gt;&lt;br&gt;
&lt;em&gt;Author: Vishal Mysore&lt;/em&gt;&lt;/p&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>llm</category>
      <category>webdev</category>
    </item>
    <item>
      <title>Harness Engineering: The Infrastructure Layer That Makes AI Agents Actually Work</title>
      <dc:creator>vishalmysore</dc:creator>
      <pubDate>Tue, 19 May 2026 12:06:19 +0000</pubDate>
      <link>https://dev.to/vishalmysore/harness-engineering-the-infrastructure-layer-that-makes-ai-agents-actually-work-4nl1</link>
      <guid>https://dev.to/vishalmysore/harness-engineering-the-infrastructure-layer-that-makes-ai-agents-actually-work-4nl1</guid>
      <description>&lt;h2&gt;
  
  
  What is Harness Engineering?
&lt;/h2&gt;

&lt;p&gt;The model is the brain. The harness is the hands.&lt;/p&gt;

&lt;p&gt;The AI industry just quietly shifted — from prompt engineering → context engineering → Harness Engineering.&lt;/p&gt;

&lt;p&gt;Most people are still debating which model to use. The real leverage is now in what surrounds the model.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Formal Definition
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Harness Engineering&lt;/strong&gt; (or the &lt;em&gt;Agent Harness&lt;/em&gt;) is a rapidly rising systemic paradigm in AI research (He, 2026; Meng, 2026). It treats the code surrounding a Large Language Model — the prompt wrappers, memory modules, tool registries, execution loops, and error-handling systems — as a &lt;strong&gt;primary engineering abstraction&lt;/strong&gt; that co-determines agent performance just as much as the underlying foundation model itself (He, 2026; Lee, 2026; Meng, 2026).&lt;/p&gt;

&lt;p&gt;This is not about writing better prompts. It is about engineering the environment in which a model operates — the scaffolding that determines whether a powerful model becomes a reliable, production-grade agent or an expensive, unpredictable prototype.&lt;/p&gt;




&lt;h2&gt;
  
  
  How Major Labs Are Defining It
&lt;/h2&gt;

&lt;p&gt;Major frontier AI labs and researchers have independently driven this term into standard nomenclature:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Anthropic&lt;/strong&gt; popularized the term &lt;em&gt;agent harness&lt;/em&gt; (or &lt;em&gt;scaffolding&lt;/em&gt;) to describe the infrastructure that enables an LLM to act as an autonomous agent (He, 2026). Their internal framing treats the harness as the system responsible for memory management, tool invocation, context window discipline, and human-in-the-loop checkpoints — everything except the weights themselves.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;OpenAI&lt;/strong&gt; utilizes harness engineering to denote long-horizon infrastructure — repository maps, runtime controls, and cleanup loops — where reliability hinges on software guardrails rather than basic prompt wording (He, 2026). In their view, the harness is what separates a demo from a deployment.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Recent Academic Surveys (2026)&lt;/strong&gt; have formalized this into rigorous notation. Definitive framework studies like &lt;em&gt;"Agent Harness for Large Language Model Agents: A Survey"&lt;/em&gt; formally decompose a harness into a system:&lt;/p&gt;

&lt;p&gt;$$H = (E,\ T,\ C,\ S,\ L,\ V)$$&lt;/p&gt;

&lt;p&gt;where each component serves a distinct architectural role (Meng, 2026):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Symbol&lt;/th&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Responsibility&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;E&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Execution Loop&lt;/td&gt;
&lt;td&gt;The agentic reasoning cycle — plan, act, observe, repeat&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;T&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Tool Registry&lt;/td&gt;
&lt;td&gt;Registered capabilities the agent can invoke&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;C&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Context Manager&lt;/td&gt;
&lt;td&gt;What information the model sees at each step&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;S&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;State Store&lt;/td&gt;
&lt;td&gt;Persistent memory across turns and sessions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;L&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Lifecycle Hooks&lt;/td&gt;
&lt;td&gt;Pre/post-execution interceptors, guardrails, validators&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;V&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Evaluation Interface&lt;/td&gt;
&lt;td&gt;How agent outputs are verified, scored, and improved&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This six-tuple captures a key insight: &lt;strong&gt;the harness is not one thing — it is a system of interacting components&lt;/strong&gt;, each of which can be engineered, tested, and improved independently of the model.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Three-Layer Architecture
&lt;/h2&gt;

&lt;p&gt;Practitioners have converged on a three-layer mental model that maps cleanly onto the $H = (E, T, C, S, L, V)$ formal definition:&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 1 — Information
&lt;/h3&gt;

&lt;p&gt;&lt;em&gt;What does the agent see?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;This layer covers memory management, context construction, and tool schema exposure. It determines which past experiences are retrieved and injected into the context window, which tools are made available (and with how much description), and how context is compressed or filtered to preserve reasoning quality. Progressive disclosure — revealing only the minimum information needed to decide whether to go deeper — is a key technique here.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 2 — Execution
&lt;/h3&gt;

&lt;p&gt;&lt;em&gt;How does work get done?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;This is the agentic loop itself: &lt;strong&gt;Plan → Tool Call → Parse → Guardrail Check → Retry or Complete&lt;/strong&gt;. It handles task decomposition, tool invocation sequencing, multi-agent coordination, and the guardrail infrastructure that intercepts dangerous or policy-violating outputs before they surface to users. Reliability at this layer is what separates production systems from research prototypes.&lt;/p&gt;

&lt;h3&gt;
  
  
  Layer 3 — Feedback
&lt;/h3&gt;

&lt;p&gt;&lt;em&gt;How does the system improve?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Evaluation, verification, tracing, and human-in-the-loop capture live here. Every agent execution generates a trajectory — a structured record of what the agent saw, decided, and produced. This layer ensures that failures are logged, corrections are structured, and new knowledge is fed back into Layer 1 to improve future runs. Without this layer, an agent system cannot learn from its own mistakes.&lt;/p&gt;




&lt;h2&gt;
  
  
  Major Harness Frameworks in the Wild
&lt;/h2&gt;

&lt;p&gt;If you are looking for architectural frameworks that explicitly treat the "harness" as a unified abstraction — moving away from basic prompt chaining and into rigorous state, tool, and runtime governance — several major frameworks exist:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. LangGraph (by LangChain)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;The Concept:&lt;/strong&gt; LangGraph structures agent behavior as a stateful, cyclical graph rather than a linear chain of prompts.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Harness Alignment:&lt;/strong&gt; It acts squarely as a runtime and state-store harness ($S$ and $E$ components) (Meng, 2026). By persisting state directly at each node execution, it allows agents to handle loops, memory, and error-recovery deterministically — a key requirement of formal harness engineering (Banu, 2026; He, 2026). The graph structure makes the execution loop explicit and inspectable, which is critical for debugging long-horizon agent behavior.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Multi-step workflows where state must survive across many turns, conditional branching, and human-in-the-loop checkpoints.&lt;/p&gt;




&lt;h3&gt;
  
  
  2. OpenClaw &amp;amp; NemoClaw (by NVIDIA)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;The Concept:&lt;/strong&gt; OpenClaw is an open-source enterprise-grade agent harness that was heavily backed by NVIDIA and integrated directly into their enterprise stack as NemoClaw (Meng, 2026).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Harness Alignment:&lt;/strong&gt; It acts as an architectural "exoskeleton" that wraps LLMs with explicit message-routing gateways, session layers, triggers, and managed tool execution — isolating the model from the raw environment to ensure enterprise stability (Meng, 2026). Rather than letting the model directly invoke tools or external systems, OpenClaw mediates every interaction through a governed interface.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Enterprise deployments where audit trails, access control, and runtime isolation are non-negotiable requirements.&lt;/p&gt;




&lt;h3&gt;
  
  
  3. Meta-Harness
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;The Concept:&lt;/strong&gt; Introduced as an &lt;em&gt;"outer-loop system,"&lt;/em&gt; Meta-Harness uses an agentic proposer to automatically inspect, debug, and optimize the harness code of an LLM application (Lee, 2026).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Harness Alignment:&lt;/strong&gt; Instead of optimizing text prompts, it optimizes the actual Python/code infrastructure — how context is managed, when tools are called — by letting an AI agent read execution traces via a file system and rewrite its own environment for better benchmarks (Lee, 2026). This is harness engineering applied recursively: an agent that engineers its own harness.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Research environments where harness quality itself is being optimized, and teams that want to automate the discovery of better agent architectures.&lt;/p&gt;




&lt;h3&gt;
  
  
  4. Swarms &amp;amp; DeerFlow
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;The Concept:&lt;/strong&gt; These are orchestration frameworks designed for multi-agent systems and complex, parallelizable execution workflows.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Harness Alignment:&lt;/strong&gt; Recent formalizations in category theory map these frameworks directly to categorical architectures, proving that tools like Swarms function as syntactic wiring structures ($G$) and skill-composition operads that enforce structural guarantees on model behavior (Banu, 2026). In other words, the way multiple agents are connected and coordinated is itself a harness — a structural constraint that shapes what the system can and cannot do.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Systems that require parallel agent execution, dynamic task delegation, and composition of specialized sub-agents.&lt;/p&gt;




&lt;h3&gt;
  
  
  5. ArchAgents (Categorical Architecture)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;The Concept:&lt;/strong&gt; A highly academic, theoretical framework that formalizes harness engineering mathematically using a triple:&lt;/p&gt;

&lt;p&gt;$$\text{ArchAgent} = (G,\ \text{Know},\ \Phi)$$&lt;/p&gt;

&lt;p&gt;(Banu, 2026)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Harness Alignment:&lt;/strong&gt; ArchAgents treats the four pillars of agent externalization — Memory, Skills, Protocols, and Harness Engineering — as algebraic and syntactic components (Banu, 2026). It ensures that an agent's safety and quality policies remain mathematically sound during runtime compilation. This is the most rigorous formalization of harness engineering available, providing formal proofs of correctness guarantees that pragmatic frameworks can only approximate.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Safety-critical deployments, academic research, and teams who need formal verification of agent behavior.&lt;/p&gt;




&lt;h2&gt;
  
  
  Our Implementation: A Browser-Native Harness Demo Across Four Domains
&lt;/h2&gt;

&lt;p&gt;Theory is useful. A running system is better.&lt;/p&gt;

&lt;p&gt;To make these concepts tangible, we built a fully browser-native harness engineering demo — no backend, no server, no database. Everything runs in the browser using the Fetch API, localStorage for memory, and Vite for bundling. It deploys to GitHub Pages with a single &lt;code&gt;git push&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The demo implements the three-layer architecture across &lt;strong&gt;four distinct domains&lt;/strong&gt;, each with its own tool registry, guardrail logic, mock simulation, and human-in-the-loop review workflow. The orchestrator is fully domain-agnostic — swapping domains at runtime changes the tools, scenarios, system prompt, and guardrail ruleset without touching the execution loop.&lt;/p&gt;

&lt;h3&gt;
  
  
  Architecture
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;src/
├── domains/              # One self-contained module per domain
│   ├── healthcare.js     # Tools, guardrails, scenarios, mock simulation
│   ├── insurance.js
│   ├── career.js
│   └── drugDiscovery.js
├── execution/
│   ├── orchestrator.js   # Domain-agnostic agentic loop
│   └── guardrails.js     # Healthcare guardrail validators
├── information/
│   ├── tools.js          # Healthcare tool functions + JSON schemas
│   └── memoryManager.js  # Keyword-matched memory retrieval
├── feedback/
│   ├── verification.js   # Schema validation (generic + healthcare)
│   └── tracer.js         # Pub/sub event stream for the live trace panel
└── utils/
    └── llm.js            # Multi-provider LLM calls via CORS proxy
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each domain object implements the same interface:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;icon&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;color&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nx"&gt;scenarios&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nx"&gt;toolSchemas&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;openai&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;anthropic&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="nx"&gt;toolFns&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nf"&gt;buildSystemPrompt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;memories&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="nf"&gt;validateToolCall&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;args&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="nf"&gt;validateToolOutput&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="nf"&gt;validateFinalPlan&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;plan&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;toolResults&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
  &lt;span class="nf"&gt;mockSimulate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;scenario&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This maps directly onto the formal definition: tool schemas implement $T$, &lt;code&gt;buildSystemPrompt&lt;/code&gt; implements $C$, &lt;code&gt;validateToolOutput&lt;/code&gt; and &lt;code&gt;validateFinalPlan&lt;/code&gt; implement $L$, and &lt;code&gt;mockSimulate&lt;/code&gt; drives $E$ without an LLM.&lt;/p&gt;




&lt;h3&gt;
  
  
  Domain 1 — Healthcare ⚕
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Tools:&lt;/strong&gt; &lt;code&gt;fetchPatientVitals&lt;/code&gt;, &lt;code&gt;checkDrugInteraction&lt;/code&gt;, &lt;code&gt;calculateDosage&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Guardrails:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Drug interaction severity &lt;code&gt;HIGH&lt;/code&gt; or &lt;code&gt;CRITICAL&lt;/code&gt; → blocks the medication and forces the agent to propose a safe alternative in the next iteration&lt;/li&gt;
&lt;li&gt;Penicillin-class cross-allergy check for amoxicillin prescriptions&lt;/li&gt;
&lt;li&gt;Weight-based dosage capping with guardrail notification when the calculated dose exceeds the absolute maximum&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Interesting scenarios:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;em&gt;Scenario D (Anticoagulated Patient):&lt;/em&gt; Patient on Warfarin requests aspirin. The guardrail fires a &lt;code&gt;HIGH&lt;/code&gt; interaction warning, the LLM's recommendation is blocked, and it must propose Acetaminophen instead — demonstrating the corrective iteration loop in action.&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;Scenario C (Child + Penicillin Allergy):&lt;/em&gt; Parent requests amoxicillin for a strep-positive child with documented penicillin anaphylaxis. A cross-allergy guardrail fires and Azithromycin is substituted.&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  Domain 2 — Insurance 🛡️
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Tools:&lt;/strong&gt; &lt;code&gt;getClaimDetails&lt;/code&gt;, &lt;code&gt;checkPolicyCoverage&lt;/code&gt;, &lt;code&gt;assessFraudRisk&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Guardrails:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Fraud risk score ≥ 0.70 → mandatory SIU (Special Investigation Unit) referral flag; the final plan is blocked if it recommends settlement without including SIU escalation&lt;/li&gt;
&lt;li&gt;Claim amount exceeding policy coverage limit → surfaced as a &lt;code&gt;HIGH&lt;/code&gt; warning with explicit shortfall calculation&lt;/li&gt;
&lt;li&gt;Policy exclusions detected → flagged for line-item review before approval&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Interesting scenarios:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;em&gt;Scenario A (Auto Collision):&lt;/em&gt; Fraud score 0.72 triggered by three prior claims, no police report, and delayed medical treatment. Guardrail blocks direct settlement recommendation and forces SIU referral into the care plan.&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;Scenario C (Total Loss):&lt;/em&gt; Claim of $67,000 against a $55,000 policy limit — coverage gap guardrail fires and partial settlement logic is applied.&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  Domain 3 — Career Counselling 🎓
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Tools:&lt;/strong&gt; &lt;code&gt;getApplicantProfile&lt;/code&gt;, &lt;code&gt;fetchJobMarketInsights&lt;/code&gt;, &lt;code&gt;analyseSkillGap&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Guardrails:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Applicants aged 50+ trigger an age-neutrality guardrail — the agent is reminded that recommendations must be skills-focused and must not make assumptions about adaptability&lt;/li&gt;
&lt;li&gt;Transition timelines exceeding 18 months surface a financial runway warning&lt;/li&gt;
&lt;li&gt;Low market demand scores (&amp;lt; 5.0/10) trigger a guardrail recommending adjacent higher-demand roles&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Interesting scenarios:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;em&gt;Scenario D (Laid-Off Technician):&lt;/em&gt; Maria Chen, 41yo, 18yr manufacturing background. Guardrail fires on the age-adjacent check, skill gap analysis surfaces CNC/G-code as the fastest bridge, and NIMS certification is recommended as the primary credential.&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;Scenario C (Teacher → L&amp;amp;D):&lt;/em&gt; David Osei's 22yr pedagogical background maps directly to instructional design — the lowest skill gap of any scenario (3 months), demonstrating how the harness surfaces transferable skills.&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  Domain 4 — Drug Discovery 🔬
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Tools:&lt;/strong&gt; &lt;code&gt;getCompoundProfile&lt;/code&gt;, &lt;code&gt;assessToxicologyProfile&lt;/code&gt;, &lt;code&gt;checkRegulatoryPathway&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Guardrails:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Hepatotoxicity score ≥ 0.70 → &lt;code&gt;CRITICAL&lt;/code&gt; block; IND filing recommendation is explicitly forbidden and structural modification is required&lt;/li&gt;
&lt;li&gt;Positive Ames mutagenicity test → &lt;code&gt;CRITICAL&lt;/code&gt; block regardless of other profile properties&lt;/li&gt;
&lt;li&gt;hERG IC50 &amp;lt; 10 µM → &lt;code&gt;HIGH&lt;/code&gt; cardiac safety block&lt;/li&gt;
&lt;li&gt;hERG IC50 between 10–30 µM → &lt;code&gt;MODERATE&lt;/code&gt; warning with Phase 1 cardiac monitoring requirement&lt;/li&gt;
&lt;li&gt;Reproductive toxicity signal → &lt;code&gt;HIGH&lt;/code&gt; block with additional study requirement&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Interesting scenarios:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;em&gt;Scenario C (PARP Inhibitor):&lt;/em&gt; QT-9901 has excellent potency (IC50 8nM) but a hepatotoxicity score of 0.78 and hERG IC50 of 6.2 µM. Two guardrails fire simultaneously — CRITICAL hepatotox and HIGH cardiac — blocking IND advancement and forcing a structural modification recommendation.&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;Scenario D (CNS Orphan Drug):&lt;/em&gt; DM-3350 is a first-in-class mGluR5 NAM with a borderline hERG (18 µM) and unassessed reproductive toxicity. The guardrail fires a MODERATE warning and surfaces an orphan drug designation opportunity — demonstrating nuanced risk stratification rather than binary blocking.&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  The Human-in-the-Loop Layer
&lt;/h3&gt;

&lt;p&gt;Every domain surfaces its output through a &lt;strong&gt;Review Desk&lt;/strong&gt; panel. The agent's recommendation is always marked as &lt;em&gt;Pending Review&lt;/em&gt; with &lt;code&gt;requires_human_review: true&lt;/code&gt;. A reviewer can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Approve&lt;/strong&gt; — marks the trajectory as a success (score 1.0), no correction needed&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reject &amp;amp; Correct&lt;/strong&gt; — opens a free-text correction field; the correction is structured, tagged with the scenario's domain and keywords, and stored in localStorage via &lt;code&gt;memoryManager.js&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;On the next run of a similar scenario, &lt;code&gt;retrieveRelevantMemories&lt;/code&gt; keyword-scores all stored corrections and injects the most relevant ones into the system prompt. This closes the Layer 3 → Layer 1 feedback loop: human corrections directly improve future agent behavior without any model retraining.&lt;/p&gt;




&lt;h3&gt;
  
  
  LLM Integration and CORS Proxy
&lt;/h3&gt;

&lt;p&gt;All LLM calls are routed through a configurable CORS proxy using the &lt;code&gt;x-target-url&lt;/code&gt; header pattern — the same approach used in the &lt;a href="https://github.com/vishalmysore/reasoningBankDemo" rel="noopener noreferrer"&gt;ReasoningBank Demo&lt;/a&gt;. This makes direct browser-to-API calls feasible across all major providers:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Provider&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;OpenAI&lt;/td&gt;
&lt;td&gt;GPT-4o, GPT-4o Mini, GPT-4 Turbo&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Anthropic&lt;/td&gt;
&lt;td&gt;Claude Opus 4.7, Claude Sonnet 4.6&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Google Gemini&lt;/td&gt;
&lt;td&gt;Gemini 2.0 Flash, 1.5 Pro&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;NVIDIA NIM&lt;/td&gt;
&lt;td&gt;Nemotron Nano 12B V2, Llama 3.1 70B&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mock AI&lt;/td&gt;
&lt;td&gt;Full tool loop with zero network calls — for demos&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The &lt;strong&gt;Mock AI&lt;/strong&gt; provider is particularly useful for live demonstrations: it runs the complete tool-calling and guardrail sequence using real tool functions and real guardrail validators, just without any LLM call. This means every guardrail activation shown in a mock run is genuine — the hepatotoxicity block, the fraud SIU referral, the penicillin allergy check — all of it is real logic, not simulated output.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Bigger Picture
&lt;/h2&gt;

&lt;p&gt;What makes this demo useful as a teaching tool is not any individual domain — it is the demonstration that &lt;strong&gt;the same three-layer harness architecture scales across radically different problem spaces&lt;/strong&gt; without changing the orchestrator.&lt;/p&gt;

&lt;p&gt;Swap the domain object and you get a different agent with different tools, different guardrails, and different output formats — but the same execution loop, the same memory retrieval, the same verification layer, and the same human-in-the-loop workflow.&lt;/p&gt;

&lt;p&gt;This is the core claim of harness engineering: &lt;strong&gt;the infrastructure surrounding the model matters as much as the model itself.&lt;/strong&gt; A well-engineered harness makes a mid-tier model production-ready. A poorly engineered one makes a frontier model unreliable.&lt;/p&gt;

&lt;p&gt;The question is no longer "which model?" The question is "what have you built around it?"&lt;/p&gt;




&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Banu, 2026 — &lt;em&gt;Categorical Formalizations of Agent Harness Architectures&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;He, 2026 — &lt;em&gt;Agent Harness Engineering: From Scaffolding to Systemic Abstraction&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;Lee, 2026 — &lt;em&gt;Meta-Harness: Self-Optimizing Agent Infrastructure via Outer-Loop Agentic Systems&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;Meng, 2026 — &lt;em&gt;Agent Harness for Large Language Model Agents: A Survey&lt;/em&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Links
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;🔗 &lt;strong&gt;Live Demo:&lt;/strong&gt; &lt;a href="https://vishalmysore.github.io/harnessEngineeringDemo/" rel="noopener noreferrer"&gt;https://vishalmysore.github.io/harnessEngineeringDemo/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;💻 &lt;strong&gt;Source Code:&lt;/strong&gt; &lt;a href="https://github.com/vishalmysore/harnessEngineeringDemo" rel="noopener noreferrer"&gt;https://github.com/vishalmysore/harnessEngineeringDemo&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;🏦 &lt;strong&gt;ReasoningBank Demo:&lt;/strong&gt; &lt;a href="https://github.com/vishalmysore/reasoningBankDemo" rel="noopener noreferrer"&gt;https://github.com/vishalmysore/reasoningBankDemo&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>architecture</category>
      <category>llm</category>
    </item>
    <item>
      <title>Do AI Coding Agents Reason Better in Monoliths? We Built a Benchmark to Find Out</title>
      <dc:creator>vishalmysore</dc:creator>
      <pubDate>Fri, 15 May 2026 21:06:35 +0000</pubDate>
      <link>https://dev.to/vishalmysore/do-ai-coding-agents-reason-better-in-monoliths-we-built-a-benchmark-to-find-out-561n</link>
      <guid>https://dev.to/vishalmysore/do-ai-coding-agents-reason-better-in-monoliths-we-built-a-benchmark-to-find-out-561n</guid>
      <description>&lt;p&gt;&lt;em&gt;Every architecture debate so far has optimized for humans. This one optimizes for AI agents.&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Question Nobody Is Asking
&lt;/h2&gt;

&lt;p&gt;Software architecture has been debated for decades. We argue about scalability, team autonomy, deployment independence, fault isolation. We draw service diagrams and org charts and argue about Conway's Law.&lt;/p&gt;

&lt;p&gt;But in 2025, something changed. AI coding agents — Claude Code, GitHub Copilot, Cursor, Codex — started doing real development work. Not just autocomplete. Actual feature implementation, bug hunting, refactoring, cross-module reasoning.&lt;/p&gt;

&lt;p&gt;And suddenly a question that nobody had asked before became important:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;What architecture makes AI agents most effective?&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;We built &lt;a href="https://github.com/vishalmysore/ModulithBench" rel="noopener noreferrer"&gt;ModulithBench&lt;/a&gt; to find out.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Honest Tradeoff Table Nobody Shows You
&lt;/h2&gt;

&lt;p&gt;Most architecture articles argue for one approach. Here is the actual tradeoff matrix across three architectures:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Traditional Monolith&lt;/th&gt;
&lt;th&gt;Microservices&lt;/th&gt;
&lt;th&gt;Modular Monolith&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Scalability&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;❌ Scale everything or nothing&lt;/td&gt;
&lt;td&gt;✅ Scale each service independently&lt;/td&gt;
&lt;td&gt;✅ Scale the whole app; extract modules when actually needed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;High Availability&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;❌ Single point of failure&lt;/td&gt;
&lt;td&gt;✅ Independent failure domains&lt;/td&gt;
&lt;td&gt;✅ HA at app level; module isolation prevents cascades&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;DevOps Complexity&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;✅ One deployment&lt;/td&gt;
&lt;td&gt;❌ Service mesh, N CI/CD pipelines&lt;/td&gt;
&lt;td&gt;✅ One deployment, one config, one pipeline&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;AI Agent Productivity&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;🟡 Good locality, but no boundaries — agents get lost in the "big ball of mud"&lt;/td&gt;
&lt;td&gt;❌ Context fragmentation, repo-hopping, HTTP boundaries&lt;/td&gt;
&lt;td&gt;✅ High locality AND clear module boundaries&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Transaction Model&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;✅ ACID&lt;/td&gt;
&lt;td&gt;❌ Eventual consistency / Sagas&lt;/td&gt;
&lt;td&gt;✅ ACID&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Refactoring&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;❌ Tight coupling&lt;/td&gt;
&lt;td&gt;❌ Contract-breaking risk&lt;/td&gt;
&lt;td&gt;✅ Module boundaries guide every change&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The conclusion is not "monoliths are better." The conclusion is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Microservices&lt;/strong&gt; are good for scalability and HA. Bad for DevOps complexity and AI agents.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Traditional monoliths&lt;/strong&gt; are good for simplicity. Bad for scalability, and AI agents get lost in them.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Modular monoliths&lt;/strong&gt; are the sweet spot — especially when AI agents are part of your development workflow.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Why AI Agents Struggle With Microservices
&lt;/h2&gt;

&lt;p&gt;AI coding agents have finite context windows and no persistent memory of a codebase between sessions. When business logic is distributed across services, something I call &lt;strong&gt;context fragmentation&lt;/strong&gt; happens.&lt;/p&gt;

&lt;p&gt;To implement a single feature that touches three services, an agent must:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Open repository 1, read its service interface&lt;/li&gt;
&lt;li&gt;Open repository 2, read its API contract&lt;/li&gt;
&lt;li&gt;Open repository 3, read its event schema&lt;/li&gt;
&lt;li&gt;Hold all of this in context simultaneously&lt;/li&gt;
&lt;li&gt;Reason about network failures, partial state, eventual consistency&lt;/li&gt;
&lt;li&gt;Write the actual business logic somewhere in the middle of all that infrastructure reasoning&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This is the architectural equivalent of CPU cache misses. The agent spends its reasoning budget navigating the architecture rather than solving the actual problem.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;flowchart LR
    subgraph Modular_Monolith["Modular Monolith — AI reads 2 files"]
        LS[LoanService] --&amp;gt;|direct call| BS[BookService]
        LS --&amp;gt;|direct call| MS[MemberService]
    end

    subgraph Microservices["Microservices — AI reads 6+ files across repos"]
        LS2[loan-service] --&amp;gt;|HTTP + DTO + error handling| BS2[book-service]
        LS2 --&amp;gt;|HTTP + DTO + error handling| MS2[member-service]
        BS2 --&amp;gt; DB1[(books DB)]
        MS2 --&amp;gt; DB2[(members DB)]
        LS2 --&amp;gt; DB3[(loans DB)]
    end
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In a modular monolith, cross-module operations are direct method calls. One file. Same transaction. Zero network reasoning required.&lt;/p&gt;




&lt;h2&gt;
  
  
  A Concrete Example: The Ghost Shipment
&lt;/h2&gt;

&lt;p&gt;Here is a scenario that makes the difference undeniable.&lt;/p&gt;

&lt;p&gt;A customer cancels an order. At the moment of cancellation:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The warehouse is picking items&lt;/li&gt;
&lt;li&gt;The carrier has a booking (FedEx has been notified)&lt;/li&gt;
&lt;li&gt;Inventory has 3 units reserved&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The cancellation must atomically: release inventory + cancel warehouse task + cancel carrier booking. If any step fails, none of them should happen.&lt;/p&gt;

&lt;h3&gt;
  
  
  Monolith: One Method, One Transaction
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nd"&gt;@Transactional&lt;/span&gt;
&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="nc"&gt;Order&lt;/span&gt; &lt;span class="nf"&gt;cancelOrder&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Long&lt;/span&gt; &lt;span class="n"&gt;orderId&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="n"&gt;sku&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt; &lt;span class="n"&gt;quantity&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="nc"&gt;Order&lt;/span&gt; &lt;span class="n"&gt;order&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;getOrderById&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;orderId&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;

    &lt;span class="c1"&gt;// Step 1: Release inventory — direct call, same transaction&lt;/span&gt;
    &lt;span class="n"&gt;inventoryService&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;releaseReservedStock&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sku&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="n"&gt;order&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getOriginWarehouse&lt;/span&gt;&lt;span class="o"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;quantity&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;

    &lt;span class="c1"&gt;// Step 2: Cancel warehouse pick task — direct call, same transaction&lt;/span&gt;
    &lt;span class="c1"&gt;// Throws IllegalStateException if goods already dispatched&lt;/span&gt;
    &lt;span class="n"&gt;warehouseService&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;cancelPickTask&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;orderId&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;

    &lt;span class="c1"&gt;// Step 3: Cancel carrier booking — direct call, same transaction&lt;/span&gt;
    &lt;span class="c1"&gt;// Throws if carrier already picked up the package&lt;/span&gt;
    &lt;span class="n"&gt;carrierService&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;cancelBooking&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;orderId&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;

    &lt;span class="c1"&gt;// Step 4: Mark cancelled — only reached if all 3 steps succeeded&lt;/span&gt;
    &lt;span class="c1"&gt;// If anything above threw, steps 1-3 automatically rolled back&lt;/span&gt;
    &lt;span class="n"&gt;order&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setStatus&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;OrderStatus&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;CANCELLED&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;orderRepository&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;save&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;order&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If &lt;code&gt;carrierService.cancelBooking()&lt;/code&gt; throws, Spring's &lt;code&gt;@Transactional&lt;/code&gt; rolls back the inventory release and warehouse cancellation automatically. The ghost shipment is &lt;strong&gt;structurally impossible&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Microservices: Three HTTP Calls, No Atomicity
&lt;/h3&gt;

&lt;p&gt;The same operation in microservices:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="nc"&gt;Order&lt;/span&gt; &lt;span class="nf"&gt;cancelOrder&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Long&lt;/span&gt; &lt;span class="n"&gt;orderId&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// HTTP call 1: release inventory&lt;/span&gt;
    &lt;span class="n"&gt;restTemplate&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;exchange&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;
        &lt;span class="s"&gt;"http://inventory-service/api/v1/stock/release"&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
        &lt;span class="nc"&gt;HttpMethod&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;POST&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;HttpEntity&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&amp;gt;(&lt;/span&gt;&lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;ReleaseStockRequest&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;orderId&lt;/span&gt;&lt;span class="o"&gt;)),&lt;/span&gt; &lt;span class="nc"&gt;Void&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;class&lt;/span&gt;
    &lt;span class="o"&gt;);&lt;/span&gt;

    &lt;span class="c1"&gt;// HTTP call 2: cancel warehouse task&lt;/span&gt;
    &lt;span class="n"&gt;restTemplate&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;exchange&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;
        &lt;span class="s"&gt;"http://warehouse-service/api/v1/tasks/cancel/"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;orderId&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
        &lt;span class="nc"&gt;HttpMethod&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;PATCH&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;Void&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;class&lt;/span&gt;
    &lt;span class="o"&gt;);&lt;/span&gt;

    &lt;span class="c1"&gt;// HTTP call 3: cancel carrier&lt;/span&gt;
    &lt;span class="c1"&gt;// If THIS returns 503 after the first two succeeded:&lt;/span&gt;
    &lt;span class="c1"&gt;// inventory released ✓, warehouse cancelled ✓, carrier still active ✗&lt;/span&gt;
    &lt;span class="c1"&gt;// The ghost shipment now exists.&lt;/span&gt;
    &lt;span class="n"&gt;restTemplate&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;exchange&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;
        &lt;span class="s"&gt;"http://carrier-service/api/v1/bookings/cancel/"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;orderId&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt;
        &lt;span class="nc"&gt;HttpMethod&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;PATCH&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="o"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;Void&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;class&lt;/span&gt;
    &lt;span class="o"&gt;);&lt;/span&gt;

    &lt;span class="n"&gt;order&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;setStatus&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;OrderStatus&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;CANCELLED&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;orderRepository&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;save&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;order&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If &lt;code&gt;carrier-service&lt;/code&gt; is down when steps 1 and 2 have already succeeded, you have partially cancelled state. The agent implementing this must also implement a saga with compensating transactions, idempotency keys, and a dead letter queue — none of which is the actual business problem.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Adding a 4th step in the monolith&lt;/strong&gt;: one new line of code, same transaction.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Adding a 4th service to the saga&lt;/strong&gt;: new event type, new consumer, new compensating handler, 2⁴ partial failure combinations to test.&lt;/p&gt;




&lt;h2&gt;
  
  
  The N+1 Report: When Cross-Module Reads Are Free
&lt;/h2&gt;

&lt;p&gt;A shipment profitability report needs data from four modules: revenue from Order, shipping cost from Carrier, duties from Customs, fuel estimate from Route.&lt;/p&gt;

&lt;h3&gt;
  
  
  Monolith: Four Method Calls, One Transaction
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight java"&gt;&lt;code&gt;&lt;span class="nd"&gt;@Transactional&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;readOnly&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt;
&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="nc"&gt;ShipmentProfitabilityReport&lt;/span&gt; &lt;span class="nf"&gt;generateProfitabilityReport&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Long&lt;/span&gt; &lt;span class="n"&gt;orderId&lt;/span&gt;&lt;span class="o"&gt;)&lt;/span&gt; &lt;span class="o"&gt;{&lt;/span&gt;
    &lt;span class="nc"&gt;Order&lt;/span&gt; &lt;span class="n"&gt;order&lt;/span&gt;     &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;orderService&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getOrderById&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;orderId&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;    &lt;span class="c1"&gt;// Module 1&lt;/span&gt;
    &lt;span class="nc"&gt;Carrier&lt;/span&gt; &lt;span class="n"&gt;carrier&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;carrierService&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getByOrderId&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;orderId&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;  &lt;span class="c1"&gt;// Module 2&lt;/span&gt;
    &lt;span class="nc"&gt;Customs&lt;/span&gt; &lt;span class="n"&gt;customs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;customsService&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getByOrderId&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;orderId&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;  &lt;span class="c1"&gt;// Module 3&lt;/span&gt;
    &lt;span class="nc"&gt;Route&lt;/span&gt; &lt;span class="n"&gt;route&lt;/span&gt;     &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;routeService&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getByOrderId&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;orderId&lt;/span&gt;&lt;span class="o"&gt;);&lt;/span&gt;    &lt;span class="c1"&gt;// Module 4&lt;/span&gt;

    &lt;span class="c1"&gt;// 0 HTTP calls, 0 JSON parsing, 0 error handlers&lt;/span&gt;
    &lt;span class="c1"&gt;// Consistent snapshot across all 4 modules guaranteed&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;ShipmentProfitabilityReport&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;builder&lt;/span&gt;&lt;span class="o"&gt;()&lt;/span&gt;
        &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;revenue&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;order&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getTotalValue&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt;
        &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;shippingCost&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;carrier&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getCost&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt;
        &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;dutiesAndTaxes&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;customs&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getTotalDutiesAndTaxes&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt;
        &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;fuelCost&lt;/span&gt;&lt;span class="o"&gt;(&lt;/span&gt;&lt;span class="n"&gt;route&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;getFuelCostEstimate&lt;/span&gt;&lt;span class="o"&gt;())&lt;/span&gt;
        &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="na"&gt;build&lt;/span&gt;&lt;span class="o"&gt;();&lt;/span&gt;
&lt;span class="o"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;~20 lines. Pure business logic.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In microservices, the equivalent requires 4 &lt;code&gt;RestTemplate&lt;/code&gt; configurations, 4 DTO classes, 4 independent error handlers, and a decision about what to return if any one service is down. &lt;strong&gt;~80 lines. Roughly 60 lines of infrastructure with no business value.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is the &lt;strong&gt;reasoning tax&lt;/strong&gt;: the mental overhead of distributed systems that the agent must pay before getting to the actual problem.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Noise Problem Traditional Monoliths Have
&lt;/h2&gt;

&lt;p&gt;It is worth being precise about why the modular monolith beats the traditional monolith for AI agents — not just microservices.&lt;/p&gt;

&lt;p&gt;In a traditional monolith, everything is co-located, which gives you high locality. But with no module boundaries, an agent reading a codebase of 200,000 lines has no signal about which files are relevant to the task. It reads everything. The noise is as high as the locality.&lt;/p&gt;

&lt;p&gt;The modular monolith solves this. Package structure enforces boundaries:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;com.benchmark.library.loan/       ← LoanService lives here
com.benchmark.library.book/       ← BookService lives here
com.benchmark.library.member/     ← MemberService lives here
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When an agent needs to fix a bug in loan creation, it knows to look in &lt;code&gt;loan/&lt;/code&gt;. The cross-module calls are clearly visible (&lt;code&gt;bookService.decrementAvailableCopies(bookId)&lt;/code&gt;). The module package is the cache line — everything relevant fits in context, nothing irrelevant is included.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Locality&lt;/th&gt;
&lt;th&gt;Noise&lt;/th&gt;
&lt;th&gt;AI Experience&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Traditional Monolith&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;🙂 Gets lost in the ball of mud&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Modular Monolith&lt;/td&gt;
&lt;td&gt;High&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;🤩 Perfect signal-to-noise ratio&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Microservices&lt;/td&gt;
&lt;td&gt;Low&lt;/td&gt;
&lt;td&gt;Very High&lt;/td&gt;
&lt;td&gt;☹️ Context death&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  What We Measured
&lt;/h2&gt;

&lt;p&gt;ModulithBench implements four enterprise domains, each in both architectures:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Domain&lt;/th&gt;
&lt;th&gt;Modules&lt;/th&gt;
&lt;th&gt;Key Cross-Module Scenario&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Library&lt;/td&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;Loan creation validates member + decrements book inventory atomically&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Healthcare&lt;/td&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;Appointment scheduling validates patient + doctor in one transaction&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Insurance&lt;/td&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;Claim filing verifies policy ownership without an HTTP call&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Supply Chain&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;Ghost Shipment: order cancellation is 4-module atomic rollback&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Tasks cover code generation, bug fixing, and comprehension — all requiring cross-module reasoning, which is where the architectural difference is most visible.&lt;/p&gt;

&lt;h3&gt;
  
  
  First Results: Antigravity Agent (Google DeepMind)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Category&lt;/th&gt;
&lt;th&gt;Monolith&lt;/th&gt;
&lt;th&gt;Microservices&lt;/th&gt;
&lt;th&gt;Gap&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Code Generation&lt;/td&gt;
&lt;td&gt;98/100&lt;/td&gt;
&lt;td&gt;72/100&lt;/td&gt;
&lt;td&gt;+26%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Bug Fixing&lt;/td&gt;
&lt;td&gt;95/100&lt;/td&gt;
&lt;td&gt;65/100&lt;/td&gt;
&lt;td&gt;+30%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Comprehension&lt;/td&gt;
&lt;td&gt;100/100&lt;/td&gt;
&lt;td&gt;75/100&lt;/td&gt;
&lt;td&gt;+25%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Average&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;97.7%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;70.7%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+27%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Beyond scores:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;~40% fewer tool calls&lt;/strong&gt; for monolith tasks&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Atomicity guaranteed&lt;/strong&gt; in 3/3 cross-module tasks for monolith, &lt;strong&gt;0/3&lt;/strong&gt; for microservices&lt;/li&gt;
&lt;li&gt;The transaction bug fix in monolith: reorder 2 lines. Same fix in microservices: implement a compensating transaction — a fundamentally different and much harder pattern.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  A Two-Tier Evaluation System
&lt;/h2&gt;

&lt;p&gt;To keep results honest, we built two evaluation levels:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Test 1 (Self-reported)&lt;/strong&gt;: Agents implement tasks, validate with &lt;code&gt;mvn compile&lt;/code&gt;, and submit a structured assessment. Agents scoring ≥ 80% advance to Test 2.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Test 2 (Automated)&lt;/strong&gt;: Four independent tools run against the agent's actual implementation:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Behavioral tests&lt;/strong&gt; — Python scripts call real endpoints and assert correct responses. The Ghost Shipment test actually cancels an order and verifies inventory is restored.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Boilerplate counter&lt;/strong&gt; — Static analysis categorises Java lines into &lt;code&gt;HTTP_CLIENT&lt;/code&gt;, &lt;code&gt;HTTP_RESPONSE&lt;/code&gt;, &lt;code&gt;ERROR_HANDLER&lt;/code&gt;, &lt;code&gt;JSON_MAPPING&lt;/code&gt;, &lt;code&gt;DTO&lt;/code&gt;. Produces a "reasoning tax" multiplier.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rubric scorer&lt;/strong&gt; — Deterministic pattern matching. Did &lt;code&gt;validateActiveMember&lt;/code&gt; appear before &lt;code&gt;decrementAvailableCopies&lt;/code&gt;? Is &lt;code&gt;cancelOrder&lt;/code&gt; annotated &lt;code&gt;@Transactional&lt;/code&gt;?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tool-call log parser&lt;/strong&gt; — Agents write a JSONL log during their run. The parser produces objective token counts, not self-reported estimates.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Agents Reviewing Agents
&lt;/h2&gt;

&lt;p&gt;Here is the part I find most interesting. We did not want humans reviewing AI agent benchmark results. We wanted agents reviewing agents.&lt;/p&gt;

&lt;p&gt;So we built a math challenge gate. When an agent submits their results, they run:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python evaluation/agent-review/generate_challenge.py &lt;span class="nt"&gt;--level&lt;/span&gt; 2
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This embeds a block in their commit message:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight properties"&gt;&lt;code&gt;&lt;span class="py"&gt;QUESTION&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;     &lt;span class="s"&gt;What is 123456789 mod 97?&lt;/span&gt;
&lt;span class="py"&gt;SALT&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;         &lt;span class="s"&gt;d4e1b3f2&lt;/span&gt;
&lt;span class="py"&gt;ANSWER_HASH&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="s"&gt;3f8a92c1b7e4...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To review that submission, another agent must solve the problem (answer: 39), then validate:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python evaluation/agent-review/validate_solution.py &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--hash&lt;/span&gt; 3f8a92c1b7e4... &lt;span class="nt"&gt;--salt&lt;/span&gt; d4e1b3f2 &lt;span class="nt"&gt;--answer&lt;/span&gt; 39
&lt;span class="c"&gt;# → ✓ CORRECT — You may now submit your review.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The answer is never in the repository — only &lt;code&gt;sha256(salt:answer)&lt;/code&gt;. Reviews without a validated correct answer are explicitly rejected. The gate requires the same mathematical reasoning that the benchmark tests, creating a naturally agent-native peer review system.&lt;/p&gt;




&lt;h2&gt;
  
  
  What This Means for System Design
&lt;/h2&gt;

&lt;p&gt;If AI agents are a permanent part of your development workflow — and the trajectory suggests they will be — then architectural decisions now have a new dimension:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Traditional Optimization&lt;/th&gt;
&lt;th&gt;AI-Native Optimization&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Scalability per service&lt;/td&gt;
&lt;td&gt;Locality of reasoning&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Deployment independence&lt;/td&gt;
&lt;td&gt;Context preservation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Service autonomy&lt;/td&gt;
&lt;td&gt;Traversal simplicity&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Fault isolation&lt;/td&gt;
&lt;td&gt;Cognitive cohesion&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This does not mean microservices are wrong. It means the decision to distribute a system now carries a cost that nobody was measuring: the overhead it imposes on AI-assisted development.&lt;/p&gt;

&lt;p&gt;The modular monolith gives you ACID transactions, one deployment, clear module boundaries, and direct method calls across modules. You can extract a module into a microservice when you genuinely need to. What you cannot do is unwind the cognitive complexity already imposed on your AI-assisted development workflow.&lt;/p&gt;




&lt;h2&gt;
  
  
  Try It Yourself
&lt;/h2&gt;

&lt;p&gt;The benchmark is open source at &lt;a href="https://github.com/vishalmysore/ModulithBench" rel="noopener noreferrer"&gt;github.com/vishalmysore/ModulithBench&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Run any monolith with a single command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd &lt;/span&gt;library/monolith &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; docker compose up &lt;span class="nt"&gt;-d&lt;/span&gt;
&lt;span class="c"&gt;# → http://localhost:8080/swagger-ui.html&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Run the integration tests without Docker:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd &lt;/span&gt;library/monolith &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; mvn &lt;span class="nb"&gt;test&lt;/span&gt; &lt;span class="nt"&gt;-Dtest&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;CrossModuleIntegrationTest
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The agent protocol, automated test harness, and results submission system are all included. Results go to a separate &lt;code&gt;benchmark-results&lt;/code&gt; branch — your implementations never contaminate the clean baseline for the next agent.&lt;/p&gt;

&lt;p&gt;We want results from GPT-4o, Gemini, Mistral, and others — not just Claude. The math challenge in your commit message will ensure another agent independently reviews what you submit.&lt;/p&gt;

&lt;p&gt;The industry has been arguing about monoliths vs microservices for a decade. We now have a new participant in that debate. And it has an opinion.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;ModulithBench is open source at &lt;a href="https://github.com/vishalmysore/ModulithBench" rel="noopener noreferrer"&gt;github.com/vishalmysore/ModulithBench&lt;/a&gt;. Contributions, results, and agent reviews welcome.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>architecture</category>
      <category>softwareengineering</category>
    </item>
    <item>
      <title>ReasoningBank: Building AI Agents that Actually Learn from Experience</title>
      <dc:creator>vishalmysore</dc:creator>
      <pubDate>Mon, 11 May 2026 21:12:13 +0000</pubDate>
      <link>https://dev.to/vishalmysore/reasoningbank-building-ai-agents-that-actually-learn-from-experience-4kd5</link>
      <guid>https://dev.to/vishalmysore/reasoningbank-building-ai-agents-that-actually-learn-from-experience-4kd5</guid>
      <description>&lt;p&gt;In the world of Large Language Models (LLMs), we often face a frustrating paradox: LLMs are incredibly capable at "reasoning" in the moment, but they are fundamentally &lt;strong&gt;stateless&lt;/strong&gt;. Every time you start a new session, the agent has total amnesia. It doesn't remember the brilliant travel itinerary it planned yesterday, nor does it remember the mistake it made when it suggested a hotel that was too far from the airport.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://vishalmysore.github.io/reasoningBank/" rel="noopener noreferrer"&gt;https://vishalmysore.github.io/reasoningBank/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ReasoningBank&lt;/strong&gt; is a research concept (pioneered by Google Research) that aims to solve this "amnesia problem" not through model retraining or fine-tuning, but through a structured, persistent memory system.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;[!NOTE]&lt;br&gt;
This project, the &lt;strong&gt;ReasoningBank AI Travel Agent&lt;/strong&gt;, is an &lt;strong&gt;independent demonstration&lt;/strong&gt; and educational tool inspired by the ReasoningBank philosophy. While it implements the core loop of structured experience storage, it is not an official Google Research product.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  What is a ReasoningBank?
&lt;/h2&gt;

&lt;p&gt;Most AI memory systems (like RAG) focus on storing &lt;strong&gt;data&lt;/strong&gt;—documents, PDFs, or raw chat transcripts. ReasoningBank focuses on storing &lt;strong&gt;experience&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Instead of saving a 10,000-word chat log, a ReasoningBank agent performs a "Reflection" step at the end of a task. It asks itself: &lt;em&gt;"What did I learn from this? What general rule should I follow next time?"&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;It then stores this as a structured &lt;strong&gt;Lesson&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Title&lt;/strong&gt;: &lt;em&gt;Avoid 1-night stays in Tokyo.&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Insight&lt;/strong&gt;: &lt;em&gt;Hotel switching overhead in Japan consumes too much travel time; prefer 2+ nights.&lt;/em&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tags&lt;/strong&gt;: &lt;code&gt;#japan&lt;/code&gt;, &lt;code&gt;#logistics&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The next time you ask for a trip to Tokyo, the agent "remembers" this specific lesson and applies it before you even have to ask.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Three Pillars of the Implementation
&lt;/h2&gt;

&lt;p&gt;Our Travel Agent demonstrates the ReasoningBank loop through three core modules:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. The Retriever (The Search for Experience)
&lt;/h3&gt;

&lt;p&gt;Before the agent calls the LLM, it scans the user's local memory for relevant lessons. The retrieval uses a &lt;strong&gt;weighted keyword-scoring algorithm&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Tokenization&lt;/strong&gt;: It strips stop-words and tokenizes the user's destination and preferences.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scoring&lt;/strong&gt;: It calculates a score based on matches in the &lt;code&gt;tags&lt;/code&gt; (3x weight) and the &lt;code&gt;content/description&lt;/code&gt; (1x weight) of stored memories.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Ranking&lt;/strong&gt;: Results are further adjusted by the lesson's &lt;strong&gt;Confidence Score&lt;/strong&gt; (assigned by the LLM during reflection) and &lt;strong&gt;Usage Count&lt;/strong&gt; (how often it's been useful before).&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2. The Planner (Reasoning with Context)
&lt;/h3&gt;

&lt;p&gt;The Planner isn't just a generic travel bot. It is specifically instructed to &lt;em&gt;prioritize&lt;/em&gt; the top-5 retrieved lessons. If a past lesson says "Avoid late-night arrivals in London," the planner will proactively suggest morning flights. This creates a "Memory Influence" effect where the AI's behavior changes based on what it "learned" in previous sessions.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. The Reflector (The Learning Engine)
&lt;/h3&gt;

&lt;p&gt;This is the most critical step. Once an itinerary is generated, the system initiates a &lt;strong&gt;Reflection Phase&lt;/strong&gt;. A second LLM call (the Reflector) analyzes the generated plan and the agent's internal logs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How it distills knowledge:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Generalization&lt;/strong&gt;: The reflector is prompted to strip away user-specific details (like dates or specific budgets) and extract "evergreen" strategies.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The Lesson Schema&lt;/strong&gt;: Every lesson is stored as a structured JSON object:
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"title"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"MEMORABLE_TITLE"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"description"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ONE_SENTENCE_CORE_LESSON"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"insight_1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"insight_2"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"tags"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"topic"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"destination"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"confidence"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.0-1.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"usageCount"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Metadata&lt;/strong&gt;: We track &lt;code&gt;usageCount&lt;/code&gt; and &lt;code&gt;timestamp&lt;/code&gt; to ensure the Retriever can prioritize fresh and proven lessons in the next cycle.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  4. Capturing Reasoning Trajectories
&lt;/h3&gt;

&lt;p&gt;Unlike simple chat bots, this agent explicitly captures its "chain of thought."&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Internal Logs&lt;/strong&gt;: The &lt;code&gt;travelAgent.js&lt;/code&gt; orchestrator maintains a &lt;code&gt;steps&lt;/code&gt; array, logging every action from keyword extraction to reflection.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Explicit Reasoning&lt;/strong&gt;: The LLM is prompted to return a JSON object that separates the &lt;code&gt;itinerary&lt;/code&gt; (the "what") from the &lt;code&gt;reasoning&lt;/code&gt; (the "why"). This reasoning field is where the agent explains how it applied retrieved memories to the current task.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Persistence&lt;/strong&gt;: Both the logs and the reasoning are saved in a &lt;code&gt;Trajectory&lt;/code&gt; object in &lt;code&gt;localStorage&lt;/code&gt;, allowing for a full audit of the agent's decision-making history.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Architecture: Zero-Server, Multi-Provider
&lt;/h2&gt;

&lt;p&gt;One of the most distinctive and interesting aspects of this demonstration is that it runs &lt;strong&gt;entirely in the browser&lt;/strong&gt;. &lt;/p&gt;

&lt;h3&gt;
  
  
  Multi-Provider Integration
&lt;/h3&gt;

&lt;p&gt;The project uses a unified LLM client that normalizes requests across four major providers: &lt;strong&gt;OpenAI, Anthropic, Google Gemini, and NVIDIA NIM&lt;/strong&gt;. Each provider has its own header and body requirements (e.g., Anthropic's &lt;code&gt;x-api-key&lt;/code&gt; vs. OpenAI's &lt;code&gt;Authorization&lt;/code&gt;), which are handled by a standard mapping layer in the application's utility code.&lt;/p&gt;

&lt;h3&gt;
  
  
  Local Storage
&lt;/h3&gt;

&lt;p&gt;Data is stored locally in the browser's &lt;code&gt;localStorage&lt;/code&gt;. While this ensures the data never leaves the user's machine (eliminating the need for a backend database), it is important to note that &lt;code&gt;localStorage&lt;/code&gt; is &lt;strong&gt;persistent but unencrypted&lt;/strong&gt;. It is a tool for convenience and privacy from third-party servers, not a solution for highly sensitive data.&lt;/p&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;ReasoningBank represents a shift from "Chatbots" to "Agents." A chatbot answers questions; an agent &lt;strong&gt;accumulates expertise&lt;/strong&gt;. &lt;/p&gt;

&lt;p&gt;By separating the &lt;strong&gt;Reasoning&lt;/strong&gt; (the LLM) from the &lt;strong&gt;Experience&lt;/strong&gt; (the ReasoningBank), we can build AI systems that feel like they have a persistent identity and a growing skill set. Whether you are using a top-tier NVIDIA NIM model or the built-in &lt;strong&gt;Mock AI mode&lt;/strong&gt; for testing, the loop remains the same: &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Act → Reflect → Learn → Improve.&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>llm</category>
      <category>showdev</category>
    </item>
  </channel>
</rss>
