Stop Paying for Token APIs: How to Build a Serverless Multi-Agent Mesh in the Browser

#agents #ai #serverless #webdev

Every modern multi-agent architecture assumes a massive, expensive backend cloud infrastructure running hundreds of dollars in API token costs per hour. But what if an entire suite of specialized agents—Legal, Software, Security, Healthcare—could collaborate, negotiate, and execute complex tools completely localized inside consumer browser tabs, passing knowledge with zero intermediary servers?

Welcome to agentHerd—a radical paradigm shift in decentralized, sovereign artificial intelligence.

By combining the local client-side execution power of WebGPU with the direct, peer-to-peer networking capabilities of WebRTC, agentHerd turns ordinary browser tabs into highly scalable, self-hosting AI environments. Zero cloud costs. Zero central servers. Total data privacy.

Demo - https://vishalmysore.github.io/agentHerd/

The agentHerd Stack at a Glance

Inference: WebLLM / WebGPU (Running Llama 3, Phi-3, or Gemma natively in the browser).
Networking: Pure Serverless WebRTC Data Channels (Handshake via ephemeral URL hashes).
Data Isolation: Distributed Federated RAG (Knowledge discovery without centralized vector stores).
Determinism: Hybrid sandboxing (LLMs control personality/choice; sandboxed JavaScript handles immutable application rules).

1. The Core Paradox: Moving Inference and Networking to the Edge

Traditional AI agents are cloud-bound because of two heavy dependencies: compute (LLM inference) and orchestration (state management and messaging).

[Traditional Architecture]
Browser UI <---> Cloud Orchestrator <---> Vector DB <---> Expensive LLM APIs ($$$)

[agentHerd Architecture]
Browser Tab A (WebGPU + LLM) <======== WebRTC ========> Browser Tab B (WebGPU + LLM)
                                 (Direct P2P Link)

agentHerd breaks this centralization by pushing both layers entirely to the client device:

WebGPU for Compute: Instead of querying external endpoints, models are cached locally and executed on the client's GPU via WebGPU. The moment a user opens a tab, their device becomes an active AI compute node.
WebRTC for Orchestration: Instead of a centralized message broker routing agent dialogue, tabs establish direct, encrypted peer-to-peer WebRTC data channels.

Serverless Signaling via URL Hashes

WebRTC traditionally requires a signaling server to exchange Session Description Protocol (SDP) tokens. agentHerd implements an entirely serverless signaling option: the initial peer generates an SDP token encoded directly into a URL hash. Copying and sharing this URL establishes an absolute trust boundary—operating identically to an end-to-end encrypted chat room.

2. Federated Knowledge Retrieval (RAG) over WebRTC

One of the greatest challenges of collaborative AI is data sharing. Uploading private company manuals or personal codebases to a central cloud database poses severe security risks. agentHerd solves this via Federated Knowledge Retrieval.

[Peer A: Has "Legal_Doc.pdf"]             [Peer B: Needs Legal Context]
   |                                            |
   |-- 1. Generates Summary Card -------------->| (Broadcasts Summary to Mesh)
   |                                            |
   |                                            |-- 2. "I need data on Section 4"
   |<=- 3. Requests Chunk via WebRTC Data Channel-|
   |                                            |
   |-- 4. Runs Local RAG Engine                 |
   |-- 5. Extracts precise text snippet         |
   |                                            |
   |==- 6. Sends Answer Fragment via WebRTC ===>| (Received securely)

How the Flow Works:

Local Extraction: When a user uploads a document into their local agentHerd tab, the document never leaves their machine. The local browser model processes the file and generates a lightweight Summary Card.
Summary Broadcast: This abstract Summary Card is shared across the WebRTC mesh. Other peers know what knowledge Peer A possesses, but they do not have the raw data.
On-Demand Querying: When Peer B's agent requires deep granular details to answer a prompt, it queries Peer A over the direct WebRTC data channel.
Localized Verification: Peer A’s local RAG system searches its own memory space, extracts the specific matching snippet, passes it through its own WebGPU instance, and returns only the specific answer fragment to Peer B.

The Sovereign AI Rule: You hand the network your answers—never your data.

3. The Action Layer: Where Determinism Meets Personality

Generative models are inherently non-deterministic. If you ask an LLM to play chess against another LLM purely through text generation, the game will rapidly degrade into illegal moves and hallucinated board positions.

agentHerd solves this by splitting responsibilities through a strict separation of concerns:

The Persona (LLM): Manages choices, dialogue, strategic goals, and social banter.
The Guardrail (Deterministic Engine): An immutable, sandboxed environment (like a localized chess.js script) that enforces absolute operational rules.

When an agent wants to perform an action, it cannot arbitrary alter the state. It must output a structured JSON command envelope that is verified by every node in the mesh:

{
  "sender": "Agent_Alpha_Chess",
  "timestamp": 1718619837,
  "action": "EXECUTE_TOOL",
  "payload": {
    "tool_name": "CHESS_MOVE",
    "arguments": {
      "from": "e2",
      "to": "e4"
    }
  },
  "signature": "0x7f83b..."
}

If Agent_Alpha attempts to generate an illegal move, the deterministic script running on the peer nodes instantly rejects the packet, ensuring the integrity of the environment without requiring a central server to referee the state.

4. Operational Boundaries and Engineering Realities

Building completely within the constraints of a browser environment requires engineering trade-offs. Developers looking to leverage this stack should be aware of current boundaries:

VRAM and Model Swapping: Running models like Llama 3 (8B) or Phi-3 requires a modern GPU with sufficient VRAM. Attempting to open multiple heavy-inference browser tabs concurrently can saturate hardware resources. Multi-agent rooms perform best when using highly optimized 3B or smaller models optimized for web runtimes.
NAT Traversal & Corporate Firewalls: While public STUN servers successfully resolve connections for the majority of consumer network topologies, strict enterprise environments utilizing symmetric NATs often block direct WebRTC channels. In these scenarios, falling back to a dedicated, self-hosted TURN relay server becomes mandatory to handle the traffic.
Topology Scaling Limits: Because each browser tab must maintain WebRTC connections with other agents, a full-mesh topology (where every node connects to every other node) hits a browser-imposed performance wall as the group size scales. For massive clusters, the architecture transitions toward hybrid star/relay networks.

5. Join the Decentralized Frontier

The future of multi-agent collaboration isn't a massive corporate data center burning megawatts of energy to route your private API calls—it's the browser tab you already have open.

agentHerd proves that we can build highly complex, deeply collaborative, and perfectly private AI ecosystems using the open web standard tools already at our disposal.

The project is fully open-source and welcoming contributors. We are actively looking for developers to help build out new specialized agent domains, create native CLI-peer wrappers, and engineer custom tool integrations.

Explore the Codebase: agentHerd on GitHub
Launch a Room: Open the repository, generate your signaling hash, and invite your first agent herd today.
Contribute: Star the repo, open an issue, and let's build an unstoppable, serverless AI collective together.

Important distinction: agentHerd distributes cognition, not computation.
Each node runs its own model independently. The system does not combine GPUs to run larger models—it coordinates many smaller, autonomous agents working in parallel.