Introducing Duplex: A Zero-Backend, Multiplexed LLM Inference Engine for True Client-Side Parallel AI

#mcp #ai #llm #webdev

Hi there. I’m Gurutva Murdia, the developer behind Duplex. Today I’m excited to share the story, architecture, and technical deep dives of a project that’s been consuming my focus for months: a fully decentralised , browser-native wrapper that lets you run multiple Large Language Models in true parallel — mixing local hardware (Ollama, LM Studio, vLLM, GGUF weights) with cloud frontier APIs (Claude, GPT, etc.) — all with zero backend servers.

The Problem It Solves -
Most “multi-LLM” tools today are either:
• SaaS proxies (introducing latency, cost, privacy leaks, and vendor lock-in), or
• Simple sequential wrappers that choke the browser thread when you try to run several models at once.
I wanted something different: sovereign, air-gapped, high-performance inference where local silicon and cloud supercomputers work together seamlessly in one workspace. No data leaves your machine unless you explicitly route it to a cloud API you control. No middleman. No telemetry. Full operator agency.

Core Concept: Multiplexed Parallel Inference

Duplex introduces the Multiple X & Paralla X Engines — a concurrency core built deep into the browser’s V8 runtime that distributes operations across separate network channels, Web Workers, and careful DOM throttling.
Key technical highlights:
• True concurrency without UI lockup: Traditional approaches queue requests sequentially or cause massive memory fragmentation. Duplex runs independent ReadableStream readers, AbortControllers, and fetch pipelines in parallel. High-frequency token streams are throttled to a rigid 16ms Virtual DOM render tick to protect the main thread from GC stuttering.
• Orphaned flow termination: One “Stop Generation” button instantly fires AbortController.abort() across all active local and remote lanes simultaneously. No lingering tokens or zombie connections.
• Mixed ecosystem blending: Run Ollama on localhost:11434 (Llama 8B at ~63 tok/s), Claude 3.5 Sonnet (~112 tok/s), and a vLLM cluster at the same time in the same chat workspace. Compare outputs side-by-side in real time.
• Failover advisor: Submit a prompt with no models active? Duplex auto-injects a lightweight client-side advisor so you’re never blocked.

The live Multiplex Active Pipeline Monitor shows lanes with real-time metrics: tokens/sec, TTFT (Time To First Token), system strain, network packets, etc.

Local vs Cloud: Architectural Dualities
One of the strongest parts of Duplex is how it embraces the dual-engine reality instead of forcing an either/or choice.
Local GGUF Weights (Consumer GPU Hardware):
• Unlimited queries, zero marginal cost.
• 100% air-gapped isolation.
• Full control over system prompts, sampling params, context, etc.
• Works offline after initial setup.

Cloud Frontier APIs:
• Access to massive models (trillions of parameters) that simply won’t fit locally.
• Great for complex reasoning, math, or creative tasks where local hardware falls short.
Duplex makes switching or blending them frictionless. You can have dedicated lanes for different strengths and cross-validate outputs instantly.
Privacy & Decentralization First

This is a strictly client-side application:
• No database.
• No analytics or telemetry sent anywhere.
• All configuration (including API keys) can be AES-256 encrypted with a user password before storage.
• Runs entirely in your browser tab (static Vercel deployment).
• Supports DoH (DNS-over-HTTPS), strict CSP hardening, and enterprise compliance scenarios (e.g., HIPAA-friendly air-gapped setups).

For local daemons like Ollama, you’ll need to handle CORS once (simple OLLAMA_ORIGINS="*" flag). Detailed guides are built into the app.

Engineering Challenges & Solutions
Building this pushed me into several deep rabbit holes:

Streaming + React Reconciliation: Managing high-velocity SSE streams without causing layout thrashing or excessive re-renders. Heavy use of React.memo, careful key management, and requestAnimationFrame batching.
Memory Management: Long context windows + multiple streams = potential memory leaks in V8. Aggressive string handling, buffer reuse, and KV cache considerations.
Browser Constraints: CORS for localhost, WebGPU/WebNN experimentation for future local acceleration, Service Workers for offline caching and faster reloads.
Telemetry HUD: Real-time performance monitoring without impacting the inference paths themselves. CPU, memory heap, render ticks, latency — all surfaced beautifully.
There are extensive engineering blog posts embedded in the app covering topics like VRAM offloading, quantization (Q4 vs Q8), Web Crypto for secure storage, CSS Grid for multiplex layouts, AbortController patterns, and more.

Tech Stack (High-Level)
• Frontend: React (with heavy optimization), TypeScript, Tailwind.
• Streaming: Native Fetch + ReadableStreams + SSE.
• Local Integration: Direct HTTP to Ollama/LM Studio/vLLM endpoints.
• Cloud: Standard OpenAI/Anthropic/etc. compatible APIs.
• State & Persistence: Encrypted LocalStorage + IndexedDB options, multi-tab BroadcastChannel sync.
• Build/Deploy: Static site on Vercel.
Fully open source under AGPLv3. Attribution appreciated if you build on it or fork.

Who Is This For?
• Developers and researchers who want maximum control and privacy.
• Power users running local models but needing occasional cloud boosts.
• Teams handling sensitive data (legal, medical, enterprise) who need air-gapped options.
• Anyone tired of paying per token for basic tasks that local hardware can handle.
• AI tinkerers who love comparing models in parallel and watching real-time metrics.
Try It & Get Involved

Head over to https://duplexinterface.vercel.app and start configuring your lanes. The in-app documentation is comprehensive — from CORS setup to performance interpretation to hardware sizing charts.

The repo - https://github.com/Ryuk1811/Duplex

feel free to reach out at duplexinterface@outlook.com if you have questions, want to contribute ideas, or need setup help.

I’d love to hear your feedback, benchmark results on different hardware, or feature requests. What local/cloud combinations are you running? How does the parallelism feel compared to other tools you’ve tried?
Happy inferencing! Let’s push client-side AI forward together. 🚀
— Gurutva Murdia
Creator of Duplex

DEV Community

Introducing Duplex: A Zero-Backend, Multiplexed LLM Inference Engine for True Client-Side Parallel AI

Top comments (0)