Ever wanted to pit two LLMs against each other with the exact same prompt and see who wins — in real time? That's what I built into Locally Uncensored v2.1, and the implementation turned out to be more interesting than I expected.
The Problem
I was constantly switching between Ollama models, trying to figure out which one actually gives better answers for my use cases. Copy-pasting prompts between tabs is tedious. So I built a split-view A/B comparison — same prompt, two models, streaming side by side.
The twist: the two models can be on completely different providers. Local Ollama model on the left, cloud API on the right. Different streaming protocols, different response formats, one unified view.
How Parallel Streaming Actually Works
The core is surprisingly simple — two async generators running inside a Promise.all():
await Promise.all([streamA(), streamB()])
Each stream function independently:
- Resolves its provider via a
"provider::modelId"prefix system (e.g.,"ollama::llama3.1:8b"vs"openai::gpt-4o") - Opens its own streaming connection with its own AbortController
- Iterates chunks via
for-await-ofover an AsyncGenerator - Writes to its own slice of a Zustand store
The tricky part? Each provider speaks a different streaming language:
- Ollama: NDJSON (newline-delimited JSON)
-
OpenAI-compatible (OpenRouter, Groq, LM Studio): SSE (
data: {...}\n\n) -
Anthropic: SSE with event types (
event: content_block_delta\ndata: {...})
Three different parsers (parseNDJSONStream, parseSSEStream, parseSSEWithEvents) all yield the same ChatStreamChunk type. The provider abstraction handles the translation.
The Zustand State Race Condition Fix
Here's where it got interesting. React hooks create closures, and closures capture stale state. When you have two async loops writing to the same store concurrently, you can't use the hook's state reference:
// BAD - stale closure
const store = useCompareStore()
// Inside async loop:
store.addContentA(chunk) // might write to old state
// GOOD - always fresh
useCompareStore.getState().addContentA(chunk)
getState() bypasses React's reactive system and reads the store directly. Since Zustand's set() is synchronous and each stream writes to separate slices (messagesA vs messagesB), there's no actual data race — but the stale closure would silently drop chunks.
The CORS Problem Nobody Talks About
Local models (Ollama) run on localhost:11434. In a Tauri desktop app, you can't just fetch() localhost from the webview — CORS blocks it.
The solution: a dual-path fetch system. In dev mode, regular fetch() works (Vite proxy). In production, every local request routes through a Rust IPC command (proxy_localhost_stream) that makes the HTTP call from the Tauri backend and streams the response back through IPC. Cloud providers don't need this since they have proper CORS headers.
What You Actually See
Each column shows the response streaming in real-time with independent auto-scroll. When both finish, you get stats:
- Tokens per second
- Total time
- Token count
It doubles as a quick benchmark. "Is llama3.1:8b faster than qwen2.5:7b on my hardware?" — you can literally watch them race.
What I'd Do Differently
The token counting is approximate — it increments per chunk, not per actual token. Ollama chunks are roughly 1 token each, but OpenAI/Anthropic chunks can contain multiple tokens. Good enough for comparison, not for billing.
Also, conversation context is pulled from Model A's history only. Both models see the same history, but it's Model A's previous responses. Maintaining truly separate conversation threads would double memory usage and complicate the UX significantly.
The repo is MIT licensed if you want to dig into the code: Locally Uncensored on GitHub
This is part of Locally Uncensored, a standalone desktop app (Tauri v2 + React 19) for local AI chat, image gen, and video gen. Single .exe, no Docker required.
Top comments (0)