Michael Stelly

Posted on Apr 8

I Ran Three LLMs Entirely in the Browser to Power an AI Coaching Feature. Here's What I Measured.

#webgpu #llm #javascript #react

I'm building Holocron, a browser-based combat log analyzer for the Star Wars: The Old Republic (SWTOR) video game. The core product thesis is that parsers that stop at showing you numbers aren't useful enough. A good tool tells you what to do about them.

The coaching layer I'm building takes ~1500 tokens of structured combat stats (spec, abilities, DPS numbers, rule-based findings) and returns ~500 tokens of plain-language guidance. It runs after parsing, entirely client-side. No server. No account. No data leaving the browser.

I already had Ollama working as a local LLM provider. But Ollama requires the user to install a background service, pull a model, and make sure it's running. For a tool where frictionless entry is a design constraint, that's a real drop-off risk. So I ran a spike to find out whether @mlc-ai/web-llm with WebGPU could replace that setup entirely: just open the page, wait under 30 seconds on first visit (measured 23.7s on the test hardware), and get AI coaching with zero install.

This post covers the full methodology, every number I measured, and the implementation decisions I made based on the results.

The Output Contract

Before getting into models and benchmarks, it helps to understand exactly what I needed the LLM to produce. The coaching system has a strict output schema:

interface CoachingOutput {
  narrativeSummary: string;         // 2-3 sentence performance narrative
  additionalFindings: Array<{
    priority: number;               // 1-3
    headline: string;
    body: string;
    recommendation: string;
  }>;                               // max 3 items
  additionalPositives: string[];    // max 3 plain strings
}

The schema is intentionally flat and bounded. additionalPositives is an array of strings, not objects. This matters. A lot. I'll come back to it.

Production validation rejects anything that doesn't conform. There's no "close enough" here.

Why WebLLM

WebLLM is an in-browser LLM inference engine built by the MLC AI team. It compiles models into a WebGPU-accelerated WASM runtime, ships a prebuilt model library hosted on HuggingFace, and exposes an OpenAI-compatible API. You load a model with CreateMLCEngine(), then call engine.chat.completions.create() exactly like you would with the OpenAI SDK.

The two features that made it worth spiking:

Grammar-constrained generation. WebLLM supports response_format: { type: 'json_object', schema: ... }, implemented at the WASM layer. This isn't prompt engineering hoping the model behaves. It enforces the schema at the token sampling level. The model literally cannot produce output that violates the schema.

OPFS caching. Model weights are cached to the Origin Private File System after the first download. A 1.3 GB model that takes 23 seconds to load cold takes 2.3 seconds warm. Repeat users pay nothing.

Test Setup

Hardware: Apple Silicon Mac (Apple M3 Max, 64gb integrated memory)
Browser: Chrome (WebGPU enabled)
WebLLM version: 0.2.82
Benchmark: 10 coaching prompts per model using the production SPF prompt structure (1500 token input, targeting 500 token output)
Quality scoring: Automated 6-signal composite (0-100 scale), equally weighted: (1) narrative depth — word count and sentence structure of narrativeSummary; (2) schema compliance — all required fields present and correctly typed; (3) template parroting — prompt text appearing verbatim in output; (4) ability name accuracy — capitalized phrases cross-referenced against ability names present in the input; (5) finding duplication — semantic overlap across additionalFindings items; (6) actionability — presence of concrete, imperative language in recommendation fields. Template parrot and hallucination counts in the side-by-side table are raw per-prompt tallies, not components of the composite score.

I tested three models, chosen to cover the quality/size/speed tradeoff space:

Model	Size	Notes
`Llama-3.2-1B-Instruct-q4f16_1-MLC`	~0.7 GB	Smallest viable instruct model
`Llama-3.2-3B-Instruct-q4f16_1-MLC`	~1.3 GB	Sweet spot candidate
`Phi-3.5-mini-instruct-q4f16_1-MLC`	~2.0 GB	Quality ceiling for this size class

I also ran the same 10 prompts against plain Ollama (no grammar enforcement) as a baseline for each model. That comparison turned out to be the most interesting part of the whole exercise.

Results

Llama 3.2 3B

Metric	Value	Target	Verdict
Download size	~1.3 GB		Acceptable
Cold load time	23.7s	≤ 30s	PASS
Warm load time	2.3s		Excellent
Tokens/sec	49.8	≥ 10	PASS
GPU memory	~3.0 GB	≤ 2 GB	FLAG
JSON parse success	10/10	≥ 9/10	PASS
Schema valid	10/10	≥ 9/10	PASS
Avg quality score	76/100	subjective	Good
Avg latency/prompt	5.8s		Acceptable

Content quality: PASS

Strengths: substantive narratives (avg 20+ words), zero template parroting, all findings have real body and recommendation text, references specific numbers from input in 8/10 prompts.

Weaknesses: hallucinated ability names in 9/10 prompts (2-4 per prompt), occasional duplication of findings across 4 of 10 prompts, VRAM at ~3 GB exceeds the 2 GB flag.

GPU memory is measured at the browser process level and includes driver and WebGPU runtime overhead beyond model weights. The 3B weights alone are ~1.6 GB at 4-bit quantization; the remainder is KV cache at 1500 token context plus browser overhead. Numbers will vary across machines and Chrome versions. The 2 GB threshold assumes a minimum-spec user running SWTOR on a machine with 8 GB unified memory: the game typically holds 3–4 GB GPU memory under load, leaving 4 GB headroom. Anything above 2 GB for the coaching model narrows that margin on older hardware.

Ollama baseline comparison: Plain Ollama 3B is 10/10 JSON valid but only 1/10 schema valid. The model consistently emits additionalPositives as objects with headline/body/recommendation fields instead of plain strings. This is a silent breaking failure. Grammar-constrained WebLLM generation is 10/10 schema valid under identical prompts. Content quality: Ollama 73/100 vs WebLLM 76/100 -- no degradation from running in-browser.

Llama 3.2 1B

Metric	Value	Target	Verdict
Download size	~0.7 GB		Good
Cold load time	11.7s	≤ 30s	PASS
Warm load time	1.4s		Excellent
Tokens/sec	118.6	≥ 10	PASS
GPU memory	~1.1 GB	≤ 2 GB	PASS
JSON parse success	10/10	≥ 9/10	PASS
Schema valid	10/10	≥ 9/10	PASS
Avg quality score	70/100	subjective	Marginal
Avg latency/prompt	2.1s		Fast

Content quality: MARGINAL FAIL

The numbers look great. 118.6 tok/s. 11.7s cold load. 1.4s warm. 0.7 GB download. Under the hood it falls apart.

Template parroting in 7/10 prompts -- the model echoes prompt text like "Things the player did well that the rule engine missed" verbatim in the output. Prompt 9 returned all three additionalPositives as identical copies of that string. Individual prompt scores ranged from 46 to 96. About 30% of runs would produce output that embarrasses the product. Speed doesn't offset that.

Ollama baseline comparison: Plain Ollama 1B is 8/10 schema valid (better than the 3B, because the simpler model apparently follows field type instructions more literally). Content quality: Ollama 52/100 vs WebLLM 70/100. The grammar constraints improve structural compliance and seem to improve content quality too, but the underlying weaknesses (parroting, duplication, hallucination) persist.

Phi-3.5 Mini

Metric	Value	Target	Verdict
Download size	~2.0 GB		Large
Cold load time	37.4s	≤ 30s	FAIL
Warm load time	2.4s		Good
Tokens/sec	52.5	≥ 10	PASS
GPU memory	~2.3 GB	≤ 2 GB	FLAG
JSON parse success	10/10	≥ 9/10	PASS
Schema valid	10/10	≥ 9/10	PASS
Avg quality score	77/100	subjective	Good
Avg latency/prompt	6.8s		Acceptable

Content quality: PASS

Best average quality score (77), best narrative depth, most actionable recommendations, zero template parroting. Loses on cold load (37.4s exceeds the 30s threshold) and both VRAM flags. The 1 point quality delta over 3B doesn't justify the extra 700MB of download and the load time failure.

Side-by-Side

	3B	1B	Phi-3.5
Cold load	23.7s	11.7s	37.4s
Warm load	2.3s	1.4s	2.4s
Tok/s	49.8	118.6	52.5
Latency/prompt	5.8s	2.1s	6.8s
Download	1.3 GB	0.7 GB	2.0 GB
VRAM	~3.0 GB	~1.1 GB	~2.3 GB
Quality	76	70	77
Schema valid	10/10	10/10	10/10
Template parrot	0/10	7/10	0/10
Hallucinations	9/10	5/10	8/10
Quality floor	52	46	67

The Finding That Changes the Architecture

The Ollama baseline comparison wasn't in the original spike plan. I added it as a sanity check. It turned out to be the most important data in the whole exercise.

Plain Ollama 3B (no grammar enforcement) fails schema validation 90% of the time on this output contract. The model produces valid JSON. It just puts objects where the schema expects strings. parseLlmResponse() rejects it.

This means the existing Ollama integration, before this spike, was silently broken at the schema level for the 3B model. It would have worked fine for smaller models that happen to follow field type instructions more literally, but for the model you actually want to use for quality coaching, it would fail in production nearly every time.

WebLLM's grammar-constrained generation doesn't improve the situation. It defines the situation. Without it, you're rolling the dice on whether the model happens to output the right types.

Implication for any project using Ollama for structured output: Ollama added JSON schema support to its format parameter in v0.4. Use it. Note that Ollama enforces schema compliance at the completion layer, not at the token sampling level — it's not equivalent to constrained decoding, but it substantially improves structured output reliability over prompt engineering alone. If you're relying on prompt engineering alone to get schema-compliant output from a small model, you're going to see silent failures in production that look like valid JSON until your validator catches them.

The Ability Hallucination Problem

Every model, at every size, hallucinates ability names. The 3B invents them in 9 out of 10 prompts. The 1B in 5 out of 10. Phi-3.5 in 8 out of 10.

Coaching that tells a player to "increase your uptime on Shadow Strike" when their class doesn't have an ability called Shadow Strike destroys credibility instantly. This is domain-specific and model-agnostic. The models don't have SWTOR ability databases. They pattern-match on capitalized phrases that look like they belong in a game context and generate plausible-sounding names.

The mitigation I'm implementing: post-process every LLM response against the set of ability names present in the input prompt. Any capitalized phrase in the output that isn't in the known set gets flagged. Starting in warn mode (log to console) before considering strip mode, because I want observability into how often this fires before making a content decision that could remove legitimate text.

This is a reminder that domain-specific hallucination isn't solved by model size. It's solved by grounding. If you're building in a domain with specific terminology (game abilities, medical terms, legal citations), plan for a validation pass.

Implementation Decisions

These are the implementation decisions the spike produced — the design I'm building toward. None of this is merged to production yet.

Chosen model: Llama-3.2-3B-Instruct-q4f16_1-MLC. Meets all performance targets. Quality comparable to Ollama baseline. Zero user setup.

Web Worker is non-optional. CreateWebWorkerMLCEngine runs all inference off the main thread. Running it on the main thread freezes the UI during the ~24 second cold load. This is not optional.

Lazy loading. The model doesn't load on page load or provider construction. It loads on the first generateCoaching() call, with a progress callback wired to a UI progress bar. Repeat users hit the OPFS cache at 2.3s.

VRAM guard. The 3B model uses ~3 GB GPU memory, which can conflict with the game running simultaneously on 8 GB machines. Before loading, call navigator.gpu.requestAdapter() and surface a warning if the device looks constrained. Don't block the load. Just warn. Sustained inference at ~50 tok/s also has thermal and power-draw implications on a laptop running the game simultaneously; the lazy-load design keeps idle overhead at zero.

1B fast mode. Exposed as an opt-in user preference (webllmModel: '3b' | '1b'), persisted to localStorage. Disclosed as "Fast mode uses a smaller model. Coaching depth may be reduced." The 1B quality floor is too low to be a default, but at 118.6 tok/s and 11.7s cold load it's genuinely compelling for users who know what they're trading.

Fallback chain: WebLlmProvider (if WebGPU available) -> OllamaProvider (if localhost:11434 reachable, with schema enforcement) -> rule-based coaching (always available). Never let a WebLLM failure surface to the user.

CSP update. Model shards fetch from HuggingFace CDN. Add bounded connect-src exceptions for https://huggingface.co and https://raw.githubusercontent.com. No wildcards.

What I'd Do Differently

Test grammar enforcement before testing model quality. The schema compliance numbers are what determine whether the integration works at all. Content quality is a secondary concern. A model that produces 80/100 content but fails schema validation 50% of the time is less useful than a model that produces 70/100 content and passes validation 100% of the time.

For anyone running a similar spike on a different output schema: start with structured generation enforced at the runtime level. Don't test prompt-engineering-only compliance and expect it to generalize.

Conclusion

WebLLM with WebGPU is production-ready for this use case — and it's what I'm building toward. The 3B Llama model clears every performance target, produces coaching quality that matches the Ollama baseline, and requires zero user setup. Grammar-constrained generation isn't a nice-to-have -- it's the feature that makes small-model structured output viable at all.

The ability hallucination problem is real and unsolved by model size. Plan for a post-processing validation pass if your domain has specific terminology.

The most useful thing I measured was the thing I almost didn't measure: what happens when you remove the grammar enforcement. The answer is that it breaks quietly and often. If I were running this spike again, I'd run the Ollama baseline first — before testing any WebLLM model. Schema compliance is a binary gate. There's no point benchmarking content quality on a model whose output your validator will reject.

Holocron is a browser-based SWTOR combat log analyzer. It's free, requires no install, and all parsing happens client-side. If you play SWTOR and want to understand your logs, try it at holocronparse.com.

DEV Community