DEV Community: Juan Pablo Enriquez Ortiz

5 production patterns for running Gemma 4 in the browser — what the docs don't tell you

Juan Pablo Enriquez Ortiz — Sat, 23 May 2026 04:25:37 +0000

This is a submission for the Gemma 4 Challenge: Write About Gemma 4

I spent 11 days shipping AULA — an AI tutor that runs Gemma 4 entirely inside the browser for Latin American students without reliable internet. The build forced me to learn things about deploying Gemma 4 in production that the official documentation glosses over.

This post distills the 5 patterns I wish someone had handed me on day one. Every one of them cost me hours (or in one case, an entire afternoon) to figure out. If you're shipping Gemma 4 to real users on real hardware, this is the playbook I would have wanted.

If you want to see the result first, AULA's repo is here: github.com/jpablortiz96/aula. The Build with Gemma 4 submission has the full demo. This post is the technical postmortem.

Pattern 1 — MediaPipe is the right runtime, not transformers.js (yet)

If you Google "run Gemma 4 in the browser", you'll mostly find tutorials using @huggingface/transformers.js. It's a fantastic library and the obvious starting point. I started there too.

On my development laptop — a Windows machine with an NVIDIA RTX 3050 (Ampere, 6 GB VRAM) — transformers.js with WebGPU gave me 2 tokens per second. The benchmarks I'd seen online claimed 20-30 tok/s on similar hardware. Something was very wrong.

After a full afternoon of debugging (chrome://gpu, Task Manager GPU monitor, NVIDIA Control Panel, Vulkan flags, switching to Edge), I found the root cause: on NVIDIA Optimus laptops, dispatch was routing through the integrated Intel UHD GPU, not the discrete NVIDIA. WebGPU's requestAdapter({ powerPreference: 'high-performance' }) is ignored on Windows (Chromium bug 369219127). The model "worked" but ran on the wrong silicon.

What fixed it: migrating to @mediapipe/tasks-genai with the WebGPU delegate.

import { FilesetResolver, LlmInference } from "@mediapipe/tasks-genai";

const fileset = await FilesetResolver.forGenAiTasks(
  "https://cdn.jsdelivr.net/npm/@mediapipe/tasks-genai/wasm"
);

const llm = await LlmInference.createFromOptions(fileset, {
  baseOptions: {
    modelAssetPath: "https://huggingface.co/litert-community/gemma-4-e2b-it/resolve/main/gemma-4-e2b-it-int4-web.task",
  },
  maxTokens: 2048,
  topK: 40,
  temperature: 0.7,
});

const response = await llm.generateResponse(prompt);

Same hardware. Same model. Jumped from 2 tok/s to 14-16 tok/s. A 7x speedup, just from switching runtimes.

MediaPipe is Google's official runtime for Gemma on edge devices. The team optimized the dispatch path specifically for the WebGPU delegate. It's also the only path that supports the .task artifact format Google publishes for browser deployment.

Lesson: if you're targeting consumer hardware in 2026, MediaPipe is the production runtime. transformers.js is excellent for prototyping but has not yet caught up on dispatch quality across all GPU/OS combinations. Use it for the local engine; revisit transformers.js in 6-12 months.

Pattern 2 — Pick the right Gemma 4 variant for the constraint, not the benchmark

Gemma 4 comes in three flavors and the marketing pages emphasize the 31B Dense and 26B MoE as the headline models. For a browser deployment, the only variant that actually matters is the E2B (~2 billion effective parameters, quantized to ~1.5 GB).

Here's the honest tradeoff matrix I built when picking the model for AULA's local engine:

Variant	Size on disk	Runs in browser?	Math/reasoning quality	When to use
Gemma 4 E2B-IT	~1.5 GB (q4f16)	✅ Yes, WebGPU	Good for conversational tutoring	Local browser deployments
Gemma 4 E4B-IT	~3 GB (q4f16)	⚠️ Only on 8 GB+ VRAM GPUs	Slightly better than E2B	Mid-range GPUs only
Gemma 4 26B-A4B-IT (MoE)	~13 GB	❌ Cloud only	Near-31B quality, lower latency	Cloud API for structured output
Gemma 4 31B-IT (Dense)	~16 GB	❌ Cloud only	Best reasoning	When latency doesn't matter

For AULA's offline-first use case, the picking logic was straightforward: the model has to fit in VRAM on a Raspberry Pi 5 (8 GB unified memory). E4B is too big the moment you account for KV cache + browser overhead. E2B fits with margin.

The non-obvious learning: on my RTX 3050 (6 GB VRAM), I tried to ship with E4B because it scored better on benchmarks. The model loaded but spilled into shared system memory via PCIe, dropping inference to ~1.8 tok/s. Switching to E2B (which actually fits in dedicated VRAM) jumped me back to 14+ tok/s.

Rule of thumb: for in-browser inference, the right model is the largest one that fits entirely in dedicated VRAM after counting ~1.5 GB of browser/system overhead. Anything larger spills to PCIe and is unusable.

For Cloud Boost (the optional half of AULA), I picked 26B-A4B over 31B Dense despite the lower parameter count. The mixture-of-experts architecture activates only ~4B parameters per forward pass, giving 2-3x lower latency at near-31B quality. For short structured outputs (a quiz JSON, a Mermaid diagram), this latency difference is the difference between "feels instant" and "user gives up".

Pattern 3 — Don't force small models into rigid structured output

This is the pattern I had to relearn three times before accepting it.

Gemma 4 E2B is brilliant at conversational tasks: math explanations, language tutoring, Socratic dialogue, multi-step reasoning in plain text. It is not reliable at:

Producing valid JSON without surrounding prose
Generating syntactically-valid Mermaid diagrams
Outputting coherent SVG with proper geometry
Following "respond ONLY with X" instructions consistently

This is not a bug. It's a known property of small open models. The "instruction following" capability scales roughly with parameter count, and at 2B effective parameters, E2B sits at the edge.

My first three approaches all failed:

Stricter prompts. "Respond ONLY with valid JSON, no markdown, no prose." Worked 70% of the time. The other 30% the model added an explanation paragraph or a Here is the JSON: prefix.
Higher temperature for diversity, lower for structure. Marginal improvement, but introduced its own failure modes.
Tolerant JSON parser that strips fences and reaches for the first {. Helped, but didn't fix the cases where the model produced almost-valid JSON with unescaped quotes inside string values.

What actually worked: route structured-output features to a larger model in the cloud (26B-A4B), keep the local model for conversational features, and be transparent about the routing in the UI.

In AULA, every screen shows a badge: green for local, blue for cloud. The user always knows which engine answered. This is the design pattern I'd argue for as a general principle:

Don't pretend your small model can do something it can't. Make the limitation a UX surface, not a hidden failure mode.

Here's the shape of the routing logic:

// Routing decision per feature, not per request
function chooseEngine(feature: Feature, hasApiKey: boolean): EngineId {
  const structuredOutputFeatures: Feature[] = [
    "infinite-practice",      // requires JSON
    "svg-illustration",       // requires valid SVG
    "mermaid-mindmap",        // requires strict syntax
    "interactive-quiz",       // requires JSON array
    "handwriting-ocr",        // requires vision (E2B is text-only)
  ];

  if (structuredOutputFeatures.includes(feature)) {
    return hasApiKey ? "cloud-boost" : "unavailable";
  }
  return "local"; // chat, voice, calculator, Socratic, etc.
}

And the user always sees the routing decision, with an honest reason if cloud isn't available:

if (engine === "unavailable") {
  showInfoMessage(
    "This feature needs Cloud Boost. Add your free Google AI Studio API key in Settings to unlock it. The rest of AULA works offline regardless."
  );
}

Pattern 4 — `LlmInference` is exclusive. Build a queue.

This bit me on day 9 and cost me half a day to diagnose.

MediaPipe's LlmInference instance is a singleton with exclusive access. It can process exactly one generation at a time. If you call generateResponse() while a previous generation is still in flight, you get:
Previous invocation or loading is still ongoing.

In a single-page app with multiple routes (chat, practice, mind maps), this is easy to trigger:

User starts a long response in /chat
User navigates to /practice before it finishes
/practice tries to generate an exercise
The model is locked. Everything breaks.

The fix: a FIFO queue with abort propagation across navigations.

class LocalEngine {
  private isGenerating = false;
  private currentAbort: AbortController | null = null;
  private queue: Array<() => Promise<void>> = [];

  async generate(prompt: string, opts: GenerateOptions): Promise<string> {
    // Cancel any in-flight generation
    if (this.isGenerating) {
      this.abort();
      await new Promise((r) => setTimeout(r, 200)); // small buffer
    }

    return new Promise((resolve, reject) => {
      const task = async () => {
        this.isGenerating = true;
        this.currentAbort = new AbortController();
        try {
          const result = await this.llm.generateResponse(prompt);
          resolve(result);
        } catch (err) {
          reject(err);
        } finally {
          this.isGenerating = false;
          this.currentAbort = null;
          const next = this.queue.shift();
          if (next) next();
        }
      };
      this.isGenerating ? this.queue.push(task) : task();
    });
  }

  abort() {
    this.currentAbort?.abort();
    this.queue = [];
  }

  // Recovery path when the model gets stuck
  forceReset() {
    this.isGenerating = false;
    this.currentAbort = null;
    this.queue = [];
  }
}

Critical: every component that uses the engine must call abort() on unmount.

useEffect(() => {
  return () => engine.abort();
}, []);

Without this cleanup, navigating away mid-generation leaves the model locked, and the next page that wants to generate will silently hang.

Pattern 5 — Gemma 4 26B does not stream reliably. Use `generateContent`, not `streamGenerateContent`.

This one took an afternoon and a careful read of DevTools Network tab to find.

The Gemini API exposes two endpoints for Gemma 4 models:
POST .../models/gemma-4-26b-a4b-it:generateContent ← Full response
POST .../models/gemma-4-26b-a4b-it:streamGenerateContent ← SSE chunks

For chat use cases, you obviously want streaming. So I wired everything through :streamGenerateContent?alt=sse and assumed it would Just Work.

It did, for chat. It returned 400 Bad Request for AULA's Practice and Mind Map features.

The DevTools investigation revealed: when the prompt requested structured output (JSON, SVG, Mermaid), the streaming endpoint failed with 400 while the non-streaming endpoint succeeded with the same payload. I never got a clear root cause from the API — it may be a Gemma-specific quirk in how streamGenerateContent handles certain responseSchema configurations or thinking-mode trailers.

The fix that unblocked everything: two separate API client paths.

// For chat — streaming, long responses
async function streamChat(prompt: string, onToken: (t: string) => void) {
  const res = await fetch(
    `${BASE}/gemma-4-26b-a4b-it:streamGenerateContent?alt=sse&key=${apiKey}`,
    { method: "POST", body: JSON.stringify({ contents: [...] }) }
  );
  // ...parse SSE chunks, call onToken per chunk
}

// For structured output — single-shot, short responses
async function cloudGenerate(opts: CloudOptions): Promise<string> {
  const res = await fetch(
    `${BASE}/gemma-4-26b-a4b-it:generateContent?key=${apiKey}`,
    { method: "POST", body: JSON.stringify({
      systemInstruction: { parts: [{ text: opts.system }] },
      contents: [{ role: "user", parts: [{ text: opts.prompt }] }],
      generationConfig: { temperature: 0.85, maxOutputTokens: 1024 },
    })}
  );
  const data = await res.json();
  // Filter out "thought" parts (Gemma 4 thinking mode)
  const parts = data.candidates?.[0]?.content?.parts ?? [];
  return parts
    .filter((p) => !p.thought)
    .map((p) => p.text ?? "")
    .join("");
}

Lesson: for short structured outputs, streaming buys you nothing. The user is waiting for one complete artifact, not a slow reveal of text. Use generateContent. Save streaming for genuine chat.

One more detail worth flagging: Gemma 4 has a thinking mode that emits "thought" parts in the response. If you naively concatenate all parts[].text, you'll surface the model's chain-of-thought in the user-visible output. Filter on part.thought === true and skip those. AULA's chat looked very weird until I added that filter — the model was literally showing its work to the student, which is not the goal.

What this means for developers shipping Gemma 4 today

If you're building with Gemma 4 in 2026, the patterns I'd internalize before writing a single line of code:

MediaPipe for browser, period. Don't waste a week on transformers.js benchmarks. Migrate or start there.
Pick the model that fits in VRAM, not the model that benchmarks best. Spilling to PCIe destroys throughput. E2B is the only realistic browser model in 2026.
Design routing as a UX surface. Small models can't do everything. Make the limitation visible and let the user opt into cloud where it matters. Honesty beats hiding limitations behind retries.
Treat LlmInference as a single-threaded mutex. Queue your requests, abort on unmount, expose a recovery path. The cost of not doing this is a frustrating "the AI broke" experience that the user can't diagnose.
Streaming is for chat. generateContent is for everything else. Don't fight the API.

These five patterns saved me probably a week of additional debugging once I internalized them. AULA exists because Gemma 4 is genuinely good enough to run in a browser tab — but it only feels good to use because the patterns above turn the rough edges into smooth UX.

What I'm hopeful about

The interesting thing about all five patterns above: none of them are about Gemma 4's quality. They're about deployment ergonomics. The model itself is remarkable. A 2-billion-parameter open model that runs at 15 tok/s in a browser tab and can hold a real tutoring conversation with a high school student is a thing that genuinely did not exist 18 months ago.

For my specific use case — students in rural Latin America who have no other access to AI tools — Gemma 4 is the first model that crosses the practical viability line. It's small enough to download once over a school WiFi connection. It's capable enough to teach. It runs offline. It's free.

If you're working on local-first AI for any underserved population, I'd encourage you to start with Gemma 4. The deployment patterns above will save you a week. The model will do the rest.

If you want to see what the patterns look like in a finished product, AULA is open source under MIT: github.com/jpablortiz96/aula. Pull requests welcome.

About the author: I'm a solo founder in Cali, Colombia building educational tech for Latin American students. AULA was built solo in 11 days for this challenge.

Companion submission (Build track): AULA — The AI tutor that fits in a browser tab — live demo, video walkthrough, full architecture.

🇨🇴 Made in LATAM, for the students the world forgot.

AULA — The AI tutor that fits in a browser tab, built for the students the internet leaves behind

Juan Pablo Enriquez Ortiz — Sat, 23 May 2026 04:13:16 +0000

This is a submission for the Gemma 4 Challenge: Build with Gemma 4

What I Built

AULA is a complete AI tutoring platform that runs Google's Gemma 4 entirely inside the browser — no server, no account, no internet required after the first 1.5 GB download. It is designed for the 65+ million Latin American students living in areas where reliable internet is the exception, not the norm.

The premise is simple: if Gemma 4 can run on a Raspberry Pi 5, it can run on a teacher's laptop in rural Boyacá, Colombia. With WebGPU and MediaPipe, this is now possible — and AULA is what that looks like as a finished product.

The problem AULA solves

In Latin America, ~40% of students live with unreliable, capped, or non-existent connectivity. ChatGPT, Gemini, Khan Academy's AI tutor — all require a stable connection. The very tools that could close the global education gap are inaccessible exactly where they are needed most.

AULA flips this: the AI runs on the student's device, not on a server thousands of miles away.

What AULA does — offline (100% local, Gemma 4 E2B)

After loading once, these features work with WiFi off, in airplane mode, in a rural school with no signal:

🎓 Conversational tutor — chat with Gemma 4 in natural language. Full LaTeX rendering for math and science. ~15 tokens/sec on a mid-range laptop GPU.
🧮 Scientific calculator that teaches — visual keypad with trig functions, exponents, roots. Gemma 4 doesn't just solve. It explains the why.
🎙️ Voice tutoring (bidirectional) — ask by speaking, listen to the response. Optional hands-free mode chains them together.
🦉 Socratic mode — Gemma 4 stops giving answers and only asks guiding questions. Pedagogy-first.
🤔 "Explain it simpler" — three escalating reformulation levels on demand.
💡 Conceptual error detection — Gemma 4 diagnoses which concept the student misunderstood, not just "wrong, try again".
📚 Persistent study sessions in IndexedDB. No cloud sync ever.
♿ Accessibility first — high contrast, large text, easy reading mode (for dyslexia), auto-read responses.
🌍 Spanish ↔ English — full i18n. System prompts translate, not just the labels.
🏆 Local gamification — XP, levels, streak, achievements. All in the browser.

What AULA does — Cloud Boost (optional, Gemma 4 26B-A4B)

For features that require strict structured output (which is beyond what a 2B-parameter model can do reliably), AULA routes through the user's own free Google AI Studio API key:

✍️ Handwritten whiteboard — draw equations with finger or mouse, Gemma 4 reads and solves.
📷 Photo OCR + reasoning — point camera at a printed exercise, get a step-by-step solution.
♾️ Infinite adaptive practice — exercises that never repeat, with difficulty calibrated dynamically.
🎯 Interactive student quiz — self-assessment with scoring and per-error conceptual review.
👩‍🏫 Teacher mode with PDF export — generate quizzes, export student/teacher PDFs ready to print.
🎨 SVG illustrations — Gemma 4 generates educational diagrams.
🗺️ Mermaid mind maps — concept diagrams rendered interactively, downloadable as PNG/SVG.

Critical: Cloud Boost is always opt-in. AULA never sends data without an explicit API key configured by the user. The core educational experience never requires the internet.

Demo

🎥 Watch the 2-minute walkthrough: https://youtu.be/d0jN8Kw_Cz4

🔗 Live demo: https://aula.run (or local: pnpm dev -p 3100 after cloning)

Key screenshots

Chat tutor running 100% locally with full LaTeX rendering

Mermaid mind maps generated by Gemma 4 — click to enlarge, download as PNG

SVG illustrations — educational diagrams generated by Gemma 4

Scientific calculator that explains, powered locally

Teacher mode with PDF export — ready for classroom

Accessibility built-in: high contrast mode

Code

🔗 Repository: https://github.com/jpablortiz96/aula

The repo includes a comprehensive README with architecture diagrams, hardware benchmarks across devices (Raspberry Pi 5 to RTX 3050 to MacBook M3), full tech stack documentation, and a roadmap for v1.1 through v3.0.

License: MIT

How I Used Gemma 4

AULA uses a dual-engine architecture with intentional model selection for each tier:

Model	Variant	Where it runs	What it powers
Gemma 4 E2B-IT	~1.5 GB (q4f16 quantized)	Browser, via MediaPipe + WebGPU	All offline features
Gemma 4 26B-A4B-IT	Cloud (MoE)	Gemini API	Structured-output features

Why Gemma 4 E2B for local

The E2B variant is the only Gemma 4 model that fits realistically on consumer hardware while preserving the multimodal capability path. It runs at:

~15 tokens/sec on an NVIDIA RTX 3050 laptop
~20-25 tokens/sec on a MacBook M3
~7 tokens/sec on a Raspberry Pi 5 (CPU fallback)

This range covers every realistic device a Latin American student or teacher might have access to — from a $80 SBC to a school laptop. The 31B Dense model would never fit in a browser tab; the 26B MoE requires server-grade resources. E2B is the only viable choice for the rural offline use case, and that's exactly why I picked it.

Why Gemma 4 26B-A4B for cloud-enhanced features

Some features in AULA require strict structured output: JSON for quiz exercises, syntactically-valid Mermaid for mind maps, coherent SVG for illustrations. Small models are unreliable for this — they're brilliant at conversation but tend to add prose around JSON, produce malformed SVG, or break Mermaid syntax.

Rather than fight this limitation or hide it, AULA makes the routing explicit and visible to the user. Every screen shows which engine answered: green badge for local, blue badge for cloud. The 26B-A4B variant gives me near-31B quality at substantially lower latency thanks to its mixture-of-experts architecture — ideal for short structured outputs.

Technical challenges I solved

1. transformers.js was not viable on NVIDIA Optimus laptops.
My first prototype used transformers.js + WebGPU. On an RTX 3050, I got 2 tokens/sec because dispatch was routing through the iGPU. Migrating to MediaPipe's WebGPU delegate unlocked 14-16 tokens/sec on the same hardware — a 7x improvement. MediaPipe is Google's official runtime for Gemma 4 on edge, and the difference is real.

2. Concurrency on LlmInference is exclusive.
A single MediaPipe LlmInference instance processes one prompt at a time. When /chat and /practice competed for the same singleton, the model locked with Previous invocation or loading is still ongoing. I implemented a FIFO queue with abort propagation across navigations, plus a forceReset() recovery path.

3. Gemma 4 26B does not support streamGenerateContent reliably.
This took an afternoon of DevTools debugging to identify: calling :streamGenerateContent returned 400, while :generateContent (no streaming) worked perfectly. The fix was creating a separate cloudNoStream.ts helper for Practice, Illustrator, and Mermaid — features that don't benefit from streaming anyway since the user is waiting for one complete response.

4. Easy Reading Mode is more than a CSS toggle.
For students with dyslexia or reading difficulties, AULA changes both the visual presentation (letter spacing, line height, max-width) and the system prompt sent to Gemma 4 ("Short sentences. Simple vocabulary. One idea per line."). This is the kind of accessibility that AI uniquely enables — the model adapts its output style, not just the typography.

What Gemma 4 unlocked that wasn't possible 18 months ago

Browser-native inference at this quality was genuinely impossible until WebGPU stabilized. AULA is only buildable in 2026. The combination of Gemma 4's open weights, WebGPU's GPU access, and MediaPipe's optimized runtime is what makes a Pi-friendly AI tutor a real thing, not a thought experiment.

For 65 million students in Latin America who have been excluded from the AI revolution, this matters more than I can describe in this post.

Tech stack: Next.js 15, TypeScript strict, Tailwind v4, MediaPipe LLM Inference, WebGPU, Gemini API (REST + SSE), Zustand, IndexedDB, jsPDF, Mermaid, tesseract.js, Web Speech API.

Built solo in 11 days for the DEV.to Gemma 4 Challenge.

AULA is open source under MIT. Fork it, run it in your school, contribute to it. If you're a teacher in a low-connectivity region and want help deploying AULA, open an issue on GitHub.

🇨🇴 Made in LATAM, for the students the world forgot.

Building AccessBridge AI: How 5 AI Agents Collaborate to Make the Web Accessible

Juan Pablo Enriquez Ortiz — Sat, 28 Mar 2026 02:04:39 +0000

Built for the JS AI Build-a-thon 2026 — Agents for Impact

The Problem That Inspired Us

96.3% of the top million websites fail basic accessibility standards.

That statistic, from the 2024 WebAIM Million report, stopped me cold. We're not talking about edge cases or rare corner cases — we're talking about the overwhelming majority of the web being effectively inaccessible to 1.3 billion people who live with some form of disability.

The tools that exist today are part of the problem. Axe, WAVE, Lighthouse — these are excellent auditors. They'll tell you that you have 23 accessibility violations. What they won't do is fix a single one of them. The burden always falls back on the developer, who may not have the time, budget, or expertise to address every flag.

We wanted to change the question from "Where are the problems?" to "Here's the fixed version — would you like to use it?"

That's AccessBridge AI.

What We Built

AccessBridge AI is a multi-agent system where 5 specialized AI agents collaborate in real-time to transform any web page into universally accessible content. You paste a URL. Fifteen seconds later, you get back:

An accessibility score (before and after) on a 0-100 scale
A list of every issue found, with the agent that found it and the confidence score
An automatically transformed HTML file with fixes applied
A full decision log explaining every choice the system made
A WCAG breakdown across all four principles: Perceivable, Operable, Understandable, Robust

When you analyze a URL, here's what happens under the hood:

The Orchestrator fetches the HTML server-side (15s timeout, custom User-Agent)
The Scanner, Vision, Simplifier, and Navigator agents all run in parallel
The Orchestrator resolves conflicts between agents (more on this below)
High-confidence fixes are automatically applied to the HTML
Low-confidence suggestions are flagged for human review
Scores are calculated and the full result is returned to the UI

On our test runs: average score improvement of +31 to +42 points, depending on how accessibility-challenged the original page was.

Architecture Deep Dive

The BaseAgent Contract

Every agent in the system implements a single interface:

export interface BaseAgent {
  name: string;
  type: AgentType;
  description: string;
  analyze(html: string, url: string, context?: any): Promise<AgentResult>;
}

That's it. Every agent receives raw HTML and a URL, and returns a structured AgentResult containing issues found, fixes proposed, metadata, and a confidence score. This contract is what makes the system composable — swapping the cloud Vision agent for an offline heuristic agent requires zero changes to the Orchestrator.

The AgentResult shape:

export interface AgentResult {
  agentType: AgentType;
  status: AgentStatus;
  issues: AccessibilityIssue[];
  fixes: Array<{
    selector: string;
    attribute?: string;
    oldValue: string;
    newValue: string;
    reason: string;
  }>;
  metadata: Record<string, any>;
  startTime: number;
  endTime: number;
  confidence: number;
}

Every fix carries a selector (CSS selector targeting the element), the old value, the new value, and — critically — a human-readable reason. This is what powers the Decision Log in the UI.

The `isEnhancement` Flag: An Honest Score Model

This is one of the subtler design decisions that took three iterations to get right.

The problem: Vision and Simplifier agents find opportunities — images that could have better alt text, paragraphs that could be simpler. These aren't pre-existing defects that the website owner created. They're improvements AccessBridge can make. If we counted them in the scoreBefore calculation, we'd be artificially penalizing the site for things it never claimed to do.

The solution: an isEnhancement flag on AccessibilityIssue.

export interface AccessibilityIssue {
  // ...
  fixApplied: boolean;
  confidence: number;
  /** True for Vision / Simplifier issues that represent *improvements*
   *  AccessBridge found, not pre-existing defects. These are shown in the
   *  UI but never penalise scoreBefore, and their fixes (if applied)
   *  add to scoreAfter. */
  isEnhancement?: boolean;
}

The scoring model then becomes additive — honest and non-decreasing:

// scoreBefore: only real pre-existing defects (Scanner + Navigator)
function calcScoreBefore(issues: IssueLike[]): number {
  const baseline = issues.filter(
    i => !i.isEnhancement &&
         (i.agentType === AgentType.SCANNER || i.agentType === AgentType.NAVIGATOR),
  );
  let score = 100;
  for (const i of baseline) {
    if (i.severity === 'critical')    score -= 10;
    else if (i.severity === 'major')  score -= 5;
    else                              score -= 2;
  }
  return Math.max(0, Math.min(100, score));
}

// scoreAfter: scoreBefore + points earned per applied fix
const FIX_POINTS = {
  vision:     3,  // contextual alt text
  navigator:  4,  // structural fixes have high WCAG impact
  simplifier: 2,  // readability improvements
  scanner:    3,
};

function calcScoreAfter(issues: IssueLike[], before: number): number {
  let gain = 0;
  for (const i of issues) {
    if (!i.fixApplied) continue;
    gain += FIX_POINTS[i.agentType] ?? 2;
  }
  return Math.max(0, Math.min(100, before + gain));
}

Navigator gets the highest fix points because structural changes — adding landmark regions, fixing heading hierarchy, inserting skip links — have the biggest real-world impact for keyboard and screen reader users.

Parallel Execution via Promise.all

All four specialist agents run concurrently. The Orchestrator wraps each in a try/catch so one failing agent (e.g., Azure timeout) doesn't bring down the entire analysis:

const settled = await Promise.all(
  this.agents.map(async (agent) => {
    this.emitEvent({
      timestamp: Date.now(),
      agentType: agent.type,
      status: AgentStatus.WORKING,
      message: `${agent.name} started analyzing…`,
    });

    try {
      const result = await agent.analyze(html, url);
      this.emitEvent({
        timestamp: Date.now(),
        agentType: agent.type,
        status: AgentStatus.DONE,
        message: `${agent.name} found ${result.issues.length} issues`,
        data: { issueCount: result.issues.length, fixCount: result.fixes.length },
      });
      return result;
    } catch (error) {
      const msg = error instanceof Error ? error.message : String(error);
      this.emitEvent({
        timestamp: Date.now(),
        agentType: agent.type,
        status: AgentStatus.ERROR,
        message: `${agent.name} failed: ${msg}`,
      });
      return null;
    }
  })
);

Each emitEvent call feeds the real-time agent timeline in the UI via an EventEmitter pattern — the Orchestrator extends Node's EventEmitter, and the API route streams events back to the browser using a readable stream.

The Conflict Resolution Engine

Agents running in parallel will inevitably step on each other's toes. We handle two conflict types:

Type 1: Same WCAG rule, same element, different agents.
The Scanner might flag img:nth-of-type(3) for missing alt text (WCAG 1.1.1), and so might the Navigator. We deduplicate by keeping the first reporter:

const seenIssues = new Map<string, { agentType: AgentType }>();

for (const result of results) {
  for (const issue of result.issues) {
    const key = `${issue.selector}::${issue.wcagRule}`;
    const existing = seenIssues.get(key);

    if (existing && existing.agentType !== result.agentType) {
      // Log conflict, first-reporter wins
      conflicts.push({ winner: existing.agentType, reasoning: 'First-reporter wins' });
    } else if (!existing) {
      seenIssues.set(key, { agentType: result.agentType });
    }
  }
}

Type 2: Vision vs Simplifier — the context preservation conflict.
This is the interesting one. Imagine Vision generates the alt text: "Promotes transforming your future through education and growth opportunities" for an image inside a paragraph. Then Simplifier comes along and rewrites that paragraph to be shorter. Now the alt text no longer makes sense in context — screen reader users would hear the simplified text followed by the original (now out-of-context) alt text.

Our rule: Vision always wins over Simplifier on the same element. If a Vision-fixed image lives inside a paragraph that Simplifier wants to rewrite, that paragraph is blocked:

// Find text blocks where Vision has fixed an img inside
for (const simplFix of simplifierResult.fixes) {
  const $block = $(simplFix.selector).first();

  for (const imgSel of visionImgSelectors) {
    if ($block.find(imgSel).length > 0) {
      // Vision wins — block the Simplifier fix
      blockedSimplifierSelectors.add(simplFix.selector);
      conflicts.push({
        winner: AgentType.VISION,
        reasoning:
          'Alt text generated by Vision Agent is calibrated to the image\'s ' +
          'surrounding context. Rewriting that context could make the alt text ' +
          'misleading for screen reader users.',
      });
    }
  }
}

This conflict — and its resolution — is recorded in the Decision Log and surfaced in the Responsible AI panel so users can understand why a particular fix wasn't applied.

The Secret Sauce: Contextual Alt Text

The Vision Agent is where Azure OpenAI earns its place in the system.

Most accessibility scanners will tell you: "This image has no alt text." The best ones will say: "Add meaningful alt text." But what does "meaningful" mean for a specific image on a specific page?

Before even calling the API, we extract rich context from the DOM:

interface ExtractedContext {
  imageUrl: string;
  filename: string;
  imageType: 'decorative' | 'functional' | 'informative';
  heading: string;         // nearest ancestor or preceding h1-h6
  surroundingText: string; // text content of parent element
  caption: string;         // figcaption if present
  linkText: string;        // text of wrapping <a> if present
  title: string;           // title attribute
  selector: string;
  snippet: string;         // the raw HTML element
  currentAlt: string | undefined;
}

The agent classifies each image into one of three roles:

Decorative — purely visual, no information content → alt="" (handled by Scanner)
Functional — inside a link or button → alt text describes the destination/action
Informative — content image → alt text describes what the image communicates

This role classification shapes the system prompt sent to GPT-4o:

You are an accessibility expert generating alt text for a web image.

Image role: FUNCTIONAL (this image is inside a link or button)
For functional images, describe the DESTINATION or ACTION, not just what you see.
Generate alt text that a screen reader user would find helpful.

RULES:
- Be concise (under 125 characters)
- Describe PURPOSE, not visual appearance
- If it's functional, what does it DO or WHERE does it go?
- Do NOT start with "Image of", "Picture of", "Photo of"
- Do NOT include quotes in your response

Context:
- Surrounding text: "Learn more about our engineering bootcamp programs"
- Link text: "Apply now"
- Nearest heading: "Transform Your Career in Tech"

Result: "Apply for engineering bootcamp — Transform Your Career in Tech"

Without this context, a generic vision model might return: "A button with text."

The agent also penalizes its own confidence score when context is thin:

const hasContext = ctx.heading || ctx.surroundingText || ctx.caption || ctx.linkText;
confidence = hasContext ? 0.88 : 0.72;

A confidence below 0.5 means the fix is surfaced as a suggestion, never auto-applied. This is human-in-the-loop by design — the system acknowledges its own uncertainty.

Going Offline: Accessibility Without Internet

We built two fully functional modes from day one, and the offline mode is not a degraded fallback — it's a genuine capability with a specific use case.

Why? Because the communities that most need accessibility tooling — nonprofits, government agencies in emerging markets, small educational institutions — often have unreliable or metered internet connectivity. A tool that stops working without cloud connectivity isn't truly accessible.

Feature	☁️ Cloud	📡 Offline
Scanner (20+ WCAG rules)	Full	Full (same code)
Navigator (structure)	Full	Full (same code)
Vision (alt text)	AI-powered via GPT-4o	5-tier heuristic
Simplifier (readability)	AI rewriting	Deterministic splitting
Typical speed	~12 seconds	~2 seconds
Privacy	Processed via Azure	Zero external requests

The Offline Vision Heuristic

When no API key is present, the Vision agent falls back to a 5-tier priority system:

Tier 1: <img> inside <a>
  → "Link to {link text}" or "Link to {domain name}"
  Rationale: functional images communicate navigation intent

Tier 2: <figure> with <figcaption>
  → Use the caption verbatim (the author already wrote it)

Tier 3: Meaningful filename
  → "hero-education-program.jpg" → "Hero education program image"
  (strip extension, convert hyphens/underscores to spaces, title-case)

Tier 4: Nearest heading in the DOM
  → "Image related to: {heading text}"

Tier 5: Image URL domain
  → "Image — cdn.example.com"

All offline Vision issues are marked isEnhancement: true with confidence 0.5, which means they're auto-applied (the threshold is ≥ 0.5) but don't penalize the before-score.

The Offline Simplifier

The offline Simplifier uses a deterministic algorithm instead of calling GPT-4o. For any <p> element with a sentence over 30 words, it attempts a three-pass split:

Pass 1 — Natural break (comma near midpoint):
  Find the comma closest to the ±30% midpoint of the sentence.
  "The program, which was founded in 2019, has helped over 1,000 students..."
  → "The program, which was founded in 2019, has helped over 1,000 students..."
  → Split at comma before "has"

Pass 2 — Conjunction split:
  Find the first coordinating/subordinating conjunction after the midpoint:
  (and, but, which, because, however, although, while, whereas...)
  → Split before the conjunction, add a period

Pass 3 — Hard midpoint:
  If no natural break found, split at the word nearest the midpoint.
  (Last resort — preserves meaning better than cutting arbitrarily)

Result on Wikipedia: Cloud mode +37 pts, Offline mode +31 pts. The gap is real but smaller than you'd expect.

Responsible AI: Not an Afterthought

We made a deliberate decision early: transparency and human oversight are architectural requirements, not features we'd add later.

Every AgentEvent is timestamped and stored:

export interface AgentEvent {
  timestamp: number;
  agentType: AgentType;
  status: AgentStatus; // WORKING | DONE | ERROR | CONFLICT
  message: string;
  data?: any;
}

The Decision Log in the UI renders every event — including conflicts — in chronological order. Conflict events are highlighted in amber. The Responsible AI panel shows:

Transparency: total number of agent decisions logged
Human-in-the-Loop: count of suggestions vs auto-applied fixes
Confidence Scoring: breakdown of high/medium/low confidence fixes per agent
Privacy: mode used and data retention policy (none — all processing is ephemeral)

The confidence threshold for auto-apply is explicitly ≥ 0.5. Anything below that is shown as a suggestion with a reason: "Confidence 0.42 — flagged for human review."

This design reflects a real belief: AI systems that affect people's lives — and accessibility directly affects how 1.3 billion people experience the web — need to be auditable, explainable, and humble about their own limitations.

Building with AI: Our Claude Code Workflow

This section is the most honest part of this post.

We used Claude Code (Anthropic's CLI coding assistant) for the vast majority of this project. Here's what that actually looked like, warts included.

What Worked Exceptionally Well

Generating the type system. We gave Claude Code the exact interfaces we wanted and it produced clean, idiomatic TypeScript on the first try. The AccessibilityIssue, AgentResult, and AnalysisResult interfaces required almost no revision.

The Scanner Agent. We asked for a WCAG 2.1 auditor covering all four principles. Claude generated 20+ detection rules using cheerio, each wrapped in its own try/catch, with proper severity and WCAG rule codes. This would have taken a week to research and write manually.

UI components with specific constraints. When we described the exact visual behavior we wanted — "a segmented control using visually-hidden radio inputs, two options, with an offline disclaimer that animates in with aria-live='polite'" — we got exactly that. No hallucinated React libraries, no unnecessary dependencies.

Debugging TypeScript errors across a multi-agent system. When we hit a TS2322 error about IssueSeverity string literals, we described the error and the surrounding code, and got the right fix immediately: import the enum and use IssueSeverity.MAJOR instead of the string 'major'.

What Didn't Work (At First)

The scoring algorithm needed three iterations to get right.

Our first attempt: a single calcScore(issues, afterFixes: boolean) function that counted all issues and tried to subtract fixed ones. When we tested in offline mode, scores were going down — from 72 to 51 — because Vision and Simplifier were generating issues that got counted against the baseline.

Second attempt: separate before/after calculations. Better, but still wrong — the "after" score was recounting all unfixed issues instead of adding earned points.

Third attempt: the additive model with isEnhancement flag described above. The key insight was identifying why the model was wrong, not just that it was wrong.

The lesson: AI-assisted coding works best when you can articulate the bug precisely. "The offline score goes down" didn't help. "The before-score counts Vision issues that are improvements, not defects — they shouldn't appear in the baseline" produced an exact, correct solution.

Complex cheerio selectors were brittle.

Early versions of the agents generated selectors like div.container > section:first-child > img:nth-child(3). These worked on the test page but broke on real sites. We had to manually establish the selector priority rule (id > class > src attribute > nth-of-type) and explain it precisely before the generated code became stable.

Conflict resolution logic needed manual refinement.

The initial conflict resolution was purely deduplication. The Vision-vs-Simplifier context preservation conflict — where rewriting a paragraph could make an adjacent alt text misleading — was a design decision we arrived at ourselves, then asked Claude to implement. The "what" came from us; the "how" came from Claude.

Our Prompting Strategy

The difference between a prompt that works and one that doesn't, in our experience:

Specify the interface, not just the behavior.
Instead of: "Create a Scanner Agent that checks accessibility"
We used: "Create a Scanner Agent that implements the BaseAgent interface below. It should use cheerio to parse the HTML and detect these specific WCAG violations, returning issues with these exact fields..."

Describe bugs with reproduction steps, not symptoms.
Instead of: "The score is wrong"
We used: "The scoreBefore function at line 35 is including Vision agent issues (marked isEnhancement: true) in its baseline count. These should be excluded. The fix should modify the filter condition to check !i.isEnhancement && before the agentType check."

Iterate on real output, not hypothetical code.
We ran the app, analyzed a real URL, saw the output, identified what was wrong, then described that specific wrong output and the expected correct output. Every iteration was grounded in real behavior.

Example prompts we used:

Create the Vision Agent in /src/agents/vision.ts. It must implement BaseAgent.
It should:
1. Use cheerio to find all <img> elements with missing or generic alt text
2. For each image, extract this context object: [interface definition]
3. Call Azure OpenAI with this exact system prompt: [prompt text]
4. Mark each issue with isEnhancement: true and confidence: 0.85
5. Fall back to generateFallbackAlt() if Azure is not configured
The fallback function should try: linkText → figcaption → filename → heading → domain

There is a TypeScript error in orchestrator.ts line 482:
  Type 'string' is not assignable to type 'IssueSeverity'
The line reads: issue.severity = 'major';
Fix it by importing IssueSeverity from @/types/agents and using IssueSeverity.MAJOR.
Apply the same fix wherever 'critical' and 'minor' string literals are used.

Results

We tested on a range of real websites. Here's a representative sample from a real run:

Test site: eduky.co (education platform)

Score before: 51 / 100
Score after: 93 / 100 (+42 points)
Issues detected: 21 across all 4 WCAG categories
Fixes auto-applied: 13 (high confidence)
Suggestions for review: 8 (lower confidence)
Analysis time (cloud): 14.4 seconds
Analysis time (offline): 1.6 seconds

WCAG Breakdown (before → after):

Perceivable: 48 → 89 (+41)
Operable: 71 → 85 (+14)
Understandable: 62 → 78 (+16)
Robust: 55 → 91 (+36)

Test site: Wikipedia (English article)

Score before: 68 / 100
Cloud mode score after: 105 → capped at 100 (+32 points)
Offline mode score after: 99 (+31 points)
Analysis time (cloud): 18.2 seconds (many images → many API calls)
Analysis time (offline): 2.1 seconds

The near-parity between cloud and offline on Wikipedia demonstrates that the heuristic offline agents are genuinely useful — most Wikipedia images follow predictable patterns (figures with captions, file-name-described diagrams) that the heuristic system handles well.

What's Next

AccessBridge AI was built in five days for a hackathon. Here's where we'd take it with more time:

Browser Extension — Run AccessBridge directly in the browser without pasting URLs. Inject the transformed HTML into the current tab so users can see the before/after in situ.

CI/CD Integration — An API endpoint that returns a machine-readable WCAG report and exits non-zero when critical violations are detected. Plug it into your GitHub Actions pipeline: no PR gets merged if it regresses accessibility.

Foundry Local Integration — Replace the offline heuristics with actual on-device AI inference using Azure AI Foundry Local and Phi-4. True intelligence without any internet dependency.

Multi-language Support — The Simplifier currently targets English readability (Flesch-Kincaid). Extending to Spanish, French, and Portuguese would dramatically expand the tool's impact in underserved markets.

Accessibility Score Tracking — Store historical scores per domain. Show a site owner their accessibility trend over time, not just a single snapshot.

Try It Yourself

🚀 Live Demo

📦 GitHub Repository

Drop any public URL into the analyzer and watch 5 agents work in real time. Try it on:

A site you own (and care about improving)
https://example.com (a minimal, intentionally bare page)
A Wikipedia article (rich with images, complex structure)
A government or nonprofit site (where accessibility matters most)

If the analysis finds something, the fixed HTML is available for download immediately.

Final Thoughts

Accessibility is one of those problems where the technical solution is well-understood and the barrier is almost entirely friction. We know what good alt text looks like. We know what heading hierarchy should be. We know what ARIA landmarks do. The problem is that fixing 47 violations across a 200-page website is a week of tedious work.

AI agents can absorb that friction. Not perfectly — our confidence scores and human-in-the-loop design reflect genuine humility about what the system can and can't do reliably. But good enough, fast enough, that the decision for a small nonprofit to have an accessible website no longer has to be "we can't afford the developer time."

That's the goal. Everything else is implementation details.

Built with ❤️ for the JS AI Build-a-thon 2026 — because the web should work for everyone.

— Juan Pablo Enriquez Ortiz