I Built Live Captions in the Browser — No API Key, No Server

#webdev #javascript #beginners #ai

You don't need Whisper, an API key, or a server to caption speech in real time. Chrome and Edge ship a full speech-to-text engine. I built live captions with it in about 30 lines — here's the whole thing.

🎤 Try it (talk to it): https://dev48v.infy.uk/solve/day8-live-captions.html

This is Day 8 of my SolveFromZero series — practical tools, built in the browser.

1. The browser already has a speech engine

const SR = window.SpeechRecognition || window.webkitSpeechRecognition;
const rec = new SR();
rec.lang = "en-US";

One object and you have transcription. (Heads up: Chrome streams audio to Google's servers for this, so it needs internet — that's the one catch.)

2. Two flags turn dictation into live captions

rec.continuous = true;       // don't stop after one phrase
rec.interimResults = true;   // stream guesses WHILE you speak

continuous keeps it listening; interimResults makes it emit its best guess as you talk instead of only after you pause. Those two flags are the whole difference between a dictation box and live captions.

3. Split interim (grey) from final (locked)

Each result is flagged isFinal when the engine is confident. Render finals solid and interim greyed for that authentic flickering-caption feel:

rec.onresult = e => {
  let interim = "";
  for (let i = e.resultIndex; i < e.results.length; i++) {
    const r = e.results[i];
    if (r.isFinal) finalText += r[0].transcript;
    else interim += r[0].transcript;
  }
  render(finalText, interim);
};

4. It stops on silence — so restart it

The engine quietly ends after a stretch of quiet. For continuous captions, just restart it in onend:

rec.onend = () => { if (listening) rec.start(); };

That loop keeps captions alive through pauses, ums, and quiet moments.

5. From mic to any video or tab

This demo captions the microphone. To caption a video call or a YouTube tab, capture that audio and feed it to the transcriber:

const stream = await navigator.mediaDevices.getDisplayMedia({ audio: true });
// route its audio to your speakers (so the recogniser hears it),
// or run an on-device model like whisper.cpp on the captured track

Same render loop — different audio source.

The takeaway

A whole category of "AI features" is sitting unused in the browser. Live captions, voice commands, hands-free search — all a new SpeechRecognition() away, no backend required.

Open it and start talking. Best in Chrome or Edge.