Saqueib Ansari

Posted on Mar 25 • Originally published at qcode.in

Build Real-Time AI Voice Transcription for Web Meetings Fast

#realtimetranscription #aivoicetranscription #webrtc #deepgram

Web meetings generate thousands of hours of spoken content every day, and most of it vanishes the moment the call ends — unless you build something to catch it.

Why Real-Time AI Voice Transcription for Web Meetings Has Become a Core Feature

A year ago, transcription was a nice-to-have. In 2026, it's table stakes. Users expect live captions, searchable meeting notes, and action-item extraction without any manual effort. The tools to deliver all of this have matured significantly — Whisper, Deepgram, and AssemblyAI now offer sub-300ms latency on streaming audio, and browser APIs have finally caught up to make capturing audio from a meeting tab genuinely feasible without native plugins.

What changed? A few things converged at once:

WebSockets and WebRTC became universally supported and well-documented
Transformer-based ASR models got small enough to run at the edge
Streaming transcription APIs stabilized with proper WebSocket endpoints
Browser MediaStream APIs became reliable enough to capture tab and microphone audio simultaneously

If you're building a meeting tool, a productivity extension, or an AI assistant for your organization, this is the stack you need to understand.

The Core Architecture

Before writing a single line of code, understand the data flow:

Browser Tab Audio → MediaStream → AudioWorklet → WebSocket → ASR API → Transcript

You're capturing raw PCM audio from the browser, chunking it into small frames (typically 100–250ms), sending those frames over a WebSocket to a streaming ASR endpoint, and receiving partial + final transcripts back in real time. The challenge isn't any one piece — it's keeping the pipeline low-latency and handling edge cases like network interruptions, speaker changes, and audio resampling.

Setting Up Audio Capture in the Browser

Most developers hit their first wall here. Capturing both the meeting audio (system/tab audio) and the user's microphone requires combining two MediaStream tracks.

Capturing Tab Audio with getDisplayMedia

async function captureAudio() {
  const displayStream = await navigator.mediaDevices.getDisplayMedia({
    video: false,
    audio: {
      echoCancellation: false,
      noiseSuppression: false,
      sampleRate: 16000,
    },
  });

  const micStream = await navigator.mediaDevices.getUserMedia({
    audio: {
      echoCancellation: true,
      noiseSuppression: true,
      sampleRate: 16000,
    },
  });

  const audioContext = new AudioContext({ sampleRate: 16000 });
  const dest = audioContext.createMediaStreamDestination();

  audioContext.createMediaStreamSource(displayStream).connect(dest);
  audioContext.createMediaStreamSource(micStream).connect(dest);

  return dest.stream;
}

Target 16kHz mono PCM — this is what every major ASR API expects, and resampling in the browser before sending is dramatically cheaper than doing it server-side at scale.

Using AudioWorklet for Zero-Copy Audio Processing

Avoid ScriptProcessorNode — it's deprecated and runs on the main thread. Use AudioWorklet instead:

// processor.js (loaded as a worklet module)
class PCMProcessor extends AudioWorkletProcessor {
  process(inputs) {
    const input = inputs[0][0];
    if (input) {
      this.port.postMessage(input.buffer, [input.buffer]);
    }
    return true;
  }
}
registerProcessor("pcm-processor", PCMProcessor);

// main.js
await audioContext.audioWorklet.addModule("/processor.js");
const workletNode = new AudioWorkletNode(audioContext, "pcm-processor");

workletNode.port.onmessage = (event) => {
  sendAudioChunk(event.data); // send to WebSocket
};

source.connect(workletNode);

This gives you raw Float32 PCM frames off the main thread. Before sending to the API, convert to Int16 — most APIs expect 16-bit PCM, not 32-bit float:

function float32ToInt16(buffer) {
  const int16 = new Int16Array(buffer.length);
  for (let i = 0; i < buffer.length; i++) {
    int16[i] = Math.max(-32768, Math.min(32767, buffer[i] * 32768));
  }
  return int16.buffer;
}

Choosing the Right ASR API for Real-Time AI Voice Transcription for Web Meetings

Not all ASR APIs are equal for live meeting use cases. Here's how the major players stack up in 2026:

Deepgram Nova-3

Deepgram's Nova-3 is currently the best balance of latency and accuracy for English. It supports speaker diarization (identifying who's speaking) in the streaming endpoint, which is critical for meeting transcripts — and honestly non-negotiable if you want the output to be readable. Enable it with:

wss://api.deepgram.com/v1/listen?model=nova-3&diarize=true&punctuate=true&language=en-US

Expect ~150–250ms for interim results and ~500ms for finals. The diarization adds about 50ms overhead — worth it every time.

AssemblyAI Universal-2

AssemblyAI's Universal-2 is the right call if you need strong multilingual support or you're processing meetings with heavy technical vocabulary. Their custom vocabulary feature lets you boost recognition of product names, acronyms, and jargon that would otherwise get mangled. I've seen it save transcripts that Deepgram turned into gibberish for domain-specific content.

OpenAI Whisper via Local Deployment

If you're running on-premises — common in enterprise and healthcare — Whisper large-v3-turbo on a GPU instance behind a WebSocket proxy is viable. Use faster-whisper with CTranslate2 for 4–6x faster inference than the original implementation. You won't match Deepgram's latency, but you own the entire data pipeline. For regulated industries, that trade-off is often mandatory, not optional.

Building the Server-Side WebSocket Relay

Your browser can't call the ASR API directly without exposing API keys. You need a lightweight relay server. Here's a minimal Node.js implementation using Fastify and the ws library:

import Fastify from "fastify";
import WebSocket, { WebSocketServer } from "ws";

const fastify = Fastify();
const wss = new WebSocketServer({ server: fastify.server });

wss.on("connection", (clientWs) => {
  const deepgramWs = new WebSocket(
    "wss://api.deepgram.com/v1/listen?model=nova-3&diarize=true&punctuate=true",
    { headers: { Authorization: `Token ${process.env.DEEPGRAM_API_KEY}` } }
  );

  deepgramWs.on("message", (data) => {
    const result = JSON.parse(data);
    const transcript = result.channel?.alternatives?.[0]?.transcript;
    if (transcript) {
      clientWs.send(JSON.stringify({
        text: transcript,
        speaker: result.channel?.alternatives?.[0]?.words?.[0]?.speaker,
        is_final: result.is_final,
      }));
    }
  });

  clientWs.on("message", (audioChunk) => {
    if (deepgramWs.readyState === WebSocket.OPEN) {
      deepgramWs.send(audioChunk);
    }
  });

  clientWs.on("close", () => deepgramWs.close());
});

await fastify.listen({ port: 3000 });

For Laravel/PHP backends, use Ratchet or delegate the WebSocket relay to a small Node.js sidecar. PHP's blocking I/O model makes it genuinely ill-suited for sustained bidirectional streaming — don't fight it. I've seen teams spend weeks trying to make it work before giving up and spinning up a tiny Node service that took an afternoon.

Handling Reconnection and Partial Results

Production pipelines need reconnection logic. Implement exponential backoff on the client side and treat partial transcripts as disposable — only persist is_final: true results to your database. Partial results are for display only. Storing them is how you end up with a database full of duplicate fragments and confused users:

let reconnectDelay = 1000;
function connectToRelay() {
  const ws = new WebSocket("wss://your-relay.example.com");
  ws.onclose = () => {
    setTimeout(connectToRelay, reconnectDelay);
    reconnectDelay = Math.min(reconnectDelay * 2, 30000);
  };
  ws.onopen = () => { reconnectDelay = 1000; };
  // ...
}

Post-Transcription: Turning Text Into Actionable Meeting Intelligence

Raw transcripts aren't the end goal — they're the input. Once you have a stream of final transcript segments with speaker labels and timestamps, pipe them to a downstream LLM for:

Action item extraction — pass the full transcript to GPT-4o or Claude 3.5 Sonnet with a structured extraction prompt
Meeting summarization — chunk transcripts into 5-minute windows and summarize progressively
Sentiment and engagement scoring — identify when discussions became tense or one-sided

Store transcript segments with speaker_id, start_time, end_time, and text in a time-series-friendly schema. If you're on PostgreSQL, use JSONB for the metadata and full-text search indexes on the transcript content. Don't skip the timestamps. You'll want them the first time someone asks "when did we decide that?" and you actually need to answer.

Conclusion: Real-Time AI Voice Transcription for Web Meetings Is Buildable Today

The architecture described here isn't theoretical — it runs in production. Real-Time AI Voice Transcription for Web Meetings is no longer a research problem; it's an engineering problem, and most of the hard parts have already been solved by the API providers. Your job is to wire up the audio pipeline correctly, pick an ASR API that matches your latency and accuracy requirements, and build the post-processing layer that makes the raw transcript genuinely useful.

Start with Deepgram Nova-3 if you want the fastest path to production. Add speaker diarization from day one — retrofitting it later is painful in ways that will make you regret the shortcut. And invest in the reconnection and error-handling logic before you go live. Audio streams are inherently flaky, and your users will notice every dropped word.

Why let thousands of hours of decisions, commitments, and ideas disappear at the end of every call? The meetings are happening. Build the system that remembers them.

This article was originally published on qcode.in

DEV Community