I built a real-time audio pipeline from the browser to my server. Here's what actually works.

Flo — Thu, 26 Feb 2026 22:41:49 +0000

Getting audio from a browser to a server in real-time sounds like a two-line solution. It isn't.

I built this pipeline for LiveSuggest, an AI assistant that listens to meetings and gives suggestions as the conversation happens. That means streaming audio continuously, with as little delay as possible, across a WebSocket connection that can drop at any time.

The pipeline

Here's the full chain:

Capture audio with getUserMedia (mic) or getDisplayMedia (tab audio)
Feed it into a MediaRecorder
Slice it into chunks every N seconds
Encode each chunk to base64
Send it over WebSocket to the server
Server decodes and sends to a transcription API

Every step has a gotcha.

MediaRecorder is great until it isn't

MediaRecorder handles encoding for you. I use audio/webm;codecs=opus because it's widely supported and compresses well.

const mediaRecorder = new MediaRecorder(stream, {
  mimeType: 'audio/webm;codecs=opus',
});

The problem: you don't control the chunk boundaries. ondataavailable fires when the browser feels like it, not when you need it. If you call mediaRecorder.stop() and start() to force a new chunk, you get a new WebM header each time. That's fine, but the chunks aren't standalone files you can just concatenate.

I settled on 10-second segments. Short enough for responsive transcription, long enough for the transcription API to have decent context.

Base64 is wasteful but practical

Binary WebSocket frames would be more efficient. But base64 over JSON keeps the payload inspectable, works with Socket.io out of the box, and makes debugging way easier.

const reader = new FileReader();
reader.readAsDataURL(blob);
reader.onloadend = () => {
  const base64 = reader.result.split(',')[1];
  socket.emit('audio-chunk', {
    sessionId,
    audio: base64,
    format: 'webm',
    duration,
    timestamp: Date.now(),
  });
};

The 33% size overhead hasn't been an issue in practice. A 10-second Opus chunk is tiny.

Mixing two audio sources

If you want both mic and system audio (from a browser tab), you need to mix them. The Web Audio API makes this possible but unintuitive:

const audioContext = new AudioContext();
const destination = audioContext.createMediaStreamDestination();

const micSource = audioContext.createMediaStreamSource(micStream);
const tabSource = audioContext.createMediaStreamSource(tabStream);

micSource.connect(destination);
tabSource.connect(destination);

// destination.stream is your mixed stream

The resulting stream goes into MediaRecorder. Both sides of the conversation end up in one stream. It works better than you'd expect.

What I learned about reliability

The stream can die at any time. Chrome's "Stop sharing" button kills getDisplayMedia streams instantly. Listening for the ended event on every track is not optional.

Rate limiting saved me from a nasty bug. I do sliding-window rate limiting in Redis: 60 chunks per minute per session. Without it, a buggy client can silently flood the transcription API for hours.

Small chunks are almost always noise. Buffers under 2KB get filtered before hitting the API. Same for transcriptions under 4 words — silence, breathing, keyboard sounds. The transcription model isn't cheap, and garbage in means garbage out regardless.

Reconnection is non-trivial. WebSocket drops happen. I use exponential backoff with jitter, and the server restores session state from Redis when a client reconnects to a different instance.

Was it worth building from scratch?

I considered third-party services that handle the whole pipeline. But owning the audio layer means controlling latency, cost, and what data leaves the app. For a product where those three things matter, it was worth the complexity.

The pipeline now handles thousands of audio chunks per day. Not glamorous code, but it's the plumbing everything else depends on.

I tried to capture system audio in the browser. Here's what I learned.

Flo — Mon, 12 Jan 2026 16:12:56 +0000

I'm building LiveSuggest, a real-time AI assistant that listens to your meetings and gives you suggestions as you talk. Simple idea, right?

Turns out, capturing audio from a browser tab is... complicated.

The good news

Chrome and Edge support it. You use getDisplayMedia, the same API for screen sharing, but with an audio option:

const stream = await navigator.mediaDevices.getDisplayMedia({
  video: true,
  audio: { systemAudio: 'include' }
});

The user picks a tab to share, checks "Share tab audio", and boom — you get the audio stream. Works great for Zoom, Teams, Meet, whatever runs in a browser tab.

The bad news

Firefox? Implements getDisplayMedia but completely ignores the audio part. No error, no warning. You just... don't get audio.

Safari? Same story. The API exists, audio doesn't.

Mobile browsers? None of them support it. iOS, Android, doesn't matter.

So if you're building something that needs system audio, you're looking at Chrome/Edge desktop only. That's maybe 60-65% of your potential users.

What I ended up doing

I detect the browser upfront and show a clear message:

"Firefox doesn't support system audio capture for meetings. Use Chrome or Edge for this feature. Microphone capture is still available."

No tricks, no workarounds. Just honesty. Users appreciate knowing why something doesn't work rather than wondering if they did something wrong.

For Firefox/Safari users, the app falls back to microphone-only mode. It's not ideal for capturing both sides of a conversation, but it's better than nothing.

The annoying details

A few things that wasted my time so they don't waste yours:

You have to request video. Even if you only want audio. video: true is mandatory. I immediately stop the video track after getting the stream, but you can't skip it.

The "Share tab audio" checkbox is easy to miss. Chrome shows it in the sharing dialog, but it's not checked by default. If your user doesn't check it, you get a stream with zero audio tracks. No error, just silence.

The stream can die anytime. User clicks "Stop sharing" in Chrome's toolbar? Your stream ends. You need to listen for the ended event and handle it gracefully.

Was it worth it?

Absolutely. For the browsers that support it, capturing tab audio is a game-changer. You can build things that weren't possible before — meeting assistants, live translators, accessibility tools.

Just go in knowing that you'll spend time on browser detection and fallbacks. That's the web in 2025.

If you're curious about what I built, check out LiveSuggest. And if you've found better workarounds for Firefox/Safari, I'd love to hear about them in the comments.

DEV Community: Flo