Getting audio from a browser to a server in real-time sounds like a two-line solution. It isn't.
I built this pipeline for LiveSuggest, an AI assistant that listens to meetings and gives suggestions as the conversation happens. That means streaming audio continuously, with as little delay as possible, across a WebSocket connection that can drop at any time.
The pipeline
Here's the full chain:
- Capture audio with
getUserMedia(mic) orgetDisplayMedia(tab audio) - Feed it into a
MediaRecorder - Slice it into chunks every N seconds
- Encode each chunk to base64
- Send it over WebSocket to the server
- Server decodes and sends to a transcription API
Every step has a gotcha.
MediaRecorder is great until it isn't
MediaRecorder handles encoding for you. I use audio/webm;codecs=opus because it's widely supported and compresses well.
const mediaRecorder = new MediaRecorder(stream, {
mimeType: 'audio/webm;codecs=opus',
});
The problem: you don't control the chunk boundaries. ondataavailable fires when the browser feels like it, not when you need it. If you call mediaRecorder.stop() and start() to force a new chunk, you get a new WebM header each time. That's fine, but the chunks aren't standalone files you can just concatenate.
I settled on 10-second segments. Short enough for responsive transcription, long enough for the transcription API to have decent context.
Base64 is wasteful but practical
Binary WebSocket frames would be more efficient. But base64 over JSON keeps the payload inspectable, works with Socket.io out of the box, and makes debugging way easier.
const reader = new FileReader();
reader.readAsDataURL(blob);
reader.onloadend = () => {
const base64 = reader.result.split(',')[1];
socket.emit('audio-chunk', {
sessionId,
audio: base64,
format: 'webm',
duration,
timestamp: Date.now(),
});
};
The 33% size overhead hasn't been an issue in practice. A 10-second Opus chunk is tiny.
Mixing two audio sources
If you want both mic and system audio (from a browser tab), you need to mix them. The Web Audio API makes this possible but unintuitive:
const audioContext = new AudioContext();
const destination = audioContext.createMediaStreamDestination();
const micSource = audioContext.createMediaStreamSource(micStream);
const tabSource = audioContext.createMediaStreamSource(tabStream);
micSource.connect(destination);
tabSource.connect(destination);
// destination.stream is your mixed stream
The resulting stream goes into MediaRecorder. Both sides of the conversation end up in one stream. It works better than you'd expect.
What I learned about reliability
The stream can die at any time. Chrome's "Stop sharing" button kills getDisplayMedia streams instantly. Listening for the ended event on every track is not optional.
Rate limiting saved me from a nasty bug. I do sliding-window rate limiting in Redis: 60 chunks per minute per session. Without it, a buggy client can silently flood the transcription API for hours.
Small chunks are almost always noise. Buffers under 2KB get filtered before hitting the API. Same for transcriptions under 4 words — silence, breathing, keyboard sounds. The transcription model isn't cheap, and garbage in means garbage out regardless.
Reconnection is non-trivial. WebSocket drops happen. I use exponential backoff with jitter, and the server restores session state from Redis when a client reconnects to a different instance.
Was it worth building from scratch?
I considered third-party services that handle the whole pipeline. But owning the audio layer means controlling latency, cost, and what data leaves the app. For a product where those three things matter, it was worth the complexity.
The pipeline now handles thousands of audio chunks per day. Not glamorous code, but it's the plumbing everything else depends on.
Top comments (0)