DEV Community

Cover image for TIL canvas.captureStream() is video-only — here's how I mixed voiceover + music into a MediaRecorder export" published: true
Robert Corn
Robert Corn

Posted on

TIL canvas.captureStream() is video-only — here's how I mixed voiceover + music into a MediaRecorder export" published: true

TL;DRcanvas.captureStream() only carries video tracks. To get audio into your MediaRecorder export you need to build an AudioContext graph that lets every <audio> and <video> element play through speakers AND tap into a MediaStreamDestination simultaneously. There are at least five footguns along the way. Here's the singleton wrapper that finally got it right.


I've been building a client-side video editor over the last few months (AetherCut — 100% in-browser, nothing uploads, no SaaS backend for the editing itself). The video pipeline was straightforward enough: HTML5 <video> elements drawn into a <canvas> on every animation frame, then canvas.captureStream(30) piped into MediaRecorder for the export.

Then I added voiceover. Then background music. Then I shipped it.

Then a user said "why is the exported file silent."

Turns out canvas.captureStream() only carries video tracks. Cool. No problem, I thought, I'll just addTrack() the audio elements. That's where the fun started.

Problem 1: createMediaElementSource() can only be called ONCE per element

The moment you call audioContext.createMediaElementSource(audioEl), the element's normal speaker output is rerouted through the graph. Call it twice on the same element and you get an InvalidStateError.

Fine — bind once on mount. But now the element is silent in preview unless you explicitly gainNode.connect(audioContext.destination). The element's own volume attribute is still respected (via the gain stage), but the speaker output literally goes through your graph now.

// First call: works, reroutes audio through the graph
const src = ctx.createMediaElementSource(audioEl);

// Second call on the same element: throws InvalidStateError
const src2 = ctx.createMediaElementSource(audioEl); // 💥
Enter fullscreen mode Exit fullscreen mode

In React this means a hard rule: bind exactly once on mount, store the source node somewhere stable (a Map outside the component is fine), and never re-create it on re-render.

Problem 2: <video>.muted = true silences the export, not just the speakers

This one cost me hours.

Once a <video> element is bound to the AudioContext graph, setting video.muted = true mutes BOTH the speaker output AND the input to the graph. So if the user toggles preview-mute (a normal "I'm at the office" gesture), their export comes out silent and they don't notice until they open it in another player.

The fix isn't on the element — it's at the graph level (see Problem 3).

Problem 3: Speaker mute and export volume need to be independent

The user must be able to silence preview without silencing the export. Solution: a master gain node that controls speakers only. Per-source gains feed both the master gain AND any active export taps.

<video clip A>  ─┐
<video clip B>  ─┤── srcNode → gainNode ─┬─→ masterGain ─→ ctx.destination (speakers)
<audio voiceover>┤                       └─→ MediaStreamDestination (export, when active)
<audio bgMusic> ─┘
Enter fullscreen mode Exit fullscreen mode

Preview mute → masterGain.gain.value = 0. Export tap → pull from per-source gains BEFORE the master gain. Both decisions are independent.

Problem 4: Multiple recorder.start() calls = stacking taps

MediaStreamDestination instances don't auto-clean. If you call gainNode.connect(dest) on every export and never disconnect(), the next export gets a phantom mix from the last one (audible echoes at first, eventually silent because each gain node hits its max-fanout limit).

Keep the destination reference, and on recorder.onstop call gainNode.disconnect(dest) for every source you connected.

Problem 5: audioContext.resume() requires a user gesture

Chrome blocks the AudioContext until a click. The export button click is a gesture, but if the user has been editing for 10 minutes and the context auto-suspended, your recorder.start() call fires before audio actually flows → silent first second of the export.

// Resume + small grace period before starting the recorder
await audioContext.resume();
await new Promise(r => setTimeout(r, 150));
recorder.start();
Enter fullscreen mode Exit fullscreen mode

The 150ms isn't theoretical jitter — it's the empirical "first chunk has audible audio" threshold across Chrome/Safari/Firefox.


What ended up working

One singleton AudioContext. One Map<HTMLMediaElement, gainNode>. Every <video> and <audio> in the editor binds exactly once at mount. Export taps the entire mix downstream. Preview mute is a single masterGain.gain.value = 0.

Here's the whole utility — about 120 lines:

/**
 * Generalized AudioContext mixing graph for the editor's preview + export.
 *
 * Multiple HTMLMediaElements (voiceover <audio>, background music <audio>,
 * and the per-clip <video> elements in the preview pool) need to be:
 *   1. Audible through the user's speakers during preview, AND
 *   2. Tappable by a MediaStreamDestination during MediaRecorder export
 *      so all those audio sources are mixed into the exported file.
 */

let ctx = null;
// Map of element → { srcNode, gainNode, key }
const sources = new Map();
// Master gain controls SPEAKER output only — preview mute toggles this
// so exports still capture audio. Export taps pull from each
// per-source gainNode BEFORE the master gain.
let masterGain = null;

function ensureCtx() {
  if (!ctx) {
    const AC = window.AudioContext || window.webkitAudioContext;
    if (!AC) return null;
    ctx = new AC();
    masterGain = ctx.createGain();
    masterGain.gain.value = 1;
    masterGain.connect(ctx.destination);
  }
  return ctx;
}

// Bind an HTMLMediaElement to the graph. Idempotent.
export function bindMediaElement(key, mediaEl) {
  if (!mediaEl) return false;
  if (sources.has(mediaEl)) return true;
  const c = ensureCtx();
  if (!c) return false;
  let srcNode;
  try {
    srcNode = c.createMediaElementSource(mediaEl);
  } catch (e) {
    // Already bound to another context (HMR / double-mount).
    console.warn(`[audioGraph] bind failed for "${key}":`, e?.message);
    return false;
  }
  const gainNode = c.createGain();
  srcNode.connect(gainNode);
  gainNode.connect(masterGain);
  sources.set(mediaEl, { srcNode, gainNode, key });
  return true;
}

// Preview mute (0 = silent speakers, 1 = full). Doesn't touch export.
export function setMasterGain(value) {
  if (!masterGain) return;
  masterGain.gain.value = Math.max(0, Math.min(1, value));
}

export function unbindMediaElement(mediaEl) {
  const entry = sources.get(mediaEl);
  if (!entry) return;
  try { entry.srcNode.disconnect(); } catch {}
  try { entry.gainNode.disconnect(); } catch {}
  sources.delete(mediaEl);
}

// Mixed-audio tap for MediaRecorder. Caller MUST call disconnect()
// after recorder.onstop or the next export stacks a phantom mix.
export function getMixTap() {
  if (!ctx || sources.size === 0) return null;
  const dest = ctx.createMediaStreamDestination();
  const connected = [];
  for (const entry of sources.values()) {
    try {
      entry.gainNode.connect(dest);
      connected.push(entry.gainNode);
    } catch { /* keep connecting the rest */ }
  }
  if (connected.length === 0) return null;
  return {
    stream: dest.stream,
    disconnect: () => {
      connected.forEach((g) => {
        try { g.disconnect(dest); } catch {}
      });
    },
  };
}

export function resumeAudioContext() {
  if (ctx && ctx.state === "suspended") {
    return ctx.resume().catch(() => {});
  }
  return Promise.resolve();
}

export function hasAudioGraph() {
  return !!(ctx && sources.size > 0);
}
Enter fullscreen mode Exit fullscreen mode

And here's how the export hook uses it:

const stream = canvas.captureStream(30);

// Mix every bound audio source into the recorded stream
await resumeAudioContext();
const tap = getMixTap();
if (tap) {
  tap.stream.getAudioTracks().forEach((track) => {
    stream.addTrack(track);
  });
}

const recorder = new MediaRecorder(stream, {
  mimeType: "video/webm;codecs=vp9",
  videoBitsPerSecond: 4_000_000,
});

recorder.start(100);
// ... drive playback so the canvas paints frames ...

recorder.onstop = () => {
  // CRITICAL: release the tap or the next export stacks a phantom mix
  if (tap) tap.disconnect();
};
Enter fullscreen mode Exit fullscreen mode

The one rule I'd nail to the wall

Bind every media element to your graph ONCE on mount, even if it doesn't currently have audio. Adding it later is a footgun because by then it's already played sound through ctx.destination and the second createMediaElementSource() call will throw.

On WebCodecs vs MediaRecorder

WebCodecs is faster (1.5–3× realtime in my testing for video-only exports) but in current browsers it's painful to mux audio + video into one container without writing your own ISOBMFF muxer. So my codepath is:

  • Has any audio source bound? → MediaRecorder path with the audio graph tap
  • Pure-video export? → WebCodecs (GPU-accelerated)

Most users get the slower-but-correct path. Power users with silent timelines get the fast path. That trade-off has held up in production.

Stack

React 18, Web Audio API, canvas.captureStream, MediaRecorder, WebCodecs (where available), FastAPI backend (only for auth + Stripe — zero video data). No FFmpeg, no server-side rendering. ~17k lines of frontend.

If anyone wants to see this graph in a working app, give it a spin — drop your video on the page and try the export. No signup needed for the editor itself.

Happy to nerd out in comments. 🎚️

Top comments (0)