Why your browser multitrack audio drifts out of sync (and how to fix it)

#webaudio #javascript #webdev #audio

If you've ever tried to build anything more ambitious than a single <audio> tag in the browser, you've probably hit this wall: you start two or three audio tracks at "the same time", and within 30 seconds they sound like a drunk wedding band. Drums leading the bass by 80ms, vocals lagging the guitar, the whole thing falling apart.

I hit this exact problem on a project last month — building a small multitrack practice tool for a friend who teaches guitar. First version used <audio> elements and play() calls in a loop. It worked beautifully for about 4 seconds before the tracks started smearing.

Let's dig into why this happens, and how to actually solve it.

The root cause: two clocks, no coordination

The core problem is that the browser has multiple timing systems and they do not agree with each other. When you call audio.play() on an HTMLAudioElement, you're at the mercy of:

The main thread event loop (which can stall during GC, layout, anything)
The media element's internal scheduler (which buffers independently per element)
Date.now() / performance.now() (wall clock, not audio clock)

Each <audio> element schedules itself against its own internal clock. There is no shared timebase. So when you do this:

// This is the trap. Looks correct. Is not.
track1.play();
track2.play();
track3.play();

The three play() calls return immediately, but each element starts producing samples whenever its decoder feels ready. On Chrome the gap might be 5ms. On Firefox under load, 40ms. And the elements keep drifting because they aren't slaved to the same sample clock.

A 1ms drift on a kick drum is audible. A 10ms drift is unusable.

The fix: schedule against the AudioContext clock

The Web Audio API was designed specifically to solve this. The AudioContext exposes a single, sample-accurate clock (context.currentTime), and every source node you create is scheduled against that clock with sub-millisecond precision.

The trick is to decode your audio into AudioBuffer objects up front, then schedule playback at a single future timestamp.

// One context = one shared clock for everything
const ctx = new AudioContext();

async function loadTrack(url) {
  const res = await fetch(url);
  const arrayBuffer = await res.arrayBuffer();
  // decodeAudioData returns a fully decoded buffer in memory
  return await ctx.decodeAudioData(arrayBuffer);
}

const [drums, bass, guitar] = await Promise.all([
  loadTrack('/drums.wav'),
  loadTrack('/bass.wav'),
  loadTrack('/guitar.wav'),
]);

Now here's the part most tutorials get wrong. You don't start the sources at ctx.currentTime. You schedule them slightly in the future, so all three start() calls definitely land before the playback time arrives:

function playSynced(buffers) {
  // 100ms lookahead. Long enough to absorb main-thread jitter,
  // short enough that the user does not perceive delay.
  const startAt = ctx.currentTime + 0.1;

  const sources = buffers.map(buf => {
    const src = ctx.createBufferSource();
    src.buffer = buf;
    src.connect(ctx.destination);
    // All three sources are armed against the SAME future timestamp
    src.start(startAt);
    return src;
  });

  return sources;
}

The magic is in that shared startAt value. The audio thread (which runs at high priority, separate from the main thread) sees three sources all scheduled for the same sample, and starts them in lockstep. No drift. Ever.

Per-track gain, mute, and solo

Once you have the basics, you almost always want per-track volume control. Insert a GainNode between each source and the destination:

function createTrack(buffer) {
  const gain = ctx.createGain();
  gain.connect(ctx.destination);

  return {
    buffer,
    gain,
    // Set volume with a tiny ramp to avoid clicks on sudden changes
    setVolume(value) {
      gain.gain.setTargetAtTime(value, ctx.currentTime, 0.01);
    },
    mute() { this.setVolume(0); },
    unmute() { this.setVolume(1); },
  };
}

Note setTargetAtTime instead of assigning gain.gain.value directly. Direct assignment causes a zipper-noise click because the value jumps between sample frames. The exponential ramp smooths it out over ~10ms.

The gotcha nobody mentions: source nodes are one-shot

This tripped me up for an embarrassing amount of time. An AudioBufferSourceNode can only be start()ed once. Ever. If you stop playback and want to play again, you have to create a new source node:

class Track {
  constructor(ctx, buffer) {
    this.ctx = ctx;
    this.buffer = buffer;
    this.gain = ctx.createGain();
    this.gain.connect(ctx.destination);
    this.source = null;
  }

  play(startAt) {
    // Always create a fresh source — they are disposable
    this.source = this.ctx.createBufferSource();
    this.source.buffer = this.buffer;
    this.source.connect(this.gain);
    this.source.start(startAt);
  }

  stop() {
    if (this.source) {
      this.source.stop();
      this.source.disconnect();
      this.source = null;
    }
  }
}

The AudioBuffer is the expensive thing — decoded PCM data sitting in memory. The source node is cheap, ephemeral plumbing. Treat them differently.

Prevention: a checklist

After shipping a couple of multitrack browser tools, here's what I check before any audio code goes anywhere near production:

One AudioContext per app. Creating multiple contexts means multiple clocks, which defeats the entire point.
Decode once, play many. Hold AudioBuffer objects in memory; never re-decode on each play.
Always schedule with lookahead. Never start(ctx.currentTime). Use at least ctx.currentTime + 0.05 to absorb jitter.
Resume the context on user gesture. Browsers ship AudioContext in suspended state. Call ctx.resume() inside a click handler or playback silently fails.
Use setTargetAtTime for parameter changes. Direct assignment to .value clicks. Ramps don't.
Watch your buffer sizes. A decoded stereo 44.1kHz minute is about 10MB in memory. Long sessions with many tracks add up fast.

The Web Audio API has a steep initial learning curve because it inverts how you usually think about timing — you describe when things should happen, then let the audio thread execute it, instead of imperatively saying do it now. Once that clicks, the drift problems disappear entirely.

For a deeper dive into scheduling patterns, Chris Wilson's A Tale of Two Clocks is still the canonical reference and worth bookmarking. It's the article that finally made all of this make sense to me.