monkeymore studio

Posted on Apr 12

Extracting Vocals with AI in the Browser: A Technical Deep Dive

#webdev #ai #javascript #tutorial

Have you ever wanted to isolate the vocals from a song? Maybe you're a producer looking to create a remix, a singer wanting to practice with the original backing, or just curious about how AI can separate audio sources. In this guide, I'll walk you through how we built a vocal extraction tool that runs entirely in your browser using the same AI technology as our free online vocal extractor.

Why Extract Vocals in the Browser?

Before we get into the technical details, let's talk about why you'd want to do this locally instead of using a cloud service.

Your Audio Stays Private

When you upload a song to a server, you're trusting that company with your data. For musicians working on unreleased tracks or anyone concerned about privacy, this is a big deal. Browser-based processing means your audio never leaves your device.

No Upload Delays

Uploading a high-quality audio file can take several minutes. With local processing, you skip the upload entirely. The only initial delay is downloading the AI model once (about 170MB), which gets cached for future sessions.

Works Without Internet

Once the page and model are loaded, you can extract vocals even without an internet connection. This is perfect for studio environments or when you're traveling.

Completely Free

Running AI on servers costs money. By moving the computation to your device, we can offer this tool forever without charging a dime.

The Architecture Overview

Our vocal extractor uses Demucs, a state-of-the-art music source separation model from Meta (Facebook Research), converted to run in the browser using ONNX Runtime Web. Here's how the pieces fit together:

Understanding Demucs: The AI That Separates Music

Demucs is a neural network trained to separate mixed audio into four distinct sources: drums, bass, other instruments, and vocals. It was developed by Meta's AI research team and represents the cutting edge of music source separation.

How Demucs Works

The model uses a hybrid approach that processes audio in both the time domain and frequency domain:

Time-domain processing: The raw audio waveform goes through a neural network that learns temporal patterns
Frequency-domain processing: The audio is converted to a spectrogram (visual representation of frequencies over time) using STFT
Hybrid combination: Both representations are fused together for the final separation

The key insight for vocal extraction is simple: Demucs gives us four separate tracks, and we just keep the vocals while discarding everything else.

Core Data Structures

Let's look at the essential data structures that power our vocal extractor:

Demucs Model Constants

export const CONSTANTS = {
  SAMPLE_RATE: 44100,
  FFT_SIZE: 4096,
  HOP_SIZE: 1024,
  TRAINING_SAMPLES: 343980,
  MODEL_SPEC_BINS: 2048,
  MODEL_SPEC_FRAMES: 336,
  SEGMENT_OVERLAP: 0.25,
  TRACKS: ['drums', 'bass', 'other', 'vocals'],
  DEFAULT_MODEL_URL: 'https://huggingface.co/timcsy/demucs-web-onnx/resolve/main/htdemucs_embedded.onnx'
};

These constants are crucial:

TRAINING_SAMPLES (343980): Each segment is about 7.8 seconds at 44.1kHz sample rate
SEGMENT_OVERLAP (0.25): 25% overlap prevents audible artifacts at segment boundaries
TRACKS: The four outputs we get - we only keep 'vocals'

React State Management

const [file, setFile] = useState<File | null>(null);
const [isProcessing, setIsProcessing] = useState(false);
const [progress, setProgress] = useState(0);
const [error, setError] = useState<string | null>(null);
const [vocalUrl, setVocalUrl] = useState<string | null>(null);
const [processTime, setProcessTime] = useState<number | null>(null);
const [elapsedTime, setElapsedTime] = useState<number>(0);
const startTimeRef = useRef<number>(0);
const timerIntervalRef = useRef<NodeJS.Timeout | null>(null);

We track file state, processing status, progress (0-1), timing information, and the resulting vocal audio URL.

The Complete Processing Flow

Here's the entire journey from uploaded file to isolated vocals:

Loading and Initializing the AI Model

The first step is loading the Demucs model using ONNX Runtime Web:

const extractVocals = async () => {
  if (!file) return;

  setIsProcessing(true);
  setError(null);
  setProgress(0);
  setIsPreview(false);
  setProcessTime(null);
  startTimeRef.current = Date.now();
  startTimer();

  try {
    // Check WebGPU support for better performance
    const hasWebGPU = 'gpu' in navigator;
    console.log('WebGPU support:', hasWebGPU);

    // Dynamically import demucs-web and onnxruntime-web
    const [{ DemucsProcessor }, ort] = await Promise.all([
      import('demucs-web'),
      import('onnxruntime-web')
    ]);

    setProgress(0.05);

    // Initialize processor with progress tracking
    const processor = new DemucsProcessor({
      ort,
      onProgress: (p: number) => {
        const validProgress = typeof p === 'number' && !isNaN(p) ? p : 0;
        const scaledProgress = 0.2 + validProgress * 0.6;
        console.log(`Demucs progress: ${(validProgress * 100).toFixed(1)}%, Overall: ${(scaledProgress * 100).toFixed(1)}%`);
        setProgress(scaledProgress);
      },
      onLog: (phase: string, msg: string) => {
        console.log(`[${phase}] ${msg}`);
        if (phase === 'inference') {
          console.log('AI inference in progress...');
        }
      }
    });

    // Load model
    setProgress(0.1);
    const { CONSTANTS } = await import('demucs-web');

    // Model is about 170MB, loading may take a while
    console.log('Loading demucs model...');
    await processor.loadModel(CONSTANTS.DEFAULT_MODEL_URL);

    setProgress(0.25);
    // ... continue with audio processing
  } catch (err) {
    console.error('Vocal extraction error:', err);
    setError(t.vocalExtractError || 'Failed to extract vocals. Please try a different file.');
  } finally {
    setIsProcessing(false);
    stopTimer();
  }
};

We use dynamic imports to load the heavy libraries only when needed. The model (~170MB) is downloaded from Hugging Face and cached by the browser for future use.

Decoding and Preparing Audio

Once the model is loaded, we decode the audio file:

// Decode audio file
console.log('Decoding audio file...');
const arrayBuffer = await file.arrayBuffer();
const audioContext = new (window.AudioContext || (window as any).webkitAudioContext)();
let audioBuffer;
try {
  audioBuffer = await audioContext.decodeAudioData(arrayBuffer);
} catch (decodeErr) {
  console.error('Audio decode error:', decodeErr);
  throw new Error('Failed to decode audio file. Please try a different format (MP3, WAV).');
}

// Check audio duration
console.log('Audio duration:', audioBuffer.duration, 'seconds');

// Show warning for long audio
if (audioBuffer.duration > 60) {
  const estimatedMinutes = Math.ceil(audioBuffer.duration / 60 * 5);
  console.log(`Long audio detected. Estimated processing time: ${estimatedMinutes} minutes`);
}

// Get audio data - process full audio
console.log('Getting audio channels...');
const leftChannel = audioBuffer.getChannelData(0);
const rightChannel = audioBuffer.numberOfChannels > 1 
  ? audioBuffer.getChannelData(1) 
  : audioBuffer.getChannelData(0);

The decodeAudioData method handles various audio formats (MP3, WAV, OGG, M4A, FLAC) and converts them into raw PCM data.

The AI Separation Process

Here's where the AI does its magic. We call the separate method:

console.log('Separating tracks...');
let result;
try {
  result = await processor.separate(leftChannel, rightChannel);
  console.log('Separation complete:', Object.keys(result));
} catch (sepErr: any) {
  console.error('Separation error:', sepErr);
  throw new Error('Failed to separate audio tracks. Please try a different file or refresh the page.');
}

The separate method returns an object with four tracks: drums, bass, other, and vocals. Each track contains separate left and right channel data.

Extracting the Vocals Track

Unlike vocal removal where we mix the instrumental tracks, here we simply keep only the vocals:

// Create audio buffer from vocals only
const vocalBuffer = audioContext.createBuffer(
  2,
  result.vocals.left.length,
  44100
);
vocalBuffer.copyToChannel(result.vocals.left, 0);
vocalBuffer.copyToChannel(result.vocals.right, 1);

This creates a stereo audio buffer containing only the isolated vocals.

Deep Dive: The DemucsProcessor Class

Let's examine how the DemucsProcessor class works. This is the heart of the AI processing.

Model Loading with Progress Tracking

async loadModel(modelPathOrBuffer) {
  if (!this.ort) {
    throw new Error('ONNX Runtime not provided. Pass ort in constructor options.');
  }

  this.onLog('model', 'Loading model...');

  let modelBuffer;
  if (modelPathOrBuffer instanceof ArrayBuffer) {
    modelBuffer = modelPathOrBuffer;
  } else {
    const response = await fetch(modelPathOrBuffer || this.modelPath);

    // Check if we can track progress
    const contentLength = response.headers.get('Content-Length');
    if (contentLength && response.body) {
      const totalSize = parseInt(contentLength, 10);
      const reader = response.body.getReader();
      const chunks = [];
      let loadedSize = 0;

      while (true) {
        const { done, value } = await reader.read();
        if (done) break;
        chunks.push(value);
        loadedSize += value.length;
        this.onDownloadProgress(loadedSize, totalSize);
      }

      // Combine chunks into single ArrayBuffer
      const combined = new Uint8Array(loadedSize);
      let offset = 0;
      for (const chunk of chunks) {
        combined.set(chunk, offset);
        offset += chunk.length;
      }
      modelBuffer = combined.buffer;
    } else {
      // Fallback: no progress tracking
      modelBuffer = await response.arrayBuffer();
    }
  }

  const defaultSessionOptions = {
    executionProviders: ['webgpu', 'wasm'],
    graphOptimizationLevel: 'basic'
  };

  this.session = await this.ort.InferenceSession.create(modelBuffer, {
    ...defaultSessionOptions,
    ...this.sessionOptions
  });

  this.onLog('model', 'Model loaded successfully');
  return this.session;
}

The model loading uses the Streams API to track download progress, giving users visual feedback during the ~170MB download. It configures ONNX Runtime to use WebGPU for GPU acceleration (falling back to WASM if unavailable).

The Separation Algorithm

The separate method processes audio in overlapping segments:

async separate(leftChannel, rightChannel) {
  if (!this.session) {
    throw new Error('Model not loaded. Call loadModel() first.');
  }

  const totalSamples = leftChannel.length;
  const stride = Math.floor(TRAINING_SAMPLES * (1 - SEGMENT_OVERLAP));
  const numSegments = Math.ceil((totalSamples - TRAINING_SAMPLES) / stride) + 1;

  const outputs = TRACKS.map(() => ({
    left: new Float32Array(totalSamples),
    right: new Float32Array(totalSamples)
  }));
  const weights = new Float32Array(totalSamples);

  let segmentIdx = 0;

  for (let start = 0; start < totalSamples; start += stride) {
    const end = Math.min(start + TRAINING_SAMPLES, totalSamples);
    const segmentLength = end - start;

    const segLeft = new Float32Array(TRAINING_SAMPLES);
    const segRight = new Float32Array(TRAINING_SAMPLES);

    for (let i = 0; i < segmentLength; i++) {
      segLeft[i] = leftChannel[start + i];
      segRight[i] = rightChannel[start + i];
    }

    const input = prepareModelInput(segLeft, segRight);

    const waveformTensor = new this.ort.Tensor('float32', input.waveform, [1, 2, TRAINING_SAMPLES]);
    const magSpecTensor = new this.ort.Tensor('float32', input.magSpec, [1, 4, MODEL_SPEC_BINS, MODEL_SPEC_FRAMES]);

    const feeds = {};
    feeds[this.session.inputNames[0]] = waveformTensor;
    if (this.session.inputNames.length > 1) {
      feeds[this.session.inputNames[1]] = magSpecTensor;
    }

    const inferResults = await this.session.run(feeds);
    // ... process outputs

    segmentIdx++;
    this.onProgress({
      progress: segmentIdx / numSegments,
      currentSegment: segmentIdx,
      totalSegments: numSegments
    });
  }

  // Normalize by weights
  for (let t = 0; t < TRACKS.length; t++) {
    for (let i = 0; i < totalSamples; i++) {
      if (weights[i] > 0) {
        outputs[t].left[i] /= weights[i];
        outputs[t].right[i] /= weights[i];
      }
    }
  }

  return {
    drums: outputs[0],
    bass: outputs[1],
    other: outputs[2],
    vocals: outputs[3]
  };
}

Key aspects of this algorithm:

Segmentation: Long audio is split into overlapping segments (343,980 samples each with 25% overlap)
STFT Preparation: Each segment is converted to a spectrogram
ONNX Inference: The model runs on both time-domain and frequency-domain inputs
Overlap-Add: Results are combined using weighted overlap-add to prevent artifacts
Progress Tracking: Updates the UI after each segment

Signal Processing: FFT and STFT

The vocal extractor relies on Fast Fourier Transform (FFT) operations. Here's the core FFT implementation:

export function fft(realOut, imagOut, realIn, n) {
  const bits = Math.log2(n) | 0;
  const twiddles = getFFTTwiddles(n);

  // Bit-reverse permutation
  for (let i = 0; i < n; i++) {
    const j = bitReverse(i, bits);
    realOut[i] = realIn[j];
    imagOut[i] = 0;
  }

  // Cooley-Tukey butterfly operations
  for (let size = 2; size <= n; size *= 2) {
    const halfSize = size / 2;
    const step = n / size;
    for (let i = 0; i < n; i += size) {
      for (let j = 0; j < halfSize; j++) {
        const k = j * step;
        const tReal = twiddles.real[k];
        const tImag = twiddles.imag[k];
        const idx1 = i + j;
        const idx2 = i + j + halfSize;
        const eReal = realOut[idx1];
        const eImag = imagOut[idx1];
        const oReal = realOut[idx2] * tReal - imagOut[idx2] * tImag;
        const oImag = realOut[idx2] * tImag + imagOut[idx2] * tReal;
        realOut[idx1] = eReal + oReal;
        imagOut[idx1] = eImag + oImag;
        realOut[idx2] = eReal - oReal;
        imagOut[idx2] = eImag - oImag;
      }
    }
  }
}

This implements the Cooley-Tukey radix-2 FFT algorithm.

The STFT (Short-Time Fourier Transform) applies FFT to overlapping windows:

export function stft(signal, fftSize, hopSize) {
  const numFrames = Math.floor((signal.length - fftSize) / hopSize) + 1;
  const numBins = fftSize / 2 + 1;
  const window = getHannWindow(fftSize);
  const scale = 1.0 / Math.sqrt(fftSize);

  const specReal = new Float32Array(numFrames * numBins);
  const specImag = new Float32Array(numFrames * numBins);
  const frameReal = new Float32Array(fftSize);
  const frameImag = new Float32Array(fftSize);
  const windowedFrame = new Float32Array(fftSize);

  for (let frame = 0; frame < numFrames; frame++) {
    const start = frame * hopSize;
    for (let i = 0; i < fftSize; i++) {
      windowedFrame[i] = signal[start + i] * window[i];
    }
    fft(frameReal, frameImag, windowedFrame, fftSize);
    const outOffset = frame * numBins;
    for (let k = 0; k < numBins; k++) {
      specReal[outOffset + k] = frameReal[k] * scale;
      specImag[outOffset + k] = frameImag[k] * scale;
    }
  }

  return { real: specReal, imag: specImag, numFrames, numBins };
}

Converting to WAV Format

After extracting the vocals, we convert to WAV:

const audioBufferToWav = async (buffer: AudioBuffer): Promise<Blob> => {
  const numChannels = buffer.numberOfChannels;
  const sampleRate = buffer.sampleRate;
  const format = 1; // PCM
  const bitDepth = 16;

  const bytesPerSample = bitDepth / 8;
  const blockAlign = numChannels * bytesPerSample;

  const dataLength = buffer.length * numChannels * bytesPerSample;
  const bufferLength = 44 + dataLength;

  const arrayBuffer = new ArrayBuffer(bufferLength);
  const view = new DataView(arrayBuffer);

  // Write WAV header
  const writeString = (offset: number, string: string) => {
    for (let i = 0; i < string.length; i++) {
      view.setUint8(offset + i, string.charCodeAt(i));
    }
  };

  writeString(0, 'RIFF');
  view.setUint32(4, 36 + dataLength, true);
  writeString(8, 'WAVE');
  writeString(12, 'fmt ');
  view.setUint32(16, 16, true);
  view.setUint16(20, format, true);
  view.setUint16(22, numChannels, true);
  view.setUint32(24, sampleRate, true);
  view.setUint32(28, sampleRate * blockAlign, true);
  view.setUint16(32, blockAlign, true);
  view.setUint16(34, bitDepth, true);
  writeString(36, 'data');
  view.setUint32(40, dataLength, true);

  // Write audio data
  const offset = 44;
  const channels = [];
  for (let i = 0; i < numChannels; i++) {
    channels.push(buffer.getChannelData(i));
  }

  let index = 0;
  for (let i = 0; i < buffer.length; i++) {
    for (let channel = 0; channel < numChannels; channel++) {
      const sample = Math.max(-1, Math.min(1, channels[channel][i]));
      const intSample = sample < 0 ? sample * 0x8000 : sample * 0x7FFF;
      view.setInt16(offset + index, intSample, true);
      index += 2;
    }
  }

  return new Blob([arrayBuffer], { type: 'audio/wav' });
};

Performance Optimizations

WebGPU Acceleration

ONNX Runtime Web uses WebGPU when available:

const defaultSessionOptions = {
  executionProviders: ['webgpu', 'wasm'],
  graphOptimizationLevel: 'basic'
};

Overlapping Segments

The 25% overlap prevents artifacts:

const overlapWindow = new Float32Array(segmentLength);
for (let i = 0; i < segmentLength; i++) {
  const fadeIn = Math.min(i / (stride * 0.5), 1);
  const fadeOut = Math.min((segmentLength - i) / (stride * 0.5), 1);
  overlapWindow[i] = Math.min(fadeIn, fadeOut);
}

Dynamic Imports

Only load heavy libraries when needed:

const [{ DemucsProcessor }, ort] = await Promise.all([
  import('demucs-web'),
  import('onnxruntime-web')
]);

Browser Compatibility

Our vocal extractor works in modern browsers:

Chrome/Edge 113+: Full support with WebGPU
Firefox 121+: Full support (WASM fallback)
Safari 17+: Full support (WASM fallback)

Required APIs:

AudioContext: Universal support
fetch: Universal support
WebGPU: Chrome/Edge (optional)
WebAssembly: Universal support

Try It Yourself

Ready to extract vocals from your favorite songs? Visit our free online vocal extractor and give it a try. All processing happens locally - your files never leave your device.

Conclusion

Building a browser-based vocal extractor shows what's possible with modern web technologies:

AI in the browser works: ONNX Runtime Web brings powerful ML models to the client side.
Privacy by design: Local processing means your data stays yours.
Signal processing fundamentals: FFT, STFT, and overlap-add are essential for audio AI.
Performance is key: WebGPU, smart segmentation, and caching make it practical.

The complete source is available in our repository. Whether you're building a music app, learning about AI, or just love karaoke, I hope this guide helps you understand how vocal extraction works.

Happy extracting! 🎤🎵

DEV Community