monkeymore studio

Posted on Apr 12

Building an AI-Powered Vocal Remover in the Browser: A Deep Technical Dive

#webdev #ai #javascript #tutorial

Ever wanted to create karaoke versions of your favorite songs without uploading them to some mysterious server? In this guide, I'll show you how we built a vocal remover that runs entirely in your browser using AI, no server required. This is the same technology powering our free online vocal remover tool.

Why Browser-Based AI Audio Processing?

Before we dive into the code, let's talk about why you'd want to process audio with AI directly in the browser instead of using a cloud service.

Complete Privacy

When you upload a song to a server, you're sending your personal data (or copyrighted material) to a third party. With browser-based processing, your audio never leaves your device. This is especially important for musicians working on unreleased tracks or anyone concerned about privacy.

No Upload Wait Times

Uploading a 5-minute song can take several minutes depending on your connection. With local processing, you skip the upload entirely. The only delay is downloading the AI model once (about 170MB), which gets cached for future use.

Works Offline

Once the page and model are loaded, you can process audio even without an internet connection. This makes it reliable for studio environments or locations with poor connectivity.

Cost-Free Forever

Running AI inference on servers costs money. By moving the computation to the user's device, we can offer this tool completely free without worrying about cloud computing bills.

The Architecture: How It All Fits Together

Our vocal remover uses a sophisticated AI model called Demucs, which has been converted to run in the browser using ONNX Runtime Web. Here's the high-level architecture:

Understanding Demucs: The AI Behind the Magic

Demucs is a state-of-the-art music source separation model developed by Facebook Research. It can separate a song into four distinct tracks: drums, bass, other (instruments), and vocals.

How Demucs Works

The model uses a hybrid approach combining time-domain and frequency-domain processing:

Time-domain processing: The raw audio waveform is fed through a neural network
Frequency-domain processing: The audio is converted to a spectrogram using STFT (Short-Time Fourier Transform)
Hybrid output: Both representations are combined for the final separation

Here's the key insight: to remove vocals, we simply take the drums, bass, and other tracks, then mix them together while discarding the vocals track.

Core Data Structures

Let's look at the important data structures that power our vocal remover:

Demucs Constants

export const CONSTANTS = {
  SAMPLE_RATE: 44100,
  FFT_SIZE: 4096,
  HOP_SIZE: 1024,
  TRAINING_SAMPLES: 343980,
  MODEL_SPEC_BINS: 2048,
  MODEL_SPEC_FRAMES: 336,
  SEGMENT_OVERLAP: 0.25,
  TRACKS: ['drums', 'bass', 'other', 'vocals'],
  DEFAULT_MODEL_URL: 'https://huggingface.co/timcsy/demucs-web-onnx/resolve/main/htdemucs_embedded.onnx'
};

These constants define the model's expected input format:

TRAINING_SAMPLES (343980): Each segment processed is about 7.8 seconds at 44.1kHz
SEGMENT_OVERLAP (0.25): Segments overlap by 25% to avoid artifacts at boundaries
TRACKS: The four outputs we get from the model

State Management

const [file, setFile] = useState<File | null>(null);
const [isProcessing, setIsProcessing] = useState(false);
const [progress, setProgress] = useState(0);
const [error, setError] = useState<string | null>(null);
const [instrumentalUrl, setInstrumentalUrl] = useState<string | null>(null);
const [processTime, setProcessTime] = useState<number | null>(null);
const [elapsedTime, setElapsedTime] = useState<number>(0);
const startTimeRef = useRef<number>(0);
const timerIntervalRef = useRef<NodeJS.Timeout | null>(null);

We track processing state, progress (0-1), timing information, and the resulting instrumental audio URL.

The Complete Processing Flow

Here's the entire flow from file upload to instrumental output:

Loading and Initializing the AI Model

The first step is loading the Demucs model using ONNX Runtime Web:

const removeVocals = async () => {
  setIsProcessing(true);
  setProgress(0);
  startTimeRef.current = Date.now();
  startTimer();

  try {
    // Dynamically import demucs-web and onnxruntime-web
    const [{ DemucsProcessor }, ort] = await Promise.all([
      import('demucs-web'),
      import('onnxruntime-web')
    ]);

    setProgress(0.05);

    // Initialize processor
    const processor = new DemucsProcessor({
      ort,
      onProgress: (p: number) => {
        const validProgress = typeof p === 'number' && !isNaN(p) ? p : 0;
        setProgress(0.2 + validProgress * 0.6);
      },
      onLog: (phase: string, msg: string) => console.log(`[${phase}] ${msg}`)
    });

    // Load model
    setProgress(0.1);
    const { CONSTANTS } = await import('demucs-web');

    // Model is about 170MB, loading may take a while
    console.log('Loading demucs model...');
    await processor.loadModel(CONSTANTS.DEFAULT_MODEL_URL);

    setProgress(0.25);
    // ... continue processing
  } catch (err) {
    console.error('Vocal removal error:', err);
    setError('Failed to remove vocals. Please try a different file.');
  } finally {
    setIsProcessing(false);
    stopTimer();
  }
};

We use dynamic imports to avoid loading the large libraries until they're needed. The model (~170MB) is downloaded from Hugging Face and cached by the browser.

Decoding and Preparing Audio

Once the model is loaded, we decode the audio file:

// Decode audio file
console.log('Decoding audio file...');
const arrayBuffer = await file.arrayBuffer();
const audioContext = new (window.AudioContext || (window as any).webkitAudioContext)();
let audioBuffer;
try {
  audioBuffer = await audioContext.decodeAudioData(arrayBuffer);
} catch (decodeErr) {
  console.error('Audio decode error:', decodeErr);
  throw new Error('Failed to decode audio file. Please try a different format (MP3, WAV).');
}

// Check audio duration
console.log('Audio duration:', audioBuffer.duration, 'seconds');

// Get audio data
console.log('Getting audio channels...');
const leftChannel = audioBuffer.getChannelData(0);
const rightChannel = audioBuffer.numberOfChannels > 1 
  ? audioBuffer.getChannelData(1) 
  : audioBuffer.getChannelData(0);

The decodeAudioData method converts various audio formats (MP3, WAV, OGG, etc.) into raw PCM data that we can process.

The Separation Process

Here's where the magic happens. We call the separate method on our processor:

console.log('Separating tracks...');
let result;
try {
  result = await processor.separate(leftChannel, rightChannel);
  console.log('Separation complete:', Object.keys(result));
} catch (sepErr: any) {
  console.error('Separation error:', sepErr);
  throw new Error('Failed to separate audio tracks. Please try a different file or refresh the page.');
}

The separate method returns an object with four tracks: drums, bass, other, and vocals. Each track contains left and right channel data.

Mixing the Instrumental Track

To create the instrumental (karaoke) version, we simply mix the non-vocal tracks:

// Mix drums, bass, and other (remove vocals)
const mixedLength = result.drums.left.length;
const mixedLeft = new Float32Array(mixedLength);
const mixedRight = new Float32Array(mixedLength);

for (let i = 0; i < mixedLength; i++) {
  mixedLeft[i] = result.drums.left[i] + result.bass.left[i] + result.other.left[i];
  mixedRight[i] = result.drums.right[i] + result.bass.right[i] + result.other.right[i];
}

// Create audio buffer
const instrumentalBuffer = audioContext.createBuffer(2, mixedLength, 44100);
instrumentalBuffer.copyToChannel(mixedLeft, 0);
instrumentalBuffer.copyToChannel(mixedRight, 1);

This is the key insight: Demucs separates everything, and we just recombine what we want to keep.

Deep Dive: The DemucsProcessor Class

Let's look at how the DemucsProcessor class works internally. This is the core of the AI processing:

Model Loading with Progress Tracking

async loadModel(modelPathOrBuffer) {
  if (!this.ort) {
    throw new Error('ONNX Runtime not provided. Pass ort in constructor options.');
  }

  this.onLog('model', 'Loading model...');

  let modelBuffer;
  if (modelPathOrBuffer instanceof ArrayBuffer) {
    modelBuffer = modelPathOrBuffer;
  } else {
    const response = await fetch(modelPathOrBuffer || this.modelPath);

    // Check if we can track progress
    const contentLength = response.headers.get('Content-Length');
    if (contentLength && response.body) {
      const totalSize = parseInt(contentLength, 10);
      const reader = response.body.getReader();
      const chunks = [];
      let loadedSize = 0;

      while (true) {
        const { done, value } = await reader.read();
        if (done) break;
        chunks.push(value);
        loadedSize += value.length;
        this.onDownloadProgress(loadedSize, totalSize);
      }

      // Combine chunks into single ArrayBuffer
      const combined = new Uint8Array(loadedSize);
      let offset = 0;
      for (const chunk of chunks) {
        combined.set(chunk, offset);
        offset += chunk.length;
      }
      modelBuffer = combined.buffer;
    } else {
      // Fallback: no progress tracking
      modelBuffer = await response.arrayBuffer();
    }
  }

  const defaultSessionOptions = {
    executionProviders: ['webgpu', 'wasm'],
    graphOptimizationLevel: 'basic'
  };

  this.session = await this.ort.InferenceSession.create(modelBuffer, {
    ...defaultSessionOptions,
    ...this.sessionOptions
  });

  this.onLog('model', 'Model loaded successfully');
  return this.session;
}

The model loading uses the Streams API to track download progress, giving users feedback during the ~170MB download. It also configures ONNX Runtime to use WebGPU for acceleration (falling back to WASM if unavailable).

The Separation Algorithm

The separate method processes audio in overlapping segments:

async separate(leftChannel, rightChannel) {
  if (!this.session) {
    throw new Error('Model not loaded. Call loadModel() first.');
  }

  const totalSamples = leftChannel.length;
  const stride = Math.floor(TRAINING_SAMPLES * (1 - SEGMENT_OVERLAP));
  const numSegments = Math.ceil((totalSamples - TRAINING_SAMPLES) / stride) + 1;

  const outputs = TRACKS.map(() => ({
    left: new Float32Array(totalSamples),
    right: new Float32Array(totalSamples)
  }));
  const weights = new Float32Array(totalSamples);

  let segmentIdx = 0;

  for (let start = 0; start < totalSamples; start += stride) {
    const end = Math.min(start + TRAINING_SAMPLES, totalSamples);
    const segmentLength = end - start;

    const segLeft = new Float32Array(TRAINING_SAMPLES);
    const segRight = new Float32Array(TRAINING_SAMPLES);

    for (let i = 0; i < segmentLength; i++) {
      segLeft[i] = leftChannel[start + i];
      segRight[i] = rightChannel[start + i];
    }

    const input = prepareModelInput(segLeft, segRight);

    const waveformTensor = new this.ort.Tensor('float32', input.waveform, [1, 2, TRAINING_SAMPLES]);
    const magSpecTensor = new this.ort.Tensor('float32', input.magSpec, [1, 4, MODEL_SPEC_BINS, MODEL_SPEC_FRAMES]);

    const feeds = {};
    feeds[this.session.inputNames[0]] = waveformTensor;
    if (this.session.inputNames.length > 1) {
      feeds[this.session.inputNames[1]] = magSpecTensor;
    }

    const inferResults = await this.session.run(feeds);
    // ... process outputs

    segmentIdx++;
    this.onProgress({
      progress: segmentIdx / numSegments,
      currentSegment: segmentIdx,
      totalSegments: numSegments
    });
  }

  // Normalize by weights
  for (let t = 0; t < TRACKS.length; t++) {
    for (let i = 0; i < totalSamples; i++) {
      if (weights[i] > 0) {
        outputs[t].left[i] /= weights[i];
        outputs[t].right[i] /= weights[i];
      }
    }
  }

  return {
    drums: outputs[0],
    bass: outputs[1],
    other: outputs[2],
    vocals: outputs[3]
  };
}

Key aspects of this algorithm:

Segmentation: Long audio is split into overlapping segments (343,980 samples each with 25% overlap)
STFT Preparation: Each segment is converted to a spectrogram using prepareModelInput
ONNX Inference: The model runs inference on both time-domain and frequency-domain inputs
Overlap-Add: Results are combined using weighted overlap-add to avoid boundary artifacts
Progress Tracking: The onProgress callback updates the UI after each segment

Signal Processing: FFT and STFT

The vocal remover relies heavily on Fast Fourier Transform (FFT) operations. Here's the FFT implementation:

export function fft(realOut, imagOut, realIn, n) {
  const bits = Math.log2(n) | 0;
  const twiddles = getFFTTwiddles(n);

  // Bit-reverse permutation
  for (let i = 0; i < n; i++) {
    const j = bitReverse(i, bits);
    realOut[i] = realIn[j];
    imagOut[i] = 0;
  }

  // Cooley-Tukey butterfly operations
  for (let size = 2; size <= n; size *= 2) {
    const halfSize = size / 2;
    const step = n / size;
    for (let i = 0; i < n; i += size) {
      for (let j = 0; j < halfSize; j++) {
        const k = j * step;
        const tReal = twiddles.real[k];
        const tImag = twiddles.imag[k];
        const idx1 = i + j;
        const idx2 = i + j + halfSize;
        const eReal = realOut[idx1];
        const eImag = imagOut[idx1];
        const oReal = realOut[idx2] * tReal - imagOut[idx2] * tImag;
        const oImag = realOut[idx2] * tImag + imagOut[idx2] * tReal;
        realOut[idx1] = eReal + oReal;
        imagOut[idx1] = eImag + oImag;
        realOut[idx2] = eReal - oReal;
        imagOut[idx2] = eImag - oImag;
      }
    }
  }
}

This implements the Cooley-Tukey radix-2 FFT algorithm, which converts time-domain audio to frequency-domain representation.

The STFT (Short-Time Fourier Transform) applies FFT to overlapping windows:

export function stft(signal, fftSize, hopSize) {
  const numFrames = Math.floor((signal.length - fftSize) / hopSize) + 1;
  const numBins = fftSize / 2 + 1;
  const window = getHannWindow(fftSize);
  const scale = 1.0 / Math.sqrt(fftSize);

  const specReal = new Float32Array(numFrames * numBins);
  const specImag = new Float32Array(numFrames * numBins);
  const frameReal = new Float32Array(fftSize);
  const frameImag = new Float32Array(fftSize);
  const windowedFrame = new Float32Array(fftSize);

  for (let frame = 0; frame < numFrames; frame++) {
    const start = frame * hopSize;
    for (let i = 0; i < fftSize; i++) {
      windowedFrame[i] = signal[start + i] * window[i];
    }
    fft(frameReal, frameImag, windowedFrame, fftSize);
    const outOffset = frame * numBins;
    for (let k = 0; k < numBins; k++) {
      specReal[outOffset + k] = frameReal[k] * scale;
      specImag[outOffset + k] = frameImag[k] * scale;
    }
  }

  return { real: specReal, imag: specImag, numFrames, numBins };
}

The Hann window reduces spectral leakage, and the hop size (1024 samples) determines the time resolution.

Converting Back to WAV

After processing, we convert the instrumental audio back to WAV format:

const audioBufferToWav = async (buffer: AudioBuffer): Promise<Blob> => {
  const numChannels = buffer.numberOfChannels;
  const sampleRate = buffer.sampleRate;
  const format = 1; // PCM
  const bitDepth = 16;

  const bytesPerSample = bitDepth / 8;
  const blockAlign = numChannels * bytesPerSample;

  const dataLength = buffer.length * numChannels * bytesPerSample;
  const bufferLength = 44 + dataLength;

  const arrayBuffer = new ArrayBuffer(bufferLength);
  const view = new DataView(arrayBuffer);

  // Write WAV header
  const writeString = (offset: number, string: string) => {
    for (let i = 0; i < string.length; i++) {
      view.setUint8(offset + i, string.charCodeAt(i));
    }
  };

  writeString(0, 'RIFF');
  view.setUint32(4, 36 + dataLength, true);
  writeString(8, 'WAVE');
  writeString(12, 'fmt ');
  view.setUint32(16, 16, true);
  view.setUint16(20, format, true);
  view.setUint16(22, numChannels, true);
  view.setUint32(24, sampleRate, true);
  view.setUint32(28, sampleRate * blockAlign, true);
  view.setUint16(32, blockAlign, true);
  view.setUint16(34, bitDepth, true);
  writeString(36, 'data');
  view.setUint32(40, dataLength, true);

  // Write audio data
  const offset = 44;
  const channels = [];
  for (let i = 0; i < numChannels; i++) {
    channels.push(buffer.getChannelData(i));
  }

  let index = 0;
  for (let i = 0; i < buffer.length; i++) {
    for (let channel = 0; channel < numChannels; channel++) {
      const sample = Math.max(-1, Math.min(1, channels[channel][i]));
      const intSample = sample < 0 ? sample * 0x8000 : sample * 0x7FFF;
      view.setInt16(offset + index, intSample, true);
      index += 2;
    }
  }

  return new Blob([arrayBuffer], { type: 'audio/wav' });
};

This creates a standard WAV file with a 44-byte header followed by 16-bit PCM audio data.

Performance Considerations

Processing audio with AI in the browser is computationally intensive. Here are some key optimizations:

1. WebGPU Acceleration

ONNX Runtime Web uses WebGPU when available, providing significant speedup over CPU-only WASM:

const defaultSessionOptions = {
  executionProviders: ['webgpu', 'wasm'],
  graphOptimizationLevel: 'basic'
};

2. Overlapping Segments

The 25% overlap between segments prevents audible artifacts at segment boundaries:

const overlapWindow = new Float32Array(segmentLength);
for (let i = 0; i < segmentLength; i++) {
  const fadeIn = Math.min(i / (stride * 0.5), 1);
  const fadeOut = Math.min((segmentLength - i) / (stride * 0.5), 1);
  overlapWindow[i] = Math.min(fadeIn, fadeOut);
}

3. Dynamic Imports

We only load the heavy libraries when needed:

const [{ DemucsProcessor }, ort] = await Promise.all([
  import('demucs-web'),
  import('onnxruntime-web')
]);

4. Model Caching

The browser caches the 170MB model file, so subsequent uses load instantly.

Browser Compatibility

Our vocal remover works in modern browsers with the following requirements:

Chrome/Edge 113+: Full support with WebGPU acceleration
Firefox 121+: Full support (WASM fallback)
Safari 17+: Full support (WASM fallback)

Required APIs:

AudioContext: Universal support
fetch: Universal support
WebGPU: Chrome/Edge (optional, falls back to WASM)
WebAssembly: Universal support

Try It Yourself

Now that you understand how AI-powered vocal removal works, try it out on your own songs! Visit our free online vocal remover to create karaoke versions of your favorite tracks. Remember, all processing happens locally in your browser - your audio files never leave your device.

Conclusion

Building a browser-based vocal remover demonstrates the incredible capabilities of modern web technologies:

AI in the browser is viable: With ONNX Runtime Web, we can run sophisticated machine learning models entirely client-side.
Privacy by default: Local processing means complete data privacy without sacrificing functionality.
Signal processing fundamentals: Understanding FFT, STFT, and overlap-add is crucial for audio AI applications.
Performance matters: WebGPU acceleration, smart segmentation, and caching make real-time AI audio processing feasible.

The complete source code is available in our repository. Whether you're building a music app, a podcast editor, or just curious about AI audio processing, I hope this guide gives you a solid foundation to build upon.

Happy coding, and enjoy your karaoke sessions! 🎤🎵

DEV Community