Have you ever wanted to isolate the vocals from a song? Maybe you're a producer looking to create a remix, a singer wanting to practice with the original backing, or just curious about how AI can separate audio sources. In this guide, I'll walk you through how we built a vocal extraction tool that runs entirely in your browser using the same AI technology as our free online vocal extractor.
Why Extract Vocals in the Browser?
Before we get into the technical details, let's talk about why you'd want to do this locally instead of using a cloud service.
Your Audio Stays Private
When you upload a song to a server, you're trusting that company with your data. For musicians working on unreleased tracks or anyone concerned about privacy, this is a big deal. Browser-based processing means your audio never leaves your device.
No Upload Delays
Uploading a high-quality audio file can take several minutes. With local processing, you skip the upload entirely. The only initial delay is downloading the AI model once (about 170MB), which gets cached for future sessions.
Works Without Internet
Once the page and model are loaded, you can extract vocals even without an internet connection. This is perfect for studio environments or when you're traveling.
Completely Free
Running AI on servers costs money. By moving the computation to your device, we can offer this tool forever without charging a dime.
The Architecture Overview
Our vocal extractor uses Demucs, a state-of-the-art music source separation model from Meta (Facebook Research), converted to run in the browser using ONNX Runtime Web. Here's how the pieces fit together:
Understanding Demucs: The AI That Separates Music
Demucs is a neural network trained to separate mixed audio into four distinct sources: drums, bass, other instruments, and vocals. It was developed by Meta's AI research team and represents the cutting edge of music source separation.
How Demucs Works
The model uses a hybrid approach that processes audio in both the time domain and frequency domain:
- Time-domain processing: The raw audio waveform goes through a neural network that learns temporal patterns
- Frequency-domain processing: The audio is converted to a spectrogram (visual representation of frequencies over time) using STFT
- Hybrid combination: Both representations are fused together for the final separation
The key insight for vocal extraction is simple: Demucs gives us four separate tracks, and we just keep the vocals while discarding everything else.
Core Data Structures
Let's look at the essential data structures that power our vocal extractor:
Demucs Model Constants
export const CONSTANTS = {
SAMPLE_RATE: 44100,
FFT_SIZE: 4096,
HOP_SIZE: 1024,
TRAINING_SAMPLES: 343980,
MODEL_SPEC_BINS: 2048,
MODEL_SPEC_FRAMES: 336,
SEGMENT_OVERLAP: 0.25,
TRACKS: ['drums', 'bass', 'other', 'vocals'],
DEFAULT_MODEL_URL: 'https://huggingface.co/timcsy/demucs-web-onnx/resolve/main/htdemucs_embedded.onnx'
};
These constants are crucial:
- TRAINING_SAMPLES (343980): Each segment is about 7.8 seconds at 44.1kHz sample rate
- SEGMENT_OVERLAP (0.25): 25% overlap prevents audible artifacts at segment boundaries
- TRACKS: The four outputs we get - we only keep 'vocals'
React State Management
const [file, setFile] = useState<File | null>(null);
const [isProcessing, setIsProcessing] = useState(false);
const [progress, setProgress] = useState(0);
const [error, setError] = useState<string | null>(null);
const [vocalUrl, setVocalUrl] = useState<string | null>(null);
const [processTime, setProcessTime] = useState<number | null>(null);
const [elapsedTime, setElapsedTime] = useState<number>(0);
const startTimeRef = useRef<number>(0);
const timerIntervalRef = useRef<NodeJS.Timeout | null>(null);
We track file state, processing status, progress (0-1), timing information, and the resulting vocal audio URL.
The Complete Processing Flow
Here's the entire journey from uploaded file to isolated vocals:
Loading and Initializing the AI Model
The first step is loading the Demucs model using ONNX Runtime Web:
const extractVocals = async () => {
if (!file) return;
setIsProcessing(true);
setError(null);
setProgress(0);
setIsPreview(false);
setProcessTime(null);
startTimeRef.current = Date.now();
startTimer();
try {
// Check WebGPU support for better performance
const hasWebGPU = 'gpu' in navigator;
console.log('WebGPU support:', hasWebGPU);
// Dynamically import demucs-web and onnxruntime-web
const [{ DemucsProcessor }, ort] = await Promise.all([
import('demucs-web'),
import('onnxruntime-web')
]);
setProgress(0.05);
// Initialize processor with progress tracking
const processor = new DemucsProcessor({
ort,
onProgress: (p: number) => {
const validProgress = typeof p === 'number' && !isNaN(p) ? p : 0;
const scaledProgress = 0.2 + validProgress * 0.6;
console.log(`Demucs progress: ${(validProgress * 100).toFixed(1)}%, Overall: ${(scaledProgress * 100).toFixed(1)}%`);
setProgress(scaledProgress);
},
onLog: (phase: string, msg: string) => {
console.log(`[${phase}] ${msg}`);
if (phase === 'inference') {
console.log('AI inference in progress...');
}
}
});
// Load model
setProgress(0.1);
const { CONSTANTS } = await import('demucs-web');
// Model is about 170MB, loading may take a while
console.log('Loading demucs model...');
await processor.loadModel(CONSTANTS.DEFAULT_MODEL_URL);
setProgress(0.25);
// ... continue with audio processing
} catch (err) {
console.error('Vocal extraction error:', err);
setError(t.vocalExtractError || 'Failed to extract vocals. Please try a different file.');
} finally {
setIsProcessing(false);
stopTimer();
}
};
We use dynamic imports to load the heavy libraries only when needed. The model (~170MB) is downloaded from Hugging Face and cached by the browser for future use.
Decoding and Preparing Audio
Once the model is loaded, we decode the audio file:
// Decode audio file
console.log('Decoding audio file...');
const arrayBuffer = await file.arrayBuffer();
const audioContext = new (window.AudioContext || (window as any).webkitAudioContext)();
let audioBuffer;
try {
audioBuffer = await audioContext.decodeAudioData(arrayBuffer);
} catch (decodeErr) {
console.error('Audio decode error:', decodeErr);
throw new Error('Failed to decode audio file. Please try a different format (MP3, WAV).');
}
// Check audio duration
console.log('Audio duration:', audioBuffer.duration, 'seconds');
// Show warning for long audio
if (audioBuffer.duration > 60) {
const estimatedMinutes = Math.ceil(audioBuffer.duration / 60 * 5);
console.log(`Long audio detected. Estimated processing time: ${estimatedMinutes} minutes`);
}
// Get audio data - process full audio
console.log('Getting audio channels...');
const leftChannel = audioBuffer.getChannelData(0);
const rightChannel = audioBuffer.numberOfChannels > 1
? audioBuffer.getChannelData(1)
: audioBuffer.getChannelData(0);
The decodeAudioData method handles various audio formats (MP3, WAV, OGG, M4A, FLAC) and converts them into raw PCM data.
The AI Separation Process
Here's where the AI does its magic. We call the separate method:
console.log('Separating tracks...');
let result;
try {
result = await processor.separate(leftChannel, rightChannel);
console.log('Separation complete:', Object.keys(result));
} catch (sepErr: any) {
console.error('Separation error:', sepErr);
throw new Error('Failed to separate audio tracks. Please try a different file or refresh the page.');
}
The separate method returns an object with four tracks: drums, bass, other, and vocals. Each track contains separate left and right channel data.
Extracting the Vocals Track
Unlike vocal removal where we mix the instrumental tracks, here we simply keep only the vocals:
// Create audio buffer from vocals only
const vocalBuffer = audioContext.createBuffer(
2,
result.vocals.left.length,
44100
);
vocalBuffer.copyToChannel(result.vocals.left, 0);
vocalBuffer.copyToChannel(result.vocals.right, 1);
This creates a stereo audio buffer containing only the isolated vocals.
Deep Dive: The DemucsProcessor Class
Let's examine how the DemucsProcessor class works. This is the heart of the AI processing.
Model Loading with Progress Tracking
async loadModel(modelPathOrBuffer) {
if (!this.ort) {
throw new Error('ONNX Runtime not provided. Pass ort in constructor options.');
}
this.onLog('model', 'Loading model...');
let modelBuffer;
if (modelPathOrBuffer instanceof ArrayBuffer) {
modelBuffer = modelPathOrBuffer;
} else {
const response = await fetch(modelPathOrBuffer || this.modelPath);
// Check if we can track progress
const contentLength = response.headers.get('Content-Length');
if (contentLength && response.body) {
const totalSize = parseInt(contentLength, 10);
const reader = response.body.getReader();
const chunks = [];
let loadedSize = 0;
while (true) {
const { done, value } = await reader.read();
if (done) break;
chunks.push(value);
loadedSize += value.length;
this.onDownloadProgress(loadedSize, totalSize);
}
// Combine chunks into single ArrayBuffer
const combined = new Uint8Array(loadedSize);
let offset = 0;
for (const chunk of chunks) {
combined.set(chunk, offset);
offset += chunk.length;
}
modelBuffer = combined.buffer;
} else {
// Fallback: no progress tracking
modelBuffer = await response.arrayBuffer();
}
}
const defaultSessionOptions = {
executionProviders: ['webgpu', 'wasm'],
graphOptimizationLevel: 'basic'
};
this.session = await this.ort.InferenceSession.create(modelBuffer, {
...defaultSessionOptions,
...this.sessionOptions
});
this.onLog('model', 'Model loaded successfully');
return this.session;
}
The model loading uses the Streams API to track download progress, giving users visual feedback during the ~170MB download. It configures ONNX Runtime to use WebGPU for GPU acceleration (falling back to WASM if unavailable).
The Separation Algorithm
The separate method processes audio in overlapping segments:
async separate(leftChannel, rightChannel) {
if (!this.session) {
throw new Error('Model not loaded. Call loadModel() first.');
}
const totalSamples = leftChannel.length;
const stride = Math.floor(TRAINING_SAMPLES * (1 - SEGMENT_OVERLAP));
const numSegments = Math.ceil((totalSamples - TRAINING_SAMPLES) / stride) + 1;
const outputs = TRACKS.map(() => ({
left: new Float32Array(totalSamples),
right: new Float32Array(totalSamples)
}));
const weights = new Float32Array(totalSamples);
let segmentIdx = 0;
for (let start = 0; start < totalSamples; start += stride) {
const end = Math.min(start + TRAINING_SAMPLES, totalSamples);
const segmentLength = end - start;
const segLeft = new Float32Array(TRAINING_SAMPLES);
const segRight = new Float32Array(TRAINING_SAMPLES);
for (let i = 0; i < segmentLength; i++) {
segLeft[i] = leftChannel[start + i];
segRight[i] = rightChannel[start + i];
}
const input = prepareModelInput(segLeft, segRight);
const waveformTensor = new this.ort.Tensor('float32', input.waveform, [1, 2, TRAINING_SAMPLES]);
const magSpecTensor = new this.ort.Tensor('float32', input.magSpec, [1, 4, MODEL_SPEC_BINS, MODEL_SPEC_FRAMES]);
const feeds = {};
feeds[this.session.inputNames[0]] = waveformTensor;
if (this.session.inputNames.length > 1) {
feeds[this.session.inputNames[1]] = magSpecTensor;
}
const inferResults = await this.session.run(feeds);
// ... process outputs
segmentIdx++;
this.onProgress({
progress: segmentIdx / numSegments,
currentSegment: segmentIdx,
totalSegments: numSegments
});
}
// Normalize by weights
for (let t = 0; t < TRACKS.length; t++) {
for (let i = 0; i < totalSamples; i++) {
if (weights[i] > 0) {
outputs[t].left[i] /= weights[i];
outputs[t].right[i] /= weights[i];
}
}
}
return {
drums: outputs[0],
bass: outputs[1],
other: outputs[2],
vocals: outputs[3]
};
}
Key aspects of this algorithm:
- Segmentation: Long audio is split into overlapping segments (343,980 samples each with 25% overlap)
- STFT Preparation: Each segment is converted to a spectrogram
- ONNX Inference: The model runs on both time-domain and frequency-domain inputs
- Overlap-Add: Results are combined using weighted overlap-add to prevent artifacts
- Progress Tracking: Updates the UI after each segment
Signal Processing: FFT and STFT
The vocal extractor relies on Fast Fourier Transform (FFT) operations. Here's the core FFT implementation:
export function fft(realOut, imagOut, realIn, n) {
const bits = Math.log2(n) | 0;
const twiddles = getFFTTwiddles(n);
// Bit-reverse permutation
for (let i = 0; i < n; i++) {
const j = bitReverse(i, bits);
realOut[i] = realIn[j];
imagOut[i] = 0;
}
// Cooley-Tukey butterfly operations
for (let size = 2; size <= n; size *= 2) {
const halfSize = size / 2;
const step = n / size;
for (let i = 0; i < n; i += size) {
for (let j = 0; j < halfSize; j++) {
const k = j * step;
const tReal = twiddles.real[k];
const tImag = twiddles.imag[k];
const idx1 = i + j;
const idx2 = i + j + halfSize;
const eReal = realOut[idx1];
const eImag = imagOut[idx1];
const oReal = realOut[idx2] * tReal - imagOut[idx2] * tImag;
const oImag = realOut[idx2] * tImag + imagOut[idx2] * tReal;
realOut[idx1] = eReal + oReal;
imagOut[idx1] = eImag + oImag;
realOut[idx2] = eReal - oReal;
imagOut[idx2] = eImag - oImag;
}
}
}
}
This implements the Cooley-Tukey radix-2 FFT algorithm.
The STFT (Short-Time Fourier Transform) applies FFT to overlapping windows:
export function stft(signal, fftSize, hopSize) {
const numFrames = Math.floor((signal.length - fftSize) / hopSize) + 1;
const numBins = fftSize / 2 + 1;
const window = getHannWindow(fftSize);
const scale = 1.0 / Math.sqrt(fftSize);
const specReal = new Float32Array(numFrames * numBins);
const specImag = new Float32Array(numFrames * numBins);
const frameReal = new Float32Array(fftSize);
const frameImag = new Float32Array(fftSize);
const windowedFrame = new Float32Array(fftSize);
for (let frame = 0; frame < numFrames; frame++) {
const start = frame * hopSize;
for (let i = 0; i < fftSize; i++) {
windowedFrame[i] = signal[start + i] * window[i];
}
fft(frameReal, frameImag, windowedFrame, fftSize);
const outOffset = frame * numBins;
for (let k = 0; k < numBins; k++) {
specReal[outOffset + k] = frameReal[k] * scale;
specImag[outOffset + k] = frameImag[k] * scale;
}
}
return { real: specReal, imag: specImag, numFrames, numBins };
}
Converting to WAV Format
After extracting the vocals, we convert to WAV:
const audioBufferToWav = async (buffer: AudioBuffer): Promise<Blob> => {
const numChannels = buffer.numberOfChannels;
const sampleRate = buffer.sampleRate;
const format = 1; // PCM
const bitDepth = 16;
const bytesPerSample = bitDepth / 8;
const blockAlign = numChannels * bytesPerSample;
const dataLength = buffer.length * numChannels * bytesPerSample;
const bufferLength = 44 + dataLength;
const arrayBuffer = new ArrayBuffer(bufferLength);
const view = new DataView(arrayBuffer);
// Write WAV header
const writeString = (offset: number, string: string) => {
for (let i = 0; i < string.length; i++) {
view.setUint8(offset + i, string.charCodeAt(i));
}
};
writeString(0, 'RIFF');
view.setUint32(4, 36 + dataLength, true);
writeString(8, 'WAVE');
writeString(12, 'fmt ');
view.setUint32(16, 16, true);
view.setUint16(20, format, true);
view.setUint16(22, numChannels, true);
view.setUint32(24, sampleRate, true);
view.setUint32(28, sampleRate * blockAlign, true);
view.setUint16(32, blockAlign, true);
view.setUint16(34, bitDepth, true);
writeString(36, 'data');
view.setUint32(40, dataLength, true);
// Write audio data
const offset = 44;
const channels = [];
for (let i = 0; i < numChannels; i++) {
channels.push(buffer.getChannelData(i));
}
let index = 0;
for (let i = 0; i < buffer.length; i++) {
for (let channel = 0; channel < numChannels; channel++) {
const sample = Math.max(-1, Math.min(1, channels[channel][i]));
const intSample = sample < 0 ? sample * 0x8000 : sample * 0x7FFF;
view.setInt16(offset + index, intSample, true);
index += 2;
}
}
return new Blob([arrayBuffer], { type: 'audio/wav' });
};
Performance Optimizations
WebGPU Acceleration
ONNX Runtime Web uses WebGPU when available:
const defaultSessionOptions = {
executionProviders: ['webgpu', 'wasm'],
graphOptimizationLevel: 'basic'
};
Overlapping Segments
The 25% overlap prevents artifacts:
const overlapWindow = new Float32Array(segmentLength);
for (let i = 0; i < segmentLength; i++) {
const fadeIn = Math.min(i / (stride * 0.5), 1);
const fadeOut = Math.min((segmentLength - i) / (stride * 0.5), 1);
overlapWindow[i] = Math.min(fadeIn, fadeOut);
}
Dynamic Imports
Only load heavy libraries when needed:
const [{ DemucsProcessor }, ort] = await Promise.all([
import('demucs-web'),
import('onnxruntime-web')
]);
Browser Compatibility
Our vocal extractor works in modern browsers:
- Chrome/Edge 113+: Full support with WebGPU
- Firefox 121+: Full support (WASM fallback)
- Safari 17+: Full support (WASM fallback)
Required APIs:
-
AudioContext: Universal support -
fetch: Universal support -
WebGPU: Chrome/Edge (optional) -
WebAssembly: Universal support
Try It Yourself
Ready to extract vocals from your favorite songs? Visit our free online vocal extractor and give it a try. All processing happens locally - your files never leave your device.
Conclusion
Building a browser-based vocal extractor shows what's possible with modern web technologies:
AI in the browser works: ONNX Runtime Web brings powerful ML models to the client side.
Privacy by design: Local processing means your data stays yours.
Signal processing fundamentals: FFT, STFT, and overlap-add are essential for audio AI.
Performance is key: WebGPU, smart segmentation, and caching make it practical.
The complete source is available in our repository. Whether you're building a music app, learning about AI, or just love karaoke, I hope this guide helps you understand how vocal extraction works.
Happy extracting! š¤šµ


Top comments (0)