Ever wanted to create karaoke versions of your favorite songs without uploading them to some mysterious server? In this guide, I'll show you how we built a vocal remover that runs entirely in your browser using AI, no server required. This is the same technology powering our free online vocal remover tool.
Why Browser-Based AI Audio Processing?
Before we dive into the code, let's talk about why you'd want to process audio with AI directly in the browser instead of using a cloud service.
Complete Privacy
When you upload a song to a server, you're sending your personal data (or copyrighted material) to a third party. With browser-based processing, your audio never leaves your device. This is especially important for musicians working on unreleased tracks or anyone concerned about privacy.
No Upload Wait Times
Uploading a 5-minute song can take several minutes depending on your connection. With local processing, you skip the upload entirely. The only delay is downloading the AI model once (about 170MB), which gets cached for future use.
Works Offline
Once the page and model are loaded, you can process audio even without an internet connection. This makes it reliable for studio environments or locations with poor connectivity.
Cost-Free Forever
Running AI inference on servers costs money. By moving the computation to the user's device, we can offer this tool completely free without worrying about cloud computing bills.
The Architecture: How It All Fits Together
Our vocal remover uses a sophisticated AI model called Demucs, which has been converted to run in the browser using ONNX Runtime Web. Here's the high-level architecture:
Understanding Demucs: The AI Behind the Magic
Demucs is a state-of-the-art music source separation model developed by Facebook Research. It can separate a song into four distinct tracks: drums, bass, other (instruments), and vocals.
How Demucs Works
The model uses a hybrid approach combining time-domain and frequency-domain processing:
- Time-domain processing: The raw audio waveform is fed through a neural network
- Frequency-domain processing: The audio is converted to a spectrogram using STFT (Short-Time Fourier Transform)
- Hybrid output: Both representations are combined for the final separation
Here's the key insight: to remove vocals, we simply take the drums, bass, and other tracks, then mix them together while discarding the vocals track.
Core Data Structures
Let's look at the important data structures that power our vocal remover:
Demucs Constants
export const CONSTANTS = {
SAMPLE_RATE: 44100,
FFT_SIZE: 4096,
HOP_SIZE: 1024,
TRAINING_SAMPLES: 343980,
MODEL_SPEC_BINS: 2048,
MODEL_SPEC_FRAMES: 336,
SEGMENT_OVERLAP: 0.25,
TRACKS: ['drums', 'bass', 'other', 'vocals'],
DEFAULT_MODEL_URL: 'https://huggingface.co/timcsy/demucs-web-onnx/resolve/main/htdemucs_embedded.onnx'
};
These constants define the model's expected input format:
- TRAINING_SAMPLES (343980): Each segment processed is about 7.8 seconds at 44.1kHz
- SEGMENT_OVERLAP (0.25): Segments overlap by 25% to avoid artifacts at boundaries
- TRACKS: The four outputs we get from the model
State Management
const [file, setFile] = useState<File | null>(null);
const [isProcessing, setIsProcessing] = useState(false);
const [progress, setProgress] = useState(0);
const [error, setError] = useState<string | null>(null);
const [instrumentalUrl, setInstrumentalUrl] = useState<string | null>(null);
const [processTime, setProcessTime] = useState<number | null>(null);
const [elapsedTime, setElapsedTime] = useState<number>(0);
const startTimeRef = useRef<number>(0);
const timerIntervalRef = useRef<NodeJS.Timeout | null>(null);
We track processing state, progress (0-1), timing information, and the resulting instrumental audio URL.
The Complete Processing Flow
Here's the entire flow from file upload to instrumental output:
Loading and Initializing the AI Model
The first step is loading the Demucs model using ONNX Runtime Web:
const removeVocals = async () => {
setIsProcessing(true);
setProgress(0);
startTimeRef.current = Date.now();
startTimer();
try {
// Dynamically import demucs-web and onnxruntime-web
const [{ DemucsProcessor }, ort] = await Promise.all([
import('demucs-web'),
import('onnxruntime-web')
]);
setProgress(0.05);
// Initialize processor
const processor = new DemucsProcessor({
ort,
onProgress: (p: number) => {
const validProgress = typeof p === 'number' && !isNaN(p) ? p : 0;
setProgress(0.2 + validProgress * 0.6);
},
onLog: (phase: string, msg: string) => console.log(`[${phase}] ${msg}`)
});
// Load model
setProgress(0.1);
const { CONSTANTS } = await import('demucs-web');
// Model is about 170MB, loading may take a while
console.log('Loading demucs model...');
await processor.loadModel(CONSTANTS.DEFAULT_MODEL_URL);
setProgress(0.25);
// ... continue processing
} catch (err) {
console.error('Vocal removal error:', err);
setError('Failed to remove vocals. Please try a different file.');
} finally {
setIsProcessing(false);
stopTimer();
}
};
We use dynamic imports to avoid loading the large libraries until they're needed. The model (~170MB) is downloaded from Hugging Face and cached by the browser.
Decoding and Preparing Audio
Once the model is loaded, we decode the audio file:
// Decode audio file
console.log('Decoding audio file...');
const arrayBuffer = await file.arrayBuffer();
const audioContext = new (window.AudioContext || (window as any).webkitAudioContext)();
let audioBuffer;
try {
audioBuffer = await audioContext.decodeAudioData(arrayBuffer);
} catch (decodeErr) {
console.error('Audio decode error:', decodeErr);
throw new Error('Failed to decode audio file. Please try a different format (MP3, WAV).');
}
// Check audio duration
console.log('Audio duration:', audioBuffer.duration, 'seconds');
// Get audio data
console.log('Getting audio channels...');
const leftChannel = audioBuffer.getChannelData(0);
const rightChannel = audioBuffer.numberOfChannels > 1
? audioBuffer.getChannelData(1)
: audioBuffer.getChannelData(0);
The decodeAudioData method converts various audio formats (MP3, WAV, OGG, etc.) into raw PCM data that we can process.
The Separation Process
Here's where the magic happens. We call the separate method on our processor:
console.log('Separating tracks...');
let result;
try {
result = await processor.separate(leftChannel, rightChannel);
console.log('Separation complete:', Object.keys(result));
} catch (sepErr: any) {
console.error('Separation error:', sepErr);
throw new Error('Failed to separate audio tracks. Please try a different file or refresh the page.');
}
The separate method returns an object with four tracks: drums, bass, other, and vocals. Each track contains left and right channel data.
Mixing the Instrumental Track
To create the instrumental (karaoke) version, we simply mix the non-vocal tracks:
// Mix drums, bass, and other (remove vocals)
const mixedLength = result.drums.left.length;
const mixedLeft = new Float32Array(mixedLength);
const mixedRight = new Float32Array(mixedLength);
for (let i = 0; i < mixedLength; i++) {
mixedLeft[i] = result.drums.left[i] + result.bass.left[i] + result.other.left[i];
mixedRight[i] = result.drums.right[i] + result.bass.right[i] + result.other.right[i];
}
// Create audio buffer
const instrumentalBuffer = audioContext.createBuffer(2, mixedLength, 44100);
instrumentalBuffer.copyToChannel(mixedLeft, 0);
instrumentalBuffer.copyToChannel(mixedRight, 1);
This is the key insight: Demucs separates everything, and we just recombine what we want to keep.
Deep Dive: The DemucsProcessor Class
Let's look at how the DemucsProcessor class works internally. This is the core of the AI processing:
Model Loading with Progress Tracking
async loadModel(modelPathOrBuffer) {
if (!this.ort) {
throw new Error('ONNX Runtime not provided. Pass ort in constructor options.');
}
this.onLog('model', 'Loading model...');
let modelBuffer;
if (modelPathOrBuffer instanceof ArrayBuffer) {
modelBuffer = modelPathOrBuffer;
} else {
const response = await fetch(modelPathOrBuffer || this.modelPath);
// Check if we can track progress
const contentLength = response.headers.get('Content-Length');
if (contentLength && response.body) {
const totalSize = parseInt(contentLength, 10);
const reader = response.body.getReader();
const chunks = [];
let loadedSize = 0;
while (true) {
const { done, value } = await reader.read();
if (done) break;
chunks.push(value);
loadedSize += value.length;
this.onDownloadProgress(loadedSize, totalSize);
}
// Combine chunks into single ArrayBuffer
const combined = new Uint8Array(loadedSize);
let offset = 0;
for (const chunk of chunks) {
combined.set(chunk, offset);
offset += chunk.length;
}
modelBuffer = combined.buffer;
} else {
// Fallback: no progress tracking
modelBuffer = await response.arrayBuffer();
}
}
const defaultSessionOptions = {
executionProviders: ['webgpu', 'wasm'],
graphOptimizationLevel: 'basic'
};
this.session = await this.ort.InferenceSession.create(modelBuffer, {
...defaultSessionOptions,
...this.sessionOptions
});
this.onLog('model', 'Model loaded successfully');
return this.session;
}
The model loading uses the Streams API to track download progress, giving users feedback during the ~170MB download. It also configures ONNX Runtime to use WebGPU for acceleration (falling back to WASM if unavailable).
The Separation Algorithm
The separate method processes audio in overlapping segments:
async separate(leftChannel, rightChannel) {
if (!this.session) {
throw new Error('Model not loaded. Call loadModel() first.');
}
const totalSamples = leftChannel.length;
const stride = Math.floor(TRAINING_SAMPLES * (1 - SEGMENT_OVERLAP));
const numSegments = Math.ceil((totalSamples - TRAINING_SAMPLES) / stride) + 1;
const outputs = TRACKS.map(() => ({
left: new Float32Array(totalSamples),
right: new Float32Array(totalSamples)
}));
const weights = new Float32Array(totalSamples);
let segmentIdx = 0;
for (let start = 0; start < totalSamples; start += stride) {
const end = Math.min(start + TRAINING_SAMPLES, totalSamples);
const segmentLength = end - start;
const segLeft = new Float32Array(TRAINING_SAMPLES);
const segRight = new Float32Array(TRAINING_SAMPLES);
for (let i = 0; i < segmentLength; i++) {
segLeft[i] = leftChannel[start + i];
segRight[i] = rightChannel[start + i];
}
const input = prepareModelInput(segLeft, segRight);
const waveformTensor = new this.ort.Tensor('float32', input.waveform, [1, 2, TRAINING_SAMPLES]);
const magSpecTensor = new this.ort.Tensor('float32', input.magSpec, [1, 4, MODEL_SPEC_BINS, MODEL_SPEC_FRAMES]);
const feeds = {};
feeds[this.session.inputNames[0]] = waveformTensor;
if (this.session.inputNames.length > 1) {
feeds[this.session.inputNames[1]] = magSpecTensor;
}
const inferResults = await this.session.run(feeds);
// ... process outputs
segmentIdx++;
this.onProgress({
progress: segmentIdx / numSegments,
currentSegment: segmentIdx,
totalSegments: numSegments
});
}
// Normalize by weights
for (let t = 0; t < TRACKS.length; t++) {
for (let i = 0; i < totalSamples; i++) {
if (weights[i] > 0) {
outputs[t].left[i] /= weights[i];
outputs[t].right[i] /= weights[i];
}
}
}
return {
drums: outputs[0],
bass: outputs[1],
other: outputs[2],
vocals: outputs[3]
};
}
Key aspects of this algorithm:
- Segmentation: Long audio is split into overlapping segments (343,980 samples each with 25% overlap)
-
STFT Preparation: Each segment is converted to a spectrogram using
prepareModelInput - ONNX Inference: The model runs inference on both time-domain and frequency-domain inputs
- Overlap-Add: Results are combined using weighted overlap-add to avoid boundary artifacts
-
Progress Tracking: The
onProgresscallback updates the UI after each segment
Signal Processing: FFT and STFT
The vocal remover relies heavily on Fast Fourier Transform (FFT) operations. Here's the FFT implementation:
export function fft(realOut, imagOut, realIn, n) {
const bits = Math.log2(n) | 0;
const twiddles = getFFTTwiddles(n);
// Bit-reverse permutation
for (let i = 0; i < n; i++) {
const j = bitReverse(i, bits);
realOut[i] = realIn[j];
imagOut[i] = 0;
}
// Cooley-Tukey butterfly operations
for (let size = 2; size <= n; size *= 2) {
const halfSize = size / 2;
const step = n / size;
for (let i = 0; i < n; i += size) {
for (let j = 0; j < halfSize; j++) {
const k = j * step;
const tReal = twiddles.real[k];
const tImag = twiddles.imag[k];
const idx1 = i + j;
const idx2 = i + j + halfSize;
const eReal = realOut[idx1];
const eImag = imagOut[idx1];
const oReal = realOut[idx2] * tReal - imagOut[idx2] * tImag;
const oImag = realOut[idx2] * tImag + imagOut[idx2] * tReal;
realOut[idx1] = eReal + oReal;
imagOut[idx1] = eImag + oImag;
realOut[idx2] = eReal - oReal;
imagOut[idx2] = eImag - oImag;
}
}
}
}
This implements the Cooley-Tukey radix-2 FFT algorithm, which converts time-domain audio to frequency-domain representation.
The STFT (Short-Time Fourier Transform) applies FFT to overlapping windows:
export function stft(signal, fftSize, hopSize) {
const numFrames = Math.floor((signal.length - fftSize) / hopSize) + 1;
const numBins = fftSize / 2 + 1;
const window = getHannWindow(fftSize);
const scale = 1.0 / Math.sqrt(fftSize);
const specReal = new Float32Array(numFrames * numBins);
const specImag = new Float32Array(numFrames * numBins);
const frameReal = new Float32Array(fftSize);
const frameImag = new Float32Array(fftSize);
const windowedFrame = new Float32Array(fftSize);
for (let frame = 0; frame < numFrames; frame++) {
const start = frame * hopSize;
for (let i = 0; i < fftSize; i++) {
windowedFrame[i] = signal[start + i] * window[i];
}
fft(frameReal, frameImag, windowedFrame, fftSize);
const outOffset = frame * numBins;
for (let k = 0; k < numBins; k++) {
specReal[outOffset + k] = frameReal[k] * scale;
specImag[outOffset + k] = frameImag[k] * scale;
}
}
return { real: specReal, imag: specImag, numFrames, numBins };
}
The Hann window reduces spectral leakage, and the hop size (1024 samples) determines the time resolution.
Converting Back to WAV
After processing, we convert the instrumental audio back to WAV format:
const audioBufferToWav = async (buffer: AudioBuffer): Promise<Blob> => {
const numChannels = buffer.numberOfChannels;
const sampleRate = buffer.sampleRate;
const format = 1; // PCM
const bitDepth = 16;
const bytesPerSample = bitDepth / 8;
const blockAlign = numChannels * bytesPerSample;
const dataLength = buffer.length * numChannels * bytesPerSample;
const bufferLength = 44 + dataLength;
const arrayBuffer = new ArrayBuffer(bufferLength);
const view = new DataView(arrayBuffer);
// Write WAV header
const writeString = (offset: number, string: string) => {
for (let i = 0; i < string.length; i++) {
view.setUint8(offset + i, string.charCodeAt(i));
}
};
writeString(0, 'RIFF');
view.setUint32(4, 36 + dataLength, true);
writeString(8, 'WAVE');
writeString(12, 'fmt ');
view.setUint32(16, 16, true);
view.setUint16(20, format, true);
view.setUint16(22, numChannels, true);
view.setUint32(24, sampleRate, true);
view.setUint32(28, sampleRate * blockAlign, true);
view.setUint16(32, blockAlign, true);
view.setUint16(34, bitDepth, true);
writeString(36, 'data');
view.setUint32(40, dataLength, true);
// Write audio data
const offset = 44;
const channels = [];
for (let i = 0; i < numChannels; i++) {
channels.push(buffer.getChannelData(i));
}
let index = 0;
for (let i = 0; i < buffer.length; i++) {
for (let channel = 0; channel < numChannels; channel++) {
const sample = Math.max(-1, Math.min(1, channels[channel][i]));
const intSample = sample < 0 ? sample * 0x8000 : sample * 0x7FFF;
view.setInt16(offset + index, intSample, true);
index += 2;
}
}
return new Blob([arrayBuffer], { type: 'audio/wav' });
};
This creates a standard WAV file with a 44-byte header followed by 16-bit PCM audio data.
Performance Considerations
Processing audio with AI in the browser is computationally intensive. Here are some key optimizations:
1. WebGPU Acceleration
ONNX Runtime Web uses WebGPU when available, providing significant speedup over CPU-only WASM:
const defaultSessionOptions = {
executionProviders: ['webgpu', 'wasm'],
graphOptimizationLevel: 'basic'
};
2. Overlapping Segments
The 25% overlap between segments prevents audible artifacts at segment boundaries:
const overlapWindow = new Float32Array(segmentLength);
for (let i = 0; i < segmentLength; i++) {
const fadeIn = Math.min(i / (stride * 0.5), 1);
const fadeOut = Math.min((segmentLength - i) / (stride * 0.5), 1);
overlapWindow[i] = Math.min(fadeIn, fadeOut);
}
3. Dynamic Imports
We only load the heavy libraries when needed:
const [{ DemucsProcessor }, ort] = await Promise.all([
import('demucs-web'),
import('onnxruntime-web')
]);
4. Model Caching
The browser caches the 170MB model file, so subsequent uses load instantly.
Browser Compatibility
Our vocal remover works in modern browsers with the following requirements:
- Chrome/Edge 113+: Full support with WebGPU acceleration
- Firefox 121+: Full support (WASM fallback)
- Safari 17+: Full support (WASM fallback)
Required APIs:
-
AudioContext: Universal support -
fetch: Universal support -
WebGPU: Chrome/Edge (optional, falls back to WASM) -
WebAssembly: Universal support
Try It Yourself
Now that you understand how AI-powered vocal removal works, try it out on your own songs! Visit our free online vocal remover to create karaoke versions of your favorite tracks. Remember, all processing happens locally in your browser - your audio files never leave your device.
Conclusion
Building a browser-based vocal remover demonstrates the incredible capabilities of modern web technologies:
AI in the browser is viable: With ONNX Runtime Web, we can run sophisticated machine learning models entirely client-side.
Privacy by default: Local processing means complete data privacy without sacrificing functionality.
Signal processing fundamentals: Understanding FFT, STFT, and overlap-add is crucial for audio AI applications.
Performance matters: WebGPU acceleration, smart segmentation, and caching make real-time AI audio processing feasible.
The complete source code is available in our repository. Whether you're building a music app, a podcast editor, or just curious about AI audio processing, I hope this guide gives you a solid foundation to build upon.
Happy coding, and enjoy your karaoke sessions! š¤šµ


Top comments (0)