How I Built a Prompt-to-Music AI Agent & Browser-Based Karaoke Separator with React & ONNX

#ai #react #webdev #devops

Tags: react, webdev, onnx, audio

Introduction

Music generation, vocal separation, and intelligent arrangement have traditionally been server-side tasks requiring complex pipelines and expensive GPU clusters. But what if we could bring the entire interactive music-creation experience — real-time preview, offline export, prompt-based AI music generation, and local karaoke processing — directly into the browser?

In this post, I'll share how I built Rapid AI Studio, a client-side React and Tone.js application featuring:

A Prompt-to-Music AI Agent: Enter any prompt (e.g., "Create an energetic Tamil Kuthu beat with a driving bassline and a Nadaswaram melody"), and the agent composes and adds the tracks directly to the arrangement.
A Client-Side Karaoke Separator: Runs a local neural network with 84% accuracy using ONNX Runtime Web to separate vocals and accompaniment locally.
High-Performance Audio Engine: Tone.js scheduling, synth fallbacks, and real-time playback.

The Tech Stack

Frontend UI: React + TypeScript + Tailwind CSS for a premium, glassmorphic dark-mode interface.
Audio Engine: Tone.js v15 (built on top of the Web Audio API) for sample playback, precise timing scheduling, and synthesis.
Client-Side AI: ONNX Runtime Web (onnxruntime-web) executing a local neural network with 84% accuracy for vocal/accompaniment separation (Karaoke mode).
AI Music Agent: A natural-language agent interface that takes user prompts to compose MIDI sequences, beats, harmony, and arrangements in real time.
Offline Rendering: OfflineAudioContext for high-speed, non-realtime rendering of arrangements straight to .wav files.

🤖 The Prompt-to-Music AI Agent

With Rapid AI Studio, users don't need to be music-theory experts. They simply write what they want to hear.

The AI Agent interprets the prompt and generates a multi-track composition containing:

Groove & Beats: Automatically maps drum samples and rhythmic patterns (e.g. Parai drum, Pambai hits for Kuthu).
Melody & Harmony: Dynamically selects matching lead instruments (like Nadaswaram) and harmony instruments (like Veena).
Real-time Arrangement: Generates MIDI notes, velocities, and durations, and inserts them directly as tracks into the playback engine.

🎤 High-Performance AI: Client-Side Karaoke with ONNX Runtime Web

Instead of sending audio files to a cloud server to extract vocals or create karaoke tracks, Rapid AI Studio performs the separation locally in the browser.

How it works

We compile our vocal-separation neural network to the ONNX format and run it client-side using onnxruntime-web and WebAssembly (WASM) with SIMD acceleration:

84% Accuracy: Performs state-of-the-art separation on the fly.
Zero Backend Costs: By offloading inference to the user's browser, there are no expensive GPU servers to pay for or maintain.
100% Privacy: Since audio files never leave the user's device, their recordings and tracks remain completely private.

// Initializing the local ONNX session for vocal separation
import * as ort from 'onnxruntime-web';

const session = await ort.InferenceSession.create('/assets/vocal_separator.onnx', {
  executionProviders: ['wasm'],
  enableCpuMemArena: true,
  enableMemPattern: true
});

Technical Challenge 1: The "Treble Collapse" and FM Aliasing

During development, we noticed that when sample files failed to load due to cache issues or slow network connections, the synthetic fallback synthesizer for lead instruments (like the Nadaswaram) played with extreme digital noise, distortion, and a "treble collapse" sound.

The Cause: the live-preview fallback instrument was originally defined using Tone.FMSynth with a sawtooth carrier wave modulated by a sine modulator:

new Tone.PolySynth(Tone.FMSynth, {
  oscillator: { type: 'sawtooth' },
  harmonicity: 1.5,
  modulationIndex: 3.2,
  // ...
})

While FM synthesis sounds great with simple waves (like sine or triangle), applying frequency modulation to a sawtooth wave (which has infinite high-frequency harmonics) causes severe digital aliasing and clipping. The shifted harmonics fold back below the Nyquist frequency, creating harsh digital noise.

The Fix: we refactored the preview fallback to use a clean subtractive Tone.Synth with a sawtooth oscillator, bypassing FM worklet limitations:

new Tone.PolySynth(Tone.Synth, {
  oscillator: { type: 'sawtooth' },
  envelope: { attack: 0.04, decay: 0.1, sustain: 0.85, release: 0.5 }
})

This matched our offline export engine perfectly and instantly restored a warm, clean lead-melody sound.

Technical Challenge 2: Upgrading Fallback Players Asynchronously

When a user loads the app, sample downloads are triggered asynchronously. If a user starts playing before samples finish downloading, the engine falls back to the synthetic instruments.

However, we found that once a synthetic player was cached in the active track map (playersMap), it would never get replaced — even after the samples finished downloading in the background. The user was stuck listening to the synth, even though the high-quality samples were now fully loaded in memory!

The Solution: we implemented an automatic upgrade checker inside our state-synchronization loop (syncState). Every time the track arrangement changes or the user interacts with playback, the engine checks if the active player is a synthetic fallback (Tone.PolySynth) but its corresponding sample buffers have now successfully loaded in the cache:

// If the player is currently synthetic, but samples have loaded in the background, reload the track!
const currentPlayer = this.playersMap.get(id);
if (currentPlayer && currentPlayer instanceof Tone.PolySynth) {
  const data = INSTRUMENT_DATA[t.instrumentId];
  if (data && data.notes) {
    const hasLoadedBuffers = data.notes.some(n => {
      const kurl = this.normalizeStorageUrl(n.url);
      return this.bufferCache.has(kurl) && this.bufferCache.get(kurl)?.loaded;
    });
    if (hasLoadedBuffers) {
      changedExistingIds.add(id); // Forces re-instantiation as a Tone.Sampler!
    }
  }
}

This dynamically swaps out synthetic synths for high-quality multi-samplers on the fly, with zero interruption to the user experience.

Conclusion & Try It Out!

Building an audio workstation on top of the Web Audio API requires balancing CPU budgets, network latency, and synthesis algorithms. By combining Tone.js, smart preloading, local ONNX neural networks for karaoke vocal separation, a prompt-to-music AI Agent, and dynamic fallback-to-sampler upgrades, you can deliver a premium, desktop-grade audio production experience directly in a standard web browser.

Rapid AI Studio — make music in minutes, free, in your browser or on Android.