monkeymore studio

Posted on Apr 22

I Built an Audio-to-MIDI Converter That Runs on Your Laptop — Using Spotify's AI

#ai #machinelearning #webdev #showdev

Have you ever listened to a melody and thought, "I wish I had the sheet music for that"? Or maybe you recorded a guitar riff on your phone and now you want to edit it in a DAW as MIDI notes. The usual options aren't great: expensive transcription software, sketchy online upload tools, or doing it by ear like it's 1995.

I built something better. It's a free audio-to-MIDI converter that runs a machine learning model — based on Spotify's open-source Basic Pitch — entirely inside your web browser. Drop in an MP3, WAV, FLAC, or M4A file, wait a few seconds while the AI listens, and download a MIDI file you can open in any notation or production software.

No uploads. No servers. Your audio never leaves your machine. Try it on our free audio to MIDI converter.

Why Run AI in the Browser?

Machine learning transcription sounds like a server job. But forcing users to upload audio to a remote GPU cluster introduces problems that are easy to overlook.

Your Audio Stays Private

Maybe it's an unreleased song. Maybe it's a voice memo with personal content. Maybe you just don't trust random websites with your files. When the model runs locally, none of those concerns exist. The audio is decoded, analyzed, and discarded — all inside your browser's memory.

No API Keys, No Rate Limits, No Quotas

Server-side ML inference costs money. Most services pass that cost to users through subscription tiers, per-minute fees, or daily limits. A local model is free forever, for unlimited files, with no account required.

Works Offline After First Load

The AI model downloads once (about 900 KB), then caches in your browser. After that, you can transcribe audio even without an internet connection. Plane ride? Studio session with no Wi-Fi? Doesn't matter.

How the Pipeline Works

Here's the journey from a raw audio file to a downloadable MIDI:

Let's dig into each step.

Audio Preprocessing: Getting the Signal Ready

Neural networks are picky about input. The Basic Pitch model expects audio at exactly 22,050 Hz, mono, as a Float32Array. Your uploaded file could be a 48 kHz stereo WAV or a 44.1 kHz MP3. We need to normalize it.

Decoding with the Web Audio API

async function prepareAudio(arrayBuffer: ArrayBuffer): Promise<Float32Array> {
  const targetSampleRate = 22050;
  const audioCtx = new (window.AudioContext || (window as any).webkitAudioContext)();
  const decoded = await audioCtx.decodeAudioData(arrayBuffer.slice(0));
  await audioCtx.close();
  // ...
}

decodeAudioData is the browser's built-in audio decoder. It handles MP3, WAV, OGG, FLAC, and M4A automatically. No external codecs needed. We immediately close the AudioContext afterward to free up the audio thread — we're done with playback; we just needed the raw samples.

Mixing to Mono

Stereo files get averaged down to a single channel:

const numChannels = decoded.numberOfChannels;
const originalLength = decoded.length;
const monoData = new Float32Array(originalLength);

for (let i = 0; i < originalLength; i++) {
  let sum = 0;
  for (let ch = 0; ch < numChannels; ch++) {
    sum += decoded.getChannelData(ch)[i];
  }
  monoData[i] = sum / numChannels;
}

This is a simple mean across channels. It works because pitch content is usually similar in both stereo channels. If they're radically different (rare in practice), the averaged signal still contains enough harmonic information for the model to work.

Resampling to 22050 Hz

Basic Pitch was trained on audio at 22.05 kHz. If the source is already at that rate, we pass it through. Otherwise, we use linear interpolation:

const ratio = targetSampleRate / originalRate;
const newLength = Math.floor(originalLength * ratio);
const resampled = new Float32Array(newLength);

for (let i = 0; i < newLength; i++) {
  const pos = i / ratio;
  const idx = Math.floor(pos);
  const frac = pos - idx;
  const a = monoData[idx] || 0;
  const b = monoData[idx + 1] || monoData[idx] || 0;
  resampled[i] = a + (b - a) * frac;
}

Linear interpolation is fast and good enough here. The model operates on a spectrogram anyway, so slight resampling artifacts get smoothed out by the frequency-domain transformation inside the neural network.

The AI Model: What Basic Pitch Actually Does

The model itself is a lightweight convolutional neural network published by Spotify's Audio Intelligence Lab. It was trained to solve a specific problem: given a short audio snippet, predict three things simultaneously:

Note frames — which pitches are active at each time step
Onsets — when each note starts
Contours — fine-grained pitch variations (vibrato, bends, slides)

The model file is split into two parts served from the public folder:

model.json — the architecture and weight manifest (~175 KB)
group1-shard1of1.bin — the actual trained parameters (~742 KB)

Total download: under a megabyte. That's tiny by modern ML standards.

Loading and Running the Model

const { BasicPitch, outputToNotesPoly, addPitchBendsToNoteEvents, noteFramesToTime } =
  await import("@spotify/basic-pitch");

const basicPitch = new BasicPitch("/basic-pitch-model/model.json");

const frames: number[][] = [];
const onsets: number[][] = [];
const contours: number[][] = [];

await basicPitch.evaluateModel(
  audioData,
  (f: number[][], o: number[][], c: number[][]) => {
    frames.push(...f);
    onsets.push(...o);
    contours.push(...c);
  },
  (p: number) => {
    setProgress(p);
  }
);

The evaluateModel call runs inference in chunks. The callback receives batched predictions and a progress value between 0 and 1. For a 3-minute song, this typically takes 5–15 seconds on a modern laptop CPU. No GPU required, though it helps.

From Neural Network Outputs to Musical Notes

The raw model outputs are probability matrices — not notes. We need post-processing to extract actual note events with start times, durations, and pitches.

Polyphonic Note Extraction

const notes = noteFramesToTime(
  addPitchBendsToNoteEvents(
    contours,
    outputToNotesPoly(frames, onsets, 0.5, 0.3, 5)
  )
);

This is a three-stage pipeline:

outputToNotesPoly(frames, onsets, 0.5, 0.3, 5) — converts frame and onset activations into discrete note events. The parameters control sensitivity:
- 0.5 — onset confidence threshold (higher = fewer false starts)
- 0.3 — frame confidence threshold (higher = more strict about sustained notes)
- 5 — minimum note length in frames (filters out tiny blips)
addPitchBendsToNoteEvents(contours, ...) — matches contour predictions to each note event. If the pitch wavers during the note (vibrato, string bending, portamento), this captures it as a series of pitch-bend values.
noteFramesToTime(...) — converts frame indices into actual seconds using the known hop size and sample rate.

The resulting notes array contains objects like this:

interface NoteEventTime {
  startTimeSeconds: number;
  durationSeconds: number;
  pitchMidi: number;
  amplitude: number;
  pitchBends?: number[];
}

Generating the MIDI File

Once we have note events, the last step is packaging them into a standard MIDI file. We use @tonejs/midi for this:

async function generateMidiFileData(notes: NoteEventTime[]): Promise<Uint8Array> {
  const { Midi } = await import('@tonejs/midi');
  const midi = new Midi();
  const track = midi.addTrack();

  notes.forEach((note) => {
    track.addNote({
      midi: note.pitchMidi,
      time: note.startTimeSeconds,
      duration: note.durationSeconds,
      velocity: Math.min(1, Math.max(0, note.amplitude)),
    });

    if (note.pitchBends) {
      note.pitchBends.forEach((bend, i) => {
        track.addPitchBend({
          time: note.startTimeSeconds + (i * note.durationSeconds) / note.pitchBends.length,
          value: bend,
        });
      });
    }
  });

  return midi.toArray();
}

A few details worth noting:

Velocity mapping: The model outputs an amplitude value (0–1) representing how loud the note was. We clamp it to the MIDI velocity range by passing it directly to Tone.js, which handles the scaling.
Pitch bends: If the AI detected vibrato or string bends, we distribute pitch-bend events evenly across the note duration. This isn't as smooth as a continuous bend controller, but it captures the expressive character well enough for most DAWs.
Single track: Everything goes into one MIDI track. The model doesn't separate instruments, so if you feed it a full band recording, you'll get all the melodic content mashed together. For best results, use monophonic or sparse polyphonic sources — solo piano, vocal lines, guitar melodies, synth leads.

The UI Flow

The interface stays simple: upload, convert, download. A progress bar shows real-time inference progress. Errors are surfaced if the model detects no notes (common with drums, noise, or very dense mixes) or if the audio file is corrupted.

Limitations and Realistic Expectations

This tool is genuinely useful, but it's not magic. Understanding its limits helps you get better results.

Monophonic and Sparse Polyphonic Sources Work Best

The model was trained primarily on solo melodic instruments. A single vocal line, a guitar melody, or a piano riff — these transcribe cleanly. Feed it a full rock mix with drums, bass, guitars, and vocals, and you'll get a mess of overlapping notes. The AI hears everything and tries to notate it all.

Drums and Percussion Are Problematic

Drums don't have pitched content in the way the model understands. A snare hit might get transcribed as a random low note, or ignored entirely. This tool is for melodic transcription, not rhythm extraction.

Reverb and Effects Can Confuse It

Heavy reverb creates phantom harmonics that the model may interpret as extra notes. A dry, close-mic'd recording will always produce cleaner output than a washed-out ambient track.

Quantization Is Your Friend

The output MIDI is not quantized. Note timings reflect the exact micro-timing of the original performance — which is great for preserving feel, but messy if you want grid-aligned notes. Import the MIDI into your DAW and quantize to taste.

Why This Architecture Works

The combination of technologies here is pragmatic rather than cutting-edge, and that's the point:

Web Audio API handles decoding — no ffmpeg WASM binary needed
TensorFlow.js runs the model on the CPU (or WebGPU if available) — no cloud dependency
@spotify/basic-pitch provides a pre-trained, well-documented model — no custom training pipeline
@tonejs/midi generates valid MIDI files — no hand-rolling binary formats

The total frontend bundle stays lean because everything is loaded on demand via dynamic import(). The model itself is under a megabyte. On a decent laptop, the entire process from drop to download takes under 20 seconds for a typical song.

Try It Yourself

Got an audio file with a melody you'd love to transcribe? Maybe a piano recording, a guitar riff, or a vocal line you hummed into your phone?

Upload it to our free audio to MIDI converter and let the AI do the listening. Download the MIDI, open it in your favorite DAW or notation software, and start editing. All the hard work happens on your machine — your audio stays yours.

DEV Community