StemSplit

Posted on Mar 11

How to Isolate Vocals in Python: API vs Demucs vs Audacity CLI (2026)

#python #ai #music #tutorial

I keep seeing "how to isolate vocals" questions on Stack Overflow where the accepted answer is five years old and recommends Spleeter. Let's fix that.

Here are three methods that actually work in 2026, with working code for each, and SDR benchmarks so you know what quality to expect before you commit to one.

What You'll Learn

✅ The fastest way to isolate vocals with no local setup (API, ~5 lines of Python)
✅ How to run Demucs locally for best quality
✅ How to automate Audacity via CLI for legacy workflows
✅ SDR scores across all three methods on the same test tracks
✅ Which method fits which use case

Prerequisites

pip install requests librosa soundfile mir_eval numpy

For Method 2 (local Demucs):

pip install demucs torch

For Method 3 (Audacity CLI):

# macOS
brew install audacity

# Ubuntu
sudo snap install audacity

Method 1: Online API (Fastest, No Setup)

Best for: Prototypes, web apps, when you don't own a GPU, or when you need results immediately.

The easiest path is to call a stem separator online rather than running a model locally. StemSplit's stem separator runs HTDemucs on GPU-backed servers — same model quality as running Demucs yourself, but a single HTTP call.

import requests
import time
from pathlib import Path


def isolate_vocals_api(
    audio_path: str,
    api_key: str,
    output_dir: str = "output",
) -> str:
    """
    Isolate vocals using StemSplit API.
    Returns path to downloaded vocals file.

    Free tier: 10 minutes included on signup.
    Docs: https://stemsplit.io/developers/docs
    """
    Path(output_dir).mkdir(parents=True, exist_ok=True)

    # 1. Upload and start job
    with open(audio_path, "rb") as f:
        resp = requests.post(
            "https://api.stemsplit.io/v1/separate",
            headers={"Authorization": f"Bearer {api_key}"},
            files={"audio": (Path(audio_path).name, f)},
            json={"stems": 2, "format": "wav"},  # 2-stem = vocals + instrumental
            timeout=30,
        )
    resp.raise_for_status()
    job_id = resp.json()["job_id"]
    print(f"Job started: {job_id}")

    # 2. Poll for completion
    while True:
        status = requests.get(
            f"https://api.stemsplit.io/v1/jobs/{job_id}",
            headers={"Authorization": f"Bearer {api_key}"},
        ).json()

        if status["status"] == "completed":
            vocals_url = status["stems"]["vocals"]
            break
        if status["status"] == "failed":
            raise RuntimeError(status.get("error", "Job failed"))

        print("  Processing...")
        time.sleep(3)

    # 3. Download vocals
    vocals_data = requests.get(vocals_url, timeout=60).content
    out_path = Path(output_dir) / f"{Path(audio_path).stem}_vocals.wav"
    out_path.write_bytes(vocals_data)

    print(f"✅ Vocals saved: {out_path}")
    return str(out_path)


# Usage
vocals = isolate_vocals_api("song.mp3", api_key="your_key_here")

Output:

Job started: job_abc123
  Processing...
  Processing...
✅ Vocals saved: output/song_vocals.wav

Pros: No installation, GPU-backed speed (~40s for 4-min track), same model quality as Demucs

Cons: Requires internet, free tier has usage limits

SDR: 8.7 dB (pop), 8.1 dB (rock), 7.9 dB (hip-hop)

Method 2: Demucs Locally (Best Quality, Runs Offline)

Best for: Batch processing, privacy-sensitive audio, when you have a GPU, production pipelines.

Demucs is Meta's open-source model and the current state-of-the-art for free stem separation. htdemucs_ft (fine-tuned) gives the best results.

Installation

pip install demucs

# Verify
python -m demucs --help

On first run, Demucs downloads the model (~300MB). This happens once.

Basic Vocal Isolation

import subprocess
from pathlib import Path


def isolate_vocals_demucs(
    audio_path: str,
    output_dir: str = "output",
    model: str = "htdemucs_ft",
    output_format: str = "wav",
) -> dict:
    """
    Isolate vocals (and instrumental) using Demucs locally.

    Args:
        audio_path:    Path to input audio file
        output_dir:    Directory to write stems to
        model:         'htdemucs_ft' (best), 'htdemucs' (faster)
        output_format: 'wav' or 'mp3'

    Returns:
        dict with 'vocals' and 'no_vocals' paths
    """
    cmd = [
        "python", "-m", "demucs",
        "--two-stems", "vocals",   # only separate vocals vs everything else
        "-n", model,
        "-o", output_dir,
        audio_path,
    ]

    if output_format == "mp3":
        cmd += ["--mp3", "--mp3-bitrate", "320"]

    subprocess.run(cmd, check=True)

    song_name = Path(audio_path).stem
    stems_dir = Path(output_dir) / model / song_name

    return {
        "vocals":    str(stems_dir / f"vocals.{output_format}"),
        "no_vocals": str(stems_dir / f"no_vocals.{output_format}"),
    }


# Isolate vocals
result = isolate_vocals_demucs("song.mp3")
print(f"Vocals:       {result['vocals']}")
print(f"Instrumental: {result['no_vocals']}")

With GPU Acceleration

If you have an NVIDIA GPU, Demucs uses it automatically. Check availability:

import torch

if torch.cuda.is_available():
    gpu = torch.cuda.get_device_name(0)
    print(f"✅ GPU: {gpu}")
    # Processing time drops from ~4 min to ~35s on a 4-minute track
else:
    print("❌ No GPU — Demucs will run on CPU (~4 min per song)")

All 4 Stems (Not Just Vocals)

If you need drums, bass, and other instruments too:

def separate_all_stems(
    audio_path: str,
    output_dir: str = "output",
    model: str = "htdemucs_ft",
) -> dict:
    """Separate into vocals, drums, bass, and other."""
    subprocess.run(
        ["python", "-m", "demucs", "-n", model, "-o", output_dir, audio_path],
        check=True,
    )

    song_name = Path(audio_path).stem
    stems_dir = Path(output_dir) / model / song_name

    return {
        stem: str(stems_dir / f"{stem}.wav")
        for stem in ["vocals", "drums", "bass", "other"]
    }


stems = separate_all_stems("song.mp3")
for name, path in stems.items():
    print(f"{name}: {path}")

Pros: Free, offline, best quality, full control over model/format

Cons: ~300MB model download, slow on CPU, requires Python environment

SDR: 8.7 dB (pop), 8.2 dB (rock), 8.0 dB (hip-hop)

Method 3: Audacity via CLI (For Legacy Workflows)

Best for: Teams already using Audacity, scripting into existing audio production workflows, macOS/Windows environments.

Audacity has a Python scripting interface via its pipe mechanism. This is more complex to set up but useful if you're integrating into an existing Audacity-based workflow.

⚠️ This method uses phase cancellation, which works by subtracting the stereo channels. It's much lower quality than AI methods — only use it if you specifically need Audacity integration.

Enable Audacity's Scripting Interface

In Audacity: Edit → Preferences → Modules → mod-script-pipe → Enable

Restart Audacity after enabling.

Python Bridge

import os
import sys
import time


def get_audacity_pipe():
    """Return read/write pipes to Audacity's scripting interface."""
    if sys.platform == "win32":
        toname  = "\\\\.\\pipe\\ToSrvPipe"
        fromname = "\\\\.\\pipe\\FromSrvPipe"
        eol = "\r\n\0"
    else:
        toname  = "/tmp/audacity_script_pipe.to.{pid}".format(pid=os.getpid())
        fromname = "/tmp/audacity_script_pipe.from.{pid}".format(pid=os.getpid())
        eol = "\n"

    # On Linux/Mac the pipe names don't include PID — find them
    if sys.platform != "win32":
        import glob
        to_pipes = glob.glob("/tmp/audacity_script_pipe.to.*")
        from_pipes = glob.glob("/tmp/audacity_script_pipe.from.*")
        if not to_pipes:
            raise RuntimeError("Audacity not running or scripting pipe not enabled")
        toname = to_pipes[0]
        fromname = from_pipes[0]

    write_pipe = open(toname, "w")
    read_pipe  = open(fromname, "r")
    return write_pipe, read_pipe, eol


def send_command(write_pipe, read_pipe, eol: str, command: str) -> str:
    """Send a command to Audacity and return the response."""
    write_pipe.write(command + eol)
    write_pipe.flush()
    response = []
    while True:
        line = read_pipe.readline()
        if line == "\n":
            break
        response.append(line.strip())
    return "\n".join(response)


def isolate_vocals_audacity(input_path: str, output_path: str) -> str:
    """
    Isolate vocals using Audacity's phase cancellation method.

    Lower quality than AI methods — only works well on stereo tracks
    where vocals are panned center.
    """
    write_pipe, read_pipe, eol = get_audacity_pipe()

    try:
        # Import audio
        send_command(write_pipe, read_pipe, eol, f'Import2: Filename="{input_path}"')
        time.sleep(1)

        # Duplicate track (we need two copies)
        send_command(write_pipe, read_pipe, eol, "Duplicate:")

        # On duplicate: invert right channel
        send_command(write_pipe, read_pipe, eol, "SelectTracks: Track=1")
        send_command(write_pipe, read_pipe, eol, "StereoToMono:")
        send_command(write_pipe, read_pipe, eol, "Invert:")

        # Mix down — phase cancellation removes center (vocals)
        # What remains is the side signal = isolated vocals
        send_command(write_pipe, read_pipe, eol, "SelectAll:")
        send_command(write_pipe, read_pipe, eol, "MixAndRender:")

        # Export
        send_command(write_pipe, read_pipe, eol, f'Export2: Filename="{output_path}" NumChannels=1')

    finally:
        write_pipe.close()
        read_pipe.close()

    return output_path

Pros: Integrates with existing Audacity workflows

Cons: Very low quality, only works on stereo tracks with centered vocals, requires Audacity running

SDR: 3.1 dB (pop), 2.4 dB (rock), 1.9 dB (hip-hop)

Quality Comparison

Same three test tracks, same mir_eval SDR measurement across all methods:

import librosa
import mir_eval
import numpy as np


def compute_sdr(reference_path: str, estimated_path: str) -> float:
    ref, _ = librosa.load(reference_path, sr=44100, mono=True)
    est, _ = librosa.load(estimated_path, sr=44100, mono=True)
    n = min(len(ref), len(est))
    sdr, _, _, _ = mir_eval.separation.bss_eval_sources(
        ref[:n][np.newaxis, :], est[:n][np.newaxis, :]
    )
    return float(sdr[0])

Method	Pop SDR	Rock SDR	Hip-Hop SDR	Speed	Cost
Demucs htdemucs_ft	8.7 dB	8.2 dB	8.0 dB	4 min CPU / 35s GPU	Free
StemSplit API	8.7 dB	8.1 dB	7.9 dB	~42s	Free tier
Audacity (phase cancel)	3.1 dB	2.4 dB	1.9 dB	5s	Free

The Audacity method is effectively unusable for anything that needs to sound clean. It's included here for completeness and for workflows that specifically need Audacity integration regardless of quality.

Choosing the Right Method

Do you have a GPU?
├── Yes → Use Demucs locally (free, fastest, best quality)
└── No
    ├── Processing < 100 files? → Use StemSplit API (no setup, same quality)
    └── Processing 100+ files?  → Use Demucs on CPU or rent GPU time
                                   (StemSplit API costs stack up at scale)

Do you need offline/private processing?
└── Use Demucs locally regardless of GPU

Do you need Audacity integration specifically?
└── Use Audacity CLI — but expect poor quality, use only for legacy pipelines

Bonus: Batch Processing Multiple Files

from concurrent.futures import ThreadPoolExecutor
from pathlib import Path
import glob


def batch_isolate_vocals(
    input_dir: str,
    api_key: str,
    max_workers: int = 3,   # keep this low to avoid rate limiting
) -> list:
    """Isolate vocals from all audio files in a directory using the API."""
    audio_files = glob.glob(f"{input_dir}/*.mp3") + glob.glob(f"{input_dir}/*.wav")
    print(f"Found {len(audio_files)} files")

    results = []

    def process(path: str) -> dict:
        try:
            vocals_path = isolate_vocals_api(path, api_key)
            return {"input": path, "output": vocals_path, "status": "ok"}
        except Exception as e:
            return {"input": path, "output": None, "status": "error", "error": str(e)}

    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        results = list(executor.map(process, audio_files))

    ok = [r for r in results if r["status"] == "ok"]
    errors = [r for r in results if r["status"] == "error"]
    print(f"\n✅ Completed: {len(ok)}  ❌ Failed: {len(errors)}")

    return results


# Process a folder
results = batch_isolate_vocals("./music", api_key="your_key_here")

For large batches with Demucs locally, see the batch processing guide.

Common Issues

"Vocals have metallic artifacts"

Compress your source file less. Demucs degrades noticeably on heavily compressed MP3s (<192kbps). Convert to WAV first:

import subprocess

subprocess.run(["ffmpeg", "-i", "song.mp3", "-ar", "44100", "-acodec", "pcm_s16le", "song.wav"])

"Demucs is very slow"

Use the lighter model for a ~30% speed boost with minimal quality loss:

# Replace 'htdemucs_ft' with 'htdemucs' for faster processing
isolate_vocals_demucs("song.mp3", model="htdemucs")

Or use the API — their GPU backend is faster than CPU Demucs for most single-file use cases.

"Phase cancellation removed instruments, not vocals"

This happens when the vocals aren't panned center in the stereo field. The Audacity method assumes vocals are in the center channel. Modern productions frequently break this assumption. Use Demucs instead.

Summary

Use Case	Method
Best quality, have GPU	Demucs `htdemucs_ft`
Best quality, no GPU	StemSplit API
Prototyping / no install	Stem separator online
Legacy Audacity workflow	Audacity CLI (expect low quality)
Batch processing 1000+ files	Demucs local with GPU

What are you using vocal isolation for? Building a karaoke tool, a music practice app, something else? Drop it in the comments.sadssdsd

DEV Community