DEV Community

Cover image for How to Build an AI Vocal Remover in Python with the StemSplit API (2026)
StemSplit
StemSplit

Posted on

How to Build an AI Vocal Remover in Python with the StemSplit API (2026)

I needed to add vocal removal to an app last week without shipping a 300 MB Demucs model with the binary. The shortest path I found was StemSplit's API — three endpoints, one model (HTDemucs), and a free tier big enough to prototype on.

This is the working tutorial I wish I'd had. Single file, batch, webhooks, and a Flask wrapper at the end. All copy-paste runnable.

What You'll Learn

  • ✅ How to remove vocals from any audio file in 5 lines of Python
  • ✅ How the async job pattern works (upload → poll → download)
  • ✅ Batch processing with bounded concurrency
  • ✅ Webhook callbacks instead of polling
  • ✅ Wrapping the API as a Flask backend for your own frontend
  • ✅ Cost math, rate limits, and the production gotchas

Prerequisites

pip install requests python-dotenv tenacity flask
Enter fullscreen mode Exit fullscreen mode

You'll need a free API key from the AI vocal remover dashboard — sign up gives you 10 minutes of processing, no card. Drop it in .env:

STEMSPLIT_API_KEY=your_key_here
Enter fullscreen mode Exit fullscreen mode
# config.py
import os
from dotenv import load_dotenv

load_dotenv()

API_KEY  = os.environ["STEMSPLIT_API_KEY"]
API_BASE = "https://api.stemsplit.io/v1"
HEADERS  = {"Authorization": f"Bearer {API_KEY}"}
Enter fullscreen mode Exit fullscreen mode

The 5-Line Version

If you just want to see it work before committing to anything:

import requests, time

job = requests.post(
    "https://api.stemsplit.io/v1/separate",
    headers=HEADERS,
    files={"audio": open("song.mp3", "rb")},
    json={"stems": 2, "format": "wav"},
).json()
status = {"status": "processing"}
while status["status"] not in ("completed", "failed"):
    time.sleep(3)
    status = requests.get(f"https://api.stemsplit.io/v1/jobs/{job['job_id']}", headers=HEADERS).json()
open("instrumental.wav", "wb").write(requests.get(status["stems"]["instrumental"]).content)
Enter fullscreen mode Exit fullscreen mode

That's the entire flow. Upload, poll, download. The rest of this article is making that production-grade.


How the API Works

Three endpoints. That's the whole surface area:

Method Path Purpose
POST /v1/separate Upload audio + start a job
GET /v1/jobs/{id} Poll job status
POST /v1/webhooks Register a callback URL (skip the polling)

The stems field on the upload controls what you get back:

stems Output
2 vocals + instrumental (vocal removal)
4 vocals + drums + bass + other
6 adds guitar + piano

For vocal removal you want stems: 2. The instrumental is the "everything except vocals" file.

📝 BPM and musical key come back in every job response at no extra cost. Useful if you're piping into a DJ tool, practice app, or recommender.


Single File: The Production Version

Same flow as the 5-liner, but with timeouts, retries on flaky network calls, exponential backoff for polling, and proper file streaming for large files.

import requests
import time
from pathlib import Path
from tenacity import retry, stop_after_attempt, wait_exponential

from config import API_BASE, HEADERS


@retry(stop=stop_after_attempt(3), wait=wait_exponential(min=2, max=20))
def _post_with_retry(url: str, **kwargs):
    resp = requests.post(url, headers=HEADERS, timeout=60, **kwargs)
    resp.raise_for_status()
    return resp


def remove_vocals(
    audio_path: str,
    output_dir: str = "output",
    poll_interval: float = 3.0,
    timeout: float = 300.0,
) -> dict:
    """
    Remove vocals from a single audio file using the StemSplit API.

    Args:
        audio_path:    Path to input file. MP3, WAV, FLAC, M4A, OGG, WEBM up to 100 MB.
        output_dir:    Where to write the downloaded stems.
        poll_interval: Seconds between status checks.
        timeout:       Maximum seconds to wait before giving up.

    Returns:
        Dict with 'vocals' and 'instrumental' file paths, plus 'bpm' and 'key'.
    """
    Path(output_dir).mkdir(parents=True, exist_ok=True)

    with open(audio_path, "rb") as f:
        job = _post_with_retry(
            f"{API_BASE}/separate",
            files={"audio": (Path(audio_path).name, f)},
            data={"stems": "2", "format": "wav"},
        ).json()

    job_id = job["job_id"]
    deadline = time.time() + timeout

    while time.time() < deadline:
        status = requests.get(f"{API_BASE}/jobs/{job_id}", headers=HEADERS, timeout=30).json()

        if status["status"] == "completed":
            break
        if status["status"] == "failed":
            raise RuntimeError(f"Job {job_id} failed: {status.get('error')}")

        time.sleep(poll_interval)
    else:
        raise TimeoutError(f"Job {job_id} did not complete within {timeout}s")

    out = {}
    for stem, url in status["stems"].items():
        path = Path(output_dir) / f"{Path(audio_path).stem}_{stem}.wav"
        with requests.get(url, stream=True, timeout=120) as r:
            r.raise_for_status()
            with open(path, "wb") as f:
                for chunk in r.iter_content(chunk_size=8192):
                    f.write(chunk)
        out[stem] = str(path)

    out["bpm"] = status.get("bpm")
    out["key"] = status.get("key")
    return out
Enter fullscreen mode Exit fullscreen mode

Usage:

result = remove_vocals("song.mp3")
print(result)
# {
#   'vocals': 'output/song_vocals.wav',
#   'instrumental': 'output/song_instrumental.wav',
#   'bpm': 124.0,
#   'key': 'C# minor'
# }
Enter fullscreen mode Exit fullscreen mode

Notes on what this gives you that the 5-liner doesn't:

  • Streaming download. Stems can be 30–80 MB each. Don't .content them into memory if you're processing many files.
  • Bounded wait. A timeout parameter means a stuck job can't hang your worker forever.
  • Retried uploads. tenacity wraps the upload in three attempts with exponential backoff — survives transient network blips.

Batch Processing

The naive batch loop is slow because most of the wall-clock time is the model running on the server. You want to upload several files, then poll them all in parallel.

from concurrent.futures import ThreadPoolExecutor, as_completed
from pathlib import Path
import glob


def remove_vocals_batch(
    input_dir: str,
    output_dir: str = "output",
    max_concurrent: int = 3,
) -> list[dict]:
    """
    Remove vocals from every audio file in a directory.

    Keep max_concurrent low — the free tier rate-limits aggressively.
    Bumped to 5–10 once you're on a paid plan.
    """
    files = []
    for ext in ("mp3", "wav", "flac", "m4a", "ogg", "webm"):
        files.extend(glob.glob(f"{input_dir}/*.{ext}"))

    print(f"Found {len(files)} files to process")
    results = []

    with ThreadPoolExecutor(max_workers=max_concurrent) as ex:
        futures = {ex.submit(remove_vocals, f, output_dir): f for f in files}
        for future in as_completed(futures):
            src = futures[future]
            try:
                result = future.result()
                results.append({"input": src, **result, "status": "ok"})
                print(f"{Path(src).name} ({result.get('bpm', '?')} BPM)")
            except Exception as e:
                results.append({"input": src, "status": "error", "error": str(e)})
                print(f"{Path(src).name}: {e}")

    ok = sum(1 for r in results if r["status"] == "ok")
    print(f"\nDone. {ok}/{len(files)} succeeded.")
    return results


# Usage
results = remove_vocals_batch("./music")
Enter fullscreen mode Exit fullscreen mode

A 50-file batch on the free tier with max_concurrent=3 finishes in around 12–15 minutes wall-clock for ~4-minute songs. Most of that is wait time on the GPU queue, not network.

⚠️ Set max_concurrent to your tier's allowed parallel jobs. The free tier rate-limits at 3 concurrent. You'll get HTTP 429 above that, which will retry-storm if you don't catch it.


Skip the Polling: Webhooks

Polling is fine for scripts. For a backend, you want webhooks so you're not burning HTTP requests waiting around.

Register a callback URL when you create the job:

def remove_vocals_async(audio_path: str, callback_url: str) -> str:
    """
    Start a vocal removal job. The API will POST to callback_url when done.
    Returns the job_id immediately.
    """
    with open(audio_path, "rb") as f:
        resp = requests.post(
            f"{API_BASE}/separate",
            headers=HEADERS,
            files={"audio": (Path(audio_path).name, f)},
            data={
                "stems": "2",
                "format": "wav",
                "webhook_url": callback_url,
            },
            timeout=60,
        )
    resp.raise_for_status()
    return resp.json()["job_id"]
Enter fullscreen mode Exit fullscreen mode

The callback payload looks like:

{
  "job_id": "job_a8f2c1",
  "status": "completed",
  "stems": {
    "vocals": "https://stems.stemsplit.io/...",
    "instrumental": "https://stems.stemsplit.io/..."
  },
  "bpm": 124.0,
  "key": "C# minor",
  "duration_seconds": 218.5
}
Enter fullscreen mode Exit fullscreen mode

Receive it in Flask:

from flask import Flask, request, jsonify
import requests

app = Flask(__name__)


@app.route("/webhook/stemsplit", methods=["POST"])
def stemsplit_webhook():
    payload = request.get_json()

    if payload["status"] != "completed":
        # log + alert; payload['error'] has the reason
        return "", 204

    job_id = payload["job_id"]
    instrumental_url = payload["stems"]["instrumental"]

    # Download in a background task — don't block the webhook response
    queue_download.delay(job_id, instrumental_url)

    return "", 200
Enter fullscreen mode Exit fullscreen mode

Two rules for webhooks that bite people:

  1. Always return 2xx fast. The webhook caller won't wait for you to download the stem. Queue the download, return, then process out-of-band.
  2. Verify the payload. Set a webhook secret in your dashboard and check the X-StemSplit-Signature header. The format matches Stripe-style HMAC-SHA256 over the raw body.

Wrapping It as Your Own API

If you're building a frontend for vocal removal, you usually don't want browsers calling the StemSplit API directly — your key would leak. Wrap it.

from flask import Flask, request, jsonify, send_file
import io
import requests
from pathlib import Path

app = Flask(__name__)


@app.route("/api/remove-vocals", methods=["POST"])
def remove_vocals_endpoint():
    """
    POST a multipart audio file. Returns the instrumental as the response body.
    """
    if "audio" not in request.files:
        return jsonify({"error": "missing 'audio' file"}), 400

    upload = request.files["audio"]

    if upload.content_length and upload.content_length > 100 * 1024 * 1024:
        return jsonify({"error": "file too large (max 100 MB)"}), 413

    job = requests.post(
        f"{API_BASE}/separate",
        headers=HEADERS,
        files={"audio": (upload.filename, upload.stream)},
        data={"stems": "2", "format": "wav"},
        timeout=60,
    ).json()

    job_id = job["job_id"]

    while True:
        status = requests.get(f"{API_BASE}/jobs/{job_id}", headers=HEADERS, timeout=30).json()
        if status["status"] == "completed":
            break
        if status["status"] == "failed":
            return jsonify({"error": status.get("error", "job failed")}), 500

    instrumental_url = status["stems"]["instrumental"]
    audio_bytes = requests.get(instrumental_url, timeout=120).content

    return send_file(
        io.BytesIO(audio_bytes),
        mimetype="audio/wav",
        as_attachment=True,
        download_name=f"{Path(upload.filename).stem}_instrumental.wav",
    )


if __name__ == "__main__":
    app.run(port=8000)
Enter fullscreen mode Exit fullscreen mode

For real production, swap the inline polling for a Celery task and webhook the result back to the user via WebSocket or SSE. The pattern in the Stem Splitter API with FastAPI and Celery article drops in cleanly here — same architecture, different upstream.


Cost Math

The pricing is $0.10 per minute of input audio. So a 4-minute song costs $0.40 to process. The 10-minute free tier on signup works out to ~2–3 average tracks, enough to wire up the integration before you commit a card.

A few back-of-envelope numbers I ran for the side project:

Workload Math Monthly
Personal app, 50 songs/month 50 × 4 min × $0.10 $20
Side project, 1,000 songs/month 1000 × 4 × $0.10 $400
Production batch, 10,000 songs/month 10000 × 4 × $0.10 $4,000

Above ~5,000 songs/month, running Demucs yourself on a $0.40/hr GPU starts to make sense if you have someone willing to babysit the queue. Below that, the API is the cheaper option because you're not paying for idle GPU time.

Credits don't expire, so buying $50 of credits and burning through them slowly is fine — useful for hobby projects.


Common Issues

"Job stuck on processing for >2 minutes"

A 4-minute track typically completes in 40–60 seconds. If it's been over two minutes, the file is probably long (10+ min mixes are slower) or the queue is backed up. The API has a hard 5-minute SLA per job; my polling code times out at 300 s for a reason.

"HTTP 429 on every other request"

You're hitting the concurrency cap. Drop max_concurrent to 3 and add a backoff:

from tenacity import retry, retry_if_exception_type, wait_exponential

@retry(retry=retry_if_exception_type(requests.HTTPError), wait=wait_exponential(min=2, max=60))
def safe_post(*args, **kwargs):
    r = requests.post(*args, **kwargs)
    r.raise_for_status()
    return r
Enter fullscreen mode Exit fullscreen mode

"The instrumental still has a faint vocal in the chorus"

Heavily layered choruses are the hard case for any AI vocal remover. Two things help:

  1. Try stems: 6 instead of 2 and re-mix without the vocal stem. Backing harmonies sometimes get bucketed into other or guitar when the model isn't sure they're lead vocal.
  2. Convert your input to WAV first if it's an MP3 below 192 kbps. Lossy compression strips frequency detail the model relies on.

"Webhook never fires"

Three things to check, in order:

  1. Is your webhook URL publicly reachable? (Use a tunnel like cloudflared for local dev.)
  2. Are you returning 2xx within 10 seconds? Slow handlers get marked failed.
  3. Is the URL on HTTPS? The API won't POST to plain HTTP.

"Big files time out on upload"

Default timeout=60 for requests.post covers maybe a 50 MB upload on a decent connection. For 100 MB files, bump to 300:

requests.post(..., timeout=300)
Enter fullscreen mode Exit fullscreen mode

Or stream the upload from disk so you're not loading the whole file into memory first (which the example code already does via open(path, "rb")).


Summary

Use Case Pattern
Quick script to remove vocals from one file The 5-line version above
CLI tool for a folder of files remove_vocals_batch with bounded concurrency
Backend for a vocal-removal frontend Flask wrapper + webhook callbacks
Production pipeline Celery + webhooks + retry-with-backoff
5,000+ songs/month Reconsider local Demucs on your own GPU

The whole AI vocal remover workflow boils down to three HTTP calls. The interesting part is what you build around them — error handling, batching, and how you stream results back to the user without blocking your workers.


Related Articles

Top comments (0)