You're about to ship stem separation in a product. Marketing pages say "state-of-the-art AI." Your job is to pick an actual model_id string — htdemucs, htdemucs_ft, htdemucs_6s, or mdx_extra_q — and defend that choice to whoever reads the PR.
Listening tests on three Spotify tracks won't cut it. Running MUSDB18 yourself takes a weekend and 22 GB of disk.
There's a third option: stem-separation-benchmark-2026 — 800 rows of BSS Eval v4 scores (50 MUSDB18-HQ test tracks × 4 models × 4 stems), CC-BY-4.0, queryable in pandas from your laptop in under a minute.
This post shows how to turn that table into a model picker and a regression test — not how to re-run the benchmark (see Best Free AI Stem Splitters in 2026 for DIY mir_eval).
What You'll Learn
- ✅ Load the public leaderboard into pandas without downloading audio
- ✅ Pick a model by stem priority (
vocalsvsdrums) and latency budget (RTF) - ✅ Find worst-case tracks per model before they hit production
- ✅ Assert median vocal SDR in CI so model upgrades don't silently regress
- ✅ Map picks to StemSplit API quality tiers (same weights)
Prerequisites
pip install datasets pandas
No GPU. No MUSDB18 download. The dataset ships metrics only (MUSDB18 audio can't be redistributed).
What's in the dataset?
Short answer: One Parquet config (metrics_only) with per-track, per-stem SDR/ISR/SIR/SAR plus wall_time_s, rtf, and host metadata from an Apple M4 Pro + PyTorch MPS run.
| Column | Use in your app |
|---|---|
model_id |
Which weights to load or API tier to call |
stem |
vocals, drums, bass, other
|
sdr_median |
Primary quality score (dB, higher = better) |
rtf |
wall_time_s / duration_s — 0.07 ≈ 14× faster than realtime |
track_id |
MUSDB18-HQ test folder name |
Four models in v1: htdemucs, htdemucs_ft, htdemucs_6s, mdx_extra_q.
Note on
htdemucs_6s:otherscores look broken (~0.2 dB) because piano/guitar are split out of MUSDB'sotherstem. Comparehtdemucs_6son vocals/drums/bass only, or wait for v1.1 wherepiano + guitar + otheris summed before eval. See the dataset card.
For background on what SDR means in practice, how HTDemucs separates audio is a better read than repeating the metric tutorial here.
Load the leaderboard in three lines
Short answer: load_dataset → to_pandas() → groupby.
from datasets import load_dataset
ds = load_dataset(
"StemSplitio/stem-separation-benchmark-2026",
"metrics_only",
split="results",
)
df = ds.to_pandas()
leaderboard = (
df.groupby(["model_id", "stem"])["sdr_median"]
.median()
.unstack()
.round(2)
)
print(leaderboard)
Example output (v1, May 2026):
stem bass drums other vocals
model_id
htdemucs 9.78 10.01 6.42 8.53
htdemucs_6s 9.11 9.54 0.22 8.66
htdemucs_ft 10.38 10.11 6.34 9.19
mdx_extra_q 11.42 11.49 7.67 9.04
Vocal-first apps: htdemucs_ft wins on vocals. Drum/bass-heavy apps (DJ tools, rhythm games): mdx_extra_q leads. Latency-sensitive previews: check RTF next.
Build a model picker function
Short answer: Filter by stem + max RTF, then idxmax on sdr_median.
import pandas as pd
VOCAL_MODELS = ["htdemucs", "htdemucs_ft", "htdemucs_6s", "mdx_extra_q"]
def load_benchmark() -> pd.DataFrame:
from datasets import load_dataset
return load_dataset(
"StemSplitio/stem-separation-benchmark-2026",
"metrics_only",
split="results",
).to_pandas()
def pick_model(
df: pd.DataFrame,
*,
stem: str = "vocals",
max_rtf: float | None = 0.10,
candidates: list[str] | None = None,
) -> str:
"""Return model_id with best median SDR for `stem` under `max_rtf`."""
subset = df[df["stem"] == stem]
if candidates:
subset = subset[subset["model_id"].isin(candidates)]
if max_rtf is not None:
subset = subset[subset["rtf"] <= max_rtf]
if subset.empty:
raise ValueError(f"No model satisfies stem={stem!r}, max_rtf={max_rtf}")
scores = subset.groupby("model_id")["sdr_median"].median()
return scores.idxmax()
if __name__ == "__main__":
df = load_benchmark()
print("Vocal model (RTF ≤ 0.10):", pick_model(df, stem="vocals"))
print("Drums model (no RTF cap):", pick_model(df, stem="drums", max_rtf=None))
On the current data, pick_model(..., stem="vocals", max_rtf=0.10) returns htdemucs_ft. Drop the RTF cap on drums and you get mdx_extra_q.
Wire this into your config:
# settings.py
STEM_MODEL = {
"vocals": pick_model(df, stem="vocals", max_rtf=0.08),
"drums": pick_model(df, stem="drums", max_rtf=None),
}
That's a defensible default you can paste into a design doc with a link to the dataset.
Find your worst-case tracks before users do
Short answer: Sort by sdr_median ascending per model — the tail is where separation breaks.
def worst_tracks(df: pd.DataFrame, model_id: str, stem: str, n: int = 5) -> pd.DataFrame:
return (
df[(df["model_id"] == model_id) & (df["stem"] == stem)]
.nsmallest(n, "sdr_median")[["track_id", "sdr_median", "rtf", "duration_s"]]
)
df = load_benchmark()
print(worst_tracks(df, "htdemucs_ft", "vocals"))
If your product targets metal or dense mixes, cross-check whether your genre's failure modes show up in these track names — then add those clips to your own QA set. The HF table won't replace genre-specific testing, but it surfaces landmines faster than random uploads.
Regression-test model quality in CI
Short answer: Assert median vocal SDR from the public table stays above a floor — catches bad deploys without running inference in GitHub Actions.
You are not running Demucs in CI. You're pinning expectations to the published benchmark so nobody ships htdemucs where product spec says htdemucs_ft.
# tests/test_model_policy.py
import pytest
from picker import load_benchmark # module from previous section
VOCAL_SDR_FLOOR_DB = 9.0 # below htdemucs_ft median; adjust with product input
def test_vocal_model_meets_benchmark_floor():
df = load_benchmark()
vocal_median = (
df[(df["model_id"] == "htdemucs_ft") & (df["stem"] == "vocals")]["sdr_median"].median()
)
assert vocal_median >= VOCAL_SDR_FLOOR_DB, (
f"htdemucs_ft vocal median {vocal_median:.2f} dB < floor {VOCAL_SDR_FLOOR_DB}"
)
def test_picker_respects_rtf_budget():
df = load_benchmark()
model = pick_model(df, stem="vocals", max_rtf=0.10)
rtf_p95 = df[(df["model_id"] == model) & (df["stem"] == "vocals")]["rtf"].quantile(0.95)
assert rtf_p95 <= 0.10, f"{model} p95 RTF {rtf_p95:.3f} exceeds budget"
Cache the dataset in CI (HF_HOME or datasets cache dir) so pulls don't hit rate limits every run.
Map picks to StemSplit API tiers
Short answer: Hosted StemSplit runs the same weights — no separate stemsplit_api row in the dataset because numbers would duplicate.
| Your picker result | StemSplit tier | Benchmark row |
|---|---|---|
| Speed / bulk preview | FAST | htdemucs |
| Vocal isolation default | BALANCED | htdemucs_ft |
| Piano + guitar stems | BEST (6-stem) | htdemucs_6s |
If you'd rather not operate Demucs + ffmpeg yourself, the API uses the same models — see building a vocal remover with the StemSplit API for the async job flow.
When you still need your own benchmark
The HF table answers: "Which open-source Demucs variant is best on MUSDB18-HQ?"
It does not answer:
- How your users' uploads (low-bitrate MP3, phone recordings) behave
- Commercial tools (LALAL.AI, Moises) — see AI Stem Splitter API Comparison
- Spleeter migration — Spleeter is Dead + local
mir_evalif you need custom tracks
Use the public dataset for model ID selection and CI floors. Layer your own golden files on top for product QA.
Wrapping Up
- Dataset: huggingface.co/datasets/StemSplitio/stem-separation-benchmark-2026
-
Pattern: load once →
pick_model(stem, max_rtf)→ worst-track audit → CI median assertion - Don't duplicate work: MUSDB18 eval is already done; spend your GPU time on your audio
If you build something on top of this table (dashboard, CLI, fine-tune report), drop a link in the comments — the dataset is CC-BY-4.0.
Citation:
@misc{stemsplit_benchmark_2026,
title = {StemSplit Stem-Separation Benchmark 2026},
author = {StemSplit},
year = {2026},
url = {https://huggingface.co/datasets/StemSplitio/stem-separation-benchmark-2026}
}
Top comments (0)