StemSplit

Posted on May 19

Pick the Right HTDemucs Model in Python — Query 800 MUSDB18-HQ Scores on Hugging Face (2026)

#ai #python #opensource #machinelearning

You're about to ship stem separation in a product. Marketing pages say "state-of-the-art AI." Your job is to pick an actual model_id string — htdemucs, htdemucs_ft, htdemucs_6s, or mdx_extra_q — and defend that choice to whoever reads the PR.

Listening tests on three Spotify tracks won't cut it. Running MUSDB18 yourself takes a weekend and 22 GB of disk.

There's a third option: stem-separation-benchmark-2026 — 800 rows of BSS Eval v4 scores (50 MUSDB18-HQ test tracks × 4 models × 4 stems), CC-BY-4.0, queryable in pandas from your laptop in under a minute.

This post shows how to turn that table into a model picker and a regression test — not how to re-run the benchmark (see Best Free AI Stem Splitters in 2026 for DIY mir_eval).

What You'll Learn

✅ Load the public leaderboard into pandas without downloading audio
✅ Pick a model by stem priority (vocals vs drums) and latency budget (RTF)
✅ Find worst-case tracks per model before they hit production
✅ Assert median vocal SDR in CI so model upgrades don't silently regress
✅ Map picks to StemSplit API quality tiers (same weights)

Prerequisites

pip install datasets pandas

No GPU. No MUSDB18 download. The dataset ships metrics only (MUSDB18 audio can't be redistributed).

What's in the dataset?

Short answer: One Parquet config (metrics_only) with per-track, per-stem SDR/ISR/SIR/SAR plus wall_time_s, rtf, and host metadata from an Apple M4 Pro + PyTorch MPS run.

Column	Use in your app
`model_id`	Which weights to load or API tier to call
`stem`	`vocals`, `drums`, `bass`, `other`
`sdr_median`	Primary quality score (dB, higher = better)
`rtf`	`wall_time_s / duration_s` — 0.07 ≈ 14× faster than realtime
`track_id`	MUSDB18-HQ test folder name

Four models in v1: htdemucs, htdemucs_ft, htdemucs_6s, mdx_extra_q.

Note on htdemucs_6s: other scores look broken (~0.2 dB) because piano/guitar are split out of MUSDB's other stem. Compare htdemucs_6s on vocals/drums/bass only, or wait for v1.1 where piano + guitar + other is summed before eval. See the dataset card.

For background on what SDR means in practice, how HTDemucs separates audio is a better read than repeating the metric tutorial here.

Load the leaderboard in three lines

Short answer: load_dataset → to_pandas() → groupby.

from datasets import load_dataset

ds = load_dataset(
    "StemSplitio/stem-separation-benchmark-2026",
    "metrics_only",
    split="results",
)
df = ds.to_pandas()

leaderboard = (
    df.groupby(["model_id", "stem"])["sdr_median"]
    .median()
    .unstack()
    .round(2)
)
print(leaderboard)

Example output (v1, May 2026):

stem          bass  drums  other  vocals
model_id
htdemucs      9.78  10.01   6.42    8.53
htdemucs_6s   9.11   9.54   0.22    8.66
htdemucs_ft  10.38  10.11   6.34    9.19
mdx_extra_q  11.42  11.49   7.67    9.04

Vocal-first apps: htdemucs_ft wins on vocals. Drum/bass-heavy apps (DJ tools, rhythm games): mdx_extra_q leads. Latency-sensitive previews: check RTF next.

Build a model picker function

Short answer: Filter by stem + max RTF, then idxmax on sdr_median.

import pandas as pd

VOCAL_MODELS = ["htdemucs", "htdemucs_ft", "htdemucs_6s", "mdx_extra_q"]


def load_benchmark() -> pd.DataFrame:
    from datasets import load_dataset

    return load_dataset(
        "StemSplitio/stem-separation-benchmark-2026",
        "metrics_only",
        split="results",
    ).to_pandas()


def pick_model(
    df: pd.DataFrame,
    *,
    stem: str = "vocals",
    max_rtf: float | None = 0.10,
    candidates: list[str] | None = None,
) -> str:
    """Return model_id with best median SDR for `stem` under `max_rtf`."""
    subset = df[df["stem"] == stem]
    if candidates:
        subset = subset[subset["model_id"].isin(candidates)]
    if max_rtf is not None:
        subset = subset[subset["rtf"] <= max_rtf]
    if subset.empty:
        raise ValueError(f"No model satisfies stem={stem!r}, max_rtf={max_rtf}")

    scores = subset.groupby("model_id")["sdr_median"].median()
    return scores.idxmax()


if __name__ == "__main__":
    df = load_benchmark()
    print("Vocal model (RTF ≤ 0.10):", pick_model(df, stem="vocals"))
    print("Drums model (no RTF cap):", pick_model(df, stem="drums", max_rtf=None))

On the current data, pick_model(..., stem="vocals", max_rtf=0.10) returns htdemucs_ft. Drop the RTF cap on drums and you get mdx_extra_q.

Wire this into your config:

# settings.py
STEM_MODEL = {
    "vocals": pick_model(df, stem="vocals", max_rtf=0.08),
    "drums": pick_model(df, stem="drums", max_rtf=None),
}

That's a defensible default you can paste into a design doc with a link to the dataset.

Find your worst-case tracks before users do

Short answer: Sort by sdr_median ascending per model — the tail is where separation breaks.

def worst_tracks(df: pd.DataFrame, model_id: str, stem: str, n: int = 5) -> pd.DataFrame:
    return (
        df[(df["model_id"] == model_id) & (df["stem"] == stem)]
        .nsmallest(n, "sdr_median")[["track_id", "sdr_median", "rtf", "duration_s"]]
    )


df = load_benchmark()
print(worst_tracks(df, "htdemucs_ft", "vocals"))

If your product targets metal or dense mixes, cross-check whether your genre's failure modes show up in these track names — then add those clips to your own QA set. The HF table won't replace genre-specific testing, but it surfaces landmines faster than random uploads.

Regression-test model quality in CI

Short answer: Assert median vocal SDR from the public table stays above a floor — catches bad deploys without running inference in GitHub Actions.

You are not running Demucs in CI. You're pinning expectations to the published benchmark so nobody ships htdemucs where product spec says htdemucs_ft.

# tests/test_model_policy.py
import pytest
from picker import load_benchmark  # module from previous section

VOCAL_SDR_FLOOR_DB = 9.0  # below htdemucs_ft median; adjust with product input


def test_vocal_model_meets_benchmark_floor():
    df = load_benchmark()
    vocal_median = (
        df[(df["model_id"] == "htdemucs_ft") & (df["stem"] == "vocals")]["sdr_median"].median()
    )
    assert vocal_median >= VOCAL_SDR_FLOOR_DB, (
        f"htdemucs_ft vocal median {vocal_median:.2f} dB < floor {VOCAL_SDR_FLOOR_DB}"
    )


def test_picker_respects_rtf_budget():
    df = load_benchmark()
    model = pick_model(df, stem="vocals", max_rtf=0.10)
    rtf_p95 = df[(df["model_id"] == model) & (df["stem"] == "vocals")]["rtf"].quantile(0.95)
    assert rtf_p95 <= 0.10, f"{model} p95 RTF {rtf_p95:.3f} exceeds budget"

Cache the dataset in CI (HF_HOME or datasets cache dir) so pulls don't hit rate limits every run.

Map picks to StemSplit API tiers

Short answer: Hosted StemSplit runs the same weights — no separate stemsplit_api row in the dataset because numbers would duplicate.

Your picker result	StemSplit tier	Benchmark row
Speed / bulk preview	FAST	`htdemucs`
Vocal isolation default	BALANCED	`htdemucs_ft`
Piano + guitar stems	BEST (6-stem)	`htdemucs_6s`

If you'd rather not operate Demucs + ffmpeg yourself, the API uses the same models — see building a vocal remover with the StemSplit API for the async job flow.

When you still need your own benchmark

The HF table answers: "Which open-source Demucs variant is best on MUSDB18-HQ?"

It does not answer:

How your users' uploads (low-bitrate MP3, phone recordings) behave
Commercial tools (LALAL.AI, Moises) — see AI Stem Splitter API Comparison
Spleeter migration — Spleeter is Dead + local mir_eval if you need custom tracks

Use the public dataset for model ID selection and CI floors. Layer your own golden files on top for product QA.

Wrapping Up

Dataset: huggingface.co/datasets/StemSplitio/stem-separation-benchmark-2026
Pattern: load once → pick_model(stem, max_rtf) → worst-track audit → CI median assertion
Don't duplicate work: MUSDB18 eval is already done; spend your GPU time on your audio

If you build something on top of this table (dashboard, CLI, fine-tune report), drop a link in the comments — the dataset is CC-BY-4.0.

Citation:

@misc{stemsplit_benchmark_2026,
  title  = {StemSplit Stem-Separation Benchmark 2026},
  author = {StemSplit},
  year   = {2026},
  url    = {https://huggingface.co/datasets/StemSplitio/stem-separation-benchmark-2026}
}

DEV Community