DEV Community

StemSplit
StemSplit

Posted on

Pick the Right HTDemucs Model in Python — Query 800 MUSDB18-HQ Scores on Hugging Face (2026)

You're about to ship stem separation in a product. Marketing pages say "state-of-the-art AI." Your job is to pick an actual model_id string — htdemucs, htdemucs_ft, htdemucs_6s, or mdx_extra_q — and defend that choice to whoever reads the PR.

Listening tests on three Spotify tracks won't cut it. Running MUSDB18 yourself takes a weekend and 22 GB of disk.

There's a third option: stem-separation-benchmark-2026 — 800 rows of BSS Eval v4 scores (50 MUSDB18-HQ test tracks × 4 models × 4 stems), CC-BY-4.0, queryable in pandas from your laptop in under a minute.

This post shows how to turn that table into a model picker and a regression test — not how to re-run the benchmark (see Best Free AI Stem Splitters in 2026 for DIY mir_eval).

What You'll Learn

  • ✅ Load the public leaderboard into pandas without downloading audio
  • ✅ Pick a model by stem priority (vocals vs drums) and latency budget (RTF)
  • ✅ Find worst-case tracks per model before they hit production
  • ✅ Assert median vocal SDR in CI so model upgrades don't silently regress
  • ✅ Map picks to StemSplit API quality tiers (same weights)

Prerequisites

pip install datasets pandas
Enter fullscreen mode Exit fullscreen mode

No GPU. No MUSDB18 download. The dataset ships metrics only (MUSDB18 audio can't be redistributed).


What's in the dataset?

Short answer: One Parquet config (metrics_only) with per-track, per-stem SDR/ISR/SIR/SAR plus wall_time_s, rtf, and host metadata from an Apple M4 Pro + PyTorch MPS run.

Column Use in your app
model_id Which weights to load or API tier to call
stem vocals, drums, bass, other
sdr_median Primary quality score (dB, higher = better)
rtf wall_time_s / duration_s — 0.07 ≈ 14× faster than realtime
track_id MUSDB18-HQ test folder name

Four models in v1: htdemucs, htdemucs_ft, htdemucs_6s, mdx_extra_q.

Note on htdemucs_6s: other scores look broken (~0.2 dB) because piano/guitar are split out of MUSDB's other stem. Compare htdemucs_6s on vocals/drums/bass only, or wait for v1.1 where piano + guitar + other is summed before eval. See the dataset card.

For background on what SDR means in practice, how HTDemucs separates audio is a better read than repeating the metric tutorial here.


Load the leaderboard in three lines

Short answer: load_datasetto_pandas()groupby.

from datasets import load_dataset

ds = load_dataset(
    "StemSplitio/stem-separation-benchmark-2026",
    "metrics_only",
    split="results",
)
df = ds.to_pandas()

leaderboard = (
    df.groupby(["model_id", "stem"])["sdr_median"]
    .median()
    .unstack()
    .round(2)
)
print(leaderboard)
Enter fullscreen mode Exit fullscreen mode

Example output (v1, May 2026):

stem          bass  drums  other  vocals
model_id
htdemucs      9.78  10.01   6.42    8.53
htdemucs_6s   9.11   9.54   0.22    8.66
htdemucs_ft  10.38  10.11   6.34    9.19
mdx_extra_q  11.42  11.49   7.67    9.04
Enter fullscreen mode Exit fullscreen mode

Vocal-first apps: htdemucs_ft wins on vocals. Drum/bass-heavy apps (DJ tools, rhythm games): mdx_extra_q leads. Latency-sensitive previews: check RTF next.


Build a model picker function

Short answer: Filter by stem + max RTF, then idxmax on sdr_median.

import pandas as pd

VOCAL_MODELS = ["htdemucs", "htdemucs_ft", "htdemucs_6s", "mdx_extra_q"]


def load_benchmark() -> pd.DataFrame:
    from datasets import load_dataset

    return load_dataset(
        "StemSplitio/stem-separation-benchmark-2026",
        "metrics_only",
        split="results",
    ).to_pandas()


def pick_model(
    df: pd.DataFrame,
    *,
    stem: str = "vocals",
    max_rtf: float | None = 0.10,
    candidates: list[str] | None = None,
) -> str:
    """Return model_id with best median SDR for `stem` under `max_rtf`."""
    subset = df[df["stem"] == stem]
    if candidates:
        subset = subset[subset["model_id"].isin(candidates)]
    if max_rtf is not None:
        subset = subset[subset["rtf"] <= max_rtf]
    if subset.empty:
        raise ValueError(f"No model satisfies stem={stem!r}, max_rtf={max_rtf}")

    scores = subset.groupby("model_id")["sdr_median"].median()
    return scores.idxmax()


if __name__ == "__main__":
    df = load_benchmark()
    print("Vocal model (RTF ≤ 0.10):", pick_model(df, stem="vocals"))
    print("Drums model (no RTF cap):", pick_model(df, stem="drums", max_rtf=None))
Enter fullscreen mode Exit fullscreen mode

On the current data, pick_model(..., stem="vocals", max_rtf=0.10) returns htdemucs_ft. Drop the RTF cap on drums and you get mdx_extra_q.

Wire this into your config:

# settings.py
STEM_MODEL = {
    "vocals": pick_model(df, stem="vocals", max_rtf=0.08),
    "drums": pick_model(df, stem="drums", max_rtf=None),
}
Enter fullscreen mode Exit fullscreen mode

That's a defensible default you can paste into a design doc with a link to the dataset.


Find your worst-case tracks before users do

Short answer: Sort by sdr_median ascending per model — the tail is where separation breaks.

def worst_tracks(df: pd.DataFrame, model_id: str, stem: str, n: int = 5) -> pd.DataFrame:
    return (
        df[(df["model_id"] == model_id) & (df["stem"] == stem)]
        .nsmallest(n, "sdr_median")[["track_id", "sdr_median", "rtf", "duration_s"]]
    )


df = load_benchmark()
print(worst_tracks(df, "htdemucs_ft", "vocals"))
Enter fullscreen mode Exit fullscreen mode

If your product targets metal or dense mixes, cross-check whether your genre's failure modes show up in these track names — then add those clips to your own QA set. The HF table won't replace genre-specific testing, but it surfaces landmines faster than random uploads.


Regression-test model quality in CI

Short answer: Assert median vocal SDR from the public table stays above a floor — catches bad deploys without running inference in GitHub Actions.

You are not running Demucs in CI. You're pinning expectations to the published benchmark so nobody ships htdemucs where product spec says htdemucs_ft.

# tests/test_model_policy.py
import pytest
from picker import load_benchmark  # module from previous section

VOCAL_SDR_FLOOR_DB = 9.0  # below htdemucs_ft median; adjust with product input


def test_vocal_model_meets_benchmark_floor():
    df = load_benchmark()
    vocal_median = (
        df[(df["model_id"] == "htdemucs_ft") & (df["stem"] == "vocals")]["sdr_median"].median()
    )
    assert vocal_median >= VOCAL_SDR_FLOOR_DB, (
        f"htdemucs_ft vocal median {vocal_median:.2f} dB < floor {VOCAL_SDR_FLOOR_DB}"
    )


def test_picker_respects_rtf_budget():
    df = load_benchmark()
    model = pick_model(df, stem="vocals", max_rtf=0.10)
    rtf_p95 = df[(df["model_id"] == model) & (df["stem"] == "vocals")]["rtf"].quantile(0.95)
    assert rtf_p95 <= 0.10, f"{model} p95 RTF {rtf_p95:.3f} exceeds budget"
Enter fullscreen mode Exit fullscreen mode

Cache the dataset in CI (HF_HOME or datasets cache dir) so pulls don't hit rate limits every run.


Map picks to StemSplit API tiers

Short answer: Hosted StemSplit runs the same weights — no separate stemsplit_api row in the dataset because numbers would duplicate.

Your picker result StemSplit tier Benchmark row
Speed / bulk preview FAST htdemucs
Vocal isolation default BALANCED htdemucs_ft
Piano + guitar stems BEST (6-stem) htdemucs_6s

If you'd rather not operate Demucs + ffmpeg yourself, the API uses the same models — see building a vocal remover with the StemSplit API for the async job flow.


When you still need your own benchmark

The HF table answers: "Which open-source Demucs variant is best on MUSDB18-HQ?"

It does not answer:

Use the public dataset for model ID selection and CI floors. Layer your own golden files on top for product QA.


Wrapping Up

If you build something on top of this table (dashboard, CLI, fine-tune report), drop a link in the comments — the dataset is CC-BY-4.0.

Citation:

@misc{stemsplit_benchmark_2026,
  title  = {StemSplit Stem-Separation Benchmark 2026},
  author = {StemSplit},
  year   = {2026},
  url    = {https://huggingface.co/datasets/StemSplitio/stem-separation-benchmark-2026}
}
Enter fullscreen mode Exit fullscreen mode

Top comments (0)