codesugar lin

Posted on Apr 29 • Originally published at aistemsplitter.org

htdemucs vs BS-RoFormer vs Spleeter: A 2026 Audio Source Separation Benchmark

#ai #deeplearning #machinelearning #performance

If you've spent any time looking at AI music separation in the last twelve months, you've probably run into the same three names: Spleeter, htdemucs (Hybrid Transformer Demucs), and BS-RoFormer. They show up in every comparison post, every research paper, and every "how to extract vocals" tutorial — but the way they're compared is usually wrong. Most posts cite a single SDR number from a 2019 paper and call it a day.

That's not useful if you're trying to ship a product, build a pipeline, or pick a model for real audio.

This post compares the three on the dimensions that actually matter when you're deploying audio separation:

Quality — SDR scores from peer-reviewed sources, not vibes
Inference speed — what you'll actually wait for in production
Cost per song — running on commodity GPUs at 2026 prices
Output flexibility — 2 stems vs 4 stems vs 6 stems
When each one is the right choice — and when it isn't

Everything below is based on published benchmarks plus our own production deployment of htdemucs at scale. Where we cite numbers, we cite the source.

TL;DR (for people who want the answer now)

Model	Best for	Output stems	Quality (avg SDR)	Speed
Spleeter	Real-time, low-resource, batch processing	2, 4, or 5	~5.9 dB (vocals)	~100× real-time on GPU
htdemucs	Production C2C apps, balance of quality and speed	4 or 6	~9.0 dB (avg)	~5–8× real-time on A40
BS-RoFormer	Highest-fidelity offline work, mastering, archival	4 (typically)	~9.80 dB (avg)	~2–3× real-time on A40

If you take only one thing from this post: htdemucs is the right default for almost any product, and you should probably be running htdemucs_ft rather than the default checkpoint. On Replicate's serverless pricing, all three Demucs variants (default, 6s, ft) cost essentially the same per call — but ft delivers meaningfully better separation. We didn't expect this when we started; it only became clear after looking at our actual billing.

BS-RoFormer is meaningfully better only on bass and only when latency doesn't matter. Spleeter is a 2019 model running on 2026 hardware — fast, but the quality gap is now audible.

The rest of this post explains why.

What we mean by "quality" — SDR explained briefly

Music source separation quality is usually measured in Signal-to-Distortion Ratio (SDR), in decibels. Higher is better. The reference dataset is MUSDB18 (or MUSDB18-HQ for high-quality audio), which contains 150 full-length tracks with isolated stems for vocals, drums, bass, and "other."

A few practical anchors:

<6 dB SDR: noticeable artifacts, "phasey" vocals, audible bleed between stems
6–8 dB SDR: usable for casual purposes (karaoke, learning songs, sketching ideas)
8–10 dB SDR: clean enough for content creation and most DJ applications
>10 dB SDR: approaching transparent for the average listener; suitable for release-quality work after light cleanup

Anything above ~9 dB on vocals is generally past the point where most listeners can tell the difference in a blind test. The gains from there are about edge cases — heavy reverb, doubled vocals, complex mixes.

A note on SI-SDR: Some recent papers report SI-SDR (scale-invariant SDR), which corrects for simple gain differences and is more robust. When numbers in this post differ from other sources, the metric definition is usually the reason.

The three models, briefly

Spleeter (Deezer, 2019)

Released by the Deezer research team in 2019, Spleeter is a U-Net architecture operating in the spectrogram domain. It comes in 2-stem (vocals/accompaniment), 4-stem (vocals/drums/bass/other), and 5-stem (adds piano) configurations.

It was a landmark release at the time — the first time anyone could run good-enough source separation on a laptop CPU without licensing fees. Six years later, it's been overtaken on quality by every modern model, but it remains the fastest and lightest option by a wide margin.

htdemucs (Meta AI, 2022)

The fourth-generation Demucs model from Meta AI's research team. Unlike Spleeter, htdemucs is a hybrid model — it operates in both the time domain (waveform) and frequency domain (spectrogram), with a Transformer backbone connecting them. The original paper reports a 1.4 dB SDR improvement over the previous Demucs generation on MUSDB-HQ.

Two variants matter in practice:

htdemucs — the standard 4-stem model
htdemucs_6s — a 6-stem variant that adds isolated guitar and piano stems

There's also htdemucs_ft, a fine-tuned version that's slower but slightly more accurate on individual stems.

htdemucs placed competitively in the 2021 Sony Music Demixing Challenge and remains the default for most production pipelines that aren't chasing the absolute SOTA.

BS-RoFormer (2023)

The current state of the art on MUSDB18-HQ, BS-RoFormer (Band-Split RoPE Transformer) is a pure-Transformer architecture that replaces RNN modules with a hierarchical RoPE Transformer. It splits the input spectrogram into multiple non-overlapping frequency sub-bands, exploiting the fact that different instruments occupy characteristic frequency ranges (bass low, cymbals high, etc.).

BS-RoFormer trained on MUSDB18-HQ plus 500 extra songs won first place in the Music Source Separation track of the Sound Demixing Challenge 2023 (SDX23). Even the smaller version trained without extra data reports 9.80 dB average SDR on MUSDB18-HQ.

The downside: it's slower and more memory-intensive than htdemucs, and the production-ready open weights are still scattered across community implementations rather than a single canonical release.

1. Quality benchmark (published SDR scores)

This is where most comparison posts fall apart — they cherry-pick a single number. Here are the per-stem SDR scores from the published literature, on MUSDB18-HQ (no extra training data unless noted):

Model	Vocals	Drums	Bass	Other	Average
Spleeter (4-stem)	~5.9 dB	~5.9 dB	~5.5 dB	~4.5 dB	~5.4 dB
htdemucs (default)	~8.1 dB	~8.4 dB	~8.6 dB	~5.9 dB	~7.7 dB
htdemucs_ft (fine-tuned)	~8.9 dB	~9.5 dB	~9.4 dB	~6.4 dB	~8.5 dB
BS-RoFormer (no extra data)	—	—	~11.28 dB	—	~9.80 dB
BS-RoFormer (with 500 extra songs)	—	—	—	—	~9.76 dB+

Sources: Spleeter scores from the Spleeter JOSS paper and the BeatsToRapOn separation benchmark. htdemucs scores from Hybrid Spectrogram and Waveform Source Separation and Benchmarks and leaderboards for sound demixing tasks. BS-RoFormer scores from the SDX23 results documented in the same paper.

A few observations from the table:

The Spleeter → htdemucs gap is bigger than the htdemucs → BS-RoFormer gap. Going from Spleeter to htdemucs gets you roughly +2.3 dB on average. Going from htdemucs to BS-RoFormer gets you roughly +1.3 dB. This is why htdemucs is the practical sweet spot for most use cases.

BS-RoFormer's biggest win is on bass. Bass separation jumps from ~8.6 dB (htdemucs) to ~11.28 dB (BS-RoFormer) — a difference you can hear in a blind test. The vocal and drum gains are smaller. If you're building something that specifically needs clean bass (DJ tools, transcription, music education for bass players), BS-RoFormer is worth the extra compute. For everything else, the gain is on the edge of perceptible.

htdemucs_ft is underrated. Many comparison posts only test the default htdemucs checkpoint. The fine-tuned version (htdemucs_ft) closes most of the gap to BS-RoFormer at the cost of roughly 4× the inference time — still faster than BS-RoFormer in practice.

2. Inference speed (real-world, not theoretical)

Approximate end-to-end time for a 3-minute song on a single A40 GPU, measured from API call to download-ready output:

Model	End-to-end time	Real-time multiplier
Spleeter (4-stem, GPU)	~2–5 seconds	~40–90× real-time
htdemucs (default, 4-stem)	~30–45 seconds	~4–6× real-time
htdemucs_6s (6-stem)	~40–60 seconds	~3–5× real-time
htdemucs_ft (fine-tuned)	~90–150 seconds	~1.2–2× real-time
BS-RoFormer	~60–120 seconds	~1.5–3× real-time

Notes:

End-to-end time ≠ pure GPU inference time. Public benchmarks usually report just the model forward pass on clean inputs. Real production time includes container cold start (5–30s on serverless), audio I/O (file download, ffmpeg pre-processing), and result upload. Our numbers above are end-to-end on Replicate.
Spleeter is in a different league for speed. It's the only one that runs comfortably faster than real-time on CPU alone.
htdemucs's overlap parameter is a big speed lever. The default overlap=0.25 is a reasonable trade-off; setting overlap=0.5 improves quality slightly at ~2× the cost; setting overlap=0 makes it noticeably faster but introduces audible chunking artifacts at segment boundaries.
BS-RoFormer's reference implementations vary wildly in speed depending on whose checkpoint and inference code you use. Numbers above are for the community-popular MVSep BS-RoFormer SW build.

If you're shipping a consumer product where users wait for results, anything slower than ~60 seconds for a 3-minute song starts to hurt conversion in our experience. That keeps htdemucs (default and 6s) inside acceptable territory and pushes htdemucs_ft and BS-RoFormer toward async/queued flows where the user can come back later.

3. Cost per song (production deployment economics)

This is the section where most online comparisons are completely wrong. Public pricing on Replicate looks straightforward — A40 at $0.000725/second, multiply by inference time, done. In practice, that calculation is off by roughly 2× from your actual bill, and there's a more interesting wrinkle that almost no comparison post mentions.

The headline finding from our production deployment

We've been running htdemucs in production at aistemsplitter.org for several months across all three Demucs variants — htdemucs (default 4-stem), htdemucs_6s (6-stem), and htdemucs_ft (fine-tuned). On Replicate's A40 GPU instances, all three variants cost approximately the same per call in our actual billing: roughly 22 calls per $1, or about $0.045 per song.

That's worth pausing on, because it contradicts what you'd expect from the published inference times.

Model	Naive cost (public pricing × inference time)	Our actual measured cost
Spleeter (GPU)	<$0.002	<$0.005
htdemucs (default)	~$0.022	~$0.045
htdemucs_6s (6-stem)	~$0.029	~$0.045
htdemucs_ft (fine-tuned)	~$0.11	~$0.045
BS-RoFormer	~$0.065	~$0.06–0.10 (varies)

Why all three Demucs variants converge to the same cost

The naive pricing model assumes you pay only for pure GPU inference time. In reality, every Replicate call also includes:

Container cold-start time (5–30 seconds when scaling from zero)
Model weight loading into GPU memory
Audio file download and ffmpeg pre-processing
Result encoding and upload back to storage
A minimum billable duration per call

These overheads are roughly fixed costs per invocation — they don't scale with how complex your model is. When the GPU forward pass goes from 30 seconds (htdemucs default) to 90 seconds (htdemucs_ft), the additional compute matters less to the bill than you'd expect, because the per-call overhead is already eating most of the budget.

The practical implication: if you're already on the htdemucs platform, there's almost no economic reason not to use the highest-quality variant your latency budget allows. If your users will wait 60 seconds, use htdemucs_6s (6 stems, default speed). If they'll wait 2 minutes, use htdemucs_ft (fine-tuned, near-BS-RoFormer quality on most stems). The bill is the same.

This is the opposite of the conclusion you'd reach by reading academic papers and Replicate's posted GPU pricing. It only shows up when you actually look at your bill at the end of the month.

Implications for unit economics

If you're modeling unit economics for a stem separation product, plan for $0.04–$0.05 per song as your floor, regardless of which Demucs variant you choose. That sets:

Free tier ceiling — at 10 free minutes per user (≈3 free songs), you're absorbing roughly $0.13 per signup before any conversion
Minimum viable credit pack pricing — anything below ~$0.10/song retail leaves no margin for Stripe fees, support, and infrastructure overhead
Bulk processing cost — at 10,000 songs/month you're looking at ~$450 in pure inference, before storage, bandwidth, and any other infrastructure

Two important caveats:

Cold starts dominate at low traffic. If your service is processing fewer than a few hundred songs per day, the cold-start overhead becomes proportionally larger. At very low traffic, the actual cost can drift up toward $0.06–$0.07 per song.
Self-hosting only beats this above ~$2k/mo in inference spend. Until you have enough sustained traffic to keep a dedicated GPU >40% utilized, serverless GPU is cheaper than RunPod, Vast.ai, or your own colo. We've measured this directly — Replicate stayed cheaper than dedicated infrastructure throughout our launch period.

4. Output flexibility (stem count and format)

Model	Available stem configurations	Notes
Spleeter	2, 4, or 5 stems	5-stem adds piano (separate model)
htdemucs	4 or 6 stems	`htdemucs_6s` adds guitar + piano
BS-RoFormer	4 stems (mostly); some 6-stem community builds	Quality drops on the rarer guitar/piano stems

This is where htdemucs_6s genuinely stands alone. If your use case requires isolated guitar or piano stems (music education, multi-track remixing, transcription), htdemucs_6s is the only widely-deployed model that delivers them at production quality. BS-RoFormer 6-stem variants exist in the community but are less mature; the canonical BS-RoFormer is a 4-stem system.

For "vocals only" or "instrumental only" use cases (the karaoke crowd), all three models work fine, and you should pick on speed, not quality. Spleeter at 90× real-time will give you a usable instrumental in milliseconds.

5. When to pick which one

After running these in production for several months, here's the simple decision tree we'd give someone starting from scratch:

Pick Spleeter when:

You need to process audio in real-time or near-real-time
You're running on CPU or constrained hardware
You need batch-processing throughput (e.g., feature extraction over a music catalog)
The quality bar is "usable" not "good"

Pick htdemucs when:

You're building a consumer-facing product where users wait <60 seconds
You need 6 stems (use htdemucs_6s)
You want the best quality-per-dollar ratio in production
You don't want to maintain custom inference code (it's well-supported on every major model-serving platform)

Pick BS-RoFormer when:

You're running offline or batch jobs where 1–2 minutes per song is fine
Bass quality specifically matters (DJ tools, transcription, audio analysis)
You're producing release-quality work and the marginal SDR matters
You're willing to invest engineering time in keeping up with community model releases

Don't pick any of these when:

You only need vocal removal for karaoke. Use Spleeter 2-stem; the quality difference doesn't matter for sing-along audio that's going to be played over a microphone.
You need real-time stem separation in a DJ application. None of these are real-time on consumer hardware. Use a DAW with built-in real-time separation (Ableton 12, etc.) or pre-process tracks offline.

What this looks like in practice

We run htdemucs_6s in production at aistemsplitter.org — a hosted version of 6-stem separation aimed at people who don't want to set up the local toolchain (which, between PyTorch versions, CUDA versions, and audio dependency hell, takes most people a full afternoon).

A few things we learned that aren't in the papers:

Real production cost is roughly 2× what naive calculations suggest, and roughly flat across Demucs variants. Public GPU pricing × inference time gives you a number that ignores platform overhead. Our actual Replicate bill works out to about $0.045 per song — and it's the same number whether we run htdemucs, htdemucs_6s, or htdemucs_ft. The fixed overhead per call swamps the marginal compute difference between models. This single fact changed how we think about model selection: pick on quality, not on theoretical compute cost, because the cost difference doesn't actually show up in your bill.
Format conversion matters more than the model. htdemucs only accepts WAV input. Users upload MP3, FLAC, M4A, OGG, and increasingly weird WebM containers. The pre-processing ffmpeg layer is non-trivial to get right at scale.
YouTube/SoundCloud URL ingestion is half the UX win. Asking users to download a file and upload it loses ~40% of them. Direct URL ingestion via yt-dlp is fiddly to maintain (age-restricted videos, region locks, livestreams) but worth it.
The 6-stem case is where users see the magic. When someone hears guitar isolated from piano on their favorite song for the first time, they tell their friends. The 4-stem case is "neat"; the 6-stem case is "wait, how is this possible".

If you want to hear what 6-stem htdemucs sounds like on real audio without setting up the toolchain, our site has free credits to try a few songs.

What's next in this space

A few open questions worth watching in 2026:

Will 8-stem (vocals/backing-vocals/drums/bass/guitar/piano/synth/other) become standard? Community fine-tunes are moving in this direction, but training data for individual synth and backing-vocal stems is the bottleneck.
Real-time on consumer hardware? No current open model runs at real-time speed on a CPU at acceptable quality. This will change with model distillation, but probably not in 2026.
Multilingual / non-Western vocal separation. Most published benchmarks are dominated by English pop and rock. We see noticeably lower performance on languages with different vocal techniques (Mandarin, Cantopop with heavy auto-tune, Bollywood vocal stacks). This is a genuine gap in the field, not a model deployment issue.

If you're working in this space and have data we'd find interesting — or you've hit something on these models we haven't — drop us a line.

References

htdemucs — Rouard, S., Massa, F., Défossez, A. Hybrid Transformers for Music Source Separation. arXiv:2211.08553
Demucs v4 (hybrid) — Défossez, A. Hybrid Spectrogram and Waveform Source Separation. arXiv:2111.03600
BS-RoFormer — Lu, W.-T., Wang, J.-C., et al. Music Source Separation with Band-Split RoPE Transformer. SDX23 Challenge results
Spleeter — Hennequin, R., Khlif, A., Voituret, F., Moussallam, M. Spleeter: a fast and efficient music source separation tool with pre-trained models. JOSS 2020
MUSDB18 dataset — Rafii, Z., Liutkus, A., Stöter, F.-R., Mimilakis, S. I., Bittner, R. The MUSDB18 corpus for music separation. Zenodo
Sound Demixing Challenge 2023 — Mitsufuji et al., SDX23 results
MVSep model leaderboard — mvsep.com/en/algorithms

Last updated: April 2026. If you find an error in the data, the SDR numbers, or any of the practical claims, send us a correction and we'll update the post with attribution.

DEV Community