Wlad Radchenko

Posted on Jul 1

One model, two stems: how a vocal remover gets the instrumental for free

#machinelearning #audio #python #deeplearning

You feed a song in. You get two files back: one with only the singer, one with only the band. The obvious way to build that is to train two models, one per stem. The audio separator in my open-source app does something cheaper and, in practice, cleaner: it runs one model, then gets the second stem by subtraction.

This post walks through the actual code that does it. It is short, it is specific, and once you see the subtraction trick you will use it in other places too.

A mixing desk keeps every instrument on its own channel. Source separation tries to recover those channels after they have already been bounced down to one. Photo: Unsplash.

The mental model: a song is a stack of transparent sheets

Picture the final track as a stack of clear plastic sheets pressed together. One sheet has the vocals printed on it, the others have drums, bass, guitar. When you hear an MP3, you are looking down through the whole stack at once.

Separation tries to lift one sheet off. Here is the part people miss: if you can lift the vocals sheet cleanly, you do not need a separate model to read the instrumental. Whatever is left on the table after the vocals are gone is the instrumental. Lift one, subtract, keep both.

That is the whole idea. The rest is making "lift the vocals" and "subtract" precise enough to sound good.

Step 1. Stop working with audio, start working with a picture

You cannot un-mix a waveform by staring at the wiggly line. So the first move is a Short-Time Fourier Transform (STFT), which turns the audio into a spectrogram: a heatmap where one axis is time, the other is pitch, and brightness is how loud that pitch is at that moment.

The useful framing: a spectrogram is an editable image. A vocal sits in a recognizable band with recognizable shapes. A model that edits images can learn to erase it.

class STFT:
    def __init__(self, n_fft, hop_length, dim_f, device):
        self.n_fft = n_fft            # 6144 here: window length
        self.hop_length = hop_length  # 1024: how far the window slides
        self.dim_f = dim_f            # 3072: how many frequency bins we keep
        self.hann_window = torch.hann_window(window_length=n_fft, periodic=True)

    def __call__(self, x):
        spec = torch.stft(x, n_fft=self.n_fft, hop_length=self.hop_length,
                          window=self.hann_window.to(x.device),
                          center=True, return_complex=True)
        spec = torch.view_as_real(spec)            # split into real + imaginary
        spec = spec.permute(...).reshape(...)      # pack as image-like channels
        return spec[..., : self.dim_f, :]          # keep only the first dim_f bins

Two numbers there matter more than they look.

n_fft = 6144 sets how many pitch bins exist: n_fft // 2 + 1 = 3073. hop_length = 1024 is the stride, the same idea as stride in a CNN, just along time. Smaller hop means more overlap and a smoother picture at the cost of more compute.

The frequency cut-off (the part that is not obvious)

Look at the last line: return spec[..., : self.dim_f, :]. The STFT produces 3073 frequency bins, but the model only ever sees 3072 of them. The top of the spectrum is thrown away before inference and padded back with zeros on the way out.

This is not a bug, it is a deliberate trick from the KUIELab-MDX-Net paper (Kim et al., ISMIR 2021). You cut the frequency range to the band where the target source actually lives, which lets you spend the model's resolution where it counts instead of on bins that carry almost no energy. The inverse transform pads the missing bins back so the shapes line up:

def pad_frequency_dimension(self, x, ...):
    freq_padding = torch.zeros([..., num_freq_bins - freq_dim, time_dim]).to(x.device)
    return torch.cat([x, freq_padding], -2)   # put the cut bins back as zeros

One more small thing the code does right before inference:

spek = self.stft(mix)
spek[:, :, :3, :] *= 0   # zero the lowest 3 bins

That kills the bottom three frequency bins, the sub-bass rumble and DC offset that add nothing to a vocal and only confuse the model. Three lines, real quality difference.

Step 2. Run the model on the picture

The model is shipped as ONNX, so inference is one call. It takes the spectrogram and returns a predicted spectrogram for the primary stem.

def run_model(self, mix):
    spek = self.stft(mix)
    spek[:, :, :3, :] *= 0
    spec_pred = self.model_run(spek)                 # ONNX forward pass
    return self.stft.inverse(spec_pred)              # back to a waveform

self.model_run is just the ONNX session wrapped in a lambda, which is the entire reason this runs fast on a CPU when no GPU is around:

session = ort.InferenceSession(model_path, providers=providers, sess_options=opts)
self.model_run = lambda spek: session.run(None, {"input": spek.cpu().numpy()})[0]

Step 3. The subtraction that hands you the second stem

Now the payoff. The model gave us one stem. The other one is mixture minus that stem. The code offers two ways to do it.

The plain way is waveform subtraction:

# source = the stem the model predicted
self.secondary_source = mix.T - source.T

The careful way is spectral subtraction, used when you want to suppress bleed more aggressively. It subtracts in the frequency domain and keeps the phase of the original mix:

def invert_audio(specs, invert_p=True):
    X_mag = np.abs(specs[0])                  # mixture magnitude
    y_mag = np.abs(specs[1])                  # predicted stem magnitude
    max_mag = np.where(X_mag >= y_mag, X_mag, y_mag)
    return specs[1] - max_mag * np.exp(1.0j * np.angle(specs[0]))

Both work. Waveform subtraction is the default because it is exact and fast.

The subtlety that makes subtraction sound clean

Here is the detail I am proud of, and it is easy to get wrong. If you subtract the model's clean output from the raw input, the reconstruction artifacts from the model's own STFT and inverse STFT do not cancel. You hear them as a faint warble in the second stem.

The fix is to subtract through the same lossy pipeline. Before subtracting, the code re-runs the mix through the transform in a pass-through mode that applies STFT and inverse STFT but skips the model:

raw_mix = self.demix(mix, is_match_mix=True)   # STFT -> ISTFT, no model

if self.invert_using_spec:
    self.secondary_source = invert_stem(raw_mix, source)
else:
    self.secondary_source = mix.T - source.T

In that pass-through mode, run_model just returns the spectrogram untouched:

if is_match_mix:
    spec_pred = spek.cpu().numpy()   # identity: no model, only transform round-trip

So both sides of the subtraction carry the same transform artifacts, and the artifacts cancel instead of leaking into the result. Small idea, audible difference.

The 1.022 fudge factor

One last constant worth knowing about:

if not is_match_mix:
    source *= self.compensate   # compensate = 1.022

The model very slightly under-predicts magnitude, so the output is multiplied by 1.022 before it is written. It is a calibration constant that ships with the model weights, not a magic number I picked. If your separated stem sounds a hair quiet, this is where it lives.

Gotchas if you build this yourself

A few things that cost me time:

The audio must be stereo before it hits the model. Mono gets duplicated into two channels (np.asfortranarray([mix, mix])), otherwise the channel math downstream breaks.
Normalize before and after. The code normalizes to a peak of 0.9 so the int16 export does not clip.
The ONNX path here assumes segment_size == dim_t (256 in this model). If they differ you have to fall back to an onnx2torch path, which the code guards against with an explicit error rather than failing silently.

I left out one piece on purpose: how the song is cut into overlapping chunks and stitched back without clicks at the seams. That sliding-window-and-Hann-window dance deserves its own walkthrough, and it is next in this series.

About the author. I'm Wlad Radchenko, a software engineer. The code in this article comes from Wunjo Make (open source), local software for video makers, and Wunjo Design, an offline PWA for designers. Get in touch to find more on GitHub and LinkedIn.

Takeaway

You do not need one model per stem. Train one to lift the hardest source, get the rest by subtraction, and spend your effort on making the subtraction honest: same transform on both sides, phase preserved, magnitude calibrated. The full file is separator/mdx/model.py in the repo if you want to read past the snippets.

If you try this on your own audio, tell me what the second stem sounds like before and after the match-mix trick. That difference is the whole point.

References

Kim, Lee, Lee, Lee. "KUIELab-MDX-Net: A Two-Stream Neural Network for Music Demixing." ISMIR 2021. arXiv:2111.12203
Mitsufuji et al. "Music Demixing Challenge 2021." Frontiers in Signal Processing, 2022. arXiv:2108.13559
Défossez. "Hybrid Spectrogram and Waveform Source Separation." (Demucs v3.) 2021. arXiv:2111.03600

DEV Community