DEV Community

Cover image for Moving UVR5 to the Cloud: How I Built a Free Online Vocal Remover (Architecture from 0 to 1)

Moving UVR5 to the Cloud: How I Built a Free Online Vocal Remover (Architecture from 0 to 1)

I got tired of wrestling with the UVR5 desktop client, so I built a free web version.

This post is not a “click here, then here” user guide — it’s a developer-focused breakdown of:

  • Why I turned everything into a Job
  • How I wrapped complex models into simple “scenes”
  • How Workers, queues, and object storage fit together If you’re trying to move heavy offline tasks (transcoding, ML inference, batch jobs) to the web, this pattern is meant to be reusable.

1. Starting point: I just didn’t want to fight GPU drivers anymore

When I first used the UVR5 desktop client, my feelings were simple:

  • Pros: lots of models, solid quality
  • Cons: heavy install, complex local setup, picky GPU drivers

For non-technical users, “install before use” is already a hard stop.

Even for me, having to redo the whole setup every time I switched machines was annoying.

So I set myself one clear goal:

Build a “open in browser and just use it” UVR5 online version.

Users should only have to:

  • Upload an audio file
  • Pick a preset
  • Wait
  • Download the results Everything else (models, GPU, storage, cleanup) should be buried in the backend.

This post is my architecture retrospective for how I got there.

✅ Reusable takeaway:

Before touching any code, write down which pains you want to hide from the user.

A lot of architecture decisions will fall out of that list.


2. Big picture: turning “upload a file” into “submit a Job”

With that goal in mind, I landed on a fairly standard cloud-ish shape:

  • Browser + Frontend (Next.js): handles upload, preset selection, progress, downloads
  • Job API (thin Node/Fastify-ish backend): manages Jobs and signed URLs
  • Queue (Redis + a queue library): handles queuing, retries, and throttling
  • Python Worker + GPU: runs UVR5 / Demucs, does the actual heavy lifting
  • Object Storage (S3/R2): stores original audio and separated stems, with expiry

The user sees “Upload → Processing → Download”.

Internally it’s “Create Job → Worker consumes → Write result → Cleanup”.

I gave myself one hard rule:

The frontend must only talk to Jobs, never to the separation process directly.

Once that was in place, a lot of later problems became much easier.

✅ Reusable takeaway:

Whenever an operation is “noticeably slow and painful when it fails”, it probably wants to be a Job with its own lifecycle, not a long synchronous HTTP call.


3. Why Jobs are non-negotiable: turning long-running work into a tiny state machine

From the user’s perspective, separation is more like “running a CI pipeline” than “reading a record”:

  • There’s a clear start and end
  • It goes through several stages
  • It can fail, and you want a sane way to retry or at least explain why

If you try to squeeze that into a single synchronous endpoint:

  • Long-running requests make the UX flaky
  • Any disconnect / refresh becomes a mess to explain
  • The backend ends up with a giant hidden transaction doing download + inference + upload

So I forced myself to forget about “synchronous result” entirely and keep only this model:

The user submits a Job with a reference to the audio file.

The backend processes it asynchronously, updating Job status along the way.

The frontend just polls or subscribes to that Job.

That means:

  • The frontend page naturally becomes a “Job dashboard”: it can show where things are and what might be going wrong.
  • The backend can split the work into clear phases: download → inference → export → upload → cleanup.
  • If a Worker crashes, as long as the Job status is correct, the user gets a truthful story about what happened.

✅ Reusable takeaway:

If you keep asking “what if this times out / disconnects?”, that’s usually a hint:

you should promote this into a Job with explicit states.


4. Using “scenes” to wrap models: users don’t care about MDX vs Demucs

Tools like UVR5 / Demucs live in a world made of:

  • Model names
  • Versions
  • Parameters
  • Post-processing tricks

Users live in a different world:

  • “I just want vocals”
  • “I just want an instrumental for karaoke”
  • “I want separate drums and bass to practice”

Those two languages don’t match.

So in the product I only expose one abstraction: scene (aka “preset + usage scenario”).

For example:

  • “Vocals + Instrumental (two stems)” – typical for acapellas and backing tracks
  • “Karaoke” – keep only vocals / keep only instrumental
  • “Band (four stems)” – drums / bass / other / vocals

Under the hood each scene is mapped to a concrete set of models and configs:

which UVR model to use, whether to resample first, whether to apply additional post-processing — all hidden behind the scene.

The benefits:

  • For users: They just remember “which scene matches my goal”, instead of learning model names.
  • For me: I can silently upgrade the model behind a scene. If a better UVR/ Demucs setup appears, I swap it in; users just feel “it works better now”.

✅ Reusable takeaway:

For user-facing systems, lead with “scenes / tasks / presets”, not low-level knobs.

Let domain language be the interface; keep models and parameters behind it.


5. Workers: treating GPUs like a team of on-call engineers

Once you commit to UVR / Demucs, you’ve implicitly accepted that:

  • Models are big and slow to load
  • Inference time scales with audio length
  • VRAM is tight, and you can’t just crank up concurrency

If you spin up a new process per HTTP request, everything is going to feel terrible.

My strategy:

Treat Workers like a team of on-call engineers.

  • When they “clock in”, they preload all needed models onto the GPU
  • Then they sit there, pulling Jobs from the queue one by one
  • If a Worker crashes, you let it crash and let the queue decide to retry or fail the Job

That gives you:

  • Model load happens once, and every subsequent Job benefits from that warm state
  • You can explicitly control “how many Jobs per GPU / per Worker” and keep resource spikes under control
  • Failure recovery becomes trivial:
    • Worker dies → queue decides which Jobs to retry or mark as failed
    • New Worker comes up, keeps eating Jobs

From a dev-experience angle, this structure has another underrated upside:

local reproduction of bugs becomes easy.

I can grab a Job config from prod, feed it into a local Worker, and reproduce almost exactly what happened.

✅ Reusable takeaway:

When heavy models + GPUs are involved, default to “long-lived Workers + queue”,

not “spawn a fresh process per request”.


6. Storage & cleanup: not just a tech problem, but a trust problem

Object storage (S3 / R2) is the backbone for all large files here:

  • The frontend uploads via signed URLs, so the backend doesn’t have to proxy huge files
  • Workers pull inputs and push outputs by object key
  • Downloads use signed URLs too, so the browser talks directly to storage

The real hard questions are:

  1. How long do you keep those files?
  2. How do you convince users they won’t live on some random server forever?

Here’s what I ended up doing:

  • Each Job gets an expireAt timestamp on creation (e.g. 24–72 hours later)
  • A cleanup routine periodically scans for expired Jobs and:
    • Deletes the original input
    • Deletes all generated stems
    • Marks the Job as expired, so the frontend can say “this Job has expired”

On the product side, I don’t hide this only in a privacy policy footer; I say it in the UI:

  • “Files are stored in encrypted object storage and used only for this separation.”
  • “Your input and all stems are automatically deleted after X hours.”

✅ Reusable takeaway:

When designing storage, don’t just optimize for cost —

turn the lifecycle into something users can actually see and understand.

The more personal the data, the more important it is to explain the rules in plain UI text.


7. Frontend: hide the complexity, keep a single clear path

Once the architecture was set, the frontend rules became straightforward:

The UI should feel like a single linear path, not a control panel.

My frontend only has to do four things well:

  1. Make picking a file trivial

    • Drag-and-drop + click-to-select
    • After upload, show simple feedback: filename, duration, size
  2. Explain what you can do with it via scene cards

    • One short title + one line of explanation per scene
    • No buzzword salad like “ultimate AI super-resolution vocal extraction”
  3. Translate Job states into human language

    • E.g. “Splitting vocals from instrumental…”, “Queue is a bit busy, this may take a bit longer”
    • Not just “queued / running / success”
  4. Present stems clearly on the results view

    • Each stem labelled with plain usage hints: “Vocals (good for covers)”, “Instrumental (good for karaoke)” etc.

✅ Reusable takeaway:

For heavy-processing tools, frontend value isn’t about being flashy —

it’s about turning a complex multi-stage pipeline into one understandable story for the user.


8. The “follow-through”: why this architecture is worth maintaining

Once everything lived in “Job + scene + Worker + storage” land, a lot of future work became easier:

  • Add a new scene

    • Create a new scene config in the backend, render a new card in the UI, done.
  • Try a new model

    • Attach a new model config behind an existing scene and send a portion of Jobs there to compare.
  • Basic analytics

    • Jobs are already rich data points: scenes, durations, success/fail rates, error types.
  • Expose a simple external API

    • Expose “create Job / check Job / fetch result” as endpoints;
    • The core flow stays exactly the same.

For a solo dev, this kind of “follow-through” matters a lot:

it decides whether the project stays a one-off demo, or grows into a sustainable product.

✅ Reusable takeaway:

When designing architecture, keep asking “what will I want to add or change later?”

If something can be solved with configuration instead of hardcoding, lean toward config.


9. Yes, this is soft promotion — but I want the main course to be the architecture

Let me be explicit:

Yes, this post is a soft plug for the UVR5 online version I built.

Right now it’s free to use,

partly because idle GPUs are wasted GPUs,

and partly because I want real-world usage to pressure-test the system.

But more importantly, I care about:

  • Sharing a concrete path from “local tool” to “online service”
  • Giving you some patterns you can adapt to your own projects
  • Making sure that even if you never click my site, the post still feels worth reading

✅ Reusable takeaway:

In developer communities, the best way to do soft promotion is:

make the post itself valuable, even if nobody clicks your link.


10. If you want to build something similar, start with these three steps

To wrap up, if you’re thinking about moving a heavy local workflow to the web, here are three steps I’d start with:

  1. Draw it as a Job timeline first

    • What’s the input?
    • What are the stages in between?
    • Where can it fail, and what should the user see at each point?
  2. Define “scenes” before you define “endpoints”

    • Figure out how users describe their goals in their own words,
    • then map those scenes to models and pipelines behind the curtain.
  3. Give every Job a lifecycle from day one

    • When is it created?
    • When is it downloadable?
    • When does it expire?
    • When do you clean it up?
    • Try to phrase those rules so both you and your users can understand them.

Open questions for you:

  • If you were designing this, would you add another layer above Jobs, like “sessions” or “projects” to group multiple separations?
  • When queues are long and GPUs are saturated, how would you design expectation management so users still feel okay waiting?

I’d love to hear how you’d structure something similar —

and if you do try an online vocal remover built this way,

I’m equally interested in how you would refactor the architecture.

Top comments (0)