Five weeks ago I wrote about the full stack behind VidClean, a free video and audio processing tool suite. That post covered the pipeline, the queue system, and the general architecture. This one goes deep on one specific tool: the background noise remover. It is now the second most used tool on the site. Here is exactly how it works.
FFmpeg cannot do this
FFmpeg has two noise reduction filters worth knowing about: afftdn and anlmdn. Both work fine for consistent background hiss, like tape noise or a steady hum at a fixed frequency. Neither works well for real-world noise, things like air conditioner rumble that shifts in volume, keyboard clicks, street noise, or a fan that speeds up and slows down.
The problem is that FFmpeg's filters are not trained on speech. They do not understand the difference between your voice and the noise behind it. They apply a statistical filter across the whole signal and hope for the best. For simple cases this is fine. For anything real, it falls apart.
DeepFilterNet3 is different. It is a neural network trained specifically on speech enhancement. It understands what speech sounds like and suppresses everything that is not.
Why DeepFilterNet3
There are a few options in this space. RNNoise is lightweight and fast but older and less accurate on complex noise. Whisper is not a noise suppressor, it is a transcription model, though people try to use it this way. DeepFilterNet3 is the current best open-source option for this use case: accurate, actively maintained, and small enough to run without a GPU.
The Python library is deepfilternet. The API is simple:
pythonfrom df.enhance import enhance, init_df, load_audio, save_audio
model, df_state, _ = init_df()
audio, _ = load_audio(input_path, sr=df_state.sr())
enhanced = enhance(model, df_state, audio)
save_audio(output_path, enhanced, df_state.sr())
That is the core. Four lines of Python. The rest is plumbing.
Running it on CPU
Every DeepFilterNet3 tutorial assumes you have a GPU. The model page recommends CUDA. Most implementations are built around it.
VidClean runs on Railway with no GPU, just CPU. The stack is torch==2.0.1+cpu and torchaudio==2.0.2+cpu. No CUDA, no GPU bill.
The tradeoff is speed. DF3 on CPU runs at roughly [VERIFY: 1-2x realtime on Railway's hardware, check against a real job] so a 3-minute file takes around 3-6 minutes to process. For a free utility tool where users are not watching a progress bar in a meeting, this is an acceptable tradeoff. Nobody is running this live. They upload a file, go do something else, and come back to download.
The cost difference is significant. A Railway CPU instance costs a fraction of any GPU instance. The whole site runs for $16-20/month.
Memory is the real constraint
Speed is not the problem with running DF3 on CPU. Memory is.
When the model loads, it pulls its weights into RAM. That is manageable on its own. The problem is concurrent jobs. If two DF3 jobs start at the same time on the same Railway replica, both model instances are in RAM simultaneously, along with both audio files being processed. On a standard instance this causes an out-of-memory crash.
The fix is a Redis semaphore. Before any heavy job starts, the worker tries to acquire a lock. If the lock is taken, the job waits in the queue. Only one heavy job runs per replica at a time.
pythonasync def acquire_heavy_lock(redis, worker_id, ttl=120):
key = f"heavy_lock:{worker_id}"
acquired = await redis.set(key, "1", nx=True, ex=ttl)
return acquired
The lock has a 120-second TTL with a heartbeat that renews every 60 seconds while the job is running. If the job crashes mid-process, the lock expires on its own and the next job can proceed. No manual cleanup, no stuck locks.
The full repair_audio pipeline
The background noise remover is one surface for DF3. There is a second tool, repair_audio, that uses DF3 as the middle step in a three-stage pipeline.
Stage 1: loudnorm. Normalizes the audio volume before DF3 sees it. DF3 performs better on audio that is already at a consistent level.
Stage 2: DF3. The actual noise removal.
Stage 3: De-hum. A notch filter targeting 60Hz and harmonics (50Hz for non-US content). Some recordings have electrical hum baked in that DF3 does not fully remove. The notch filter handles it as a cleanup pass.
Order matters here. Running de-hum before DF3 can remove frequency content that DF3 would have used to make better decisions. Running loudnorm after DF3 can reintroduce clipping. The sequence loudnorm then DF3 then de-hum gives the cleanest results.
Real numbers
The background noise remover has processed [VERIFY: 24 jobs as of May 20, update to current number on day of posting] since launch. It is the second most used tool on the site behind the silence remover, and it has more than doubled in the last three days.
Total infrastructure cost for all 16 tools: $16-20/month. No GPU. No paid noise removal API. No per-minute billing.
What is next
The next tool in this category is auto captions via Whisper. It is deferred for now because the current bottleneck is distribution, not product. Eight of the sixteen tools have zero completions. Building more before fixing that would be the wrong call.
If you want the full stack breakdown covering FastAPI, ARQ, Cloudflare R2, and Railway deployment, it is in my previous post here on Dev.to. https://dev.to/thebuciyo/how-i-built-a-free-video-audio-tool-suite-for-20month-2dhe
You can try the background noise remover at vidclean.net. Free, no account needed, and no watermark.
If you have questions about the DF3 setup, the semaphore pattern, or the pipeline order, drop a comment.
Top comments (0)