DEV Community: Melvin Bucio

How I built a free unlimited transcription tool with faster-whisper on CPU for $25/month

Melvin Bucio — Mon, 01 Jun 2026 19:30:00 +0000

Every free transcription tool caps you. Otter caps at 300 minutes per month. Turboscribe at 3 files per day. Notta at 120 minutes per month. They all do this for the same reason: transcription is expensive to run and someone has to pay for it.
I built one with no cap. Here is how it works.
Background
This is the third post in a series about VidClean, a free video and audio tool suite I have been building in public. The first post covered the full stack (FastAPI, ARQ, Railway, Cloudflare R2). The second covered running DeepFilterNet3 background noise removal on CPU. This one covers adding Whisper transcription without breaking the tools that already work.
Why you cannot just add Whisper to your existing worker
My backend already runs two heavy models: DeepFilterNet3 for background noise removal and speech enhancement. Both are gated behind a Redis semaphore that allows only one heavy job at a time per worker. This prevents out-of-memory crashes when multiple users submit files simultaneously.
The naive approach to adding Whisper is to load it alongside DF3 in the same worker. The problem: a 30-minute transcription job would hold the heavy lock for 30 minutes, blocking every silence removal and noise reduction job in the queue. For a free tool where silence removal is the flagship feature, that is unacceptable.
The solution is queue isolation. Whisper gets its own Railway service, its own ARQ queue, and its own Redis namespace. Transcription jobs never touch the DF3 worker. The two systems run completely independently.
The proof is in the timing. I tested this with concurrent jobs: a transcription job and a silence removal job fired within one second of each other. The silence job completed in 6 seconds. The transcription job completed 36 seconds later. No blocking, no waiting.
faster-whisper on CPU
Standard Whisper on CPU is slow. OpenAI's own model runs at roughly 1x realtime on a typical server CPU, meaning a 30-minute file takes 30 minutes to process.
faster-whisper (CTranslate2) changes that. CTranslate2 is an optimized inference engine for Transformer models that runs roughly 4x faster than the standard implementation. On Railway's 2 vCPU allocation, a 10-minute file processes in roughly 3-4 minutes. A 30-minute file finishes in roughly 8-12 minutes.
The configuration that matters:
pythonfrom faster_whisper import WhisperModel

model = WhisperModel(
"small",
device="cpu",
compute_type="int8",
cpu_threads=2,
num_workers=1,
download_root="/tmp/whisper-models"
)
compute_type="int8" quantizes the model weights to 8-bit integers. This halves the memory footprint and speeds up inference further with minimal accuracy loss on clear speech. cpu_threads=2 matches the 2 vCPU allocation.
The model is 244MB and downloads at worker startup. First boot takes about 30 extra seconds while the model fetches from HuggingFace. Subsequent boots are faster once cached.
VAD — how we handle 90-minute files
A 90-minute recording is not 90 minutes of continuous speech. There are pauses, gaps, filler silence between sentences. Standard Whisper processes all of it. faster-whisper has a VAD filter that skips non-speech segments entirely.
pythonsegments, info = model.transcribe(
wav_path,
vad_filter=True,
language=language,
beam_size=5
)
vad_filter=True means a 90-minute sermon with natural pauses might only require processing 65-70 minutes of actual speech. The practical effect: long files finish faster and the transcript is cleaner because Whisper is not trying to transcribe ambient room noise between sentences.
The maximum file duration is capped at 90 minutes on the backend. Anything longer fails cleanly with an error before any model work starts.
Three output formats from one job
Whisper returns a list of segments, each with a start time, end time, and transcribed text. From that single segments list three output files are generated:
Plain text (.txt): clean prose, no timestamps, good for reading and copy-pasting.
SRT (.srt): numbered cues with HH:MM:SS,mmm timestamps, the format every video editor and subtitle tool accepts.
VTT (.vtt): WEBVTT format with HH:MM:SS.mmm timestamps, the format browsers use for HTML5 video tracks.
All three are uploaded to Cloudflare R2 as separate objects and returned as individual presigned download URLs. One upload, three downloads.
Real numbers
The dedicated transcription worker adds roughly $8-12 per month to the existing infrastructure. Total cost for the entire suite silence removal, background noise, speech enhancement, transcription, and 14 other tools is around $25-30 per month. No GPU. No paid transcription API. No per-minute billing.
The no-cap positioning
Every competitor caps the free tier because they run GPU infrastructure and need to limit free usage to protect margins. CPU inference is slower but the cost structure is completely different. Running faster-whisper small on a Railway CPU instance costs fractions of a cent per job. There is no economic pressure to cap it.
What is next
Burned-in captions: running the SRT through FFmpeg to render subtitles directly onto the video. One FFmpeg command, no new model, no new infrastructure. That and a dedicated subtitle page at /add-subtitles are next on the roadmap.
Drop a comment if you have questions about the faster-whisper setup, the queue isolation approach, or the VAD configuration.
You can try it at vidclean.net/transcribe. Free, no account, no time limit.

How I built a free AI background noise remover that runs on CPU for $20/month

Melvin Bucio — Thu, 21 May 2026 00:37:52 +0000

Five weeks ago I wrote about the full stack behind VidClean, a free video and audio processing tool suite. That post covered the pipeline, the queue system, and the general architecture. This one goes deep on one specific tool: the background noise remover. It is now the second most used tool on the site. Here is exactly how it works.
FFmpeg cannot do this
FFmpeg has two noise reduction filters worth knowing about: afftdn and anlmdn. Both work fine for consistent background hiss, like tape noise or a steady hum at a fixed frequency. Neither works well for real-world noise, things like air conditioner rumble that shifts in volume, keyboard clicks, street noise, or a fan that speeds up and slows down.
The problem is that FFmpeg's filters are not trained on speech. They do not understand the difference between your voice and the noise behind it. They apply a statistical filter across the whole signal and hope for the best. For simple cases this is fine. For anything real, it falls apart.
DeepFilterNet3 is different. It is a neural network trained specifically on speech enhancement. It understands what speech sounds like and suppresses everything that is not.
Why DeepFilterNet3
There are a few options in this space. RNNoise is lightweight and fast but older and less accurate on complex noise. Whisper is not a noise suppressor, it is a transcription model, though people try to use it this way. DeepFilterNet3 is the current best open-source option for this use case: accurate, actively maintained, and small enough to run without a GPU.
The Python library is deepfilternet. The API is simple:
pythonfrom df.enhance import enhance, init_df, load_audio, save_audio

model, df_state, _ = init_df()
audio, _ = load_audio(input_path, sr=df_state.sr())
enhanced = enhance(model, df_state, audio)
save_audio(output_path, enhanced, df_state.sr())
That is the core. Four lines of Python. The rest is plumbing.
Running it on CPU
Every DeepFilterNet3 tutorial assumes you have a GPU. The model page recommends CUDA. Most implementations are built around it.
VidClean runs on Railway with no GPU, just CPU. The stack is torch==2.0.1+cpu and torchaudio==2.0.2+cpu. No CUDA, no GPU bill.
The tradeoff is speed. DF3 on CPU runs at roughly [VERIFY: 1-2x realtime on Railway's hardware, check against a real job] so a 3-minute file takes around 3-6 minutes to process. For a free utility tool where users are not watching a progress bar in a meeting, this is an acceptable tradeoff. Nobody is running this live. They upload a file, go do something else, and come back to download.
The cost difference is significant. A Railway CPU instance costs a fraction of any GPU instance. The whole site runs for $16-20/month.
Memory is the real constraint
Speed is not the problem with running DF3 on CPU. Memory is.
When the model loads, it pulls its weights into RAM. That is manageable on its own. The problem is concurrent jobs. If two DF3 jobs start at the same time on the same Railway replica, both model instances are in RAM simultaneously, along with both audio files being processed. On a standard instance this causes an out-of-memory crash.
The fix is a Redis semaphore. Before any heavy job starts, the worker tries to acquire a lock. If the lock is taken, the job waits in the queue. Only one heavy job runs per replica at a time.
pythonasync def acquire_heavy_lock(redis, worker_id, ttl=120):
key = f"heavy_lock:{worker_id}"
acquired = await redis.set(key, "1", nx=True, ex=ttl)
return acquired
The lock has a 120-second TTL with a heartbeat that renews every 60 seconds while the job is running. If the job crashes mid-process, the lock expires on its own and the next job can proceed. No manual cleanup, no stuck locks.
The full repair_audio pipeline
The background noise remover is one surface for DF3. There is a second tool, repair_audio, that uses DF3 as the middle step in a three-stage pipeline.
Stage 1: loudnorm. Normalizes the audio volume before DF3 sees it. DF3 performs better on audio that is already at a consistent level.
Stage 2: DF3. The actual noise removal.
Stage 3: De-hum. A notch filter targeting 60Hz and harmonics (50Hz for non-US content). Some recordings have electrical hum baked in that DF3 does not fully remove. The notch filter handles it as a cleanup pass.
Order matters here. Running de-hum before DF3 can remove frequency content that DF3 would have used to make better decisions. Running loudnorm after DF3 can reintroduce clipping. The sequence loudnorm then DF3 then de-hum gives the cleanest results.
Real numbers
The background noise remover has processed [VERIFY: 24 jobs as of May 20, update to current number on day of posting] since launch. It is the second most used tool on the site behind the silence remover, and it has more than doubled in the last three days.
Total infrastructure cost for all 16 tools: $16-20/month. No GPU. No paid noise removal API. No per-minute billing.
What is next
The next tool in this category is auto captions via Whisper. It is deferred for now because the current bottleneck is distribution, not product. Eight of the sixteen tools have zero completions. Building more before fixing that would be the wrong call.
If you want the full stack breakdown covering FastAPI, ARQ, Cloudflare R2, and Railway deployment, it is in my previous post here on Dev.to. https://dev.to/thebuciyo/how-i-built-a-free-video-audio-tool-suite-for-20month-2dhe
You can try the background noise remover at vidclean.net. Free, no account needed, and no watermark.
If you have questions about the DF3 setup, the semaphore pattern, or the pipeline order, drop a comment.

How I Built a Free Video & Audio Tool Suite for $20/Month

Melvin Bucio — Fri, 08 May 2026 17:27:40 +0000

I got tired of video editing tools that either charged money, added watermarks, or made you create an account just to do something simple like remove silence from a recording.
So five weeks ago I built my own. And then kept building. It's now a full video and audio tool suite with 8 tools, getting organic traffic from Google, ChatGPT, and Copilot, all for about $16-20/month in infrastructure costs.
Here's exactly how it's built.

The stack

Frontend: Pure static HTML on Vercel. No React, no Next.js, no build step. Just HTML, Tailwind CDN, and vanilla JS. Vercel's free tier handles it.
Backend: FastAPI on Railway with 2 worker replicas
Queue: Redis + ARQ (async job queue for Python)
Storage: Cloudflare R2 with 1-day lifecycle rules
Processing: FFmpeg for everything

Monthly cost breakdown:

Railway (2 replicas): ~$10-12
Cloudflare R2: ~$1-2
Redis (Railway): ~$3-5
Vercel: $0
Total: $16-20/month

How the processing pipeline works
Every tool follows the same pattern:

User uploads a file to FastAPI via the browser
FastAPI streams it to Cloudflare R2
FastAPI enqueues a job in Redis via ARQ
A worker replica picks up the job, downloads the file from R2, runs FFmpeg, uploads the result back to R2
Frontend polls /status every 500ms until the job is complete
User gets a presigned download URL, file auto-deletes after 15 minutes

The flow goes: browser uploads to FastAPI, FastAPI streams to R2 and enqueues a job in Redis, a worker picks up the job, downloads from R2, runs FFmpeg, uploads the result back to R2, and writes the status. The frontend polls every 500ms until it gets a presigned download URL back.
The interesting technical decisions
No user accounts, ever
This was a deliberate architectural decision, not just a UX choice. No accounts means no user database, no sessions, no auth system, no password resets, no GDPR compliance headaches, no data breach liability. Every request is stateless. Files are identified by a UUID job ID, not a user ID.
The tradeoff is you can't offer saved history or preferences. That's fine for a free utility tool. People just want to process a file and leave.
15-minute file deletion
Files are deleted from R2 automatically via lifecycle rules after 1 day, but a cleanup job also runs 15 minutes after each job completes. This isn't just a privacy feature. It keeps R2 storage costs near zero since files never accumulate.
max_jobs=1 per worker
ARQ supports concurrent jobs per worker, but FFmpeg is CPU-bound. Running two FFmpeg processes on the same Railway instance causes them to compete for CPU and both slow down. Setting max_jobs=1 means each worker processes one file at a time. With 2 replicas you get 2 simultaneous jobs, enough for current traffic and easy to scale by adding replicas.
The boto3 mistake that caused 504s
This one took me an embarrassingly long time to catch.
boto3 (the AWS/R2 SDK) is synchronous. My first version called it directly inside a FastAPI async endpoint. Under any meaningful upload load, the event loop blocked while the file transferred to R2, requests piled up, and the server started returning 504s.
The fix was one line:
pythonawait asyncio.to_thread(storage.upload_file, tmp_path, key)
This runs the blocking boto3 call in a thread pool, freeing the event loop to handle other requests during the upload. Obvious in hindsight, painful to debug live.
The silence removal pipeline is not one command
A naive implementation uses a single silenceremove filter:
ffmpeg -i input.mp4 -af silenceremove=stop_periods=-1:stop_duration=0.5:stop_threshold=-35dB output.mp4
This works but gives you very little control over cut padding and re-encode behavior. The actual implementation uses a two-pass approach: silencedetect to find the boundaries, then segment-cut and re-stitch with concat. More code but much better results, especially for speech with natural breath gaps you want to preserve.

What each tool taught me

Remove silence: Two-pass silence detection beats single-command filters for speech content
Extract audio: libmp3lame at 192k constant bitrate is the right default for spoken audio. VBR (-q:a) is fine for music but creates surprises when input is voice with quiet sections
Compress video: CRF 23 with libx264 veryfast is the sweet spot for quality vs speed on Railway's hardware
Mute video: Stream copy (-c:v copy -an) makes muting essentially instant since no re-encode is needed
Trim video: Re-encoding is worth the extra seconds vs stream copy because keyframe alignment causes noticeable off-by-seconds errors that users notice
Video to GIF: Palette generation in a single FFmpeg invocation matters far more than I expected. Without it, GIF banding is obvious even at small sizes
Resize to 9:16: The blur bars effect requires splitting into two streams in the filter graph. Non-obvious but produces much better results than black bars
MP4 to MP3: Same backend as extract audio, different SEO surface. One FFmpeg function, two landing pages, two different keyword clusters

The SEO side
Since the goal is organic traffic, I put real effort into the SEO infrastructure from day one: FAQPage, HowTo, and WebApplication JSON-LD schema on every tool page, comparison pages targeting "free Descript alternative" queries, blog posts targeting long-tail keywords, and Spanish versions of all pages.

Five weeks in: 27 clicks in Google Search Console, average position 5.5, and organic referrals already coming in from ChatGPT and Copilot.
The AI referrals were the most surprising thing. ChatGPT and Copilot were recommending the site before Google was sending meaningful traffic. Plain-language description paragraphs on each tool page ("This tool removes silence from video and audio. Upload your file and download a clean version in seconds.") seem to help AI systems understand and cite the tools accurately.

What surprised me overall
Static HTML is underrated. No build pipeline, no framework updates, no hydration errors, instant Vercel deploys. For a tool suite where each page is mostly the same structure with different copy, it's the right call. I would make the same decision again.
The other surprise was how well the cost scales. Eight tools, all running through the same pipeline, for $16-20/month. FFmpeg does the heavy lifting and Railway scales horizontally by just adding replicas. The unit economics are genuinely good.

What's next
More tools. Each one is a new keyword cluster, a new internal link target, and a new surface for AI citation. The goal is eventually a comprehensive free video utility suite, the way iLovePDF did it for PDF tools, built one tool at a time.
If you're building something similar or have questions about the FFmpeg pipeline, the ARQ setup, or the R2 lifecycle configuration, drop a comment. Happy to go deeper on any of it.
You can try what I built at vidclean.net.