DEV Community

Melvin Bucio
Melvin Bucio

Posted on

How I built a free unlimited transcription tool with faster-whisper on CPU for $25/month

Every free transcription tool caps you. Otter caps at 300 minutes per month. Turboscribe at 3 files per day. Notta at 120 minutes per month. They all do this for the same reason: transcription is expensive to run and someone has to pay for it.
I built one with no cap. Here is how it works.
Background
This is the third post in a series about VidClean, a free video and audio tool suite I have been building in public. The first post covered the full stack (FastAPI, ARQ, Railway, Cloudflare R2). The second covered running DeepFilterNet3 background noise removal on CPU. This one covers adding Whisper transcription without breaking the tools that already work.
Why you cannot just add Whisper to your existing worker
My backend already runs two heavy models: DeepFilterNet3 for background noise removal and speech enhancement. Both are gated behind a Redis semaphore that allows only one heavy job at a time per worker. This prevents out-of-memory crashes when multiple users submit files simultaneously.
The naive approach to adding Whisper is to load it alongside DF3 in the same worker. The problem: a 30-minute transcription job would hold the heavy lock for 30 minutes, blocking every silence removal and noise reduction job in the queue. For a free tool where silence removal is the flagship feature, that is unacceptable.
The solution is queue isolation. Whisper gets its own Railway service, its own ARQ queue, and its own Redis namespace. Transcription jobs never touch the DF3 worker. The two systems run completely independently.
The proof is in the timing. I tested this with concurrent jobs: a transcription job and a silence removal job fired within one second of each other. The silence job completed in 6 seconds. The transcription job completed 36 seconds later. No blocking, no waiting.
faster-whisper on CPU
Standard Whisper on CPU is slow. OpenAI's own model runs at roughly 1x realtime on a typical server CPU, meaning a 30-minute file takes 30 minutes to process.
faster-whisper (CTranslate2) changes that. CTranslate2 is an optimized inference engine for Transformer models that runs roughly 4x faster than the standard implementation. On Railway's 2 vCPU allocation, a 10-minute file processes in roughly 3-4 minutes. A 30-minute file finishes in roughly 8-12 minutes.
The configuration that matters:
pythonfrom faster_whisper import WhisperModel

model = WhisperModel(
"small",
device="cpu",
compute_type="int8",
cpu_threads=2,
num_workers=1,
download_root="/tmp/whisper-models"
)
compute_type="int8" quantizes the model weights to 8-bit integers. This halves the memory footprint and speeds up inference further with minimal accuracy loss on clear speech. cpu_threads=2 matches the 2 vCPU allocation.
The model is 244MB and downloads at worker startup. First boot takes about 30 extra seconds while the model fetches from HuggingFace. Subsequent boots are faster once cached.
VAD — how we handle 90-minute files
A 90-minute recording is not 90 minutes of continuous speech. There are pauses, gaps, filler silence between sentences. Standard Whisper processes all of it. faster-whisper has a VAD filter that skips non-speech segments entirely.
pythonsegments, info = model.transcribe(
wav_path,
vad_filter=True,
language=language,
beam_size=5
)
vad_filter=True means a 90-minute sermon with natural pauses might only require processing 65-70 minutes of actual speech. The practical effect: long files finish faster and the transcript is cleaner because Whisper is not trying to transcribe ambient room noise between sentences.
The maximum file duration is capped at 90 minutes on the backend. Anything longer fails cleanly with an error before any model work starts.
Three output formats from one job
Whisper returns a list of segments, each with a start time, end time, and transcribed text. From that single segments list three output files are generated:
Plain text (.txt): clean prose, no timestamps, good for reading and copy-pasting.
SRT (.srt): numbered cues with HH:MM:SS,mmm timestamps, the format every video editor and subtitle tool accepts.
VTT (.vtt): WEBVTT format with HH:MM:SS.mmm timestamps, the format browsers use for HTML5 video tracks.
All three are uploaded to Cloudflare R2 as separate objects and returned as individual presigned download URLs. One upload, three downloads.
Real numbers
The dedicated transcription worker adds roughly $8-12 per month to the existing infrastructure. Total cost for the entire suite silence removal, background noise, speech enhancement, transcription, and 14 other tools is around $25-30 per month. No GPU. No paid transcription API. No per-minute billing.
The no-cap positioning
Every competitor caps the free tier because they run GPU infrastructure and need to limit free usage to protect margins. CPU inference is slower but the cost structure is completely different. Running faster-whisper small on a Railway CPU instance costs fractions of a cent per job. There is no economic pressure to cap it.
What is next
Burned-in captions: running the SRT through FFmpeg to render subtitles directly onto the video. One FFmpeg command, no new model, no new infrastructure. That and a dedicated subtitle page at /add-subtitles are next on the roadmap.
Drop a comment if you have questions about the faster-whisper setup, the queue isolation approach, or the VAD configuration.
You can try it at vidclean.net/transcribe. Free, no account, no time limit.

Top comments (0)