DEV Community

Cover image for I built a free, local video transcription tool, because I didn't want to pay $10/hour or upload my files to a stranger's server
Giuseppe Carlà
Giuseppe Carlà

Posted on

I built a free, local video transcription tool, because I didn't want to pay $10/hour or upload my files to a stranger's server

Every time I needed to transcribe a video at work, I hit the same wall:
the good tools cost money per minute, and the free ones upload your files
to a remote server. Neither was acceptable for work content.

So I built "Pitchfall" - a local transcription tool that runs entirely
on your own machine.

What it does

Upload any video or audio file (or paste a YouTube URL), and Pitchfall:

  • Transcribes it locally using faster-whisper
  • Shows a real-time progress bar with the current segment being recognized
  • Syncs the transcript to the video — click any line to jump to that moment
  • Exports as .txt or .srt subtitle file
  • Optionally translates into 10 languages via OpenRouter free models

No API key needed for transcription. No account. No cloud.

Pitchfall result screen — video player synced to transcript segments

The stack

faster-whisper (local Whisper model)

▼ streaming SSE
FastAPI (Python)


Next.js 16 + Tailwind CSS 4

The backend streams transcription progress via Server-Sent Events —
each segment gets sent to the frontend as it's recognized, so you see
the text appear in real time rather than waiting for the whole file
to finish.

Why local matters more than I expected

When I started this I thought "local vs cloud" was mainly a cost issue.
It turned out to be a correctness issue too.

Faster-whisper on CPU with the small model is genuinely fast enough
for practical use — a 5-minute video takes about 2-3 minutes on a
mid-range laptop. More importantly, the transcript never touches a
third-party server. For work content, legal recordings, or anything
sensitive, that distinction matters.

The part that took longest: memory management

The original version leaked memory on every transcription. The culprit
was URL.createObjectURL() — a blob URL that keeps the entire video
file in RAM. It was never revoked, so after 3-4 sessions the browser
was holding multiple full videos in memory.

The fix is a single line, but finding it required profiling:

// Before reset, always revoke the previous blob URL
if (isBlobUrl && mediaUrl) URL.revokeObjectURL(mediaUrl);
Enter fullscreen mode Exit fullscreen mode

The backend had a similar problem: temp files from crashed SSE
connections weren't getting cleaned up. I solved it with a FastAPI
lifespan context manager that wipes .tmp/ on startup and shutdown.

What I'm less happy with

Translation reliability. The free OpenRouter models have rate limits
and occasionally go offline. Pitchfall tries 5 models in order with
automatic fallback, but if they're all saturated you get a 503. For
casual use it's fine; for production you'd want a paid model.

No GPU support in the Docker image. The Dockerfile uses CPU-only
inference. Adding CUDA support means a much heavier image and
nvidia-container-toolkit as a prerequisite — I left it out for now
to keep the setup simple.

YouTube sync limitation. For uploaded files, clicking a transcript
segment seeks the video instantly. For YouTube URLs, the video loads
as an iframe embed — the YouTube API doesn't allow external seek
control without a more complex integration.

Try it

GitHub: https://github.com/scibilo/pitchfall

Manual setup takes about 5 minutes if you have Python 3.10+ and Node.js
18+ installed. Docker setup is one command.

The only hard dependency that people often don't have: ffmpeg.
sudo apt install ffmpeg on Ubuntu, brew install ffmpeg on Mac.


I'm curious: do you handle transcription in any of your projects?
What's your current setup — local model, cloud API, or something else?

Top comments (0)