Building a local audio & video transcription API with FastAPI and faster-whisper

Mohammed Fashan — Thu, 16 Apr 2026 17:47:52 +0000

I wanted a way to transcribe audio and video files without sending anything to the cloud. No OpenAI API key, no monthly bill, no data leaving my machine. So I built player2text — a local transcription API powered by faster-whisper.

Here's what it does and how I built it.

The problem

Most transcription tools either cost money per minute, require an API key, or both. For personal projects, meeting recordings, or anything sensitive, that's not ideal. Whisper runs locally and it's surprisingly good — the challenge is just wrapping it in something usable.

The stack

FastAPI — clean async API, great auto-generated docs at /docs
faster-whisper — 4x faster than original Whisper, ~50% less RAM, same accuracy
ffmpeg — handles all the audio/video heavy lifting

The auto-compression trick

The best part of this project is the pre-processing step. Before transcription, every file gets passed through ffmpeg:

ffmpeg -i input.mp4 -vn -acodec pcm_s16le -ar 16000 -ac 1 output.wav

This strips the video stream, downsamples to 16kHz mono (Whisper's native rate), and converts to raw PCM. A 300MB video becomes about 5MB. Transcription is dramatically faster as a result.

The API

One endpoint does everything:

POST /api/v1/transcribe

Send a file (and optionally a language code), get back:

{
  "text": "Full transcript here...",
  "language": "en",
  "duration": 462.74,
  "segments": [
    { "id": 1, "start": 0.0, "end": 9.5, "text": "Hello and welcome..." }
  ]
}

Language is auto-detected if you don't specify it — Whisper supports 99 languages out of the box.

Running it

git clone https://github.com/fashan7/audio-to-text
cd audio-to-text
python -m venv venv && source venv/bin/activate
pip install --upgrade pip setuptools wheel
pip install -r requirements.txt
python run.py

Then open http://localhost:8000/docs and test it right in the browser.

What's next

React frontend (Lovable) for a proper UI with drag-and-drop upload
Progress streaming for long files
Deployment guide for Railway/Render

The full code is on GitHub: https://github.com/fashan7/audio-to-text

Would love feedback — especially if you've dealt with long audio files on CPU and have ideas for speeding things up further!