Local-first video transcription on Apple Silicon with mlx-whisper

#python #cli #opensource #machinelearning

I do a lot of learning from online videos. Many of them are not in English. With AI becoming part of my workflow, I stopped watching full videos and started extracting transcripts, feeding them into my models, and letting the model pull out what I actually need.

The problem was: the workflow was fragmented and annoying. Upload to a cloud service, wait for processing, get a transcript full of gibberish from background noise, then move that into another tool for translation. Slow, not private, and expensive at scale.

So I built ytx — a local-first command-line tool that runs entirely on your machine.

The tech
I chose mlx-whisper because Apple Silicon's GPU architecture is a perfect match for local inference. Instead of fighting TensorFlow or converting models, I could lean into Apple's native MLX framework and let the Mac GPU handle the full whisper-large-v3 model. No cloud account. No per-minute fees. Just your hardware.

The core dependencies:

Python 3.10+
mlx-whisper (Apple Silicon GPU)
whisper-large-v3 (open-source, no API key)
ffmpeg + yt-dlp for audio extraction
argos-translate as an offline translation fallback

The hardest part
Whisper's hallucination loops on noisy audio.

It would repeat phrases indefinitely, turning a clean transcript into nonsense. I built an automatic detection-and-scrub step that identifies repetition patterns and cleans them before the output ever reaches you. This is not a post-processing nicety — it is what makes the difference between a usable transcript and a broken one.

_
_
Transcribes video/audio locally in seconds on Apple Silicon

Cleans hallucination loops automatically

Outputs SRT, VTT, TXT, or JSON

Translates through local CLI agents (Claude Code, OpenCode, Ollama), cloud APIs, or fully offline

Ships with a SKILL.md so AI agents can run the workflow autonomously

The repo
It is MIT licensed and open source. If you are a terminal-first developer who values local-first tools, this might fit your workflow.

Repo: https://github.com/amateur-dev/ytx
Landing page: https://amateur-dev.github.io/ytx/

Happy to take feedback on the architecture, especially around the hallucination cleanup logic and the agent integration. If you have built something similar or have thoughts on making this faster, I would love to hear from you.

DEV Community

Local-first video transcription on Apple Silicon with mlx-whisper

Top comments (0)