Subtitles are not the same as understanding
I kept finding incredible technical tutorials on YouTube — but they were in languages I didn't speak. Auto-subtitles simply weren't cutting it. So, I built youtube-dubber.
You paste a YouTube URL, pick a language, and it dubs the video in real time using a natural-sounding neural voice. No robotic monotone here. It sounds like an actual person explaining the concept to you — utilizing Hinglish code-switching for Hindi, and conversational tones for everything else.
How It Works
The application handles audio processing using two distinct pipelines:
1. Video URL Mode
- Extraction: Pulls captions from the YouTube video (falls back to Groq Whisper STT if captions are unavailable).
- Translation: Passes the text to LLaMA 3.1-8b hosted on Groq for ultra-fast processing.
- Synthesis: Generates voice text-to-speech using edge-tts neural voices.
-
Sync: Coordinates audio and playback through
mpvvia a JSON IPC socket. - Optimization: Utilizes segment-level disk caching. The first run is slow, but every rerun is instant.
2. Live Dub Mode (Linux Only)
- Capture: Grabs live system audio via a PulseAudio virtual sink.
- 5-Agent Pipeline: Streamlined sequence: Capture → Groq Whisper STT → LLaMA Translation → edge-tts → Speaker Output.
- Latency: Holds a ~3–5 second lag floor. This is inherent to a listen-then-translate architecture rather than a software bug.
What the App Looks Like
The video plays inside a separate native mpv window. The Electron window acts exclusively as your control panel for tracking subtitles, adjusting volume, and managing progress. Both components stay completely in sync over mpv's IPC socket.
What Makes It Different
Hinglish Code-Switching
Translation shouldn't read like a formal dictionary. Hindi and Indian languages naturally blend English technical terms into normal conversation. Phrases like "yeh function ek callback leta hai" sound natural, whereas a pure Hindi translation feels forced and confusing.
The LLaMA model is explicitly prompted for this hybrid style. Additionally, a slang replacement map processes the output to handle formal-to-casual substitutions right before synthesis.
The Linux Driver Bug That Shaped the Architecture
Why use mpv instead of a standard built-in HTML5 video player?
Initially, Electron's <video> element threw constant SIGSEGV crashes on Linux machines utilizing Optimus (Intel + NVIDIA) hybrid graphics. The GPU driver routinely kills Chromium's renderer the moment it touches hardware video decoding or the AudioContext API.
The Fix: Hand all heavy media lifting entirely over to mpv and control it via a JSON IPC socket. The Electron window remains a pure HTML/CSS UI layer that never touches video GPU registers. What started as a workaround became the defining architecture of the app.
Native Desktop App + Python Library
It functions perfectly as a local app, but you can also integrate it directly into your backend code or use the CLI.
pip install youtube-dubber
export GROQ_API_KEY=gsk_xxxx
from youtube_dubber import dub
manifest = dub(
"https://www.youtube.com/watch?v=VIDEO_ID",
lang="hindi",
gender="female",
out="./output_audio",
)
Or execute directly from your terminal:
youtube-dubber --url https://youtu.be/VIDEO_ID --lang hindi --gender female --out ./output_audio
Supported Languages
The system supports 20 languages: Hindi, Tamil, Telugu, Bengali, Marathi, Gujarati, Kannada, Malayalam, Punjabi, Urdu, Spanish, French, German, Japanese, Chinese, Korean, Arabic, Portuguese, Russian, and Italian.
Note: Indian regional languages default to the Hinglish/code-switched style to keep technical terms in English. Other options translate fully.
The Tech Stack
| Layer | Tool | Why This Choice? |
|---|---|---|
| Transcription | Groq Whisper | Lightning-fast voice-to-text response times. |
| Translation | LLaMA 3.1-8b via Groq | Sub-second text inference speeds. |
| Voice Synthesis | edge-tts | High-quality Microsoft neural voices without API key limits. |
| Video Playback | mpv via JSON IPC | Bypasses Chromium GPU crashes entirely. |
| Desktop Framework | Electron | Fast, cross-platform UI development. |
| Audio Capture | PulseAudio | Direct system-level audio routing for Live Mode. |
| VAD | Silero VAD | Highly accurate Voice Activity Detection to segment chunks. |
| Video Download | yt-dlp + ffmpeg | Industry standard for reliable stream fetching and muxing. |
Honest Limitations
- Live Dub is Linux-Only: Bound strictly to PulseAudio dependencies for now.
- Initial Pass is Throttle-Gated: Processing long videos for the first time hits Groq free-tier rate limits during parallel batching.
- Language Variance: Hindi is heavily fine-tuned and sounds best; quality across the other 19 languages may vary.
-
Connectivity Dependent:
edge-ttsrelies on live connections to Microsoft servers to synthesize audio. - Lag Floor: The 3–5s delay in Live Dub cannot be bypassed without implementing predictive AI text modeling.
-
No Quick Seeking: You cannot skip ahead during the first pass; the dub engine generates chunks sequentially from
0:00.
The Part That Surprised Me
193 GitHub clones in 3 days.
No public posts. No marketing. No product hunting. It gained traction purely from the package sitting on PyPI. The problem is real, and developers are actively hunting for solutions.
Try It Out
pip install youtube-dubber
- 📦 PyPI: pypi.org/project/youtube-dubber
- 🐙 GitHub: github.com/Ashut90/youtube-dubber
- 📄 License: GPL-3.0 (Fork it freely, just keep your modifications open source)
Star it, fork it, break it. GPL-3.0.


Top comments (0)