Ashutosh Tiwari

Posted on Jun 6

I built a tool that dubs any YouTube video into Hindi — and 19 other languages — in real time

#opensource #whisper #linux #ai

Subtitles are not the same as understanding

I kept finding incredible technical tutorials on YouTube — but they were in languages I didn't speak. Auto-subtitles simply weren't cutting it. So, I built youtube-dubber.

You paste a YouTube URL, pick a language, and it dubs the video in real time using a natural-sounding neural voice. No robotic monotone here. It sounds like an actual person explaining the concept to you — utilizing Hinglish code-switching for Hindi, and conversational tones for everything else.

How It Works

The application handles audio processing using two distinct pipelines:

1. Video URL Mode

Extraction: Pulls captions from the YouTube video (falls back to Groq Whisper STT if captions are unavailable).
Translation: Passes the text to LLaMA 3.1-8b hosted on Groq for ultra-fast processing.
Synthesis: Generates voice text-to-speech using edge-tts neural voices.
Sync: Coordinates audio and playback through mpv via a JSON IPC socket.
Optimization: Utilizes segment-level disk caching. The first run is slow, but every rerun is instant.

2. Live Dub Mode (Linux Only)

Capture: Grabs live system audio via a PulseAudio virtual sink.
5-Agent Pipeline: Streamlined sequence: Capture → Groq Whisper STT → LLaMA Translation → edge-tts → Speaker Output.
Latency: Holds a ~3–5 second lag floor. This is inherent to a listen-then-translate architecture rather than a software bug.

What the App Looks Like

The video plays inside a separate native mpv window. The Electron window acts exclusively as your control panel for tracking subtitles, adjusting volume, and managing progress. Both components stay completely in sync over mpv's IPC socket.

What Makes It Different

Hinglish Code-Switching

Translation shouldn't read like a formal dictionary. Hindi and Indian languages naturally blend English technical terms into normal conversation. Phrases like "yeh function ek callback leta hai" sound natural, whereas a pure Hindi translation feels forced and confusing.

The LLaMA model is explicitly prompted for this hybrid style. Additionally, a slang replacement map processes the output to handle formal-to-casual substitutions right before synthesis.

The Linux Driver Bug That Shaped the Architecture

Why use mpv instead of a standard built-in HTML5 video player?

Initially, Electron's <video> element threw constant SIGSEGV crashes on Linux machines utilizing Optimus (Intel + NVIDIA) hybrid graphics. The GPU driver routinely kills Chromium's renderer the moment it touches hardware video decoding or the AudioContext API.

The Fix: Hand all heavy media lifting entirely over to mpv and control it via a JSON IPC socket. The Electron window remains a pure HTML/CSS UI layer that never touches video GPU registers. What started as a workaround became the defining architecture of the app.

Native Desktop App + Python Library

It functions perfectly as a local app, but you can also integrate it directly into your backend code or use the CLI.

pip install youtube-dubber
export GROQ_API_KEY=gsk_xxxx

from youtube_dubber import dub

manifest = dub(
    "https://www.youtube.com/watch?v=VIDEO_ID",
    lang="hindi",
    gender="female",
    out="./output_audio",
)

Or execute directly from your terminal:

youtube-dubber --url https://youtu.be/VIDEO_ID --lang hindi --gender female --out ./output_audio

Supported Languages

The system supports 20 languages: Hindi, Tamil, Telugu, Bengali, Marathi, Gujarati, Kannada, Malayalam, Punjabi, Urdu, Spanish, French, German, Japanese, Chinese, Korean, Arabic, Portuguese, Russian, and Italian.

Note: Indian regional languages default to the Hinglish/code-switched style to keep technical terms in English. Other options translate fully.

The Tech Stack

Layer	Tool	Why This Choice?
Transcription	Groq Whisper	Lightning-fast voice-to-text response times.
Translation	LLaMA 3.1-8b via Groq	Sub-second text inference speeds.
Voice Synthesis	edge-tts	High-quality Microsoft neural voices without API key limits.
Video Playback	mpv via JSON IPC	Bypasses Chromium GPU crashes entirely.
Desktop Framework	Electron	Fast, cross-platform UI development.
Audio Capture	PulseAudio	Direct system-level audio routing for Live Mode.
VAD	Silero VAD	Highly accurate Voice Activity Detection to segment chunks.
Video Download	yt-dlp + ffmpeg	Industry standard for reliable stream fetching and muxing.

Honest Limitations

Live Dub is Linux-Only: Bound strictly to PulseAudio dependencies for now.
Initial Pass is Throttle-Gated: Processing long videos for the first time hits Groq free-tier rate limits during parallel batching.
Language Variance: Hindi is heavily fine-tuned and sounds best; quality across the other 19 languages may vary.
Connectivity Dependent: edge-tts relies on live connections to Microsoft servers to synthesize audio.
Lag Floor: The 3–5s delay in Live Dub cannot be bypassed without implementing predictive AI text modeling.
No Quick Seeking: You cannot skip ahead during the first pass; the dub engine generates chunks sequentially from 0:00.

The Part That Surprised Me

193 GitHub clones in 3 days.

No public posts. No marketing. No product hunting. It gained traction purely from the package sitting on PyPI. The problem is real, and developers are actively hunting for solutions.

Try It Out

pip install youtube-dubber

📦 PyPI: pypi.org/project/youtube-dubber
🐙 GitHub: github.com/Ashut90/youtube-dubber
📄 License: GPL-3.0 (Fork it freely, just keep your modifications open source)

Star it, fork it, break it. GPL-3.0.

DEV Community