DEV Community: Ashutosh Tiwari

I built a tool that dubs any YouTube video into Hindi — and 19 other languages — in real time

Ashutosh Tiwari — Sat, 06 Jun 2026 11:01:47 +0000

Subtitles are not the same as understanding

I kept finding incredible technical tutorials on YouTube — but they were in languages I didn't speak. Auto-subtitles simply weren't cutting it. So, I built youtube-dubber.

You paste a YouTube URL, pick a language, and it dubs the video in real time using a natural-sounding neural voice. No robotic monotone here. It sounds like an actual person explaining the concept to you — utilizing Hinglish code-switching for Hindi, and conversational tones for everything else.

How It Works

The application handles audio processing using two distinct pipelines:

1. Video URL Mode

Extraction: Pulls captions from the YouTube video (falls back to Groq Whisper STT if captions are unavailable).
Translation: Passes the text to LLaMA 3.1-8b hosted on Groq for ultra-fast processing.
Synthesis: Generates voice text-to-speech using edge-tts neural voices.
Sync: Coordinates audio and playback through mpv via a JSON IPC socket.
Optimization: Utilizes segment-level disk caching. The first run is slow, but every rerun is instant.

2. Live Dub Mode (Linux Only)

Capture: Grabs live system audio via a PulseAudio virtual sink.
5-Agent Pipeline: Streamlined sequence: Capture → Groq Whisper STT → LLaMA Translation → edge-tts → Speaker Output.
Latency: Holds a ~3–5 second lag floor. This is inherent to a listen-then-translate architecture rather than a software bug.

What the App Looks Like

The video plays inside a separate native mpv window. The Electron window acts exclusively as your control panel for tracking subtitles, adjusting volume, and managing progress. Both components stay completely in sync over mpv's IPC socket.

What Makes It Different

Hinglish Code-Switching

Translation shouldn't read like a formal dictionary. Hindi and Indian languages naturally blend English technical terms into normal conversation. Phrases like "yeh function ek callback leta hai" sound natural, whereas a pure Hindi translation feels forced and confusing.

The LLaMA model is explicitly prompted for this hybrid style. Additionally, a slang replacement map processes the output to handle formal-to-casual substitutions right before synthesis.

The Linux Driver Bug That Shaped the Architecture

Why use mpv instead of a standard built-in HTML5 video player?

Initially, Electron's <video> element threw constant SIGSEGV crashes on Linux machines utilizing Optimus (Intel + NVIDIA) hybrid graphics. The GPU driver routinely kills Chromium's renderer the moment it touches hardware video decoding or the AudioContext API.

The Fix: Hand all heavy media lifting entirely over to mpv and control it via a JSON IPC socket. The Electron window remains a pure HTML/CSS UI layer that never touches video GPU registers. What started as a workaround became the defining architecture of the app.

Native Desktop App + Python Library

It functions perfectly as a local app, but you can also integrate it directly into your backend code or use the CLI.

pip install youtube-dubber
export GROQ_API_KEY=gsk_xxxx

from youtube_dubber import dub

manifest = dub(
    "https://www.youtube.com/watch?v=VIDEO_ID",
    lang="hindi",
    gender="female",
    out="./output_audio",
)

Or execute directly from your terminal:

youtube-dubber --url https://youtu.be/VIDEO_ID --lang hindi --gender female --out ./output_audio

Supported Languages

The system supports 20 languages: Hindi, Tamil, Telugu, Bengali, Marathi, Gujarati, Kannada, Malayalam, Punjabi, Urdu, Spanish, French, German, Japanese, Chinese, Korean, Arabic, Portuguese, Russian, and Italian.

Note: Indian regional languages default to the Hinglish/code-switched style to keep technical terms in English. Other options translate fully.

The Tech Stack

Layer	Tool	Why This Choice?
Transcription	Groq Whisper	Lightning-fast voice-to-text response times.
Translation	LLaMA 3.1-8b via Groq	Sub-second text inference speeds.
Voice Synthesis	edge-tts	High-quality Microsoft neural voices without API key limits.
Video Playback	mpv via JSON IPC	Bypasses Chromium GPU crashes entirely.
Desktop Framework	Electron	Fast, cross-platform UI development.
Audio Capture	PulseAudio	Direct system-level audio routing for Live Mode.
VAD	Silero VAD	Highly accurate Voice Activity Detection to segment chunks.
Video Download	yt-dlp + ffmpeg	Industry standard for reliable stream fetching and muxing.

Honest Limitations

Live Dub is Linux-Only: Bound strictly to PulseAudio dependencies for now.
Initial Pass is Throttle-Gated: Processing long videos for the first time hits Groq free-tier rate limits during parallel batching.
Language Variance: Hindi is heavily fine-tuned and sounds best; quality across the other 19 languages may vary.
Connectivity Dependent: edge-tts relies on live connections to Microsoft servers to synthesize audio.
Lag Floor: The 3–5s delay in Live Dub cannot be bypassed without implementing predictive AI text modeling.
No Quick Seeking: You cannot skip ahead during the first pass; the dub engine generates chunks sequentially from 0:00.

The Part That Surprised Me

193 GitHub clones in 3 days.

No public posts. No marketing. No product hunting. It gained traction purely from the package sitting on PyPI. The problem is real, and developers are actively hunting for solutions.

Try It Out

pip install youtube-dubber

📦 PyPI: pypi.org/project/youtube-dubber
🐙 GitHub: github.com/Ashut90/youtube-dubber
📄 License: GPL-3.0 (Fork it freely, just keep your modifications open source)

Star it, fork it, break it. GPL-3.0.

I Was Tired of AI Subscriptions, So I Built a Free Local PDF Tutor for Dense Docs

Ashutosh Tiwari — Tue, 02 Jun 2026 20:05:46 +0000

I work on embedded systems, which means I spend most of my time inside dense PDFs — processor reference manuals, Linux internals textbooks, datasheets that run hundreds of pages.

When "chat with your PDF" tools showed up, I was excited. Then I actually used them for real work, and three problems kept breaking my workflow:

1. Privacy. Most tools make you upload the document to their cloud. I'm not comfortable sending proprietary datasheets or copyrighted textbooks to a server I don't control.

2. Context limits. A single technical chapter can be dozens of pages of diagrams and code. Most wrappers quietly truncate it or start hallucinating once they hit their token limit — and they don't tell you it happened.

3. Summaries don't equal understanding. Reading an AI summary of how Linux memory mapping works feels productive. But it doesn't mean you can actually use it two weeks later. Passive reading isn't retention.

I wanted something local-first, private, and built around remembering what I read — not just summarizing it. So I built PDF Tutor.

🔗 https://github.com/Ashut90/pdf-tutor (open source, MIT)

How it works

PDF Tutor is a desktop app in Python 3.9+ with a three-pane Tkinter interface. The core idea is a hybrid local/cloud model — run everything locally by default, fall back to free cloud APIs only when a chapter is too big for local compute.

                 Local PDF
                    │  (PyMuPDF, parsed locally)
                    ▼
            Orchestration engine
            ┌───────┴────────┐
            ▼                ▼
     Ollama (local)    Free cloud APIs
   qwen2.5-coder /     (Gemini 1M context,
      llama3            Groq, OpenRouter)
            └───────┬────────┘
                    ▼
              Output tracks
     ┌──────────┬──────────┬──────────┐
     │   Anki   │ Diagrams │   TTS    │
     │  export  │ Graphviz │ pyttsx3  │
     └──────────┴──────────┴──────────┘

Local parsing. PDFs are parsed entirely on-device with PyMuPDF — no upload, no telemetry.
Local inference. Ollama runs the LLM (qwen2.5-coder:7b or llama3) on consumer hardware — it works on a laptop with 8GB RAM.
Cloud only when needed. If a chapter is too large for local context, it falls back to a free tier — Gemini (1M context), Groq, or OpenRouter. Your choice.
Offline diagrams. Mind maps and flowcharts render with Graphviz/Matplotlib, so they work fully offline.
Local audio. Text-to-speech runs on-device with pyttsx3.

The part I actually use: active recall

A summary isn't learning. So the tool's prompts are built around the VARK model — turning a chapter into different formats depending on how you learn:

Visual → mind maps, flowcharts, tables
Auditory → spoken, conversational explanations
Read/Write → notes and writing prompts
Kinesthetic → extracted code snippets and shell commands you can actually run

The feature I rely on most is Anki flashcard generation. PDF Tutor turns a chapter into atomic Q&A pairs and exports a .txt deck you import straight into Anki. So instead of reading about Linux memory mapping and hoping it sticks, I'm doing spaced repetition on it the same day:

Q: What kernel structure represents a task's state in Linux?
A: struct task_struct

Q: Difference between a process and a thread in the kernel?
A: Processes have separate virtual memory spaces; threads share their parent's.

Try it

git clone https://github.com/Ashut90/pdf-tutor
cd pdf-tutor

python3 -m venv venv
source venv/bin/activate   # Windows: venv\Scripts\activate
pip install -r requirements.txt

python run.py

For offline use, pull a local model first: ollama pull qwen2.5-coder:7b. For cloud mode, paste a free-tier API key into the settings.

What's next

Conversation history (SQLite)
EPUB and DjVu support
A built-in spaced-repetition scheduler, so you don't need to export to Anki at all

It's a work in progress and I'd genuinely like feedback — especially from people who deal with a lot of technical docs. If you hit an edge case or have ideas, open an issue or comment here. And if it's useful to you, a star helps others find it.

🔗 https://github.com/Ashut90/pdf-tutor