What if you could speak, and everyone listening heard you in their own language, with no noticeable delay?
That question turned into PolyDub.
What It Does
Three modes:
- Live Broadcast: one speaker, listeners worldwide, each hearing a dubbed stream in their language
- Multilingual Rooms: everyone speaks their own language, everyone hears everyone else in theirs
- VOD Dubbing: upload a video, download a dubbed MP4 with SRT subtitles
The real-time pipeline:
Mic -> WebSocket -> Deepgram Nova-2 (STT) -> Google Translate (~300ms) -> Deepgram Aura-2 (TTS) -> Speaker
Perceived latency is around 1.2 to 1.5 seconds. Fast enough for a real conversation.
A Few Decisions Worth Explaining
Why Google Translate instead of Lingo.dev for real-time? Lingo.dev is LLM-based, which means 5 to 8 seconds of latency. Fine for batch work, not for live speech. Google's gtx endpoint runs at 250 to 350ms warm. Lingo.dev is still in the project, compiling UI strings at build time across 15 locales.
Why Deepgram Aura-2? Aura v1 only shipped English voices regardless of the language param. Aura-2 ships genuinely native-accent voices: Japanese prosody, Spanish regional variation, German intonation. Using an English voice mispronouncing another language defeats the entire product.
Why a per-listener TTS queue? In a room with multiple speakers, audio chunks from different people arrive at the same socket in parallel. Without serialization they interleave into noise. A per-socket promise chain fixes this, and the queue depth is capped at 1 so stale utterances get dropped rather than building an 8-second backlog.
Screenshots
Broadcast mode: pick source and target languages, hit Start, share the listener link.
Rooms: each participant sets their own language and voice. The server handles translation per-person.
VOD: upload a video, pick a language, get a dubbed MP4 and SRT file back.
Testing With TestSprite MCP
The project was built under hackathon pressure. Third-party APIs can fail in specific ways. Frontend validation is easy to break quietly. Writing full test coverage by hand would have eaten most of the remaining build time.
TestSprite MCP plugs into Claude Code as an MCP server. It reads the codebase, generates a test plan, and writes runnable test code. I ran it twice: once for a baseline, and again after a round of fixes.
Backend tests generated (5/5 passing):
| Test | What it checks |
|---|---|
| TC001 |
POST /api/dub with valid file returns { srt, mp3 }
|
| TC002 |
POST /api/dub with missing params returns 400 |
| TC003 |
POST /api/dub with broken third-party API returns 500 |
| TC004 |
POST /api/mux with valid inputs returns video/mp4 stream |
| TC005 |
POST /api/mux with missing inputs returns 400 |
The generated code is more thorough than what you'd write in a hurry. TC001 builds a minimal valid WAV file inline, validates the base64 response actually decodes, and checks the SRT string is non-empty:
mp3_bytes = base64.b64decode(json_data["mp3"], validate=True)
assert len(mp3_bytes) > 0
assert "srt" in json_data and len(json_data["srt"].strip()) > 0
Frontend tests generated (12 cases): broadcast start and validation, room create/join/leave/rejoin, language and voice change in-session, VOD upload validation, and landing-to-mode navigation flows.
What the first run caught:
-
/api/dubwas returning a plain string in some error paths instead of a consistent JSON shape. TC003 found it. - The room ID field was letting through malformed IDs before hitting the server. TC009 found it.
Fixed both, reran, all clean. The dashboard keeps a full run history so you can diff before and after. That is the actual useful part: not a single passing run, but a record of what broke, what changed, and whether the fix held.
Running It
git clone https://github.com/your-username/polydub
cd polydub
pnpm install
cp .env.example .env
# set DEEPGRAM_API_KEY and LINGO_API_KEY in .env
pnpm dev # terminal 1: Next.js on :3000
pnpm server # terminal 2: WebSocket server on :8080





Top comments (0)