TL;DR: I built PolyDub, a real‑time multilingual video dubbing app. This post shows the practical pieces that make it work: UX, WebSockets, streaming STT → translate → TTS, and automated UI i18n. You can reuse the same approach for any React app.
The use case (why this matters)
Ever tried hosting a live webinar for a global audience? You’re speaking English, half your audience prefers Spanish, and everyone else is quietly struggling.
PolyDub turns that into: speak once, listeners hear you in their language. It works for:
- Live broadcasts (one‑to‑many)
- Multilingual meetings (many‑to‑many)
- Demos, classes, and community events
The core idea:
At a high level, PolyDub is just a fast loop:
Audio In → Speech‑to‑Text → Translate → Text‑to‑Speech → Audio Out
Step 1: Build a UI that’s ready for multiple languages
I started with a Next.js app and a UI built from small components (buttons, selects, transcript panels). The key is keeping copy centralized so it can be extracted and translated.
Tip: Avoid hard‑coding strings deep in components. Make them easy to collect.
Step 2: Automate UI translation
Manually translating UI strings is a time sink. I used Lingo.dev to automate extraction and generation of locale files.
What that gets you:
- Automatic string extraction from React components
- Versionable JSON locale files
- One build step to update all languages
Example flow
- Write UI in English
- Run build
- Locale files are generated
- UI is instantly multilingual
Step 3: Stream audio and translate in real time
For live audio, I used WebSockets and a Node server to keep latency low. The server:
- Receives audio chunks from the speaker
- Runs speech‑to‑text (Deepgram STT)
- Translates text (Lingo.dev SDK)
- Generates speech (Deepgram TTS)
- Streams the audio to listeners
Diagram (simple + memorable)
Browser → WS Server → STT → Translate → TTS → WS → Browser
Step 4: Keep it human‑sounding
Synthetic voices can feel robotic, so I used Deepgram Aura voices for more natural delivery. This makes a huge difference for engagement.
Tip: Let users pick voices per language. It adds personality and makes the app feel premium.
Step 5: Add transcripts for trust
People trust systems more when they can see what they’re hearing. I show:
- Source transcript (what was said)
- Target transcript (what was translated)
This doubles as accessibility and debugging during live sessions.
The architecture
- Next.js frontend for the UI
- WebSocket server for streaming audio
- Deepgram for STT + TTS
- Lingo.dev for translations + UI i18n
What you can copy into your own app
You don’t need to build a full dubbing platform. You can still:
- Add multilingual UI in one build step
- Show real‑time translated captions
- Offer a translated audio track for live events
Wrap‑up
The goal isn’t to impress with AI — it’s to remove friction. When people can understand you instantly, you unlock a much bigger audience.


Top comments (0)