Rahul

Posted on Feb 8

A practical walkthrough of building a real-time multilingual experience in a Next.js app

#webdev #ai

TL;DR: I built PolyDub, a real‑time multilingual video dubbing app. This post shows the practical pieces that make it work: UX, WebSockets, streaming STT → translate → TTS, and automated UI i18n. You can reuse the same approach for any React app.

The use case (why this matters)

Ever tried hosting a live webinar for a global audience? You’re speaking English, half your audience prefers Spanish, and everyone else is quietly struggling.

PolyDub turns that into: speak once, listeners hear you in their language. It works for:

Live broadcasts (one‑to‑many)
Multilingual meetings (many‑to‑many)
Demos, classes, and community events

The core idea:

At a high level, PolyDub is just a fast loop:

Audio In → Speech‑to‑Text → Translate → Text‑to‑Speech → Audio Out

Step 1: Build a UI that’s ready for multiple languages

I started with a Next.js app and a UI built from small components (buttons, selects, transcript panels). The key is keeping copy centralized so it can be extracted and translated.

Tip: Avoid hard‑coding strings deep in components. Make them easy to collect.

Step 2: Automate UI translation

Manually translating UI strings is a time sink. I used Lingo.dev to automate extraction and generation of locale files.

What that gets you:

Automatic string extraction from React components
Versionable JSON locale files
One build step to update all languages

Example flow

Write UI in English
Run build
Locale files are generated
UI is instantly multilingual

Step 3: Stream audio and translate in real time

For live audio, I used WebSockets and a Node server to keep latency low. The server:

Receives audio chunks from the speaker
Runs speech‑to‑text (Deepgram STT)
Translates text (Lingo.dev SDK)
Generates speech (Deepgram TTS)
Streams the audio to listeners

Diagram (simple + memorable)

Browser → WS Server → STT → Translate → TTS → WS → Browser

Step 4: Keep it human‑sounding

Synthetic voices can feel robotic, so I used Deepgram Aura voices for more natural delivery. This makes a huge difference for engagement.

Tip: Let users pick voices per language. It adds personality and makes the app feel premium.

Step 5: Add transcripts for trust

People trust systems more when they can see what they’re hearing. I show:

Source transcript (what was said)
Target transcript (what was translated)

This doubles as accessibility and debugging during live sessions.

The architecture

Next.js frontend for the UI
WebSocket server for streaming audio
Deepgram for STT + TTS
Lingo.dev for translations + UI i18n

What you can copy into your own app

You don’t need to build a full dubbing platform. You can still:

Add multilingual UI in one build step
Show real‑time translated captions
Offer a translated audio track for live events

Wrap‑up

The goal isn’t to impress with AI — it’s to remove friction. When people can understand you instantly, you unlock a much bigger audience.

Github: https://github.com/crypticsaiyan/Polydub

DEV Community