This is a submission for the AssemblyAI Voice Agents Challenge
What I Built
I created LinguaBridge, a real-time bidirectional voice translation app. It utilizes AssemblyAI’s Universal-Streaming API for speech-to-text (STT), Google Gemini for instant translations, and Cartesia's high-performance text-to-speech (TTS) to deliver ultra-low-latency translations, targeting sub-300ms round-trip latency.
With LinguaBridge, conversations across language barriers become natural and effortless, ideal for real-time interactions in professional, personal, and educational contexts.
This submission addresses the Real-Time Performance Voice Agent prompt with:
- Ultra-Low Latency: Sub-500ms round-trip voice translation latency.
- Streaming Speech Recognition: Instantaneous processing of spoken input using AssemblyAI's Universal-Streaming API.
- Immediate Translation: Real-time language translation via Google's Gemini Flash model.
- Natural Voice Output: Instant text-to-speech synthesis powered by Cartesia TTS, ensuring natural conversational flow.
- Multi-language Support: Seamless bidirectional translation across 12 languages, including English, Spanish, French, German, Chinese, and Arabic.
Core Problem Addressed
Effective communication across language barriers remains challenging in professional, educational, and personal contexts. LinguaBridge solves this by providing immediate, natural, and seamless voice translation, enabling effortless multilingual conversations in real-time.
Demo
Check out LinguaBridge live here:
LinguaBridge
Real-time cross-language voice translation with ultra-low latency.
Overview
LinguaBridge is a browser-based voice app that performs live bi-directional speech translation. Users select two languages (Speaker A and Speaker B). When a speaker talks, the app:
- Transcribes speech with AssemblyAI's Universal-Streaming STT
- Sends partial transcripts to Google Gemini 2.5 Flash for fast translation
- Streams the translated output through Cartesia Sonic 2 or Sonic Turbo for ultra-fast TTS playback in the listener's language
All interactions are streamed with sub-300ms latency to enable fluid cross-language voice conversations.
Setup
1. Environment Variables
Create a .env.local
file in the root directory with the following variables:
# AssemblyAI API Keys
# Get your API key from https://www.assemblyai.com/app/account
ASSEMBLYAI_API_KEY=your_assemblyai_key
# Google Gemini API Key
# Get your API key from https://aistudio.google.com/app/apikey
GEMINI_API_KEY=your_gemini_key
# Cartesia API Key
# Get your API key from https://cartesia.ai
CARTESIA_API_KEY=your_cartesia_key
2. Install Dependencies
npm install
Running the Application
LinguaBridge requires two processes…
Demo Video
Screenshots
- Landing Area
- Language Selection
- Voice Selection
- Live Transcription Area
Technical Implementation & AssemblyAI Integration
1. Real-Time Speech Processing with AssemblyAI WebSocket API
LinguaBridge leverages AssemblyAI's WebSocket API for real-time speech-to-text transcription, ensuring ultra-low latency:
// lib/services/assemblyai-streaming.ts
export class AssemblyAIStreamingService {
private ws: WebSocket | null = null;
private currentLanguage = '';
private onPartialCallback: ((text: string) => void) | null = null;
async connect(
language: string,
onPartialTranscript: (text: string) => void,
): Promise<void> {
this.onPartialCallback = onPartialTranscript;
this.currentLanguage = language;
this.disconnect(); // ensure clean connection
await this.createWebSocketConnection(); // connect to AssemblyAI WebSocket proxy
}
}
This implementation provides:
- Real-time transcription with partial results.
- Automatic reconnection handling.
- Language-specific configurations.
- Robust error handling.
2. Secure WebSocket Proxy for AssemblyAI Communication
To securely manage API keys and optimize performance, LinguaBridge implements a custom WebSocket proxy:
// server.js
const { WebSocketServer } = require('ws');
const WebSocket = require('ws');
// WebSocket server setup
wss.on('connection', (client) => {
const upstream = new WebSocket(
'wss://streaming.assemblyai.com/v3/ws?sample_rate=16000&format_turns=true',
{ headers: { authorization: ASSEMBLYAI_API_KEY } }
);
// Forward messages from AssemblyAI to client
upstream.on('message', (data) => {
if (client.readyState === WebSocket.OPEN) {
client.send(data.toString());
}
});
// Buffer audio data until ready
client.on('message', (data) => {
upstream.send(data);
});
});
This proxy ensures:
- Secure server-side API key management.
- Audio buffering during connection setup.
- Reliable and scalable communication.
3. Optimized Audio Capture with Web Audio API
LinguaBridge uses the Web Audio API and AudioWorklet for high-quality audio processing optimized for speech recognition:
// lib/services/audio-processor.ts
export class AudioProcessor {
private audioContext: AudioContext | null = null;
async startCapture(onAudioData: (data: ArrayBuffer) => void): Promise<void> {
const audioContext = await this.initializeAudioContext();
const stream = await navigator.mediaDevices.getUserMedia({
audio: { channelCount: 1, echoCancellation: true, noiseSuppression: true }
});
const source = audioContext.createMediaStreamSource(stream);
await audioContext.audioWorklet.addModule('/audio-processor.js');
const workletNode = new AudioWorkletNode(audioContext, 'audio-processor');
workletNode.port.onmessage = (event) => {
if (event.data.audioData) {
onAudioData(event.data.audioData);
}
};
source.connect(workletNode);
}
}
This ensures:
- Efficient audio capture at 16kHz mono.
- Built-in noise suppression and echo cancellation.
- Cross-browser compatibility.
4. Real-Time Translation via Google Gemini
LinguaBridge integrates Google's Gemini 2.5 Flash for ultra-fast translations:
// lib/services/gemini-translation.ts
export class GeminiTranslationService {
private model: any = null;
constructor(apiKey: string) {
const genAI = new GoogleGenerativeAI(apiKey);
this.model = genAI.getGenerativeModel({
model: 'gemini-2.5-flash',
generationConfig: { temperature: 0.1, maxOutputTokens: 1024 }
});
}
async translateStream(text: string, sourceLang: string, targetLang: string): Promise<string> {
// optimized translation logic
}
}
This translation approach provides:
- <150ms latency.
- Intelligent caching and optimized prompts.
- Robust error handling.
5. High-Speed Text-to-Speech with Cartesia
Cartesia TTS is integrated for fast, natural speech synthesis:
// lib/services/cartesia-tts.ts
export class CartesiaTTSService {
private ws: WebSocket | null = null;
async streamText(text: string, voiceId: string): Promise<void> {
if (!this.ws || this.ws.readyState !== WebSocket.OPEN) {
await this.connectWebSocket(voiceId);
}
this.ws.send(JSON.stringify({
model_id: 'sonic-turbo',
voice: { mode: 'id', id: voiceId },
transcript: text,
output_format: { encoding: 'pcm_s16le', sample_rate: 16000 }
}));
}
}
This implementation ensures:
- Low-latency speech synthesis.
- Seamless audio playback.
- Multiple voice and language support.
6. End-to-End Real-Time Translation Pipeline
LinguaBridge orchestrates all services into a seamless, real-time pipeline:
// hooks/use-translation.ts
export const useTranslation = () => {
const startTranslation = async (sourceLang, targetLang, voice) => {
await audioProcessor.startCapture((audioData) => {
assemblyAI.sendAudioData(audioData);
});
assemblyAI.connect(sourceLang, (transcript) => {
gemini.translateStream(transcript, sourceLang, targetLang)
.then((translation) => cartesia.streamText(translation, voice));
});
};
};
This coordination ensures:
- Seamless real-time interaction.
- Efficient resource management.
- Dynamic error handling.
Integration Architecture
Complete LinguaBridge workflow:
Audio Capture → WebSocket Proxy → AssemblyAI STT → Gemini Translation → Cartesia TTS → Audio Playback
This architecture achieves:
- Sub-500ms end-to-end latency.
- High-performance scalability.
- Robust and maintainable integration.
7. Multi-Language Real-Time Processing
LinguaBridge supports 12 languages with streamlined switching:
// app/page.tsx
const SUPPORTED_LANGUAGES = [
{ code: 'en', name: 'English' },
{ code: 'es', name: 'Spanish' },
{ code: 'fr', name: 'French' },
{ code: 'de', name: 'German' },
{ code: 'it', name: 'Italian' },
{ code: 'pt', name: 'Portuguese' },
{ code: 'ru', name: 'Russian' },
{ code: 'ja', name: 'Japanese' },
{ code: 'ko', name: 'Korean' },
{ code: 'zh', name: 'Chinese (Mandarin)' },
{ code: 'ar', name: 'Arabic' },
{ code: 'hi', name: 'Hindi' },
];
This ensures:
- Bidirectional multilingual translation.
- Dynamic language selection.
- Integrated voice profile management.
Top comments (2)
This looks cool.
Yeah, thanks