Abraham Dahunsi

Posted on Jul 28

Supercharge RealTime Multi-Language Translation with AssemblyAI

#devchallenge #assemblyaichallenge #ai #api

AssemblyAI Voice Agents Challenge: Real-Time

This is a submission for the AssemblyAI Voice Agents Challenge

What I Built

I created LinguaBridge, a real-time bidirectional voice translation app. It utilizes AssemblyAI’s Universal-Streaming API for speech-to-text (STT), Google Gemini for instant translations, and Cartesia's high-performance text-to-speech (TTS) to deliver ultra-low-latency translations, targeting sub-300ms round-trip latency.

With LinguaBridge, conversations across language barriers become natural and effortless, ideal for real-time interactions in professional, personal, and educational contexts.

This submission addresses the Real-Time Performance Voice Agent prompt with:

Ultra-Low Latency: Sub-500ms round-trip voice translation latency.
Streaming Speech Recognition: Instantaneous processing of spoken input using AssemblyAI's Universal-Streaming API.
Immediate Translation: Real-time language translation via Google's Gemini Flash model.
Natural Voice Output: Instant text-to-speech synthesis powered by Cartesia TTS, ensuring natural conversational flow.
Multi-language Support: Seamless bidirectional translation across 12 languages, including English, Spanish, French, German, Chinese, and Arabic.

Core Problem Addressed

Effective communication across language barriers remains challenging in professional, educational, and personal contexts. LinguaBridge solves this by providing immediate, natural, and seamless voice translation, enabling effortless multilingual conversations in real-time.

Demo

Check out LinguaBridge live here:

duabridge.vercel.app

Abraham12611 / duabridge

LinguaBridge

Real-time cross-language voice translation with ultra-low latency.

Overview

LinguaBridge is a browser-based voice app that performs live bi-directional speech translation. Users select two languages (Speaker A and Speaker B). When a speaker talks, the app:

Transcribes speech with AssemblyAI's Universal-Streaming STT
Sends partial transcripts to Google Gemini 2.5 Flash for fast translation
Streams the translated output through Cartesia Sonic 2 or Sonic Turbo for ultra-fast TTS playback in the listener's language

All interactions are streamed with sub-300ms latency to enable fluid cross-language voice conversations.

Setup

1. Environment Variables

Create a .env.local file in the root directory with the following variables:

# AssemblyAI API Keys
# Get your API key from https://www.assemblyai.com/app/account
ASSEMBLYAI_API_KEY=your_assemblyai_key

# Google Gemini API Key
# Get your API key from https://aistudio.google.com/app/apikey
GEMINI_API_KEY=your_gemini_key

# Cartesia API Key
# Get your API key from https://cartesia.ai
CARTESIA_API_KEY=your_cartesia_key

2. Install Dependencies

npm install

Running the Application

LinguaBridge requires two processes…

View on GitHub

Demo Video

Screenshots

Landing Area

Language Selection

Voice Selection

Live Transcription Area

Technical Implementation & AssemblyAI Integration

1. Real-Time Speech Processing with AssemblyAI WebSocket API

LinguaBridge leverages AssemblyAI's WebSocket API for real-time speech-to-text transcription, ensuring ultra-low latency:

// lib/services/assemblyai-streaming.ts
export class AssemblyAIStreamingService {
  private ws: WebSocket | null = null;
  private currentLanguage = '';
  private onPartialCallback: ((text: string) => void) | null = null;

  async connect(
    language: string,
    onPartialTranscript: (text: string) => void,
  ): Promise<void> {
    this.onPartialCallback = onPartialTranscript;
    this.currentLanguage = language;

    this.disconnect(); // ensure clean connection
    await this.createWebSocketConnection(); // connect to AssemblyAI WebSocket proxy
  }
}

This implementation provides:

Real-time transcription with partial results.
Automatic reconnection handling.
Language-specific configurations.
Robust error handling.

2. Secure WebSocket Proxy for AssemblyAI Communication

To securely manage API keys and optimize performance, LinguaBridge implements a custom WebSocket proxy:

// server.js
const { WebSocketServer } = require('ws');
const WebSocket = require('ws');

// WebSocket server setup
wss.on('connection', (client) => {
  const upstream = new WebSocket(
    'wss://streaming.assemblyai.com/v3/ws?sample_rate=16000&format_turns=true',
    { headers: { authorization: ASSEMBLYAI_API_KEY } }
  );

  // Forward messages from AssemblyAI to client
  upstream.on('message', (data) => {
    if (client.readyState === WebSocket.OPEN) {
      client.send(data.toString());
    }
  });

  // Buffer audio data until ready
  client.on('message', (data) => {
    upstream.send(data);
  });
});

This proxy ensures:

Secure server-side API key management.
Audio buffering during connection setup.
Reliable and scalable communication.

3. Optimized Audio Capture with Web Audio API

LinguaBridge uses the Web Audio API and AudioWorklet for high-quality audio processing optimized for speech recognition:

// lib/services/audio-processor.ts
export class AudioProcessor {
  private audioContext: AudioContext | null = null;

  async startCapture(onAudioData: (data: ArrayBuffer) => void): Promise<void> {
    const audioContext = await this.initializeAudioContext();
    const stream = await navigator.mediaDevices.getUserMedia({
      audio: { channelCount: 1, echoCancellation: true, noiseSuppression: true }
    });

    const source = audioContext.createMediaStreamSource(stream);
    await audioContext.audioWorklet.addModule('/audio-processor.js');

    const workletNode = new AudioWorkletNode(audioContext, 'audio-processor');
    workletNode.port.onmessage = (event) => {
      if (event.data.audioData) {
        onAudioData(event.data.audioData);
      }
    };
    source.connect(workletNode);
  }
}

This ensures:

Efficient audio capture at 16kHz mono.
Built-in noise suppression and echo cancellation.
Cross-browser compatibility.

4. Real-Time Translation via Google Gemini

LinguaBridge integrates Google's Gemini 2.5 Flash for ultra-fast translations:

// lib/services/gemini-translation.ts
export class GeminiTranslationService {
  private model: any = null;

  constructor(apiKey: string) {
    const genAI = new GoogleGenerativeAI(apiKey);
    this.model = genAI.getGenerativeModel({
      model: 'gemini-2.5-flash',
      generationConfig: { temperature: 0.1, maxOutputTokens: 1024 }
    });
  }

  async translateStream(text: string, sourceLang: string, targetLang: string): Promise<string> {
    // optimized translation logic
  }
}

This translation approach provides:

<150ms latency.
Intelligent caching and optimized prompts.
Robust error handling.

5. High-Speed Text-to-Speech with Cartesia

Cartesia TTS is integrated for fast, natural speech synthesis:

// lib/services/cartesia-tts.ts
export class CartesiaTTSService {
  private ws: WebSocket | null = null;

  async streamText(text: string, voiceId: string): Promise<void> {
    if (!this.ws || this.ws.readyState !== WebSocket.OPEN) {
      await this.connectWebSocket(voiceId);
    }

    this.ws.send(JSON.stringify({
      model_id: 'sonic-turbo',
      voice: { mode: 'id', id: voiceId },
      transcript: text,
      output_format: { encoding: 'pcm_s16le', sample_rate: 16000 }
    }));
  }
}

This implementation ensures:

Low-latency speech synthesis.
Seamless audio playback.
Multiple voice and language support.

6. End-to-End Real-Time Translation Pipeline

LinguaBridge orchestrates all services into a seamless, real-time pipeline:

// hooks/use-translation.ts
export const useTranslation = () => {
  const startTranslation = async (sourceLang, targetLang, voice) => {
    await audioProcessor.startCapture((audioData) => {
      assemblyAI.sendAudioData(audioData);
    });

    assemblyAI.connect(sourceLang, (transcript) => {
      gemini.translateStream(transcript, sourceLang, targetLang)
        .then((translation) => cartesia.streamText(translation, voice));
    });
  };
};

This coordination ensures:

Seamless real-time interaction.
Efficient resource management.
Dynamic error handling.

Integration Architecture

Complete LinguaBridge workflow:

Audio Capture → WebSocket Proxy → AssemblyAI STT → Gemini Translation → Cartesia TTS → Audio Playback

This architecture achieves:

Sub-500ms end-to-end latency.
High-performance scalability.
Robust and maintainable integration.

7. Multi-Language Real-Time Processing

LinguaBridge supports 12 languages with streamlined switching:

// app/page.tsx
const SUPPORTED_LANGUAGES = [
  { code: 'en', name: 'English' },
  { code: 'es', name: 'Spanish' },
  { code: 'fr', name: 'French' },
  { code: 'de', name: 'German' },
  { code: 'it', name: 'Italian' },
  { code: 'pt', name: 'Portuguese' },
  { code: 'ru', name: 'Russian' },
  { code: 'ja', name: 'Japanese' },
  { code: 'ko', name: 'Korean' },
  { code: 'zh', name: 'Chinese (Mandarin)' },
  { code: 'ar', name: 'Arabic' },
  { code: 'hi', name: 'Hindi' },
];