Programming Central

Posted on Apr 7 • Originally published at programmingcentral.hashnode.dev

Build Real-Time Voice Chat with WebSockets, LLMs, and Web Audio API

#javascript #typescript #ai #webdev

Forget clunky voice delays! This guide dives deep into building a real-time voice-to-voice communication system directly in the browser, leveraging the power of WebSockets, local Large Language Models (LLMs) like Ollama, and the Web Audio API. We’ll explore the technical challenges of low-latency audio streaming and provide a practical code example to get you started. Imagine building a conversational AI assistant that feels natural, or a collaborative voice editor with instant feedback – that’s the power of this approach.

The Challenge of Real-Time Voice Communication

Traditional web development often relies on request-response cycles. But voice communication demands something different: continuous, low-latency data flow. Human conversations have an expected latency of 200-500 milliseconds. Exceeding 1 second creates a jarring, robotic experience. The core problem isn’t just sending audio, it’s managing a constant stream of audio data, processing it quickly, and returning a response with minimal delay. This is where the limitations of standard HTTP requests become apparent. Each request introduces overhead, making it unsuitable for the continuous nature of voice interaction.

The Architecture: From Microphone to LLM and Back

Our solution centers around a streaming pipeline. Instead of discrete requests, we treat the conversation as a continuous flow of audio "chunks." Here's a breakdown of the key components:

Browser (Client): Captures audio using the Web Audio API, converts it into digital packets, and streams it to the server via WebSockets.
WebSocket: Provides a persistent, full-duplex communication channel for low-latency data transfer.
Local Backend (Server): Receives audio chunks, transcribes them to text using Speech-to-Text (STT), feeds the text to a local LLM (like those managed by Ollama), receives a text response, converts the text back to speech using Text-to-Speech (TTS), and streams the audio back to the client.
WebGPU (Optional): Accelerates audio processing and model inference for improved performance.

Think of it like an assembly line: the microphone is the raw material supplier, the WebSocket is the conveyor belt, the STT/LLM/TTS stack is the processing plant, and the return WebSocket is the delivery route for the finished product.

The Audio Pipeline: Web Audio API and AudioWorklets

The Web Audio API is the foundation for capturing and processing audio in the browser. We use an AudioContext to create an AudioWorklet, a specialized processor that runs on a separate thread, preventing audio processing from blocking the main UI thread. The AudioWorklet slices the incoming audio stream into small buffers (e.g., 1024 or 2048 samples) – these are our "packets."

The WebSocket Bridge: Low-Latency Communication

WebSockets are crucial for bridging the browser's sandboxed environment to the local backend. Unlike HTTP, WebSockets maintain a persistent connection, eliminating the overhead of repeated handshakes. This is especially important when interacting with local LLMs, avoiding the latency of sending data to a remote cloud provider.

Our WebSocket protocol defines how we send audio data:

Audio Chunks: Raw binary data or Base64-encoded audio buffers.
Metadata: Sample rate, bit depth, and sequence numbers for reliable delivery.
Control Signals: Messages indicating the start/end of speech (Voice Activity Detection - VAD).

Speech-to-Text (STT) and Text-to-Speech (TTS)

Once the audio reaches the backend, it needs to be transcribed into text using STT. This is where Context Augmentation, familiar from RAG systems, comes into play. Instead of retrieving text chunks, we create structured text from unstructured audio. Local STT models like Whisper (via Ollama) or WASM modules handle this task.

The STT model uses a "sliding window" approach, processing audio chunks incrementally and updating transcriptions as more context becomes available. WebAssembly (WASM) can further reduce latency by running STT models directly in the browser.

After the LLM generates a text response, Text-to-Speech (TTS) converts it back into audio. Streaming TTS, like VITS or FastSpeech, generates audio in small frames, sending them back to the client as soon as they're available.

Performance Optimization: WebGPU for Acceleration

For optimal performance, consider leveraging WebGPU. This modern graphics and compute API allows for parallel processing on the GPU, accelerating:

Audio Feature Extraction: Converting raw audio waves into spectrograms using FFT shaders.
Model Inference: Running STT or TTS models on the GPU.

Code Example: A Minimal Voice-to-Voice Loop

Let's build a "Hello World" example to demonstrate the core concepts. This application captures microphone audio, streams it to a local Node.js server, simulates processing, and plays back a synthetic audio response.

1. Server Code (`server.ts`)

// server.ts
import { WebSocketServer } from 'ws';
import * as http from 'http';

/**
 * Configuration for the audio stream.
 * 16kHz, 16-bit mono is standard for WebRTC/STT pipelines.
 */
const SAMPLE_RATE = 16000;
const CHANNELS = 1;

const server = http.createServer();
const wss = new WebSocketServer({ server });

console.log('Starting Voice-to-Voice WebSocket Server on port 8080...');

wss.on('connection', (ws) => {
    console.log('Client connected');

    ws.on('message', async (data: Buffer) => {
        // 1. RECEIVE AUDIO
        // In a real app, we would pipe this data to an STT model (e.g., Whisper).
        // Here, we simulate processing latency.
        console.log(`Received audio chunk: ${data.length} bytes`);

        // Simulate network jitter and model inference time (e.g., 150ms)
        await new Promise(resolve => setTimeout(resolve, 150));

        // 2. GENERATE RESPONSE AUDIO
        // Create a synthetic audio buffer (a simple sine wave for demonstration).
        // Duration: 1 second.
        const duration = 1;
        const numSamples = SAMPLE_RATE * duration;
        const audioBuffer = new Float32Array(numSamples);

        // Generate a 440Hz tone (A4 note)
        const frequency = 440;
        for (let i = 0; i < numSamples; i++) {
            const t = i / SAMPLE_RATE;
            audioBuffer[i] = Math.sin(2 * Math.PI * frequency * t) * 0.5; // 50% volume
        }

        // Convert Float32Array to Buffer (Int16 PCM for standard compatibility)
        const int16Buffer = new Int16Array(audioBuffer.length);
        for (let i = 0; i < audioBuffer.length; i++) {
            int16Buffer[i] = Math.max(-1, Math.min(1, audioBuffer[i])) * 0x7FFF;
        }

        const responseBuffer = Buffer.from(int16Buffer.buffer);

        // 3. SEND AUDIO BACK
        ws.send(responseBuffer);
        console.log(`Sent audio response: ${responseBuffer.length} bytes`);
    });

    ws.on('close', () => {
        console.log('Client disconnected');
    });
});

server.listen(8080);

2. Client Code (`client.ts`)

// client.ts

/**
 * Main application class handling the voice loop.
 */
class VoiceChatClient {
    private ws: WebSocket | null = null;
    private audioContext: AudioContext | null = null;
    private mediaRecorder: MediaRecorder | null = null;
    private audioQueue: Float32Array[] = [];
    private isPlaying: boolean = false;

    // Audio configuration
    private readonly WS_URL = 'ws://localhost:8080';
    private readonly CHUNK_SIZE_MS = 200; // Send audio every 200ms

    /**
     * Initializes the WebSocket connection and Audio Context.
     */
    public async start() {
        console.log('Initializing Voice Chat Client...');

        // 1. Setup WebSocket
        this.ws = new WebSocket(this.WS_URL);
        this.ws.binaryType = 'arraybuffer'; // Expect binary data

        this.ws.onopen = () => {
            console.log('WebSocket Connected');
            this.startMicrophone();
        };

        this.ws.onmessage = (event) => {
            // 2. Handle Incoming Audio
            this.handleIncomingAudio(event.data);
        };

        this.ws.onerror = (err) => console.error('WebSocket Error:', err);
    }

    /**
     * Captures audio from the user's microphone using MediaRecorder.
     */
    private async startMicrophone() {
        try {
            // Initialize AudioContext (requires user gesture in some browsers)
            this.audioContext = new (window.AudioContext || (window as any).webkitAudioContext)();

            const stream = await navigator.mediaDevices.getUserMedia({ audio: true });

            // 3. Setup MediaRecorder
            // Note: 'audio/webm' or 'audio/ogg' are common browser formats.
            // For raw PCM, we might need to use AudioWorklets, but MediaRecorder is simpler for "Hello World".
            this.mediaRecorder = new MediaRecorder(stream, { mimeType: 'audio/webm' });

            this.mediaRecorder.ondataavailable = (e) => {
                if (e.data.size > 0 && this.ws?.readyState === WebSocket.OPEN) {
                    // Convert Blob to ArrayBuffer to send over WebSocket
                    e.data.arrayBuffer().then((buffer) => {
                        this.ws!.send(buffer);
                    });
                }
            };

            // Trigger recording in chunks
            this.mediaRecorder.start(this.CHUNK_SIZE_MS);
            console.log('Microphone active. Streaming audio...');

        } catch (err) {
            console.error('Error accessing microphone:', err);
        }
    }

    /**
     * Handles the audio data received from the server.
     * @param data The raw ArrayBuffer received via WebSocket.
     */
    private async handleIncomingAudio(data: ArrayBuffer) {
        // 4. Decode Audio Data
        if (!this.audioContext) return;

        // Convert ArrayBuffer to Float32Array for Web Audio API processing
        // Assuming server sends 16-bit PCM (standard for raw audio)
        const int16Data = new Int16Array(data);
        const float32Data = new Float32Array(int16Data.length);

        for (let i = 0; i < int16Data.length; i++) {
            float32Data[i] = int16Data[i] / 32768.0; // Normalize 16-bit to float (-1 to 1)
        }

        // Add to queue to handle playback sequentially
        this.audioQueue.push(float32Data);

        if (!this.isPlaying) {
            this.playAudioQueue();
        }
    }

    /**
     * Plays the audio buffer queue sequentially to avoid glitches.
     */
    private async playAudioQueue() {
        if (!this.audioContext || this.audioQueue.length === 0) {
            this.isPlaying = false;
            return;
        }

        this.isPlaying = true;
        const audioData = this.audioQueue.shift()!;

        // Create an AudioBufferSourceNode to play the raw PCM data
        const source = this.audioContext.createBufferSource();
        const buffer = this.audioContext.createBuffer(
            1, // Mono
            audioData.length,
            this.audioContext.sampleRate
        );

        buffer.getChannelData(0).set(audioData);
        source.buffer = buffer;
        source.connect(this.audioContext.destination);

        source.onended = () => {
            // Recursive call to play next chunk in queue
            this.playAudioQueue();
        };

        source.start();
    }

    /**
     * Stops recording and closes connections.
     */
    public stop() {
        if (this.mediaRecorder) this.mediaRecorder.stop();
        if (this.ws) this.ws.close();
        if (this.audioContext) this.audioContext.close();
        console.log('Voice Chat Client stopped.');
    }
}

// Usage
// In a real app, you would attach this to a button click event.
const client = new VoiceChatClient();
// client.start(); // Uncomment to run

Conclusion: The Future of Real-Time Voice on the Web

Building real-time voice applications is no longer a futuristic dream. By combining the power of WebSockets, the Web Audio API, local LLMs, and performance optimizations like WebGPU, you can create truly immersive and interactive voice experiences directly in the browser. This opens up exciting possibilities for conversational AI, collaborative tools, and more. The key is to embrace the streaming paradigm and optimize for low latency at every stage of the pipeline.

The concepts and code demonstrated here are drawn directly from the comprehensive roadmap laid out in the book The Edge of AI. Local LLMs (Ollama), Transformers.js, WebGPU, and Performance Optimization Amazon Link of the AI with JavaScript & TypeScript Series.
The ebook is also on Leanpub.com: https://leanpub.com/EdgeOfAIJavaScriptTypeScript.

👉 Free Access now to the TypeScript & AI Series on Programming Central, it includes 8 Volumes, 160 Chapters and hundreds of quizzes for every chapter.

DEV Community

Build Real-Time Voice Chat with WebSockets, LLMs, and Web Audio API

The Challenge of Real-Time Voice Communication

The Architecture: From Microphone to LLM and Back

The Audio Pipeline: Web Audio API and AudioWorklets

The WebSocket Bridge: Low-Latency Communication

Speech-to-Text (STT) and Text-to-Speech (TTS)

Performance Optimization: WebGPU for Acceleration

Code Example: A Minimal Voice-to-Voice Loop

1. Server Code (`server.ts`)

2. Client Code (`client.ts`)

Conclusion: The Future of Real-Time Voice on the Web

Top comments (0)

The Challenge of Real-Time Voice Communication

The Architecture: From Microphone to LLM and Back

The Audio Pipeline: Web Audio API and AudioWorklets

The WebSocket Bridge: Low-Latency Communication

Speech-to-Text (STT) and Text-to-Speech (TTS)

Performance Optimization: WebGPU for Acceleration

Code Example: A Minimal Voice-to-Voice Loop

1. Server Code (server.ts)

2. Client Code (client.ts)

Conclusion: The Future of Real-Time Voice on the Web

1. Server Code (`server.ts`)

2. Client Code (`client.ts`)