Imagine asking an AI a complex question and hearing it think, pausing naturally as it formulates the next thought, and speaking the answer back to you in real-time. This isn't a sci-fi movie; it's the power of streaming text-to-speech (TTS).
In modern web development, specifically within the Next.js ecosystem, bridging the gap between Large Language Models (LLMs) and user audio perception creates a revolutionary user experience. By combining the Vercel AI SDK, React Server Components (RSC), and the native Web Speech API, we can build a "talking assistant" that feels alive.
This guide explores the architecture behind real-time audio synthesis and provides a complete, copy-pasteable code example to get you started.
The Architecture: From Tokens to Audio
To build a truly responsive assistant, we must abandon the "stop-and-wait" model. If we wait for the LLM to generate a full paragraph before converting it to audio, the latency ruins the immersion.
Instead, we implement a streaming audio synthesis pipeline. Here is the theoretical breakdown of how the data flows:
- The LLM (The Composer): Generates text incrementally, one token at a time.
- The Vercel AI SDK (The Conductor): Manages the stream, pushing tokens from the server to the client instantly.
- The Web Speech API (The Instrumentalist): The browser's native synthesizer receives these tokens and converts them into sound waves immediately.
The "Orchestra" Analogy
Think of this architecture like a live orchestra performance:
- The LLM is the composer writing the score note-by-note.
- The SDK is the conductor reading the notes as they appear and cueing the musicians instantly.
- The Web Speech API is the musician playing the instrument in real-time.
The goal is Zero-Latency Auditory Feedback. The user hears the AI speaking milliseconds after the first text token is generated.
The Challenge: Streaming vs. Synthesis
The Web Speech API (window.speechSynthesis) is designed to speak complete sentences. However, LLMs stream tokens (often sub-word chunks like "ing" or "pre"). If you feed every tiny token to the synthesizer as a separate utterance, the result is a robotic, stuttering mess.
To solve this, we need a Buffering Strategy:
- Lookahead Buffering: Accumulate tokens in a buffer.
- Boundary Detection: Flush the buffer to the synthesizer when a natural break is detected (punctuation like
.,?, or a space). - Timeout Fallback: If the buffer gets too large or too old (e.g., 200ms), flush it to prevent latency buildup.
This ensures the synthesizer receives intelligible phrases rather than disjointed syllables.
Implementation: The Code
Below is a self-contained Next.js Client Component. It simulates a server stream (using a mock async function) and plays the audio in real-time using the native browser API.
TalkingAssistant.tsx
'use client';
import React, { useState, useEffect, useRef } from 'react';
// Define the shape of the streamable text token
type StreamToken = {
type: 'text';
content: string;
};
export default function TalkingAssistant() {
// State for the visual UI
const [displayText, setDisplayText] = useState<string>('');
// State to track audio status
const [isSpeaking, setIsSpeaking] = useState<boolean>(false);
// Ref to buffer text tokens for smoother audio
const bufferRef = useRef<string>('');
/**
* 1. SIMULATE SERVER STREAM
* In a real app, replace this with `useChat` or `useStreamableUI` from Vercel AI SDK.
*/
const simulateStream = async (): Promise<void> => {
// Reset previous state
setDisplayText('');
bufferRef.current = '';
window.speechSynthesis.cancel(); // Clear any existing queue
const mockTokens = [
'Hello, ', 'developer! ', 'I am ', 'your AI assistant. ',
'I am processing ', 'your request ', 'right now. ',
'This ', 'audio ', 'is ', 'streaming ', 'in real-time.'
];
for (const token of mockTokens) {
// Simulate network latency
await new Promise(resolve => setTimeout(resolve, 300));
// 2. Update Visual State
setDisplayText(prev => prev + token);
// 3. Buffer and Speak Audio
handleAudioStream(token);
}
};
/**
* 2. AUDIO SYNTHESIS LOGIC
* Handles buffering and queuing to the Web Speech API.
*/
const handleAudioStream = (token: string) => {
if (!window.speechSynthesis) {
console.error('Web Speech API not supported.');
return;
}
// Add token to buffer
bufferRef.current += token;
// Check for natural break points (punctuation or spaces)
// In a production app, you might want a more robust regex or a timer.
const hasBreak = /[.!?]\s|,\s|\s$/.test(bufferRef.current);
if (hasBreak) {
speakText(bufferRef.current);
bufferRef.current = ''; // Clear buffer
}
};
/**
* 3. THE SPEAKER
* Creates an utterance and adds it to the browser's queue.
*/
const speakText = (text: string) => {
const utterance = new SpeechSynthesisUtterance(text);
// Optional: Select a specific voice
const voices = window.speechSynthesis.getVoices();
const preferredVoice = voices.find(v => v.lang === 'en-US');
if (preferredVoice) utterance.voice = preferredVoice;
// Event Listeners for UI Sync
utterance.onstart = () => setIsSpeaking(true);
utterance.onend = () => {
// Only set to idle if the queue is empty
if (window.speechSynthesis.pending === 0 && window.speechSynthesis.speaking === false) {
setIsSpeaking(false);
}
};
window.speechSynthesis.speak(utterance);
};
// Controls
const pauseSpeech = () => {
window.speechSynthesis.pause();
setIsSpeaking(false);
};
const resumeSpeech = () => {
window.speechSynthesis.resume();
setIsSpeaking(true);
};
const stopSpeech = () => {
window.speechSynthesis.cancel();
setIsSpeaking(false);
bufferRef.current = '';
};
// Cleanup on unmount
useEffect(() => {
return () => window.speechSynthesis.cancel();
}, []);
return (
<div style={{ padding: '20px', maxWidth: '600px', margin: '0 auto', fontFamily: 'system-ui' }}>
<h2>AI Talking Assistant</h2>
{/* Visual Output */}
<div style={{
border: '1px solid #ddd',
padding: '15px',
minHeight: '80px',
marginBottom: '20px',
borderRadius: '8px',
background: '#f9f9f9'
}}>
<p style={{ color: '#333' }}>
{displayText || <span style={{ color: '#999' }}>Click "Start Stream" to begin...</span>}
</p>
</div>
{/* Controls */}
<div style={{ display: 'flex', gap: '10px', flexWrap: 'wrap' }}>
<button
onClick={simulateStream}
disabled={isSpeaking}
style={{ padding: '10px', background: '#0070f3', color: 'white', border: 'none', borderRadius: '4px', cursor: 'pointer' }}
>
Start Stream
</button>
<button
onClick={pauseSpeech}
disabled={!isSpeaking}
style={{ padding: '10px', background: '#f59e0b', color: 'white', border: 'none', borderRadius: '4px', cursor: 'pointer' }}
>
Pause
</button>
<button
onClick={resumeSpeech}
style={{ padding: '10px', background: '#10b981', color: 'white', border: 'none', borderRadius: '4px', cursor: 'pointer' }}
>
Resume
</button>
<button
onClick={stopSpeech}
style={{ padding: '10px', background: '#ef4444', color: 'white', border: 'none', borderRadius: '4px', cursor: 'pointer' }}
>
Stop
</button>
</div>
<div style={{ marginTop: '15px', fontSize: '0.85rem', color: '#666' }}>
Status: {isSpeaking ? 'Speaking...' : 'Idle'}
</div>
</div>
);
}
Key Technical Concepts Explained
1. The 'use client' Directive
The Web Speech API (window.speechSynthesis) is a browser-only API. It does not exist in the Node.js environment where server components run. Therefore, any component interacting with audio must be marked as a Client Component.
2. The SpeechSynthesisUtterance Queue
The browser handles a queue of utterances automatically. When we call window.speechSynthesis.speak(utterance), it is added to the queue.
- The Pitfall: If you fire this for every single token without buffering, you will create a cacophony of overlapping syllables.
- The Solution: Our
handleAudioStreamfunction acts as a gatekeeper. It waits for a "natural break" (like a space or punctuation) before releasing the text to the synthesizer.
3. Voice Loading Race Conditions
A common issue with the Web Speech API is that window.speechSynthesis.getVoices() returns an empty array initially. Voices load asynchronously. In a production app, you should listen for the onvoiceschanged event to ensure the voice is available before attempting to speak.
Common Pitfalls to Avoid
- Auto-play on Mount: Mobile browsers (iOS Safari) strictly block audio from playing without a direct user interaction (like a click). Never call
speak()inside auseEffectwithout user input. - Token Fragmentation: If your LLM streams character-by-character, your buffer logic must be smart enough to group them. A 200ms timer is a good fallback to group rapid-fire tokens.
- Memory Leaks: Always call
window.speechSynthesis.cancel()in your component's cleanup function (useEffectreturn) to stop audio and clear the queue when the user navigates away.
Conclusion
By decoupling the generation of text from the synthesis of audio, we can create highly responsive, accessible, and immersive web applications. The combination of Next.js RSC for security, Vercel AI SDK for real-time data streaming, and the native Web Speech API for client-side synthesis provides a powerful, lightweight stack for building the next generation of voice interfaces.
The concepts and code demonstrated here are drawn directly from the comprehensive roadmap laid out in the book The Modern Stack. Building Generative UI with Next.js, Vercel AI SDK, and React Server Components Amazon Link.
Here are the volumes in the series:
- Volume 1: Building Intelligent Apps with JavaScript & TypeScript. Foundations, OpenAI API, Zod, and LangChain.js.
- Volume 2: The Modern Stack. Building Generative UI with Next.js, Vercel AI SDK, and React Server Components.
- Volume 3: Master Your Data. Production RAG, Vector Databases, and Enterprise Search with JavaScript.
- Volume 4: Autonomous Agents. Building Multi-Agent Systems and Workflows with LangGraph.js.
- Volume 5: The Edge of AI. Local LLMs (Ollama), Transformers.js, WebGPU, and Performance Optimization.
- Volume 6: The AI-Ready SaaS Boilerplate. Auth, Database with Vector Support, and Payment Stack.
Top comments (0)