Voice AI Agents: Building Speech-to-Speech Apps with TypeScript
Voice is the most natural interface for AI. In 2026, speech-to-speech applications are transforming customer service, virtual assistants, and real-time translation. But building voice AI pipelines traditionally requires stitching together multiple SDKs: one for Speech-to-Text (STT), another for LLM inference, and a third for Text-to-Speech (TTS).
NeuroLink unifies this entire pipeline into a single TypeScript SDK.
In this guide, you'll learn how to build real-time voice AI agents using NeuroLink's streaming architecture. We'll cover speech-to-text integration, streaming LLM responses, text-to-speech synthesis, and practical patterns for production voice applications.
Why Voice AI Is Hard (And How NeuroLink Solves It)
Building voice applications traditionally involves three disconnected systems:
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ STT API │ → │ LLM │ → │ TTS API │
│ (Whisper) │ │ (Various) │ │ (Eleven) │
└─────────────┘ └─────────────┘ └─────────────┘
↑ ↓
Microphone Speaker
The challenges:
- Latency stacking: Each hop adds 200-500ms
- Provider fragmentation: Different APIs, auth patterns, error handling
- Streaming complexity: Interleaving audio chunks with text responses
- State management: Tracking conversation context across services
NeuroLink treats voice as a first-class stream, just like tokens or tool calls. The same stream() API handles speech input, LLM processing, and audio output.
Architecture: The Voice Pipeline
Here's how NeuroLink simplifies voice AI:
import { NeuroLink } from "@juspay/neurolink";
const neurolink = new NeuroLink({
// Primary LLM for reasoning
provider: "anthropic",
model: "claude-4-sonnet",
// Optional: Add STT/TTS tools via MCP
tools: ["speechToText", "textToSpeech"],
});
// Single pipeline: Audio → Text → AI → Audio
const pipeline = await neurolink.stream({
input: {
audio: audioStream, // Incoming voice
text: "Transcribe and respond naturally"
},
// Response includes both text AND synthesized speech
output: { formats: ["text", "audio"] }
});
Building a Voice Assistant: Complete Example
Let's build a real-time voice assistant that listens, thinks, and speaks.
1. Setup and Configuration
import { NeuroLink } from "@juspay/neurolink";
import { createWriteStream } from "fs";
import { Readable } from "stream";
interface VoiceConfig {
sttProvider: "whisper" | "deepgram" | "assembly";
llmProvider: "anthropic" | "openai" | "google-ai";
ttsProvider: "elevenlabs" | "openai" | "azure";
}
const voiceAgent = new NeuroLink({
// Core LLM configuration
provider: "anthropic",
model: "claude-4-sonnet",
// Memory for multi-turn conversations
memory: {
enabled: true,
backend: "redis",
ttl: 3600, // 1 hour session
},
// System prompt for voice persona
systemPrompt: `You are a helpful voice assistant.
Keep responses concise (2-3 sentences) for natural speech.
Use conversational language. Avoid markdown or code blocks.`
});
2. Speech-to-Text Integration
Capture audio and transcribe in real-time:
import { Recorder } from "node-record-lpcm16";
async function* captureAudio(): AsyncIterable<Buffer> {
const recording = new Recorder({
sampleRate: 16000,
channels: 1,
audioType: "wav"
});
const stream = recording.start();
for await (const chunk of stream) {
yield chunk;
}
}
// Transcribe streaming audio
async function transcribeStream(audioStream: AsyncIterable<Buffer>) {
const result = await voiceAgent.generate({
input: { audio: audioStream },
// Use STT tool to convert speech to text
tools: [{
name: "speechToText",
provider: "whisper",
config: { language: "en", model: "whisper-1" }
}],
});
return result.toolResults.speechToText.text;
}
3. Streaming LLM Response
Process transcribed text with streaming for real-time feedback:
async function* processVoiceQuery(transcript: string, sessionId: string) {
const stream = await voiceAgent.stream({
input: { text: transcript },
// Attach session for memory/context
session: { id: sessionId },
// Request structured output for voice
output: {
format: "stream",
// Enable streaming for real-time TTS
streaming: true
}
});
for await (const chunk of stream.stream) {
if ("content" in chunk) {
// Yield text chunks for display/processing
yield { type: "text", content: chunk.content };
}
if ("toolCall" in chunk) {
// Handle any tool invocations
yield { type: "tool", call: chunk.toolCall };
}
}
}
4. Text-to-Speech Synthesis
Convert AI responses to speech with streaming audio:
import { createAudioPlayer, createAudioResource } from "@discordjs/voice";
async function* speakResponse(
textStream: AsyncIterable<string>,
voiceId: string = "default"
): AsyncIterable<Buffer> {
// Buffer text into sentences for natural speech
let sentenceBuffer = "";
for await (const text of textStream) {
sentenceBuffer += text;
// Process complete sentences
const sentences = sentenceBuffer.match(/[^.!?]+[.!?]+/g) || [];
if (sentences.length > 0) {
// Remove processed text from buffer
sentenceBuffer = sentenceBuffer.slice(
sentences.join("").length
);
// Generate speech for each sentence
for (const sentence of sentences) {
const audio = await voiceAgent.generate({
input: { text: sentence.trim() },
tools: [{
name: "textToSpeech",
provider: "elevenlabs",
config: {
voiceId,
model: "eleven_multilingual_v2",
streaming: true
}
}]
});
if (audio.toolResults?.textToSpeech?.audio) {
yield Buffer.from(audio.toolResults.textToSpeech.audio, "base64");
}
}
}
}
// Process any remaining text
if (sentenceBuffer.trim()) {
const finalAudio = await voiceAgent.generate({
input: { text: sentenceBuffer.trim() },
tools: [{
name: "textToSpeech",
provider: "elevenlabs",
config: { voiceId, model: "eleven_multilingual_v2" }
}]
});
if (finalAudio.toolResults?.textToSpeech?.audio) {
yield Buffer.from(finalAudio.toolResults.textToSpeech.audio, "base64");
}
}
}
5. Complete Voice Loop
Putting it all together:
async function runVoiceAssistant() {
const sessionId = `session-${Date.now()}`;
const speaker = new Speaker({ channels: 1, bitDepth: 16, sampleRate: 24000 });
console.log("🎤 Voice assistant started. Speak now...");
// 1. Capture audio from microphone
const audioStream = captureAudio();
// 2. Transcribe speech to text
const transcript = await transcribeStream(audioStream);
console.log(`📝 You said: "${transcript}"`);
// 3. Process with LLM (streaming)
const textStream = processVoiceQuery(transcript, sessionId);
// 4. Convert to speech (sentence-by-sentence for low latency)
const audioOutput = speakResponse(
(async function* () {
for await (const chunk of textStream) {
if (chunk.type === "text") {
process.stdout.write(chunk.content);
yield chunk.content;
}
}
})()
);
// 5. Play audio response
for await (const audioChunk of audioOutput) {
speaker.write(audioChunk);
}
console.log("\n✅ Response complete");
}
// Run the assistant
runVoiceAssistant().catch(console.error);
Advanced Patterns
Interrupt Handling
Voice assistants need to handle interruptions gracefully:
class InterruptibleVoiceAgent {
private currentStream: AbortController | null = null;
async handleVoiceInput(audioStream: AsyncIterable<Buffer>) {
// Cancel previous response
this.currentStream?.abort();
this.currentStream = new AbortController();
try {
const transcript = await this.transcribe(audioStream);
// Check for interruption keywords
if (this.isInterruption(transcript)) {
await this.playInterruptionAck();
return;
}
await this.generateResponse(transcript, this.currentStream.signal);
} catch (error) {
if (error.name !== "AbortError") throw error;
}
}
private isInterruption(text: string): boolean {
const interruptionWords = ["stop", "wait", "hold on", "nevermind"];
return interruptionWords.some(w => text.toLowerCase().includes(w));
}
}
Multi-Language Voice AI
NeuroLink's unified API makes language switching seamless:
async function multilingualVoiceAgent(detectedLanguage: string) {
const languageConfigs: Record<string, VoiceConfig> = {
"en": { stt: "whisper", tts: "elevenlabs", voice: "Bella" },
"es": { stt: "whisper", tts: "elevenlabs", voice: "Pedro" },
"hi": { stt: "whisper", tts: "azure", voice: "hi-IN-SwaraNeural" },
};
const config = languageConfigs[detectedLanguage];
return await neurolink.generate({
input: { audio: audioStream },
tools: [
{
name: "speechToText",
config: { language: detectedLanguage }
},
{
name: "textToSpeech",
config: {
voiceId: config.voice,
language: detectedLanguage
}
}
]
});
}
Voice Activity Detection (VAD)
Optimize costs by only processing speech:
import { MicVAD } from "@ricky0123/vad-web";
async function* voiceDetectedAudio() {
const vad = await MicVAD.new({
onSpeechStart: () => console.log("🎙️ Speech detected"),
onSpeechEnd: (audio) => {
// Only process when speech ends
return Buffer.from(audio);
}
});
vad.start();
// Yield audio chunks only during speech
while (true) {
const audio = await vad.nextSpeechSegment();
if (audio) yield audio;
}
}
Production Considerations
Latency Optimization
| Technique | Latency Impact |
|---|---|
| Streaming STT | -300ms |
| Sentence-level TTS | -500ms |
| Redis memory | -100ms (no context rebuild) |
| WebSocket transport | -50ms |
| Parallel TTS prefetch | -200ms |
Error Handling
const resilientVoiceAgent = new NeuroLink({
// Auto-fallback if primary STT fails
fallback: {
providers: ["whisper", "deepgram", "assembly"],
strategy: "sequential"
},
// Retry configuration
retry: {
attempts: 3,
backoff: "exponential",
maxDelay: 5000
}
});
Cost Tracking
Monitor voice AI costs across providers:
neurolink.on("usage", (event) => {
console.log(`STT: ${event.sttTokens} tokens`);
console.log(`LLM: ${event.llmTokens} tokens`);
console.log(`TTS: ${event.ttsCharacters} chars`);
console.log(`Total: $${event.estimatedCost}`);
});
Web Integration Example
For browser-based voice apps:
// Server: WebSocket endpoint
import { WebSocketServer } from "ws";
const wss = new WebSocketServer({ port: 8080 });
wss.on("connection", (ws) => {
const agent = new NeuroLink({
provider: "anthropic",
model: "claude-4-sonnet"
});
ws.on("message", async (audioData) => {
const result = await agent.generate({
input: {
audio: audioData,
mimeType: "audio/webm"
},
tools: ["speechToText", "textToSpeech"]
});
// Send audio response back
ws.send(result.toolResults.textToSpeech.audio);
});
});
// Client: Browser microphone
const ws = new WebSocket("ws://localhost:8080");
const mediaRecorder = new MediaRecorder(stream);
mediaRecorder.ondataavailable = (event) => {
ws.send(event.data);
};
ws.onmessage = async (event) => {
const audio = new Audio(URL.createObjectURL(event.data));
await audio.play();
};
Conclusion
Voice AI doesn't need to be fragmented. With NeuroLink, you build speech-to-speech applications using the same patterns as text-based AI:
- Unified API: One SDK for STT, LLM, and TTS
- Streaming native: Real-time audio processing out of the box
- Memory aware: Conversations persist across voice sessions
- Provider agnostic: Switch STT/TTS providers without rewriting code
Whether you're building a customer service bot, a voice-enabled coding assistant, or a real-time translator, NeuroLink's streaming architecture handles the complexity so you can focus on the conversation.
NeuroLink — The Universal AI SDK for TypeScript
- GitHub: github.com/juspay/neurolink
- Install:
npm install @juspay/neurolink - Docs: docs.neurolink.ink
- Blog: blog.neurolink.ink — 150+ technical articles
Top comments (0)