DEV Community

NeuroLink AI
NeuroLink AI

Posted on

Voice AI Agents: Building Speech-to-Speech Apps with TypeScript

Voice AI Agents: Building Speech-to-Speech Apps with TypeScript

Voice is the most natural interface for AI. In 2026, speech-to-speech applications are transforming customer service, virtual assistants, and real-time translation. But building voice AI pipelines traditionally requires stitching together multiple SDKs: one for Speech-to-Text (STT), another for LLM inference, and a third for Text-to-Speech (TTS).

NeuroLink unifies this entire pipeline into a single TypeScript SDK.

In this guide, you'll learn how to build real-time voice AI agents using NeuroLink's streaming architecture. We'll cover speech-to-text integration, streaming LLM responses, text-to-speech synthesis, and practical patterns for production voice applications.


Why Voice AI Is Hard (And How NeuroLink Solves It)

Building voice applications traditionally involves three disconnected systems:

┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│   STT API   │ →  │    LLM      │ →  │   TTS API   │
│  (Whisper)  │    │  (Various)  │    │  (Eleven)   │
└─────────────┘    └─────────────┘    └─────────────┘
      ↑                                    ↓
   Microphone                          Speaker
Enter fullscreen mode Exit fullscreen mode

The challenges:

  • Latency stacking: Each hop adds 200-500ms
  • Provider fragmentation: Different APIs, auth patterns, error handling
  • Streaming complexity: Interleaving audio chunks with text responses
  • State management: Tracking conversation context across services

NeuroLink treats voice as a first-class stream, just like tokens or tool calls. The same stream() API handles speech input, LLM processing, and audio output.


Architecture: The Voice Pipeline

Here's how NeuroLink simplifies voice AI:

import { NeuroLink } from "@juspay/neurolink";

const neurolink = new NeuroLink({
  // Primary LLM for reasoning
  provider: "anthropic",
  model: "claude-4-sonnet",

  // Optional: Add STT/TTS tools via MCP
  tools: ["speechToText", "textToSpeech"],
});

// Single pipeline: Audio → Text → AI → Audio
const pipeline = await neurolink.stream({
  input: {
    audio: audioStream,  // Incoming voice
    text: "Transcribe and respond naturally"
  },
  // Response includes both text AND synthesized speech
  output: { formats: ["text", "audio"] }
});
Enter fullscreen mode Exit fullscreen mode

Building a Voice Assistant: Complete Example

Let's build a real-time voice assistant that listens, thinks, and speaks.

1. Setup and Configuration

import { NeuroLink } from "@juspay/neurolink";
import { createWriteStream } from "fs";
import { Readable } from "stream";

interface VoiceConfig {
  sttProvider: "whisper" | "deepgram" | "assembly";
  llmProvider: "anthropic" | "openai" | "google-ai";
  ttsProvider: "elevenlabs" | "openai" | "azure";
}

const voiceAgent = new NeuroLink({
  // Core LLM configuration
  provider: "anthropic",
  model: "claude-4-sonnet",

  // Memory for multi-turn conversations
  memory: {
    enabled: true,
    backend: "redis",
    ttl: 3600, // 1 hour session
  },

  // System prompt for voice persona
  systemPrompt: `You are a helpful voice assistant.
    Keep responses concise (2-3 sentences) for natural speech.
    Use conversational language. Avoid markdown or code blocks.`
});
Enter fullscreen mode Exit fullscreen mode

2. Speech-to-Text Integration

Capture audio and transcribe in real-time:

import { Recorder } from "node-record-lpcm16";

async function* captureAudio(): AsyncIterable<Buffer> {
  const recording = new Recorder({
    sampleRate: 16000,
    channels: 1,
    audioType: "wav"
  });

  const stream = recording.start();

  for await (const chunk of stream) {
    yield chunk;
  }
}

// Transcribe streaming audio
async function transcribeStream(audioStream: AsyncIterable<Buffer>) {
  const result = await voiceAgent.generate({
    input: { audio: audioStream },
    // Use STT tool to convert speech to text
    tools: [{
      name: "speechToText",
      provider: "whisper",
      config: { language: "en", model: "whisper-1" }
    }],
  });

  return result.toolResults.speechToText.text;
}
Enter fullscreen mode Exit fullscreen mode

3. Streaming LLM Response

Process transcribed text with streaming for real-time feedback:

async function* processVoiceQuery(transcript: string, sessionId: string) {
  const stream = await voiceAgent.stream({
    input: { text: transcript },
    // Attach session for memory/context
    session: { id: sessionId },
    // Request structured output for voice
    output: {
      format: "stream",
      // Enable streaming for real-time TTS
      streaming: true
    }
  });

  for await (const chunk of stream.stream) {
    if ("content" in chunk) {
      // Yield text chunks for display/processing
      yield { type: "text", content: chunk.content };
    }

    if ("toolCall" in chunk) {
      // Handle any tool invocations
      yield { type: "tool", call: chunk.toolCall };
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

4. Text-to-Speech Synthesis

Convert AI responses to speech with streaming audio:

import { createAudioPlayer, createAudioResource } from "@discordjs/voice";

async function* speakResponse(
  textStream: AsyncIterable<string>,
  voiceId: string = "default"
): AsyncIterable<Buffer> {
  // Buffer text into sentences for natural speech
  let sentenceBuffer = "";

  for await (const text of textStream) {
    sentenceBuffer += text;

    // Process complete sentences
    const sentences = sentenceBuffer.match(/[^.!?]+[.!?]+/g) || [];

    if (sentences.length > 0) {
      // Remove processed text from buffer
      sentenceBuffer = sentenceBuffer.slice(
        sentences.join("").length
      );

      // Generate speech for each sentence
      for (const sentence of sentences) {
        const audio = await voiceAgent.generate({
          input: { text: sentence.trim() },
          tools: [{
            name: "textToSpeech",
            provider: "elevenlabs",
            config: {
              voiceId,
              model: "eleven_multilingual_v2",
              streaming: true
            }
          }]
        });

        if (audio.toolResults?.textToSpeech?.audio) {
          yield Buffer.from(audio.toolResults.textToSpeech.audio, "base64");
        }
      }
    }
  }

  // Process any remaining text
  if (sentenceBuffer.trim()) {
    const finalAudio = await voiceAgent.generate({
      input: { text: sentenceBuffer.trim() },
      tools: [{
        name: "textToSpeech",
        provider: "elevenlabs",
        config: { voiceId, model: "eleven_multilingual_v2" }
      }]
    });

    if (finalAudio.toolResults?.textToSpeech?.audio) {
      yield Buffer.from(finalAudio.toolResults.textToSpeech.audio, "base64");
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

5. Complete Voice Loop

Putting it all together:

async function runVoiceAssistant() {
  const sessionId = `session-${Date.now()}`;
  const speaker = new Speaker({ channels: 1, bitDepth: 16, sampleRate: 24000 });

  console.log("🎤 Voice assistant started. Speak now...");

  // 1. Capture audio from microphone
  const audioStream = captureAudio();

  // 2. Transcribe speech to text
  const transcript = await transcribeStream(audioStream);
  console.log(`📝 You said: "${transcript}"`);

  // 3. Process with LLM (streaming)
  const textStream = processVoiceQuery(transcript, sessionId);

  // 4. Convert to speech (sentence-by-sentence for low latency)
  const audioOutput = speakResponse(
    (async function* () {
      for await (const chunk of textStream) {
        if (chunk.type === "text") {
          process.stdout.write(chunk.content);
          yield chunk.content;
        }
      }
    })()
  );

  // 5. Play audio response
  for await (const audioChunk of audioOutput) {
    speaker.write(audioChunk);
  }

  console.log("\n✅ Response complete");
}

// Run the assistant
runVoiceAssistant().catch(console.error);
Enter fullscreen mode Exit fullscreen mode

Advanced Patterns

Interrupt Handling

Voice assistants need to handle interruptions gracefully:

class InterruptibleVoiceAgent {
  private currentStream: AbortController | null = null;

  async handleVoiceInput(audioStream: AsyncIterable<Buffer>) {
    // Cancel previous response
    this.currentStream?.abort();
    this.currentStream = new AbortController();

    try {
      const transcript = await this.transcribe(audioStream);

      // Check for interruption keywords
      if (this.isInterruption(transcript)) {
        await this.playInterruptionAck();
        return;
      }

      await this.generateResponse(transcript, this.currentStream.signal);
    } catch (error) {
      if (error.name !== "AbortError") throw error;
    }
  }

  private isInterruption(text: string): boolean {
    const interruptionWords = ["stop", "wait", "hold on", "nevermind"];
    return interruptionWords.some(w => text.toLowerCase().includes(w));
  }
}
Enter fullscreen mode Exit fullscreen mode

Multi-Language Voice AI

NeuroLink's unified API makes language switching seamless:

async function multilingualVoiceAgent(detectedLanguage: string) {
  const languageConfigs: Record<string, VoiceConfig> = {
    "en": { stt: "whisper", tts: "elevenlabs", voice: "Bella" },
    "es": { stt: "whisper", tts: "elevenlabs", voice: "Pedro" },
    "hi": { stt: "whisper", tts: "azure", voice: "hi-IN-SwaraNeural" },
  };

  const config = languageConfigs[detectedLanguage];

  return await neurolink.generate({
    input: { audio: audioStream },
    tools: [
      {
        name: "speechToText",
        config: { language: detectedLanguage }
      },
      {
        name: "textToSpeech",
        config: {
          voiceId: config.voice,
          language: detectedLanguage
        }
      }
    ]
  });
}
Enter fullscreen mode Exit fullscreen mode

Voice Activity Detection (VAD)

Optimize costs by only processing speech:

import { MicVAD } from "@ricky0123/vad-web";

async function* voiceDetectedAudio() {
  const vad = await MicVAD.new({
    onSpeechStart: () => console.log("🎙️ Speech detected"),
    onSpeechEnd: (audio) => {
      // Only process when speech ends
      return Buffer.from(audio);
    }
  });

  vad.start();

  // Yield audio chunks only during speech
  while (true) {
    const audio = await vad.nextSpeechSegment();
    if (audio) yield audio;
  }
}
Enter fullscreen mode Exit fullscreen mode

Production Considerations

Latency Optimization

Technique Latency Impact
Streaming STT -300ms
Sentence-level TTS -500ms
Redis memory -100ms (no context rebuild)
WebSocket transport -50ms
Parallel TTS prefetch -200ms

Error Handling

const resilientVoiceAgent = new NeuroLink({
  // Auto-fallback if primary STT fails
  fallback: {
    providers: ["whisper", "deepgram", "assembly"],
    strategy: "sequential"
  },

  // Retry configuration
  retry: {
    attempts: 3,
    backoff: "exponential",
    maxDelay: 5000
  }
});
Enter fullscreen mode Exit fullscreen mode

Cost Tracking

Monitor voice AI costs across providers:

neurolink.on("usage", (event) => {
  console.log(`STT: ${event.sttTokens} tokens`);
  console.log(`LLM: ${event.llmTokens} tokens`);
  console.log(`TTS: ${event.ttsCharacters} chars`);
  console.log(`Total: $${event.estimatedCost}`);
});
Enter fullscreen mode Exit fullscreen mode

Web Integration Example

For browser-based voice apps:

// Server: WebSocket endpoint
import { WebSocketServer } from "ws";

const wss = new WebSocketServer({ port: 8080 });

wss.on("connection", (ws) => {
  const agent = new NeuroLink({
    provider: "anthropic",
    model: "claude-4-sonnet"
  });

  ws.on("message", async (audioData) => {
    const result = await agent.generate({
      input: {
        audio: audioData,
        mimeType: "audio/webm"
      },
      tools: ["speechToText", "textToSpeech"]
    });

    // Send audio response back
    ws.send(result.toolResults.textToSpeech.audio);
  });
});
Enter fullscreen mode Exit fullscreen mode
// Client: Browser microphone
const ws = new WebSocket("ws://localhost:8080");
const mediaRecorder = new MediaRecorder(stream);

mediaRecorder.ondataavailable = (event) => {
  ws.send(event.data);
};

ws.onmessage = async (event) => {
  const audio = new Audio(URL.createObjectURL(event.data));
  await audio.play();
};
Enter fullscreen mode Exit fullscreen mode

Conclusion

Voice AI doesn't need to be fragmented. With NeuroLink, you build speech-to-speech applications using the same patterns as text-based AI:

  • Unified API: One SDK for STT, LLM, and TTS
  • Streaming native: Real-time audio processing out of the box
  • Memory aware: Conversations persist across voice sessions
  • Provider agnostic: Switch STT/TTS providers without rewriting code

Whether you're building a customer service bot, a voice-enabled coding assistant, or a real-time translator, NeuroLink's streaming architecture handles the complexity so you can focus on the conversation.


NeuroLink — The Universal AI SDK for TypeScript

Top comments (0)