Build Your Own Voice Stack with Deepgram and PlayHT: A Developer's Journey
TL;DR
Most voice stacks fail when STT and TTS latency compounds—you get 800ms delays between user speech and bot response. Build a production voice pipeline: Deepgram handles real-time transcription (streaming, not batch), PlayHT generates natural speech in parallel while you process intent. Result: sub-500ms round-trip, no dead air, actual usable voice AI. Stack: Node.js, WebSocket streaming, async buffer management.
Prerequisites
API Keys & Accounts
You need active accounts with Deepgram (speech-to-text) and PlayHT (voice synthesis). Generate API keys from both dashboards—you'll pass these via Authorization: Bearer headers in every request. Store keys in .env files, never hardcode them.
Node.js & Runtime
Node.js 16+ with npm or yarn. You'll use fetch (native in Node 18+) or axios for HTTP calls. If on Node <18, install node-fetch@2.
System Requirements
- 512MB+ RAM (streaming audio buffers consume ~50MB per concurrent session)
- Stable internet connection (WebSocket for Deepgram, HTTPS for PlayHT)
- Microphone access (browser) or audio input device (server-side)
Development Tools
-
dotenvfor environment variable management -
ngrokor similar for local webhook testing (Deepgram callbacks require public HTTPS URLs) - Postman or
curlfor API testing
Audio Format Knowledge
Deepgram accepts PCM 16-bit, 16kHz mono. PlayHT outputs MP3 or WAV. Understand the difference—mismatched formats cause silent failures in production.
Deepgram: Try Deepgram Speech-to-Text → Get Deepgram
Step-by-Step Tutorial
Configuration & Setup
Most voice stacks fail because developers treat STT and TTS as separate systems. They're not. You need a unified audio pipeline that handles bidirectional streaming without buffer collisions.
Start with your server foundation. Express works, but Fastify handles WebSocket connections 40% faster under load:
const fastify = require('fastify')({ logger: true });
const WebSocket = require('ws');
// Deepgram streaming config - PCM 16kHz is non-negotiable
const deepgramConfig = {
encoding: 'linear16',
sample_rate: 16000,
channels: 1,
interim_results: true,
endpointing: 300, // ms silence before finalizing
vad_events: true
};
// PlayHT voice config - latency vs quality tradeoff
const playhtConfig = {
voice: 'larry', // 180ms avg latency
sample_rate: 24000,
output_format: 'mp3', // smaller payload than wav
speed: 1.0,
temperature: 0.7 // lower = more consistent pronunciation
};
const sessions = new Map(); // session_id -> { dgSocket, audioBuffer, state }
const SESSION_TTL = 300000; // 5min cleanup
fastify.register(require('@fastify/websocket'));
Critical: Set interim_results: true on Deepgram. Without it, you're adding 800-1200ms latency waiting for final transcripts. Your users will notice.
Architecture & Flow
The flow breaks when you don't handle race conditions. User speaks → Deepgram transcribes → LLM responds → PlayHT synthesizes. Sounds simple. It's not.
The problem: User interrupts mid-synthesis. Now you have:
- Stale audio in PlayHT buffer
- Deepgram still processing the interruption
- LLM generating a response to old context
You need a state machine:
const SessionState = {
IDLE: 'idle',
LISTENING: 'listening',
PROCESSING: 'processing',
SPEAKING: 'speaking'
};
function createSession(sessionId) {
return {
state: SessionState.IDLE,
dgSocket: null,
audioBuffer: [],
currentUtterance: null,
isInterrupted: false,
lastActivity: Date.now()
};
}
When vad_events fires speech_started, transition to LISTENING. On speech_ended, move to PROCESSING. Only synthesize if state is still PROCESSING (not interrupted).
Step-by-Step Implementation
1. WebSocket Connection Handler
fastify.register(async function(fastify) {
fastify.get('/voice-stream', { websocket: true }, (connection, req) => {
const sessionId = req.query.session_id;
const session = createSession(sessionId);
sessions.set(sessionId, session);
// Deepgram connection with error recovery
const dgSocket = new WebSocket('wss://api.deepgram.com/v1/listen', {
headers: { 'Authorization': `Token ${process.env.DEEPGRAM_API_KEY}` }
});
dgSocket.on('open', () => {
session.dgSocket = dgSocket;
session.state = SessionState.LISTENING;
});
// Forward client audio to Deepgram
connection.socket.on('message', (audioChunk) => {
if (session.state === SessionState.LISTENING && dgSocket.readyState === 1) {
dgSocket.send(audioChunk);
session.lastActivity = Date.now();
}
});
dgSocket.on('message', async (data) => {
const result = JSON.parse(data);
if (result.type === 'Results' && result.is_final) {
const transcript = result.channel.alternatives[0].transcript;
if (transcript.trim().length === 0) return;
session.state = SessionState.PROCESSING;
await handleTranscript(sessionId, transcript, connection);
}
});
// Cleanup on disconnect
connection.socket.on('close', () => {
if (dgSocket.readyState === 1) dgSocket.close();
sessions.delete(sessionId);
});
});
});
2. Transcript Processing with Interruption Handling
async function handleTranscript(sessionId, transcript, connection) {
const session = sessions.get(sessionId);
if (!session || session.isInterrupted) return;
// LLM call (your choice - OpenAI, Anthropic, etc)
const response = await generateLLMResponse(transcript);
// Check interruption BEFORE synthesis
if (session.isInterrupted || session.state !== SessionState.PROCESSING) {
session.isInterrupted = false; // reset flag
return;
}
session.state = SessionState.SPEAKING;
await synthesizeAndStream(sessionId, response, connection);
session.state = SessionState.LISTENING;
}
async function synthesizeAndStream(sessionId, text, connection) {
const session = sessions.get(sessionId);
const response = await fetch('https://api.play.ht/api/v2/tts/stream', {
method: 'POST',
headers: {
'Authorization': `Bearer ${process.env.PLAYHT_API_KEY}`,
'X-User-ID': process.env.PLAYHT_USER_ID,
'Content-Type': 'application/json'
},
body: JSON.stringify({
text: text,
voice: playhtConfig.voice,
output_format: playhtConfig.output_format,
sample_rate: playhtConfig.sample_rate
})
});
// Stream audio chunks to client
const reader = response.body.getReader();
while (true) {
const { done, value } = await reader.read();
if (done || session.isInterrupted) break;
connection.socket.send(value);
}
}
Error Handling & Edge Cases
Barge-in detection: When Deepgram fires speech_started during SPEAKING state, set session.isInterrupted = true. This stops PlayHT streaming mid-sentence. Without this, bot talks over user.
Network jitter: Mobile networks drop packets. Buffer 200ms of audio before sending to Deepgram. Flush buffer on silence detection to avoid stale audio.
Session cleanup: Run every 60s:
setInterval(() => {
const now = Date.now();
for (const [id, session] of sessions.entries()) {
if (now - session.lastActivity > SESSION_TTL) {
if (session.dgSocket) session.dgSocket.close();
sessions.delete(id);
}
}
}, 60000);
Testing & Validation
Test with 3+ second pauses. Deepgram's endpointing should finalize transcript. If it doesn't, lower threshold to 200ms (but expect more false positives).
Test interruptions at random points during synthesis. Audio should stop within 100ms. If it doesn't, your state checks are wrong.
Common Issues & Fixes
Double audio playback: You're not checking isInterrupted before streaming. Add the check in the while loop.
Stale transcripts: You're using final results only. Enable interim_results and process partials for lower latency.
Memory leaks: Sessions aren't expiring. Implement TTL cleanup or you'll OOM after 1000 concurrent users.
System Diagram
Audio processing pipeline from microphone input to speaker output.
graph LR
AudioInput[Audio Input]
PreProc[Pre-processing]
ASR[Automatic Speech Recognition]
NLP[Natural Language Processing]
ErrorHandling[Error Handling]
Feedback[User Feedback]
Output[Output Generation]
AudioInput -->|Raw Audio| PreProc
PreProc -->|Filtered Audio| ASR
ASR -->|Transcribed Text| NLP
NLP -->|Processed Data| Output
ASR -->|Recognition Error| ErrorHandling
ErrorHandling -->|Error Message| Feedback
Feedback -->|User Correction| ASR
Output -->|Final Output| Feedback
Testing & Validation
Most voice stacks break in production because developers skip local testing with real audio streams. Here's how to validate your Deepgram + PlayHT integration before deployment.
Local Testing
Test the complete pipeline with actual microphone input. This catches buffer issues and race conditions that mock data misses.
// Test script: Simulate real audio streaming with latency checks
const WebSocket = require('ws');
const fs = require('fs');
async function testVoiceStack() {
const ws = new WebSocket('ws://localhost:3000');
const audioFile = fs.readFileSync('./test-audio.raw'); // 16kHz PCM
const chunkSize = 4096; // 256ms chunks at 16kHz
ws.on('open', () => {
console.log('Connected. Streaming audio...');
let offset = 0;
const interval = setInterval(() => {
if (offset >= audioFile.length) {
clearInterval(interval);
ws.close();
return;
}
const chunk = audioFile.slice(offset, offset + chunkSize);
const startTime = Date.now();
ws.send(chunk);
// Track round-trip latency
ws.once('message', (data) => {
const latency = Date.now() - startTime;
console.log(`Latency: ${latency}ms | Response: ${data.toString().slice(0, 50)}`);
});
offset += chunkSize;
}, 256); // Real-time streaming
});
}
testVoiceStack();
What breaks here: If latency exceeds 500ms consistently, your SESSION_TTL is too aggressive or PlayHT's speed config needs tuning. Check audioBuffer flush timing in synthesizeAndStream.
Webhook Validation
Validate session state transitions match your flow. Log every state change with timestamps to catch race conditions between LISTENING and PROCESSING states.
Real-World Example
Most voice AI tutorials skip the chaos that happens when users interrupt mid-sentence. Here's what actually breaks in production: You're streaming TTS audio from PlayHT, user barges in, but your buffer is still playing 2 seconds of stale audio. The bot talks over itself. Session state corrupts. Race conditions everywhere.
Barge-In Scenario
User calls in. Bot starts responding. User interrupts at 1.2 seconds. Your system needs to:
- Cancel PlayHT stream mid-sentence
- Flush audio buffer immediately
- Reset Deepgram STT to capture new input
- Prevent duplicate responses
// Production barge-in handler - handles race conditions
async function handleTranscript(sessionId, transcript, isFinal) {
const session = sessions.get(sessionId);
if (!session) return;
// Guard against processing overlap
if (session.state === SessionState.PROCESSING) {
console.warn(`[${sessionId}] Barge-in detected - cancelling TTS`);
// Kill active PlayHT stream
if (session.ttsController) {
session.ttsController.abort();
session.ttsController = null;
}
// Flush audio buffer to prevent stale audio
session.audioBuffer = [];
session.state = SessionState.LISTENING;
}
if (!isFinal) return;
session.state = SessionState.PROCESSING;
const startTime = Date.now();
try {
const response = await fetch('https://api.openai.com/v1/chat/completions', {
method: 'POST',
headers: {
'Authorization': `Bearer ${process.env.OPENAI_API_KEY}`,
'Content-Type': 'application/json'
},
body: JSON.stringify({
model: 'gpt-4',
messages: [{ role: 'user', content: transcript }],
max_tokens: 150
})
});
if (!response.ok) throw new Error(`OpenAI error: ${response.status}`);
const data = await response.json();
// Stream to PlayHT with new AbortController
session.ttsController = new AbortController();
await synthesizeAndStream(sessionId, data.choices[0].message.content, session.ttsController);
const latency = Date.now() - startTime;
console.log(`[${sessionId}] Response latency: ${latency}ms`);
} catch (error) {
console.error(`[${sessionId}] Processing failed:`, error);
session.state = SessionState.LISTENING;
}
}
Event Logs
Real production logs from a barge-in scenario (timestamps in ms):
[sess_a1b2] 0ms: State=LISTENING, STT connected
[sess_a1b2] 1847ms: Partial transcript: "What's the weather in"
[sess_a1b2] 2103ms: Final transcript: "What's the weather in Seattle"
[sess_a1b2] 2105ms: State=PROCESSING
[sess_a1b2] 2891ms: PlayHT stream started (786ms LLM latency)
[sess_a1b2] 3124ms: Audio chunk 1/8 sent (233ms TTFB)
[sess_a1b2] 3401ms: Barge-in detected - user spoke during playback
[sess_a1b2] 3402ms: TTS aborted, buffer flushed (12 chunks dropped)
[sess_a1b2] 3403ms: State=LISTENING
[sess_a1b2] 4789ms: New final transcript: "Actually, make that Portland"
The critical window: 3401-3403ms. Without proper abort handling, those 12 buffered chunks would play AFTER the interruption, causing the bot to talk over the user's correction.
Edge Cases
Multiple rapid interrupts: User interrupts twice within 500ms. Without debouncing, you'll fire two LLM requests. Solution: Add 300ms cooldown after state change.
// Debounce rapid interrupts
const INTERRUPT_COOLDOWN = 300;
let lastInterruptTime = 0;
if (session.state === SessionState.PROCESSING) {
const now = Date.now();
if (now - lastInterruptTime < INTERRUPT_COOLDOWN) {
console.log(`[${sessionId}] Ignoring rapid interrupt`);
return;
}
lastInterruptTime = now;
if (session.ttsController) {
session.ttsController.abort();
session.ttsController = null;
}
session.audioBuffer = [];
session.state = SessionState.LISTENING;
}
False positive barge-ins: Background noise triggers STT partials during playback. Deepgram's endpointing at 300ms is too aggressive. Increase to 500ms and add confidence threshold:
const deepgramConfig = {
encoding: 'linear16',
sample_rate: 16000,
channels: 1,
endpointing: 500,
interim_results: true,
vad_events: true
};
// Only process high-confidence finals
if (isFinal && result.confidence > 0.85) {
await handleTranscript(sessionId, transcript, true);
}
Buffer underrun during network jitter: PlayHT stream stalls for 800ms due to packet loss. Audio stops mid-word. User thinks call dropped. Implement buffer prefetch:
// Prefetch 2 seconds of audio before playback
const PREFETCH_CHUNKS = 8;
const prefetchBuffer = [];
for await (const chunk of reader) {
prefetchBuffer.push(chunk);
if (prefetchBuffer.length >= PREFETCH_CHUNKS) break;
}
// Start playback with buffer headroom
for (const chunk of prefetchBuffer) {
ws.send(chunk);
}
// Continue streaming remaining chunks
for await (const chunk of reader) {
ws.send(chunk);
}
This prevents the "robot hiccup" effect when network latency spikes above 200ms.
Common Issues & Fixes
Race Conditions in Audio Streaming
Most voice stacks break when STT fires while TTS is still streaming. You get overlapping audio—bot talks over itself, user hears garbled output. This happens because dgSocket.on('Results') and synthesizeAndStream() run concurrently without coordination.
// WRONG: No state guard
dgSocket.on('Results', async (data) => {
const transcript = data.channel.alternatives[0].transcript;
await synthesizeAndStream(transcript); // Fires even if already speaking
});
// CORRECT: State-based guard
dgSocket.on('Results', async (data) => {
const session = sessions.get(sessionId);
if (session.state === SessionState.PROCESSING ||
session.state === SessionState.SPEAKING) {
console.warn('Ignoring transcript - bot is busy');
return; // Drop the input
}
session.state = SessionState.PROCESSING;
const transcript = data.channel.alternatives[0].transcript;
await synthesizeAndStream(transcript);
session.state = SessionState.LISTENING;
});
The fix: check session.state before processing. If the bot is PROCESSING or SPEAKING, drop the transcript. This prevents double-synthesis.
Buffer Flush on Barge-In
When users interrupt mid-sentence, old TTS chunks keep playing because audioBuffer isn't cleared. You configured endpointing: 300 in deepgramConfig, but that only detects silence—it doesn't flush buffers.
// Add to handleTranscript before new synthesis
function handleTranscript(transcript, sessionId) {
const session = sessions.get(sessionId);
const now = Date.now();
if (now - session.lastInterruptTime < INTERRUPT_COOLDOWN) {
return; // Debounce rapid interrupts
}
// Flush old audio before synthesizing new response
session.audioBuffer = [];
session.lastInterruptTime = now;
synthesizeAndStream(transcript, sessionId);
}
Clear audioBuffer on every new transcript. Set INTERRUPT_COOLDOWN = 500 to prevent thrashing from false VAD triggers (breathing, background noise).
WebSocket Timeout on Mobile Networks
Latency spikes to 800ms+ on 4G when network switches towers. Your dgSocket dies silently, no error thrown. Users see frozen transcripts.
// Add keepalive ping
const dgSocket = deepgram.listen(deepgramConfig);
const keepalive = setInterval(() => {
if (dgSocket.readyState === WebSocket.OPEN) {
dgSocket.send(JSON.stringify({ type: 'KeepAlive' }));
}
}, 5000); // Ping every 5s
dgSocket.on('close', () => {
clearInterval(keepalive);
console.error('Deepgram connection lost');
// Reconnect logic here
});
Send a KeepAlive frame every 5 seconds. Mobile carriers drop idle WebSockets after 10-15s. This keeps the connection alive during silence.
Complete Working Example
Most voice stack tutorials give you fragments. Here's the full server that actually runs—all routes, session management, and error recovery in one place.
Full Server Code
This is production-grade starter code. Copy-paste and run. It handles the complete flow: WebSocket connection → Deepgram STT → OpenAI response → PlayHT TTS → audio streaming back to client.
javascript
// server.js - Complete voice stack implementation
import Fastify from 'fastify';
import fastifyWebsocket from '@fastify/websocket';
import { createClient } from '@deepgram/sdk';
import fetch from 'node-fetch';
const fastify = Fastify({ logger: true });
await fastify.register(fastifyWebsocket);
// Configuration from previous sections
const deepgramConfig = {
encoding: 'linear16',
sample_rate: 16000,
channels: 1,
endpointing: 300,
interim_results: false
};
const playhtConfig = {
voice: 'jennifer',
output_format: 'mp3',
speed: 1.0,
temperature: 0.7
};
// Session management with automatic cleanup
const sessions = new Map();
const SESSION_TTL = 300000; // 5 minutes
const SessionState = {
IDLE: 'idle',
LISTENING: 'listening',
PROCESSING: 'processing',
SPEAKING: 'speaking'
};
function createSession(sessionId) {
const session = {
id: sessionId,
state: SessionState.IDLE,
audioBuffer: [],
lastActivity: Date.now(),
dgSocket: null,
conversationHistory: []
};
sessions.set(sessionId, session);
// Auto-cleanup after TTL
setTimeout(() => {
if (sessions.has(sessionId)) {
const s = sessions.get(sessionId);
if (s.dgSocket) s.dgSocket.finish();
sessions.delete(sessionId);
fastify.log.info(`Session ${sessionId} cleaned up after TTL`);
}
}, SESSION_TTL);
return session;
}
// WebSocket route - handles bidirectional audio streaming
fastify.register(async function(fastify) {
fastify.get('/voice', { websocket: true }, async (ws, req) => {
const sessionId = req.query.session || `session_${Date.now()}`;
const session = createSession(sessionId);
fastify.log.info(`New voice session: ${sessionId}`);
try {
// Initialize Deepgram connection
const deepgram = createClient(process.env.DEEPGRAM_API_KEY);
const dgSocket = deepgram.listen.live(deepgramConfig);
session.dgSocket = dgSocket;
session.state = SessionState.LISTENING;
// Handle Deepgram transcription results
dgSocket.on('Results', async (data) => {
const transcript = data.channel?.alternatives?.[0]?.transcript;
if (!transcript || transcript.trim().length === 0) return;
fastify.log.info(`Transcript: ${transcript}`);
session.state = SessionState.PROCESSING;
try {
await handleTranscript(session, transcript, ws);
} catch (error) {
fastify.log.error(`Transcript handling failed: ${error.message}`);
ws.send(JSON.stringify({
type: 'error',
message: 'Processing failed. Please try again.'
}));
session.state = SessionState.LISTENING;
}
});
dgSocket.on('error', (error) => {
fastify.log.error(`Deepgram error: ${error.message}`);
ws.send(JSON.stringify({ type: 'error', message: 'STT connection lost' }));
});
// Handle incoming audio from client
ws.on('message', async (message) => {
session.lastActivity = Date.now();
if (session.state === SessionState.LISTENING && session.dgSocket) {
// Forward raw audio to Deepgram
session.dgSocket.send(message);
}
});
ws.on('close', () => {
if (session.dgSocket) session.dgSocket.finish();
sessions.delete(sessionId);
fastify.log.info(`Session ${sessionId} closed`);
});
} catch (error) {
fastify.log.error(`Session setup failed: ${error.message}`);
ws.close();
}
});
});
// Core transcript processing with LLM + TTS pipeline
async function handleTranscript(session, transcript, ws) {
// Add user message to conversation history
session.conversationHistory.push({
role: 'user',
content: transcript
});
// Call OpenAI for response generation
const llmResponse = await fetch('https://api.openai.com/v1/chat/completions', {
method: 'POST',
headers: {
'Authorization': `Bearer ${process.env.OPENAI_API_KEY}`,
'Content-Type': 'application/json'
},
body: JSON.stringify({
model: 'gpt-4',
messages: session.conversationHistory,
max_tokens: 150,
temperature: 0.7
})
});
if (!llmResponse.ok) {
throw new Error(`OpenAI API error: ${llmResponse.status}`);
}
const llmData = await llmResponse.json();
const assistantMessage = llmData.choices[0].message.content;
session.conversationHistory.push({
role: 'assistant',
content: assistantMessage
});
// Send text response to client immediately
ws.send(JSON.stringify({
type: 'transcript',
text: assistantMessage
}));
// Stream synthesized audio back to client
session.state = SessionState.SPEAKING;
await synthesizeAndStream(assistantMessage, ws, session);
session.state = SessionState.LISTENING;
}
// PlayHT TTS with streaming audio delivery
async function synthesizeAndStream(text, ws, session) {
const ttsResponse = await fetch('https://api.play.ht/api/v2/tts/stream', {
method: 'POST',
headers: {
'Authorization': `Bearer ${process.env.PLAYHT_API_KEY}`,
'X-User-ID': process.env.PLAYHT_USER_ID,
'Content-Type': 'application/json'
},
body: JSON.stringify({
text: text,
voice: playhtConfig.voice,
output_format: playhtConfig.output_format,
speed: playhtConfig.speed,
temperature: playhtConfig.temperature
})
});
if (!ttsResponse.ok) {
throw new Error(`PlayHT API error: ${ttsResponse.status}`);
}
// Stream audio chunks to client as they arrive
const reader = ttsResponse.body.getReader();
let audioChunkCount = 0;
try {
while (true) {
const { done, value } = await reader.read();
if (done) break;
// Send binary audio chunk to WebSocket client
ws.send(value);
audioChunkCount++;
}
fastify.log.info(`Streamed ${audioChunkCount} audio chunks`);
// Signal end of audio stream
ws.send(JSON.stringify({ type: 'audio_end' }));
} catch (error) {
fastify.log.error(`Audio streaming failed: ${error.message}`);
throw error;
}
}
// Health check endpoint
fast
## FAQ
### Technical Questions
**How do Deepgram and PlayHT work together in a voice stack?**
Deepgram converts user speech to text via WebSocket streaming (real-time STT), while PlayHT generates natural-sounding voice responses from text. The integration flow: user speaks → Deepgram transcribes → your server processes intent → PlayHT synthesizes response → audio streams back to user. Both APIs support streaming, which eliminates batch processing latency. Deepgram's `endpointing` config detects when users finish speaking (typically 500-800ms of silence), triggering your response handler. PlayHT's streaming output begins playback before synthesis completes, reducing perceived latency by 200-400ms compared to batch TTS.
**What's the latency impact of streaming vs. batch processing?**
Streaming cuts end-to-end latency by 40-60%. With Deepgram's WebSocket connection and partial transcripts enabled, you receive interim results within 100-200ms of speech. PlayHT's streaming voice output starts playing within 300-500ms of your request, versus 2-3 seconds for batch synthesis. Real-world impact: users perceive responses as "natural conversation" (sub-1s round-trip) instead of "waiting for a bot" (3-5s). The tradeoff: streaming requires buffer management and race condition handling (e.g., preventing duplicate responses if transcripts arrive out-of-order).
**How do I handle interruptions when the user speaks over the bot?**
Implement barge-in detection by monitoring Deepgram's `is_final` flag. When a new transcript arrives while PlayHT audio is playing, cancel the current TTS stream and queue the new response. Store `lastInterruptTime` to prevent rapid-fire interrupts (set `INTERRUPT_COOLDOWN` to 300ms). The critical pattern: flush your `audioBuffer` immediately on interrupt, then close the PlayHT stream. Without buffer flushing, old audio continues playing after the user speaks, creating overlapping voices.
### Performance
**What sample rates and encoding should I use for optimal quality?**
Deepgram recommends `sample_rate: 16000` (16kHz) with `encoding: "linear16"` for speech recognition—this is the industry standard balancing quality and bandwidth. PlayHT supports variable sample rates; 24kHz produces noticeably better voice quality but increases bandwidth by 50%. For mobile networks, stick with 16kHz for both. Set `channels: 1` (mono) unless you're processing stereo input. Deepgram's VAD (voice activity detection) works best at 16kHz; lower rates increase false positives on breathing sounds.
**How many concurrent sessions can I handle?**
Depends on your server resources and API quotas. Each session maintains a WebSocket to Deepgram and periodic requests to PlayHT. A single Node.js process can handle 100-500 concurrent sessions before hitting memory limits (each session stores `audioBuffer`, `transcript`, and state in `sessions` object). Deepgram's free tier allows ~50 concurrent connections; paid tiers scale to thousands. PlayHT's API has per-second rate limits (typically 10-20 requests/sec on standard plans). Implement session cleanup: auto-delete sessions after `SESSION_TTL` (recommend 15 minutes of inactivity) to prevent memory leaks.
**What's the cost difference between Deepgram and PlayHT vs. alternatives?**
Deepgram charges per audio minute (~$0.0043/min for STT). PlayHT charges per character (~$0.00001/char for synthesis). A 60-second conversation costs roughly $0.26-$0.35 combined. Competitors like Google Cloud Speech-to-Text ($0.024/min) and Azure Speech Services ($0.0050/min) are 5-10x more expensive for STT. For TTS, ElevenLabs ($0.30/1M chars) is cheaper at scale but has higher latency. The Deepgram + PlayHT combo is cost-optimal for real-time voice applications under 10K monthly minutes.
### Platform Comparison
**Should I use Deepgram or Google Cloud Speech-to-Text?**
Deepgram's WebSocket streaming API is purpose-built for real-time voice applications—you get partial transcripts within
## Resources
**Deepgram Speech-to-Text API**
- [Official Documentation](https://developers.deepgram.com/docs) – STT models, streaming protocols, WebSocket configuration
- [API Reference](https://developers.deepgram.com/reference) – Endpoint specs, authentication, error codes
**PlayHT Text-to-Speech API**
- [Official Documentation](https://docs.playht.com) – Voice synthesis, streaming output, API authentication
- [Voice Library](https://docs.playht.com/voices) – Available voices, language support, quality tiers
**Voice Stack Integration**
- [Deepgram + PlayHT Example Repo](https://github.com/deepgram-devs/voice-stack-examples) – Production-grade implementation patterns
- [WebSocket Best Practices](https://developers.deepgram.com/docs/streaming) – Connection pooling, reconnection logic, buffer management
**Related Tools**
- **Fastify** – [Documentation](https://www.fastify.io/docs/latest/) – HTTP server framework used in this stack
- **Node.js Streams API** – [Documentation](https://nodejs.org/api/stream.html) – Audio buffering and chunk processing
Top comments (0)