A technical walkthrough of how the ABD Assistant voice command system works end-to-end, from raw microphone bytes to tool execution.
The Core Architecture
The system has three moving parts: a browser Web Audio capture layer, an Express WebSocket relay, and OpenAI's Realtime API as the voice brain.
The browser streams PCM audio directly to OpenAI via a WebSocket that stays open for the entire session. OpenAI performs server-side voice activity detection (VAD), transcribes speech incrementally, runs its LLM over the conversation history, and streams back audio tokens as they're generated. This means no client-side silence detection, no turn-management logic, and no separate transcription step — one pipeline, fully server-driven.
Audio Capture: The Hard Part
Capturing audio correctly is where most implementations fall apart. The key constraint: OpenAI's Realtime API expects mono PCM at 24kHz, 16-bit signed integers. Browser MediaRecorder produces audio/webm or audio/opus — a completely different format. The solution is a ScriptProcessorNode (or AudioWorklet):
const processor = audioContext.createScriptProcessor(4096, 1, 1);
processor.onaudioprocess = (ev) => {
const inputData = ev.inputBuffer.getChannelData(0);
const pcmData = new Int16Array(inputData.length);
for (let i = 0; i < inputData.length; i++) {
const s = Math.max(-1, Math.min(1, inputData[i]));
pcmData[i] = s < 0 ? s * 32768 : s * 32767;
}
const base64 = btoa(String.fromCharCode(...new Uint8Array(pcmData.buffer)));
ws.send(JSON.stringify({ type: "input_audio_buffer.append", audio: base64 }));
};
Each chunk gets base64-encoded and sent as a WebSocket message. No buffering, no batching — continuous streaming.
WebSocket Relay
The OpenAI Realtime API WebSocket URL requires your API key in a header, which browser WebSocket clients can't set. A thin Express server handles the relay:
wss.on("connection", (clientWs, req) => {
const apiKey = new URL(req.url, `http://${req.headers.host}`).searchParams.get("api_key");
if (apiKey !== process.env.VOICE_API_KEY) { clientWs.close(4401); return; }
const openAiWs = new WebSocket("wss://api.openai.com/v1/realtime?model=gpt-realtime-2025-08-28", {
headers: { Authorization: `Bearer ${process.env.OPENAI_API_KEY}` }
});
openAiWs.on("message", (data) => clientWs.send(data.toString()));
clientWs.on("message", (data) => openAiWs.send(data.toString()));
clientWs.on("close", () => openAiWs.close());
});
All message passing is raw — no transformation. The relay is stateless.
GA Realtime API Session Config
The Generally Available Realtime API uses a different session schema than the beta. The session.update payload must include type: "realtime" and fields are organized under audio.input and audio.output:
const sessionUpdate = {
type: "session.update",
session: {
type: "realtime",
model: "gpt-realtime-2025-08-28",
audio: {
input: {
format: { type: "audio/pcm", rate: 24000 },
turn_detection: { type: "server_vad", prefix_padding_ms: 1000, silence_duration_ms: 400 }
},
output: { format: { type: "audio/pcm", rate: 24000 }, voice: "alloy" }
},
output_modalities: ["audio"]
}
};
Key fields that trip up implementations: session.type must be "realtime" (not "conversation"), modalities is not a top-level field (use output_modalities: ["audio"]), and voice lives under audio.output.
Event Handling
The GA API emits different event names than the beta. The critical ones to handle:
-
conversation.item.input_audio_transcription.delta→ user transcript (word-by-word) -
response.output_audio.delta→ AI audio chunk (bytes inevent.delta, notevent.audio) -
response.output_audio_transcript.delta→ AI transcript -
input_audio_buffer.speech_started/speech_stopped→ VAD state changes -
response.done→ response complete, session continues
Tool Calling
The Realtime API supports function calling natively within the session. Define tools in the session.update:
sessionUpdate.session.tools = toolDefinitions.map(t => ({ type: "function", ...t }));
When the model emits response.done with output[0].type === "function_call", execute the tool client-side, then inject the result:
const result = await executeTool(name, args);
ws.send(JSON.stringify({
type: "conversation.item.create",
item: { type: "function_call_output", call_id, output: result }
}));
ws.send(JSON.stringify({ type: "response.create" }));
This triggers the model to respond with speech confirming the action.
Audio Playback: Sequential Queue
Streaming audio chunks arrive out-of-order and overlap if played immediately. A simple FIFO queue with isPlaying flag solves this — each chunk is decoded to an AudioBuffer and played through a BufferSourceNode; the onended callback triggers the next chunk.
Challenges & Trade-offs
- ScriptProcessorNode deprecation: Use AudioWorklet in production; ScriptProcessorNode still works but Chrome shows a deprecation warning.
-
No multi-voice:
output_modalities: ["audio"]forces a single voice model; there's no way to get text + audio simultaneously from the same model output. -
No streaming transcript: Input transcription arrives as
input_audio_transcription.delta, not streamed word-by-word like the audio output. -
Tool execution timing: When a tool executes, recording pauses; the model waits for the
response.createcall before continuing.
For Your Own Project
The only ABD-specific parts are the tool definitions and executeTool mapping — swap those for your own API surface. Everything else — audio capture, WebSocket relay, session management, audio playback — is reusable infrastructure.
Top comments (0)