Deola Adediran

Posted on Apr 5

Building a Continuous Voice Interface with the OpenAI Realtime API

#ai #javascript #openai #tutorial

A technical walkthrough of how the ABD Assistant voice command system works end-to-end, from raw microphone bytes to tool execution.

The Core Architecture

The system has three moving parts: a browser Web Audio capture layer, an Express WebSocket relay, and OpenAI's Realtime API as the voice brain.

The browser streams PCM audio directly to OpenAI via a WebSocket that stays open for the entire session. OpenAI performs server-side voice activity detection (VAD), transcribes speech incrementally, runs its LLM over the conversation history, and streams back audio tokens as they're generated. This means no client-side silence detection, no turn-management logic, and no separate transcription step — one pipeline, fully server-driven.

Audio Capture: The Hard Part

Capturing audio correctly is where most implementations fall apart. The key constraint: OpenAI's Realtime API expects mono PCM at 24kHz, 16-bit signed integers. Browser MediaRecorder produces audio/webm or audio/opus — a completely different format. The solution is a ScriptProcessorNode (or AudioWorklet):

const processor = audioContext.createScriptProcessor(4096, 1, 1);
processor.onaudioprocess = (ev) => {
  const inputData = ev.inputBuffer.getChannelData(0);
  const pcmData = new Int16Array(inputData.length);
  for (let i = 0; i < inputData.length; i++) {
    const s = Math.max(-1, Math.min(1, inputData[i]));
    pcmData[i] = s < 0 ? s * 32768 : s * 32767;
  }
  const base64 = btoa(String.fromCharCode(...new Uint8Array(pcmData.buffer)));
  ws.send(JSON.stringify({ type: "input_audio_buffer.append", audio: base64 }));
};

Each chunk gets base64-encoded and sent as a WebSocket message. No buffering, no batching — continuous streaming.

WebSocket Relay

The OpenAI Realtime API WebSocket URL requires your API key in a header, which browser WebSocket clients can't set. A thin Express server handles the relay:

wss.on("connection", (clientWs, req) => {
  const apiKey = new URL(req.url, `http://${req.headers.host}`).searchParams.get("api_key");
  if (apiKey !== process.env.VOICE_API_KEY) { clientWs.close(4401); return; }

  const openAiWs = new WebSocket("wss://api.openai.com/v1/realtime?model=gpt-realtime-2025-08-28", {
    headers: { Authorization: `Bearer ${process.env.OPENAI_API_KEY}` }
  });

  openAiWs.on("message", (data) => clientWs.send(data.toString()));
  clientWs.on("message", (data) => openAiWs.send(data.toString()));
  clientWs.on("close", () => openAiWs.close());
});

All message passing is raw — no transformation. The relay is stateless.

GA Realtime API Session Config

The Generally Available Realtime API uses a different session schema than the beta. The session.update payload must include type: "realtime" and fields are organized under audio.input and audio.output:

const sessionUpdate = {
  type: "session.update",
  session: {
    type: "realtime",
    model: "gpt-realtime-2025-08-28",
    audio: {
      input: {
        format: { type: "audio/pcm", rate: 24000 },
        turn_detection: { type: "server_vad", prefix_padding_ms: 1000, silence_duration_ms: 400 }
      },
      output: { format: { type: "audio/pcm", rate: 24000 }, voice: "alloy" }
    },
    output_modalities: ["audio"]
  }
};

Key fields that trip up implementations: session.type must be "realtime" (not "conversation"), modalities is not a top-level field (use output_modalities: ["audio"]), and voice lives under audio.output.

Event Handling

The GA API emits different event names than the beta. The critical ones to handle:

conversation.item.input_audio_transcription.delta → user transcript (word-by-word)
response.output_audio.delta → AI audio chunk (bytes in event.delta, not event.audio)
response.output_audio_transcript.delta → AI transcript
input_audio_buffer.speech_started / speech_stopped → VAD state changes
response.done → response complete, session continues

Tool Calling

The Realtime API supports function calling natively within the session. Define tools in the session.update:

sessionUpdate.session.tools = toolDefinitions.map(t => ({ type: "function", ...t }));

When the model emits response.done with output[0].type === "function_call", execute the tool client-side, then inject the result:

const result = await executeTool(name, args);
ws.send(JSON.stringify({
  type: "conversation.item.create",
  item: { type: "function_call_output", call_id, output: result }
}));
ws.send(JSON.stringify({ type: "response.create" }));

This triggers the model to respond with speech confirming the action.

Audio Playback: Sequential Queue

Streaming audio chunks arrive out-of-order and overlap if played immediately. A simple FIFO queue with isPlaying flag solves this — each chunk is decoded to an AudioBuffer and played through a BufferSourceNode; the onended callback triggers the next chunk.

Challenges & Trade-offs

ScriptProcessorNode deprecation: Use AudioWorklet in production; ScriptProcessorNode still works but Chrome shows a deprecation warning.
No multi-voice: output_modalities: ["audio"] forces a single voice model; there's no way to get text + audio simultaneously from the same model output.
No streaming transcript: Input transcription arrives as input_audio_transcription.delta, not streamed word-by-word like the audio output.
Tool execution timing: When a tool executes, recording pauses; the model waits for the response.create call before continuing.

For Your Own Project

The only ABD-specific parts are the tool definitions and executeTool mapping — swap those for your own API surface. Everything else — audio capture, WebSocket relay, session management, audio playback — is reusable infrastructure.

Top comments (1)

Mykola Kondratiuk • Apr 14

the relay layer is the interesting failure point here - what happens when OpenAI drops the session mid-conversation? reconnect state on a persistent voice session is messier than it looks from the happy path