Joost Bakker

Posted on Mar 31

How I Got Sub-200ms Time-to-First-Audio Streaming LLM Responses on iOS

#ios #swift #ai #opensource

If you’re building an AI-powered voice feature on iOS, the kind where a user asks a question and hears the answer spoken aloud in real time, you’ve probably hit the same wall I did.

The LLM streams text back token by token. You need speech output that starts while the text is still arriving. And you need it to feel instant. Not “wait three seconds while I buffer the entire response, synthesize it into one big audio file, then play it back.”

This post is about the engineering problem hiding behind that requirement, the three failed approaches I tried, and the architecture that finally got me to sub-200ms time-to-first-audio on a real device.

The Setup

I was building a conversational iOS app. The flow looks simple on paper:

User speaks or types a question
App sends it to an LLM API (streaming response)
Text chunks arrive over several seconds
Each chunk gets sent to a cloud TTS service (ElevenLabs, Google Cloud)
Audio comes back and plays through the speaker

Steps 1–3 are well-understood. Step 5 is trivial in isolation. The nightmare is step 4. Specifically, the gap between “audio bytes arrive from the network” and “sound comes out of the speaker” when you need it to happen continuously, in real time, without glitches.

Attempt 1: The Naive Approach (AVAudioPlayer + Temp Files)

My first instinct was the simplest possible thing:

// Don't do this
for try await audioChunk in ttsStream {
    let tempURL = FileManager.default.temporaryDirectory
        .appendingPathComponent(UUID().uuidString + ".wav")
    try writeWAVFile(audioChunk, to: tempURL)
    let player = try AVAudioPlayer(contentsOf: tempURL)
    player.play()
}

Wrong in every way that matters. Each chunk creates a new AVAudioPlayer, which means a new audio session setup, a disk write, a file read, and an audible gap between chunks. On my test device, each gap was 80–150ms. Long enough that the output sounded like a stuttering robot reading a ransom note.

Latency: ~800ms to first audio, with constant stuttering.

Attempt 2: Concatenate First, Play Later

Okay, so don’t play chunks individually. Collect all the audio, concatenate it, then play:

var allAudio = Data()
for try await chunk in ttsStream {
    allAudio.append(chunk)
}
// Now play allAudio as one continuous buffer
let player = try AVAudioPlayer(data: allAudio)
player.play()

This eliminates the stuttering but introduces a much worse problem: you wait for the entire LLM response to be synthesized before any audio plays. For a paragraph-length answer, that’s 3–8 seconds of silence while the user stares at the screen. It defeats the entire purpose of streaming.

Latency: 3–8 seconds to first audio. No stuttering, but unusable.

Attempt 3: AVAudioEngine with Manual Buffer Scheduling

The right primitive on Apple platforms is AVAudioEngine with AVAudioPlayerNode. Instead of playing discrete files, you schedule PCM buffers onto a player node, and the engine plays them back-to-back seamlessly:

let engine = AVAudioEngine()
let playerNode = AVAudioPlayerNode()
engine.attach(playerNode)
engine.connect(playerNode, to: engine.mainMixerNode, format: outputFormat)
try engine.start()
playerNode.play()

for try await chunk in ttsStream {
    let buffer = convertToAVAudioPCMBuffer(chunk, format: outputFormat)
    playerNode.scheduleBuffer(buffer)
}

Conceptually right. But the real-world implementation has four problems that aren’t obvious until you hit them:

Problem 1: Format Mismatch

TTS providers return audio in their native format, typically 16-bit PCM at 24kHz mono. AVAudioEngine‘s main mixer expects 32-bit float at the device’s hardware sample rate (usually 44.1kHz or 48kHz on iOS). Schedule a buffer in the wrong format and you get either silence or a crash.

You need AVAudioConverter to bridge the gap. And AVAudioConverter has a callback-based API that’s genuinely unpleasant to use correctly:

let converter = AVAudioConverter(from: providerFormat, to: mixerFormat)!

// The converter pulls data from you via a callback
var hasProvidedInput = false
let inputBlock: AVAudioConverterInputBlock = { packetCount, status in
    if hasProvidedInput {
        status.pointee = .noDataNow
        return nil
    }
    hasProvidedInput = true
    status.pointee = .haveData
    return inputBuffer
}
converter.convert(to: outputBuffer, error: &error, withInputFrom: inputBlock)

Every chunk needs this dance. Get the status flags wrong and you get silent output with no error.

Problem 2: Byte Alignment

TTS services stream audio over WebSocket or gRPC. The network doesn’t respect audio frame boundaries. You might receive 1,347 bytes in one message and 2,891 in the next. But AVAudioPCMBuffer needs data aligned to exact frame boundaries (2 bytes per frame for 16-bit mono, 4 bytes for stereo).

Feed it unaligned data and you get a garbled, crackling mess. The kind of audio artifact that makes you question whether your entire approach is wrong.

You need a byte accumulator that buffers incoming data, emits aligned chunks when enough has arrived, and correctly handles the partial frame left over at the end of the stream.

Problem 3: The Stuttering Watermark

If you call playerNode.play() immediately when the first buffer arrives, you’ll hear a brief burst of audio, then silence while the next network round-trip happens, then another burst. Choppy playback. Sounds terrible.

The fix is a playback watermark. Don’t start playing until you’ve buffered enough audio to stay ahead of the network. Half a second is usually enough. But implementing this correctly means tracking how much audio you’ve scheduled, when to begin playback, and handling the edge case where the entire response is shorter than your watermark.

Problem 4: Memory Pressure and Backpressure

Here’s the subtle one. If your network connection is fast and the TTS service returns audio faster than the device plays it, the buffer queue grows without bound. On a fast connection with a long response, I measured buffer queues exceeding 50MB of PCM data. That’s enough to trigger iOS memory warnings and background termination.

You need backpressure: when the scheduled-but-unplayed audio exceeds a threshold (say, 3 seconds), pause consuming from the network stream. When playback catches up, resume. This requires coordinating between the network layer and the audio scheduling layer in a thread-safe way.

In Swift, the cleanest mechanism is CheckedContinuation. Suspend the stream consumption task when the buffer is full, and resume it from the AVAudioPlayerNode completion callback when a buffer finishes playing:

// In the buffer scheduling code:
if queuedDuration >= maxBufferedDuration {
    await withCheckedContinuation { continuation in
        self.backpressureContinuation = continuation
    }
}

// In the buffer completion callback:
func bufferCompleted(duration: TimeInterval) {
    queuedDuration -= duration
    if queuedDuration < maxBufferedDuration,
       let continuation = backpressureContinuation {
        backpressureContinuation = nil
        continuation.resume() // Unblock the stream consumer
    }
}

This mechanism is easy to get subtly wrong. Resume the continuation twice? Crash. Never resume it? The stream hangs forever. Skip thread safety? You get races between the audio render thread and your Swift Concurrency tasks.

The Architecture That Works

After solving all four problems, I stepped back and looked at what I’d built. The core was a pipeline with clear stages:

Text chunks (from LLM)
    ↓
TTSProvider (WebSocket/gRPC → PCM bytes)
    ↓
AudioBufferAccumulator (byte alignment)
    ↓
AVAudioConverter (format conversion)
    ↓
AVAudioPlayerNode (scheduling + playback)
    ↓
Speaker

Two control mechanisms run alongside it:

Playback watermark: Don’t start playing until 0.5s of audio is buffered
Backpressure: Pause network consumption when buffered audio exceeds 3s

And a key insight: this pipeline has nothing to do with which TTS provider you use. ElevenLabs, Google Cloud, Amazon Polly, OpenAI, a custom server. They all produce PCM bytes. The pipeline doesn’t care where the bytes came from.

Making It Reusable

That insight led me to extract this into a library. The provider-agnostic part is a Swift actor called StreamingAudioPipeline that handles accumulation, conversion, scheduling, backpressure, and watermarking. The provider-specific part is a two-method protocol:

public protocol TTSProvider: Sendable {
    var outputFormat: AVAudioFormat { get }
    func stream(text: AsyncStream<String>) -> AsyncThrowingStream<Data, Error>
}

That’s the entire contract. Tell the pipeline what audio format you produce, and give it a function that turns streaming text into streaming PCM data. The pipeline handles everything else.

The simplest usage is a single line:

let provider = ElevenLabsTTSAdapter(configuration: config)
let controller = StreamingTTSController(provider: provider)

try await controller.speak("Hello, world!")

The speak() convenience method starts the pipeline, sends the text, signals completion, and waits for playback to finish. All in one call. For the more common LLM streaming case, there’s the manual flow:

let provider = ElevenLabsTTSAdapter(configuration: config)
let controller = StreamingTTSController(provider: provider)

try await controller.start()

// As text arrives from your LLM:
controller.yield(text: "Hello there! ")
controller.yield(text: "This is streaming playback.")

controller.finish()
await controller.waitUntilFinished()

If something goes wrong (calling start() twice, calling it after cancellation) the controller throws typed errors (StreamTTSError.alreadyStarted, .alreadyCancelled) instead of failing silently. And if you need to tear everything down immediately, say the user taps “stop,” controller.cancel() kills the WebSocket connection, stops playback, and cleans up in one call.

I packaged this as StreamTTS, an open source Swift Package with:

StreamTTSCore: The audio pipeline and protocol. Zero external dependencies.
StreamTTSElevenLabs: WebSocket adapter for ElevenLabs. Also zero external dependencies (uses URLSessionWebSocketTask).
StreamTTSGoogleCloud: gRPC adapter for Google Cloud TTS.

You import only what you need. If you use ElevenLabs, you don’t pull in the gRPC dependency tree.

The ElevenLabs adapter exposes configurable voice settings (stability, similarity boost, style exaggeration, and speaker boost) through a VoiceSettings struct on the configuration, so you can tune synthesis characteristics without digging into WebSocket message internals:

var config = ElevenLabsConfiguration(apiKey: "...", voiceId: "...")
config.voiceSettings.stability = 0.7
config.voiceSettings.similarityBoost = 0.9

The pipeline itself is configurable too. If your app already manages its own AVAudioEngine, you can tell StreamTTS not to create one:

let pipelineConfig = AudioPipelineConfiguration(
    playbackWatermark: 0.3,      // Start playing after 300ms of audio
    maxBufferedDuration: 5.0,    // Allow 5s of buffer before backpressure
    managesAudioEngine: false    // I'll manage the engine myself
)

Why Swift Actors Are the Right Concurrency Primitive Here

Quick note on why the pipeline is an actor and not a class with locks.

The audio pipeline has mutable state accessed from multiple contexts: the stream consumption task writes to the buffer queue, the AVAudioPlayerNode completion callback decrements the queue from the audio render thread, and the public API reads state to check if playback is finished.

With a class, you’d need an NSLock or DispatchQueue wrapping every state access. With an actor, all state access is automatically serialized. The CheckedContinuation-based backpressure mechanism becomes trivial. You store the continuation as actor state, and the completion callback resumes it by calling an actor method.

Swift actors were essentially designed for exactly this kind of problem: mutable state shared between async tasks and callbacks. The result is code that’s both safe and readable.

Results

On an iPhone 15 Pro with ElevenLabs’ eleven_flash_v2_5 model (their lowest-latency option), I measured:

Time to first audio: ~180ms from first text chunk sent to first audible output
Inter-chunk gap: Imperceptible (< 5ms between scheduled buffers)
Memory usage: Stable at ~2–4MB regardless of response length (backpressure working)
Playback quality: Continuous, no stuttering, no artifacts

The 180ms breaks down roughly as: ~100ms ElevenLabs server processing, ~30ms WebSocket round trip, ~50ms buffer accumulation to hit the watermark. The pipeline itself adds negligible overhead.

What’s Next

StreamTTS currently ships with ElevenLabs and Google Cloud TTS adapters. Adding a new provider means implementing two methods, outputFormat and stream(text:). If you’re using OpenAI’s TTS, Amazon Polly, Azure Cognitive Services, or any other streaming TTS API, writing an adapter is straightforward.

The repo includes a TTSProvider protocol guide, a working SwiftUI demo app you can run immediately, and a CONTRIBUTING.md with step-by-step instructions for adding new provider adapters. Issues labeled good first issue include adapter requests for other providers. Contributions welcome.

If you’re building voice features on iOS and hitting the latency wall, give it a try. The problem is hard enough that nobody should have to solve it twice.

DEV Community