<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Joost Bakker</title>
    <description>The latest articles on DEV Community by Joost Bakker (@joostmbakker).</description>
    <link>https://dev.to/joostmbakker</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1943133%2Fff09ce21-a040-41d8-ac9f-f75f6a6b37bd.jpeg</url>
      <title>DEV Community: Joost Bakker</title>
      <link>https://dev.to/joostmbakker</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/joostmbakker"/>
    <language>en</language>
    <item>
      <title>How I Got Sub-200ms Time-to-First-Audio Streaming LLM Responses on iOS</title>
      <dc:creator>Joost Bakker</dc:creator>
      <pubDate>Tue, 31 Mar 2026 13:58:15 +0000</pubDate>
      <link>https://dev.to/joostmbakker/how-i-got-sub-200ms-time-to-first-audio-streaming-llm-responses-on-ios-462o</link>
      <guid>https://dev.to/joostmbakker/how-i-got-sub-200ms-time-to-first-audio-streaming-llm-responses-on-ios-462o</guid>
      <description>&lt;p&gt;If you’re building an AI-powered voice feature on iOS, the kind where a user asks a question and hears the answer spoken aloud in real time, you’ve probably hit the same wall I did.&lt;/p&gt;

&lt;p&gt;The LLM streams text back token by token. You need speech output that starts &lt;em&gt;while the text is still arriving&lt;/em&gt;. And you need it to feel instant. Not “wait three seconds while I buffer the entire response, synthesize it into one big audio file, then play it back.”&lt;/p&gt;

&lt;p&gt;This post is about the engineering problem hiding behind that requirement, the three failed approaches I tried, and the architecture that finally got me to sub-200ms time-to-first-audio on a real device.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Setup
&lt;/h2&gt;

&lt;p&gt;I was building a conversational iOS app. The flow looks simple on paper:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;User speaks or types a question&lt;/li&gt;
&lt;li&gt;App sends it to an LLM API (streaming response)&lt;/li&gt;
&lt;li&gt;Text chunks arrive over several seconds&lt;/li&gt;
&lt;li&gt;Each chunk gets sent to a cloud TTS service (ElevenLabs, Google Cloud)&lt;/li&gt;
&lt;li&gt;Audio comes back and plays through the speaker&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Steps 1–3 are well-understood. Step 5 is trivial in isolation. The nightmare is step 4. Specifically, the gap between “audio bytes arrive from the network” and “sound comes out of the speaker” when you need it to happen &lt;em&gt;continuously, in real time, without glitches&lt;/em&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Attempt 1: The Naive Approach (AVAudioPlayer + Temp Files)
&lt;/h2&gt;

&lt;p&gt;My first instinct was the simplest possible thing:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight swift"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Don't do this&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;audioChunk&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="n"&gt;ttsStream&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;tempURL&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;FileManager&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="k"&gt;default&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;temporaryDirectory&lt;/span&gt;
        &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;appendingPathComponent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;UUID&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;uuidString&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="s"&gt;".wav"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="nf"&gt;writeWAVFile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;audioChunk&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;to&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;tempURL&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;player&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="kt"&gt;AVAudioPlayer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;contentsOf&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;tempURL&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;player&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;play&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Wrong in every way that matters. Each chunk creates a new &lt;code&gt;AVAudioPlayer&lt;/code&gt;, which means a new audio session setup, a disk write, a file read, and an audible gap between chunks. On my test device, each gap was 80–150ms. Long enough that the output sounded like a stuttering robot reading a ransom note.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Latency: ~800ms to first audio, with constant stuttering.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Attempt 2: Concatenate First, Play Later
&lt;/h2&gt;

&lt;p&gt;Okay, so don’t play chunks individually. Collect all the audio, concatenate it, then play:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight swift"&gt;&lt;code&gt;&lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="nv"&gt;allAudio&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;Data&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="n"&gt;ttsStream&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;allAudio&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="c1"&gt;// Now play allAudio as one continuous buffer&lt;/span&gt;
&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;player&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="kt"&gt;AVAudioPlayer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;data&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;allAudio&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;player&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;play&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This eliminates the stuttering but introduces a much worse problem: you wait for the &lt;em&gt;entire&lt;/em&gt; LLM response to be synthesized before any audio plays. For a paragraph-length answer, that’s 3–8 seconds of silence while the user stares at the screen. It defeats the entire purpose of streaming.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Latency: 3–8 seconds to first audio. No stuttering, but unusable.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Attempt 3: AVAudioEngine with Manual Buffer Scheduling
&lt;/h2&gt;

&lt;p&gt;The right primitive on Apple platforms is &lt;code&gt;AVAudioEngine&lt;/code&gt; with &lt;code&gt;AVAudioPlayerNode&lt;/code&gt;. Instead of playing discrete files, you schedule PCM buffers onto a player node, and the engine plays them back-to-back seamlessly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight swift"&gt;&lt;code&gt;&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;engine&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;AVAudioEngine&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;playerNode&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;AVAudioPlayerNode&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;engine&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;attach&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;playerNode&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;engine&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;connect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;playerNode&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;to&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;engine&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mainMixerNode&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;format&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;outputFormat&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="n"&gt;engine&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;playerNode&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;play&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="n"&gt;ttsStream&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;buffer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;convertToAVAudioPCMBuffer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;format&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;outputFormat&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;playerNode&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;scheduleBuffer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;buffer&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Conceptually right. But the real-world implementation has four problems that aren’t obvious until you hit them:&lt;/p&gt;

&lt;h3&gt;
  
  
  Problem 1: Format Mismatch
&lt;/h3&gt;

&lt;p&gt;TTS providers return audio in their native format, typically 16-bit PCM at 24kHz mono. &lt;code&gt;AVAudioEngine&lt;/code&gt;‘s main mixer expects 32-bit float at the device’s hardware sample rate (usually 44.1kHz or 48kHz on iOS). Schedule a buffer in the wrong format and you get either silence or a crash.&lt;/p&gt;

&lt;p&gt;You need &lt;code&gt;AVAudioConverter&lt;/code&gt; to bridge the gap. And &lt;code&gt;AVAudioConverter&lt;/code&gt; has a callback-based API that’s genuinely unpleasant to use correctly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight swift"&gt;&lt;code&gt;&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;converter&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;AVAudioConverter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;from&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;providerFormat&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;to&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;mixerFormat&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;

&lt;span class="c1"&gt;// The converter pulls data from you via a callback&lt;/span&gt;
&lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="nv"&gt;hasProvidedInput&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;
&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;inputBlock&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;AVAudioConverterInputBlock&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="n"&gt;packetCount&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;hasProvidedInput&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pointee&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;noDataNow&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="kc"&gt;nil&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;hasProvidedInput&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
    &lt;span class="n"&gt;status&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pointee&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;haveData&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;inputBuffer&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="n"&gt;converter&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;convert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;to&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;outputBuffer&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;error&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;error&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;withInputFrom&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;inputBlock&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every chunk needs this dance. Get the status flags wrong and you get silent output with no error.&lt;/p&gt;

&lt;h3&gt;
  
  
  Problem 2: Byte Alignment
&lt;/h3&gt;

&lt;p&gt;TTS services stream audio over WebSocket or gRPC. The network doesn’t respect audio frame boundaries. You might receive 1,347 bytes in one message and 2,891 in the next. But &lt;code&gt;AVAudioPCMBuffer&lt;/code&gt; needs data aligned to exact frame boundaries (2 bytes per frame for 16-bit mono, 4 bytes for stereo).&lt;/p&gt;

&lt;p&gt;Feed it unaligned data and you get a garbled, crackling mess. The kind of audio artifact that makes you question whether your entire approach is wrong.&lt;/p&gt;

&lt;p&gt;You need a byte accumulator that buffers incoming data, emits aligned chunks when enough has arrived, and correctly handles the partial frame left over at the end of the stream.&lt;/p&gt;

&lt;h3&gt;
  
  
  Problem 3: The Stuttering Watermark
&lt;/h3&gt;

&lt;p&gt;If you call &lt;code&gt;playerNode.play()&lt;/code&gt; immediately when the first buffer arrives, you’ll hear a brief burst of audio, then silence while the next network round-trip happens, then another burst. Choppy playback. Sounds terrible.&lt;/p&gt;

&lt;p&gt;The fix is a &lt;em&gt;playback watermark&lt;/em&gt;. Don’t start playing until you’ve buffered enough audio to stay ahead of the network. Half a second is usually enough. But implementing this correctly means tracking how much audio you’ve scheduled, when to begin playback, and handling the edge case where the entire response is shorter than your watermark.&lt;/p&gt;

&lt;h3&gt;
  
  
  Problem 4: Memory Pressure and Backpressure
&lt;/h3&gt;

&lt;p&gt;Here’s the subtle one. If your network connection is fast and the TTS service returns audio faster than the device plays it, the buffer queue grows without bound. On a fast connection with a long response, I measured buffer queues exceeding 50MB of PCM data. That’s enough to trigger iOS memory warnings and background termination.&lt;/p&gt;

&lt;p&gt;You need &lt;em&gt;backpressure&lt;/em&gt;: when the scheduled-but-unplayed audio exceeds a threshold (say, 3 seconds), pause consuming from the network stream. When playback catches up, resume. This requires coordinating between the network layer and the audio scheduling layer in a thread-safe way.&lt;/p&gt;

&lt;p&gt;In Swift, the cleanest mechanism is &lt;code&gt;CheckedContinuation&lt;/code&gt;. Suspend the stream consumption task when the buffer is full, and resume it from the &lt;code&gt;AVAudioPlayerNode&lt;/code&gt; completion callback when a buffer finishes playing:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight swift"&gt;&lt;code&gt;&lt;span class="c1"&gt;// In the buffer scheduling code:&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;queuedDuration&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;maxBufferedDuration&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;withCheckedContinuation&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="n"&gt;continuation&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt;
        &lt;span class="k"&gt;self&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;backpressureContinuation&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;continuation&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// In the buffer completion callback:&lt;/span&gt;
&lt;span class="kd"&gt;func&lt;/span&gt; &lt;span class="nf"&gt;bufferCompleted&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;duration&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;TimeInterval&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;queuedDuration&lt;/span&gt; &lt;span class="o"&gt;-=&lt;/span&gt; &lt;span class="n"&gt;duration&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;queuedDuration&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;maxBufferedDuration&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
       &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;continuation&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;backpressureContinuation&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;backpressureContinuation&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;nil&lt;/span&gt;
        &lt;span class="n"&gt;continuation&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;resume&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="c1"&gt;// Unblock the stream consumer&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This mechanism is easy to get subtly wrong. Resume the continuation twice? Crash. Never resume it? The stream hangs forever. Skip thread safety? You get races between the audio render thread and your Swift Concurrency tasks.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Architecture That Works
&lt;/h2&gt;

&lt;p&gt;After solving all four problems, I stepped back and looked at what I’d built. The core was a pipeline with clear stages:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Text chunks (from LLM)
    ↓
TTSProvider (WebSocket/gRPC → PCM bytes)
    ↓
AudioBufferAccumulator (byte alignment)
    ↓
AVAudioConverter (format conversion)
    ↓
AVAudioPlayerNode (scheduling + playback)
    ↓
Speaker
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two control mechanisms run alongside it:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Playback watermark&lt;/strong&gt;: Don’t start playing until 0.5s of audio is buffered&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Backpressure&lt;/strong&gt;: Pause network consumption when buffered audio exceeds 3s&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And a key insight: this pipeline has &lt;em&gt;nothing to do with which TTS provider you use&lt;/em&gt;. ElevenLabs, Google Cloud, Amazon Polly, OpenAI, a custom server. They all produce PCM bytes. The pipeline doesn’t care where the bytes came from.&lt;/p&gt;

&lt;h2&gt;
  
  
  Making It Reusable
&lt;/h2&gt;

&lt;p&gt;That insight led me to extract this into a library. The provider-agnostic part is a Swift &lt;code&gt;actor&lt;/code&gt; called &lt;code&gt;StreamingAudioPipeline&lt;/code&gt; that handles accumulation, conversion, scheduling, backpressure, and watermarking. The provider-specific part is a two-method protocol:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight swift"&gt;&lt;code&gt;&lt;span class="kd"&gt;public&lt;/span&gt; &lt;span class="kd"&gt;protocol&lt;/span&gt; &lt;span class="kt"&gt;TTSProvider&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;Sendable&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="nv"&gt;outputFormat&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;AVAudioFormat&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="k"&gt;get&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="kd"&gt;func&lt;/span&gt; &lt;span class="nf"&gt;stream&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;AsyncStream&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kt"&gt;String&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="kt"&gt;AsyncThrowingStream&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kt"&gt;Data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;Error&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That’s the entire contract. Tell the pipeline what audio format you produce, and give it a function that turns streaming text into streaming PCM data. The pipeline handles everything else.&lt;/p&gt;

&lt;p&gt;The simplest usage is a single line:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight swift"&gt;&lt;code&gt;&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;provider&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;ElevenLabsTTSAdapter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;configuration&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;controller&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;StreamingTTSController&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;provider&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;provider&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;controller&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;speak&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"Hello, world!"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;speak()&lt;/code&gt; convenience method starts the pipeline, sends the text, signals completion, and waits for playback to finish. All in one call. For the more common LLM streaming case, there’s the manual flow:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight swift"&gt;&lt;code&gt;&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;provider&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;ElevenLabsTTSAdapter&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;configuration&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;controller&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;StreamingTTSController&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;provider&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;provider&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;controller&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;start&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;// As text arrives from your LLM:&lt;/span&gt;
&lt;span class="n"&gt;controller&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;yield&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"Hello there! "&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;controller&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;yield&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"This is streaming playback."&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;controller&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;finish&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;controller&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;waitUntilFinished&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If something goes wrong (calling &lt;code&gt;start()&lt;/code&gt; twice, calling it after cancellation) the controller throws typed errors (&lt;code&gt;StreamTTSError.alreadyStarted&lt;/code&gt;, &lt;code&gt;.alreadyCancelled&lt;/code&gt;) instead of failing silently. And if you need to tear everything down immediately, say the user taps “stop,” &lt;code&gt;controller.cancel()&lt;/code&gt; kills the WebSocket connection, stops playback, and cleans up in one call.&lt;/p&gt;

&lt;p&gt;I packaged this as &lt;a href="https://github.com/joostmbakker/StreamTTS" rel="noopener noreferrer"&gt;StreamTTS&lt;/a&gt;, an open source Swift Package with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;StreamTTSCore&lt;/strong&gt;: The audio pipeline and protocol. Zero external dependencies.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;StreamTTSElevenLabs&lt;/strong&gt;: WebSocket adapter for ElevenLabs. Also zero external dependencies (uses &lt;code&gt;URLSessionWebSocketTask&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;StreamTTSGoogleCloud&lt;/strong&gt;: gRPC adapter for Google Cloud TTS.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You import only what you need. If you use ElevenLabs, you don’t pull in the gRPC dependency tree.&lt;/p&gt;

&lt;p&gt;The ElevenLabs adapter exposes configurable voice settings (stability, similarity boost, style exaggeration, and speaker boost) through a &lt;code&gt;VoiceSettings&lt;/code&gt; struct on the configuration, so you can tune synthesis characteristics without digging into WebSocket message internals:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight swift"&gt;&lt;code&gt;&lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="nv"&gt;config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;ElevenLabsConfiguration&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;apiKey&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;voiceId&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"..."&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;voiceSettings&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stability&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.7&lt;/span&gt;
&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;voiceSettings&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;similarityBoost&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.9&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The pipeline itself is configurable too. If your app already manages its own &lt;code&gt;AVAudioEngine&lt;/code&gt;, you can tell StreamTTS not to create one:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight swift"&gt;&lt;code&gt;&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;pipelineConfig&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;AudioPipelineConfiguration&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="nv"&gt;playbackWatermark&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;      &lt;span class="c1"&gt;// Start playing after 300ms of audio&lt;/span&gt;
    &lt;span class="nv"&gt;maxBufferedDuration&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;5.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    &lt;span class="c1"&gt;// Allow 5s of buffer before backpressure&lt;/span&gt;
    &lt;span class="nv"&gt;managesAudioEngine&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;    &lt;span class="c1"&gt;// I'll manage the engine myself&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Why Swift Actors Are the Right Concurrency Primitive Here
&lt;/h2&gt;

&lt;p&gt;Quick note on why the pipeline is an &lt;code&gt;actor&lt;/code&gt; and not a class with locks.&lt;/p&gt;

&lt;p&gt;The audio pipeline has mutable state accessed from multiple contexts: the stream consumption task writes to the buffer queue, the &lt;code&gt;AVAudioPlayerNode&lt;/code&gt; completion callback decrements the queue from the audio render thread, and the public API reads state to check if playback is finished.&lt;/p&gt;

&lt;p&gt;With a class, you’d need an &lt;code&gt;NSLock&lt;/code&gt; or &lt;code&gt;DispatchQueue&lt;/code&gt; wrapping every state access. With an actor, all state access is automatically serialized. The &lt;code&gt;CheckedContinuation&lt;/code&gt;-based backpressure mechanism becomes trivial. You store the continuation as actor state, and the completion callback resumes it by calling an actor method.&lt;/p&gt;

&lt;p&gt;Swift actors were essentially designed for exactly this kind of problem: mutable state shared between async tasks and callbacks. The result is code that’s both safe and readable.&lt;/p&gt;

&lt;h2&gt;
  
  
  Results
&lt;/h2&gt;

&lt;p&gt;On an iPhone 15 Pro with ElevenLabs’ &lt;code&gt;eleven_flash_v2_5&lt;/code&gt; model (their lowest-latency option), I measured:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Time to first audio&lt;/strong&gt;: ~180ms from first text chunk sent to first audible output&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Inter-chunk gap&lt;/strong&gt;: Imperceptible (&amp;lt; 5ms between scheduled buffers)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Memory usage&lt;/strong&gt;: Stable at ~2–4MB regardless of response length (backpressure working)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Playback quality&lt;/strong&gt;: Continuous, no stuttering, no artifacts&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The 180ms breaks down roughly as: ~100ms ElevenLabs server processing, ~30ms WebSocket round trip, ~50ms buffer accumulation to hit the watermark. The pipeline itself adds negligible overhead.&lt;/p&gt;

&lt;h2&gt;
  
  
  What’s Next
&lt;/h2&gt;

&lt;p&gt;StreamTTS currently ships with ElevenLabs and Google Cloud TTS adapters. Adding a new provider means implementing two methods, &lt;code&gt;outputFormat&lt;/code&gt; and &lt;code&gt;stream(text:)&lt;/code&gt;. If you’re using OpenAI’s TTS, Amazon Polly, Azure Cognitive Services, or any other streaming TTS API, writing an adapter is straightforward.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://github.com/joostmbakker/StreamTTS" rel="noopener noreferrer"&gt;repo&lt;/a&gt; includes a &lt;code&gt;TTSProvider&lt;/code&gt; protocol guide, a working SwiftUI demo app you can run immediately, and a &lt;a href="https://github.com/joostmbakker/StreamTTS/blob/main/CONTRIBUTING.md" rel="noopener noreferrer"&gt;CONTRIBUTING.md&lt;/a&gt; with step-by-step instructions for adding new provider adapters. Issues labeled &lt;code&gt;good first issue&lt;/code&gt; include adapter requests for other providers. Contributions welcome.&lt;/p&gt;

&lt;p&gt;If you’re building voice features on iOS and hitting the latency wall, give it a try. The problem is hard enough that nobody should have to solve it twice.&lt;/p&gt;

</description>
      <category>ios</category>
      <category>swift</category>
      <category>ai</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
