iOS 26 SpeechAnalyzer: what I learned wiring it to a mic

#swift #ios #programming #softwareengineering

import AVFoundation
import Speech

@Observable
@MainActor
final class LiveTranscriber {
    // Two strings on purpose. One is rewritten constantly, one is permanent.
    private(set) var volatile  = AttributedString()   // the gray, live guess
    private(set) var committed = AttributedString()   // finalized text, never rewritten

    private var analyzer: SpeechAnalyzer?
    private var transcriber: SpeechTranscriber?
    private var inputBuilder: AsyncStream<AnalyzerInput>.Continuation?
    private var analyzerFormat: AVAudioFormat?
    private var resultsTask: Task<Void, Never>?
    private let engine = AVAudioEngine()

    func start(locale: Locale = .current) async throws {
        let transcriber = SpeechTranscriber(
            locale: locale,
            transcriptionOptions: [],
            reportingOptions: [.volatileResults],   // opt in to live partials
            attributeOptions:  [.audioTimeRange]    // each run carries its audio span
        )
        self.transcriber = transcriber
        analyzer = SpeechAnalyzer(modules: [transcriber])
        analyzerFormat = await SpeechAnalyzer.bestAvailableAudioFormat(compatibleWith: [transcriber])
        try await ensureModel(for: transcriber, locale: locale)   // download once, if missing

        let (stream, continuation) = AsyncStream<AnalyzerInput>.makeStream()
        inputBuilder = continuation

        resultsTask = Task {
            do {
                for try await result in transcriber.results {
                    if result.isFinal {
                        committed += result.text
                        volatile = AttributedString()       // clear the guess
                    } else {
                        volatile = result.text              // replace, don't append
                    }
                }
            } catch { /* surface to the UI; a thrown result ends the stream */ }
        }

        try await analyzer?.start(inputSequence: stream)
        try startMic()   // installs the tap, converts buffers, yields AnalyzerInput
    }
}

That is the entire spine of the live dictation I wired into the iOS app I build by myself. Forty-odd lines, no third-party packages, running fully on-device on iOS 26. It took me about a day to write and the better part of a week to stop getting wrong. This post is the week, not the day.

The class is short because SpeechAnalyzer carries the weight. But "short" hid four traps that the WWDC talk and the sample code skate past, and every one of them cost me real hours. I'll walk the spine first, then open up each trap with the code that actually fixed it.

Reading the spine

SpeechTranscriber is the module that turns audio into words. I configure it with reportingOptions: [.volatileResults], which is the single line that makes the experience feel live. Leave it out and you only get finalized text, in chunks, after the recognizer has heard enough context to be confident. With it, you get a stream of fast, throwaway guesses that tighten as more audio arrives.

SpeechAnalyzer(modules:) is the session. You hand it an array of modules; here that is just the one transcriber, though a SpeechDetector for voice-activity can ride alongside it. The analyzer does not produce results itself. Each module owns its own results sequence, which is why my for try await loop reads from transcriber.results and not from the analyzer.

bestAvailableAudioFormat(compatibleWith:) returns the PCM format the model wants to be fed. Hold onto that thought; it is the second trap.

Then AsyncStream<AnalyzerInput>.makeStream() gives me a stream and a continuation. I feed audio in through the continuation; the analyzer reads from the stream. analyzer.start(inputSequence:) begins the session, and startMic() opens the tap that pushes buffers in.

That is the happy path. Here is where I actually spent the week.

Volatile results are a UI problem, not a recognition one

The first time I ran it, the transcript stuttered and doubled. "the the quick the quick brown the quick brown fox." I had reached for the obvious move and appended every result to one string.

The fix is the two-string split at the top of the class. A volatile result is a guess about the same span of audio the recognizer is still chewing on. It is meant to replace the previous guess, not extend it. A final result is the recognizer committing: this span is settled, it will never be revised. So volatile text gets assigned, final text gets appended, and the moment a final arrives I clear the volatile buffer so the same words don't show up twice.

if result.isFinal {
    committed += result.text          // permanent, append
    volatile = AttributedString()     // the guess is now redundant
} else {
    volatile = result.text            // transient, overwrite
}

In the UI I render committed in the normal text color and volatile in gray, and I insert the live text at the cursor so a memo grows in place while I talk. The gray is not decoration. It is a promise to the reader that those words might still change, and it is the difference between an interface that feels honest and one that feels broken when a word flips a half-second after it appeared. result.text is an AttributedString rather than a String precisely so the framework can hang this kind of metadata off the runs, including the audioTimeRange I asked for.

The microphone's format is not the analyzer's format

This is the trap that ate the most hours, because it fails silently. The tap delivers buffers in the input node's hardware format. The analyzer wants the format that bestAvailableAudioFormat handed back. Feed it the wrong one and you don't get an error. You get nothing, or you get garbage, and you sit there wondering whether the model is broken.

The microphone tap has to convert every buffer before it goes in:

private func startMic() throws {
    let input = engine.inputNode
    let micFormat = input.outputFormat(forBus: 0)
    guard let analyzerFormat, let converter = AVAudioConverter(from: micFormat, to: analyzerFormat) else {
        throw TranscriberError.noUsableFormat
    }

    input.installTap(onBus: 0, bufferSize: 4096, format: micFormat) { [weak self] buffer, _ in
        guard let self, let converted = self.convert(buffer, with: converter, to: analyzerFormat) else { return }
        self.inputBuilder?.yield(AnalyzerInput(buffer: converted))
    }

    engine.prepare()
    try engine.start()
}

AnalyzerInput(buffer:) is the envelope the stream carries. The tap closure runs on an audio thread, so I keep it cheap: convert, yield, done. Nothing else belongs in there. I learned that the hard way too, by doing string work in the closure and watching the audio glitch.

"On-device" still means "download once"

On-device is the headline, and it is true: nothing I record leaves the phone, and because there is no metered speech API behind it, a user can talk all day and my server bill stays exactly zero, which for a solo dev with no backend is the entire reason this feature was even thinkable. But "on-device" is not the same as "already on the device." The language model has to be present, and the first time a given locale comes up it may not be.

func ensureModel(for transcriber: SpeechTranscriber, locale: Locale) async throws {
    let wanted = locale.identifier(.bcp47)
    let supported = await SpeechTranscriber.supportedLocales
    guard supported.contains(where: { $0.identifier(.bcp47) == wanted }) else {
        throw TranscriberError.localeUnsupported
    }
    let installed = await SpeechTranscriber.installedLocales
    if installed.contains(where: { $0.identifier(.bcp47) == wanted }) { return }

    if let request = try await AssetInventory.assetInstallationRequest(supporting: [transcriber]) {
        try await request.downloadAndInstall()   // this is the "Preparing…" state the user sees
    }
}

Two checks, not one. supportedLocales answers "can this device ever transcribe this language" — at the time of writing the list runs to forty-some locales, from en_US to ja_JP to yue_CN. installedLocales answers "is the model on disk right now." Only when a locale is supported but not installed do I ask AssetInventory to fetch it, and that download is what surfaces as a "Preparing…" label in the app.

Here is the part I genuinely like as an app author, not just an engineer: those models are shared system assets, not part of my bundle. My download size on the App Store did not move a kilobyte when I shipped this. The model lives in system storage, gets shared across every app that uses it, and updates itself out from under me when Apple improves it. I am used to features that cost binary size or cost cents-per-call. This one costs neither, and that combination is rare enough that I went back and re-read the docs twice to make sure I wasn't missing the catch.

The catch, such as it is, is availability. On a device or OS that can't run it, SpeechTranscriber.isAvailable is false, and the right move is to hide the microphone entirely rather than show a button that does nothing. Typing still works; the voice affordance simply isn't offered. Degrading by hiding beats degrading by erroring.

Two transcribers, and I'm not certain I chose right

There are actually two transcription modules in the framework. I shipped with SpeechTranscriber, which is tuned for clean, lower-overhead recognition. There is also DictationTranscriber, which adds punctuation and leans into conversational structure — the kind of thing you'd want for composing a long message out loud.

Read that back and you can see my doubt. A memo app is arguably closer to "composing a message" than to "command recognition," which is the textbook case for DictationTranscriber. I went with SpeechTranscriber because my memos are short, often fragments, and I'd rather under-punctuate a three-word note than have the model guess sentence boundaries that aren't there. But I hold that choice loosely. It is the first thing I'll A/B if people tell me the transcripts read like a transcript instead of like a note.

What I'd change after shipping it

The bug that survived longest into production was a lifecycle one, and it is worth flagging because it is counterintuitive. Finishing the input stream does not finish the session. Calling continuation.finish() just tells the analyzer no more audio is coming; the analyzer stays alive, holding resources, waiting. To actually wind down you have to call a finish method on the analyzer itself:

func stop() async {
    engine.stop()
    inputBuilder?.finish()
    try? await analyzer?.finalizeAndFinishThroughEndOfInput()  // flush, then close for real
    resultsTask?.cancel()
}

finalizeAndFinishThroughEndOfInput() flushes whatever audio is still in flight into final results before it closes, so you don't lose the last word someone spoke. I had the engine.stop() and the stream finish() from day one and assumed that was teardown. It wasn't, and the leak only showed up after a few dozen start/stop cycles in a long session.

The other thing I'd revisit is backpressure. An AsyncStream will buffer if the analyzer falls behind the microphone, and under sustained fast speech that buffer grows. I haven't been bitten by it yet, but I've made a note to bound the stream and drop the oldest buffers rather than the newest if I ever am, because in dictation the freshest audio is the audio you most need.

The converter I hand-waved

I skipped the body of convert(_:with:to:) above so the mic section would read cleanly. Here it is, since it is the piece most likely to trip you up. It is a one-shot pull through AVAudioConverter:

private func convert(_ buffer: AVAudioPCMBuffer,
                     with converter: AVAudioConverter,
                     to format: AVAudioFormat) -> AVAudioPCMBuffer? {
    let ratio = format.sampleRate / buffer.format.sampleRate
    let capacity = AVAudioFrameCount(Double(buffer.frameLength) * ratio) + 1024
    guard let out = AVAudioPCMBuffer(pcmFormat: format, frameCapacity: capacity) else { return nil }

    var supplied = false
    var conversionError: NSError?
    converter.convert(to: out, error: &conversionError) { _, status in
        if supplied { status.pointee = .noDataNow; return nil }
        supplied = true
        status.pointee = .haveData
        return buffer
    }
    return conversionError == nil ? out : nil
}

The fiddly bit is the input block. AVAudioConverter pulls input rather than taking it, so you hand it the buffer once with .haveData, then answer .noDataNow on the next pull so it doesn't spin asking for more. The + 1024 on the capacity is slack for the resample; size it too tight and the conversion truncates. None of this is exotic, but it is exactly the kind of plumbing the headline API hides, and it is why "wire it to a mic" turned out to be the hard half of the sentence.

If you've shipped SpeechAnalyzer against live audio, I want to compare notes on one thing specifically: did you stay on SpeechTranscriber, or did DictationTranscriber read better for free-form notes? That's the one decision I still can't defend with data.

I'm a solo iOS developer building Simple Memo. I write here every few days about the unglamorous parts of shipping alone, usually when an Apple API surprises me. The voice input this code grew into is documented on its own page.

Further reading: I later expanded these notes into a complete reference with the full SpeechSession, the AudioBufferConverter, and an SFSpeechRecognizer to SpeechAnalyzer migration table: the full SpeechAnalyzer guide.