GaltRanch

Posted on May 21 • Originally published at astrolexis.space

Live Captions Without Sending Your Voice to the Cloud: Building ClearCaps

#a11y #ios #ai #whisper

Originally published on the AstroLexis blog. Cross-posted here for the community.

My dad started losing his hearing about five years ago. Not catastrophically — just enough that family dinners turned into "what did she say?" and TV got a little louder every month. Off-the-shelf captioning apps existed but every single one required uploading audio to a vendor's cloud. For private family conversations, medical appointments, work calls — that wasn't going to fly. So I built ClearCaps. Here's the founder story and the technical pieces that make on-device live captioning actually work in 2026.

The motivating problem

Hearing loss is one of the most common chronic conditions on the planet. The WHO estimates over 430 million people worldwide live with disabling hearing loss — and that number is rising as the population ages. Most of them are not deaf; they hear, just less reliably. Sound gets muddier. Speech gets harder to parse, especially in noisy environments. Conversations become exhausting in a way that's invisible to anyone who hasn't experienced it.

The existing accessibility stack on iOS is genuinely good. Apple's Live Captions (built into iOS 16+) work in many contexts. Speech-to-text apps abound. But almost all of them have the same architecture: capture audio, send it to a server, get back text. For someone with hearing loss, this works fine in casual settings. It does not work for:

Medical appointments. HIPAA-protected health information, often deeply personal.
Therapy sessions. Same reasoning, plus the person on the other side might object to being recorded by a cloud service.
Family conversations. Nobody wants a vendor harvesting their kid's voice or their spouse's medical complaints.
Work meetings under NDA. The lawyer didn't sign off on routing audio through someone else's datacenter.
Anywhere there's no internet. Buses, trains, basements, planes, rural areas.

The market for "live captions that respect your privacy" was — for years — basically non-existent. The reason was technical: doing speech recognition well on a phone, in real time, with speaker identification and translation, wasn't feasible. The models were too big, the CPUs too slow, the batteries too weak. In 2026 that ceiling lifted.

What changed: on-device ASR finally got good

Three independent pieces of technology converged to make this viable on an iPhone:

WhisperKit. Argmax's optimized port of OpenAI's Whisper to the Apple Neural Engine. Whisper-small (240M parameters) runs in real time on any iPhone with an A14 or newer. Whisper-base is even faster. The accuracy is strikingly good — better than most cloud APIs for accented English and major non-English languages.
Apple's Translate framework. Built into iOS 17.4+, fully on-device, supports 10+ languages including English ↔ Spanish, Portuguese, French, German, Mandarin, Japanese, Korean. Latency is sub-second per sentence.
Pyannote speaker diarization, ported to Core ML. The piece that took the longest to get right.

None of these are mine. The work was integrating them — making them run together on a single iPhone, in real time, with low enough latency that the captions actually keep up with the conversation, without melting the battery in 20 minutes.

The architecture

ClearCaps splits the computation across every accelerator the chip has:

Apple Neural Engine (ANE): Whisper-small for automatic speech recognition. Runs exclusively on the ANE so it doesn't fight the GPU for memory bandwidth.
GPU: Pyannote embedder for speaker diarization. The embedder produces 256-dim vectors for short audio chunks; the GPU is the right place because the operations are big matmuls.
Audio DSP block: noise suppression, automatic gain control, acoustic echo cancellation. Apple's built-in voice processing, hardware-accelerated, doesn't touch ANE or GPU.
CPU: Pyannote segmenter, clusterer, voice activity detection, audio resampling, and SwiftUI rendering.

The split matters because if you naively run everything on the GPU, you bottleneck on memory bandwidth before you bottleneck on compute. By splitting across ANE + GPU + DSP, the chip's actual peak throughput becomes accessible. An iPhone 15 Pro or newer handles the full pipeline (ASR + diarization + UI) at ~30% CPU and ~15W package power. That's about half what watching a YouTube video draws.

The hard part: speaker diarization on-device

Automatic speech recognition has been a solved problem for cloud services since 2022 and for on-device since Whisper-small dropped. Diarization — figuring out who is speaking at any given moment — is much less mature.

The state of the art on the cloud side is pyannote.audio, a fantastic open-source library by Hervé Bredin. It's PyTorch under the hood, and the pretrained models assume you have a workstation GPU and Python at runtime. Neither of which exists on an iPhone.

Porting pyannote to run inside an iOS app required:

Converting the embedder to Core ML. The segmenter neural net (a 1D-CNN that ingests audio and outputs voice-activity + speaker-change probability per frame) and the embedder (which produces a 256-dim vector per active speaker segment) both convert cleanly. The clusterer is pure Python and gets reimplemented in Swift.
Streaming the inference. The pretrained pyannote models expect 10-second chunks. For live captioning, 10-second latency is unusable. We slide a 2-second window and re-cluster every 500ms. The clusters get stable after about 3-4 seconds of speech per speaker.
Handling cold-start. The first 2-3 seconds of any conversation have no diarization data. Captions show up immediately, just with a placeholder speaker label ("Speaker 1") until the clusterer locks on.
Naming speakers. The user can tap any speaker label and rename it. "Speaker 1" becomes "Doctor Rodríguez." The rename persists for the whole session and gets re-applied if the clusterer recovers the same speaker after a gap.
The "did someone address me?" signal. ClearCaps detects when a speaker directly addresses the user (questions tagged "You" or "Bruno") and triggers a haptic. The user doesn't have to stare at the screen — they can look at the person they're with and feel a buzz when something needs their attention. This came from talking to my dad: the worst part of hearing loss in conversation isn't missing words, it's missing when someone has just asked you a question.

Why on-device matters for accessibility specifically

I want to be careful here because accessibility tech often gets framed as a charity case, and that's the wrong frame. Hard-of-hearing people are paying customers. They have specific product requirements. They evaluate tools the same way anyone else evaluates tools.

The privacy-first architecture isn't a feel-good add-on for accessibility users. It's a product requirement that surfaces specifically in this market:

Medical conversations. A captioning app that requires uploading audio to a cloud service is incompatible with patient privacy expectations in most jurisdictions.
Family privacy. Spouse discussing health symptoms over dinner. Kid asking about something embarrassing at school. The captioning user doesn't want that going into anyone's training dataset.
The recipient's consent. When you're using captions in a conversation, the other person hasn't consented to a cloud service capturing their voice. On-device captions sidestep this entirely — the audio never leaves your phone.
Offline reliability. Hearing-loss users need captions most when they're most stressed, which is often in environments where wifi is bad: hospitals, public transit, large crowded events.

The first time my dad used ClearCaps in a real conversation, the thing he commented on wasn't the accuracy — it was that it kept working when the wifi flickered. That's the architectural payoff.

The AI assistant on top

ClearCaps ships with an optional AI layer on top of the captions, powered by a 3B-parameter LLM running through Apple Foundation Models on iOS 26+. The model does four things:

Cleans up the transcript. Whisper is great but it captures every "um" and "uh" and false start. The cleanup pass produces a readable version.
Summarizes long sessions. A 90-minute consultation becomes a one-page bullet summary.
Identifies speakers by context. If "Doctor Rodríguez" appears in the conversation naming themselves, the assistant infers that label automatically.
Visual context. Take a photo during the conversation (a whiteboard, a prescription, a slide) and the LLM describes it.

All of this runs on-device. The LLM is Apple's, the framework is Foundation Models, and there's a privacy manifest in the app bundle that auditors can verify.

Where ClearCaps falls short (and where it's heading)

Honest assessment:

Heavy accents. Whisper-small degrades on heavy regional accents in Spanish (rural Caribbean, Andalusian) and English (Glaswegian, deep Southern US). Whisper-medium would help but doubles the memory footprint.
Crosstalk in groups bigger than 4. Pyannote handles 2-4 speakers cleanly. Above that, clusters merge and split.
Sign-language input. Not in scope yet. ASL/LSE/LSA via camera is on the roadmap but the recognition stack isn't there.
iPad / Mac versions. iPhone only at launch.

The product

ClearCaps is on the App Store. iOS 26+, free download with a paid AI tier ($2.99/month or $19.99/year). The captioning itself — ASR + diarization + translation — is free forever.

I made it free for the captioning because of who the users are. Hard-of-hearing people are often on fixed incomes (older population), and the captioning is a basic accessibility tool that I felt strongly should be available without payment. The AI features are nice-to-have, not need-to-have, and that's where the monetization lives.

— Bruno Galtranch, founder, AstroLexis LLC. If you have feedback or a use case we missed: contact@astrolexis.space.

DEV Community