DEV Community: nickdelv

Three Models, Zero API Calls: Real-Time Meeting Intelligence on Apple Silicon

nickdelv — Tue, 16 Jun 2026 15:12:42 +0000

Originally published at thunderkitty.app/learn

Thunder Kitty's Labs features run topic segmentation and agenda tracking live, entirely on-device — and getting a sentence-embedding model onto the Neural Engine took seven attempts and a fight with a silent CoreML bug.

Thunder Kitty 1.9.0 adds a Labs section in Settings with two experimental features: a Live Topic Timeline that segments a meeting into topics as you record, and Live Agenda Tracking that marks agenda items as they get covered. Both run in real time, entirely on your Mac.

Running them means running three models at once. The interesting part wasn't the idea — it was getting one of those models, a sentence-embedding model, onto the Neural Engine. That took seven attempts and a fight with a silent CoreML bug that produces plausible-looking garbage and no error.

This is how the features work and what broke along the way.

Where this came from

Two ideas converged.

An early user wanted a live jargon buster — not a search box (he could already ask Google or Claude), but something that would notice when a term was probably unfamiliar to him and surface the definition on its own, in real time. Separately, we'd wanted a live meeting timeline for a while: a vertical view that grows as the meeting goes, showing topic flow and recurring themes as they happen.

The common thread is timing. The meeting is happening now, so the intelligence has to happen now — not as a batch job after everyone hangs up.

The timeline and agenda tracking shipped in 1.9.0; the jargon buster is still ahead of us. All of it runs on-device, with no network and no per-call cost — the same promise as the rest of the app. Turn on airplane mode and it still works.

The architecture: three models

Different tasks need different models. Here's what runs during and after a meeting:

Model	What it does	Latency
all-mpnet-base-v2 via CoreML	Topic segmentation (which sentences belong together)	5–20ms
Apple Foundation Models	Topic labeling, utterance classification	200ms–2s
Qwen 3.5 4B / 9B via MLX (downloaded once)	Post-meeting summaries	25–35 tok/s

Models 1–2 run live during the meeting; model 3 runs after. The Neural Engine handles the embedding and labeling work, the GPU handles the summary model, and they don't fight each other for resources.

The hard part was model 1: getting the mpnet embedding model running on the Neural Engine via CoreML. What should have been routine turned into seven attempts.

Topic segmentation: why DeepTiling

Before the CoreML story, here's what the embedding model is actually doing.

Topic segmentation — deciding where one topic ends and the next begins — is an old problem. TextTiling solved a version of it in 1997 by computing word overlap between sliding windows and marking the valleys as boundaries. DeepTiling is the same algorithm with neural embeddings in place of word overlap. Swap the similarity function; keep everything else.

For each transcript line we compute a 768-dimensional embedding. For line i, we take the centroid of the preceding 8 lines and compare similarity. High similarity means we're still on topic; a valley (a local minimum below a 0.12 threshold) means the topic shifted. It's simple, parallelizable, and converts cleanly to a streaming version — which is what makes the live timeline possible.

We tested five embedding approaches: all-mpnet-base-v2, all-MiniLM-L6-v2, nomic-embed-text-v1.5, Apple's NLEmbedding, and Apple's NLContextualEmbedding. The algorithm was identical across all five; only the embeddings changed. mpnet won clearly — sharper valleys, better separation between on-topic and off-topic similarity, more reliable boundaries.

Which is why getting mpnet onto CoreML properly was non-negotiable.

The CoreML conversion: seven attempts

This is the part worth reading closely if you convert transformers to CoreML, because the failure is silent and the warning is misleading.

The goal

Convert sentence-transformers/all-mpnet-base-v2 to a CoreML .mlpackage. Take input_ids and attention_mask, output token_embeddings, then mean-pool and L2-normalize in Swift. Target: Neural Engine, under 20ms per sentence.

Attempt 1: the obvious approach

traced = torch.jit.trace(wrapper, (input_ids, attention_mask))
mlmodel = ct.convert(traced, ...)

Conversion succeeded. Cosine similarity between the CoreML output and sentence-transformers: 0.17. Essentially random.

coremltools had emitted two warnings during conversion:

Core ML embedding (gather) layer does not support any inputs besides
the weights and indices. Those given will be ignored.

Translation: coremltools silently drops the position_ids from the MPNet embedding layer. With no position information, the transformer produces meaningless output. It's a known bug with no upstream fix as of coremltools 9.0, and the warning fires whether or not it actually affected your model — so you can't tell from the warning alone. The only way to know is to compare against a reference.

Attempts 2–6: the graveyard

Mean pooling inside the model — coremltools crashes on dynamic integer ops in the pooling code.
ONNX as an intermediate — coremltools 8+ dropped ONNX support; onnx-coreml turned out to be a separate, long-deprecated package.
coremltools 7.x with ONNX — same problem, plus a Python 3.11 / numpy <2.0 pinning mess.
torch.export (ExportedProgram) — version-format incompatibility between torch 2.7 and coremltools 8.3; 9.0 accepts it but still produces garbage.
Pre-computing position embeddings as constants — kills one of the two gather warnings; cosine similarity still 0.17.

By attempt 6 every obvious culprit was gone and the output was still garbage.

Attempt 7: the breakthrough

The realization: MPNet doesn't only use position embeddings in the embedding layer. It also uses relative position bias in every attention layer — another embedding lookup, computed differently from standard BERT. The whole position-handling chain was broken, not just the embedding layer.

The fix: pre-compute everything that touches position information and bypass the model's own wiring.

class MPNetCoreMLWrapper(nn.Module):
    def __init__(self, model, seq_length):
        super().__init__()
        self.encoder = model.encoder
        self.word_embeddings = model.embeddings.word_embeddings
        self.layer_norm = model.embeddings.LayerNorm

        # Pre-compute position embeddings as a constant buffer
        pos_ids = torch.arange(padding_idx + 1, padding_idx + 1 + seq_length)
        self.register_buffer("position_embeddings",
            model.embeddings.position_embeddings.weight[pos_ids].unsqueeze(0))

        # Pre-compute relative position bias as a constant buffer
        dummy = torch.zeros(1, seq_length, hidden_size)
        self.register_buffer("relative_position_bias",
            model.encoder.compute_position_bias(dummy))

    def forward(self, input_ids, attention_mask):
        word_emb = self.word_embeddings(input_ids)      # This gather works
        embeddings = word_emb + self.position_embeddings # Constant add
        embeddings = self.layer_norm(embeddings)
        # ... run encoder with pre-computed position bias

Result:

CoreML vs sentence-transformers: avg=0.999985, min=0.999974
PASS — CoreML embeddings match sentence-transformers

Every segmentation boundary now matched the Python baseline exactly.

What to take from this

If you're converting a transformer to CoreML and getting low cosine similarity, the gather layer is probably dropping position information. The fix is architecture-specific: you have to understand how your model encodes position before you can pre-compute it. MPNet needed two gather ops handled (position embeddings plus relative attention bias). BERT would differ. DeBERTa (another transformer variant with its own position encoding scheme) is its own special hell.

And validate against a known-good reference before trusting anything. The conversion warnings aren't reliable signal.

Real-time agenda tracking

With segmentation working, the second feature matches live transcript content to your pre-meeting agenda as the conversation moves, so items shift from pending to in-progress to discussed in real time.

The naive version fails immediately: when someone reads the agenda aloud at the top of the meeting, every item gets "mentioned" and a naive tracker marks them all discussed before any real discussion happens.

So the tracker uses five gates, applied in order, to avoid false positives:

Similarity threshold — the line must score ≥ 0.25 against the agenda item's embedding.
Distinctiveness — the best match must beat the second-best by 0.05; generic lines that match everything match nothing.
Minimum matches — two distinctive matches before an item goes inProgress.
Temporal spread — first and last matching lines must be ≥ 60 seconds apart before discussed; reading the agenda takes ~30 seconds, real discussion spans minutes.
Speaker diversity — two distinct speakers required; agenda reading is one voice, discussion is back-and-forth.

On a 51-minute, 721-line test transcript with six agenda items: 6/6 marked discussed, no simultaneous multi-item triggers, each item firing independently with its own relevant evidence.

The live tracker is the fast, approximate pass — visual feedback while you record. The authoritative version, with full context and LLM reasoning, comes from the post-meeting pass. Keeping the live half lightweight is deliberate: the MeetMap research (ACM CSCW 2025) found that real-time meeting AI works best when it lowers in-the-moment cognitive load and leaves the user in control, rather than demanding attention mid-conversation.

Why these are in Labs

Both features shipped in 1.9.0, and both are in Labs for a reason. They work, but they're not finished.

The timeline's data layer is solid and the segmentation is accurate. The UI is still rough, and topic labels are only as good as the on-device labeling model on a given day. Agenda tracking clears the five gates well on clean transcripts, but messy audio, heavy cross-talk, or an agenda full of near-identical items will still trip it. They're opt-in because we'd rather you turn them on knowing that than have them surprise you with a sub-par experience.

The short version

Three models on Apple Silicon — an mpnet embedder on the Neural Engine, Apple Foundation Models for live labeling, and a Qwen model on the GPU for post-meeting summaries — with nothing leaving the Mac and no per-call cost. The embedder fought us for seven attempts. The rest was getting the timing right.

It's in Labs because it's early. But it runs, it's local, and it works in airplane mode like everything else.

2,000 Buffers of Nothing

nickdelv — Wed, 10 Jun 2026 17:47:34 +0000

Originally published at thunderkitty.app/learn

What I learned about macOS audio capture that I couldn't find written down anywhere — Core Audio taps, silent TCC denials, and an Info.plist key Xcode ignores.

Two days after launching Thunder Kitty, I tested a scenario I should have tested earlier: picking up a phone call on my MacBook via iPhone Continuity. My voice came through fine. The other person's audio? Completely missing from the transcript. Silent.

The fix took me down a rabbit hole through three Apple frameworks, four wrong hypotheses, and one undocumented Info.plist key. This is what I learned about macOS audio capture that I couldn't find written down anywhere.

ScreenCaptureKit can't see phone calls

When I built system audio capture for Thunder Kitty, I reached for ScreenCaptureKit. It's Apple's modern API for capturing screen and audio. You create an SCStream, set capturesAudio = true, and audio buffers show up in a delegate callback. Clean, well-documented, works great.

For Zoom, Google Meet, Teams, anything running in a browser or windowed app — ScreenCaptureKit captures the audio perfectly. It operates at the application/compositor layer.

The problem is that word: applications.

When you pick up an iPhone call on your Mac via Continuity, the call audio doesn't come from an application with a window. It comes from callservicesd — a background system daemon. FaceTime audio routes through avconferenced. Neither has any window presence. They're invisible to ScreenCaptureKit.

This isn't a permissions issue. It's not a configuration issue. The API operates at the wrong layer of the stack. ScreenCaptureKit is a net that only works on the surface; these daemons are swimming underneath.

The fix is to drop down a layer.

Core Audio taps: the right layer

CATapDescription is a Core Audio API that captures audio at the HAL — the Hardware Abstraction Layer. The HAL sits between your audio hardware and everything above it. Every sound that hits your output device passes through it, regardless of which process produced it. Browser, Zoom, callservicesd, whatever. If the bytes are flowing, the tap can see them.

The setup is more involved than ScreenCaptureKit:

Create a process tap with CATapDescription(monoGlobalTapButExcludeProcesses:), excluding your own process so you don't create a feedback loop.
Wrap it in a private aggregate device via AudioHardwareCreateAggregateDevice.
Attach an IOProc callback with AudioDeviceCreateIOProcIDWithBlock.
Start the device with AudioDeviceStart.

The aggregate device is the non-obvious piece. A process tap alone doesn't deliver audio buffers — it has to live inside an aggregate device that combines the tap with an output device, and you read from the aggregate's input stream. Most of the documentation around this lives in WWDC sessions and a few sample projects.

I wired it all up, built the app, started a recording, played a YouTube video.

Silence.

2,000 buffers of nothing

The IOProc was running. My logs showed the callback firing hundreds of times per second, delivering audio buffers on schedule. But every buffer was zeros. Two thousand buffers of zeros.

My first hypothesis was Bluetooth. When AirPods connect, macOS negotiates between two profiles: A2DP (high-quality stereo, output only) and HFP (phone-call quality, bidirectional). Opening the microphone forces a switch from A2DP to HFP, which can change the default output device mid-setup. Maybe the aggregate device was being created with a stale device reference.

I rewrote the setup to host the aggregate on the built-in MacBook speakers — always present, always stable, doesn't change when headphones connect.

Still silence.

I tested with wired headphones. No Bluetooth in the chain. Same result. The Bluetooth hypothesis was dead.

The muted speaker test

Here's where I got lucky.

Thunder Kitty supports two recording modes: "unified" (one stream where the mic picks up speakers and your voice together) and "dual stream" (separate channels for mic and system audio, used with headphones). Unified mode had been "working" — transcripts showed both sides of conversations. Dual stream was always silent.

On a hunch, I ran unified mode with the speakers muted. If the Core Audio tap was actually delivering system audio, transcription should still work — the tap captures the audio stream before it reaches the speakers, so muting the output shouldn't matter.

Two transcript lines. Both my voice. No system audio at all.

Unified mode had never actually worked. The microphone had been picking up YouTube playing through the speakers — acoustic bleed dressed up as success. The Core Audio tap had been delivering silence the whole time, in every mode, and I'd been fooling myself for days.

The Info.plist key Xcode silently ignores

macOS gates sensitive APIs through TCC (Transparency, Consent, and Control). For system audio capture, the relevant service is kTCCServiceAudioCapture, and it requires NSAudioCaptureUsageDescription in your Info.plist — the string shown in the permission dialog.

I had added this key. Or so I thought.

Xcode normally lets you set INFOPLIST_KEY_* build settings, and it injects the corresponding key into your compiled Info.plist at build time. This works for NSMicrophoneUsageDescription, NSSpeechRecognitionUsageDescription, and most other privacy keys. So I'd set INFOPLIST_KEY_NSAudioCaptureUsageDescription in my build settings and moved on.

It doesn't work for NSAudioCaptureUsageDescription. Xcode silently ignores it. The key never makes it into the compiled plist.

I verified by running plutil -p on the app bundle. Microphone description: present. Speech recognition: present. Audio capture: gone.

The fix was adding it directly to the Info.plist file:

<key>NSAudioCaptureUsageDescription</key>
<string>Thunder Kitty captures system audio to transcribe the other side of your calls.</string>

The silent denial

Here's the truly nasty part: when NSAudioCaptureUsageDescription is missing, TCC denies access silently.

AudioHardwareCreateProcessTap returns noErr. AudioHardwareCreateAggregateDevice returns noErr. AudioDeviceCreateIOProcIDWithBlock returns noErr. AudioDeviceStart returns noErr. Your IOProc callback fires on schedule. Everything looks perfect.

But every buffer is zeros.

There's no error code. No log message. No indication anything is wrong. The system hands you silence and lets you figure it out. If you're not specifically checking whether audio data is non-zero, you'll never know.

A single error from AudioHardwareCreateProcessTap saying "TCC denied" would have saved me hours. I understand the security rationale — you don't want to make it easy for malware to detect denial — but it makes legitimate development genuinely painful.

And one more gotcha: triggering the prompt

Even with the Info.plist key in place, I had one more problem. I wanted Thunder Kitty's onboarding to trigger the permission prompt before the first recording, so users could grant access without confusion.

My first attempt: create a process tap and immediately destroy it. The permission gate is on tap creation, right? Surely calling AudioHardwareCreateProcessTap will trigger the system prompt.

Nope. The tap creates "successfully" (returns noErr, as we've established it does regardless). No prompt appears.

It turns out AudioDeviceStart is the call that triggers the TCC prompt. Not creating the tap. Not creating the aggregate device. Not creating the IOProc. You have to actually start IO on a tap-backed aggregate device before macOS asks the user.

There's no requestAuthorization-style API for audio capture, the way there is for the microphone or speech recognition. You have to spin up the entire pipeline — tap, aggregate device, IOProc, start IO — wait for the system prompt, then tear it all down. Thunder Kitty's onboarding does exactly this: builds a throwaway audio pipeline, holds it for a beat, destroys it. It's the only way.

Working

After fixing the Info.plist key and rebuilding, I started a test recording and picked up a phone call on my Mac. My voice. Then the other person's voice. Then both of us, transcribed line by line in a single conversation.

The permission prompt now reads "System Audio Recording" instead of "Screen & System Audio Recording." It's a smaller ask, a less alarming privacy indicator, and it actually describes what the app does.

Six things I'd tell other Mac developers

Use Core Audio taps, not ScreenCaptureKit, if you need to hear everything. SCK is great for capturing specific applications. But if your users might be on phone calls, FaceTime, or anything that runs through a background daemon, SCK will miss it. Don't ship a transcription product that misses phone calls.

Add NSAudioCaptureUsageDescription directly to your Info.plist file. Do not rely on INFOPLIST_KEY_* build settings for this key. Xcode ignores them silently. Verify with plutil -p on your compiled app bundle to confirm the key is actually there.

TCC enforcement is silent. Every Core Audio API will return noErr even when permission is denied. The only way to know is that your buffers contain zeros. Build a non-zero check into your audio pipeline early, before you spend a day debugging.

AudioDeviceStart is what triggers the permission prompt. Not creating the tap, not creating the aggregate device. If you want to prompt during onboarding, you need to build and start the full pipeline, then tear it down once macOS has shown the dialog.

Start your mic engine before creating the aggregate device. If your users have AirPods, opening the mic first forces the HFP profile negotiation. If the aggregate device starts first, AirPods stay in A2DP, and your mic channel fails silently. (Another silent failure. They're a theme.)

Host the aggregate device on the built-in output device, not the default output device. The default device can change when headphones connect or Bluetooth profiles switch. Built-in speakers are always there, always stable. Pin to them and your aggregate survives device changes mid-recording.

None of this is documented in one place. I hope this post saves someone the days I spent on it.