thehwang

Posted on May 12

Building a 100% Local Meeting Transcription App for macOS with whisper.cpp and ScreenCaptureKit

#swift #opensource #ai #productivity

How I built Scripta — a dual-channel meeting recorder that transcribes your mic and system audio in real-time, generates AI summaries, and never sends a byte to the cloud.

I spend 2–3 hours a day on Teams and Zoom calls. By the end of the day, I can barely remember who committed to what. I tried cloud transcription services — Otter.ai, Fireflies, Granola — but my company's security policy doesn't allow meeting audio to leave the corporate network.

So I built Scripta: an open-source macOS app that records both sides of a meeting, transcribes everything in real-time, and generates AI summaries — all running entirely on your Mac. Zero cloud requests. Zero subscriptions. Zero data exfiltration.

GitHub: github.com/thehwang/Scripta

The Dual-Channel Problem

Most transcription apps work with a single audio stream. That's fine for podcasts, but in a meeting you have two distinct audio sources:

Your microphone — your voice, physically entering the mic
System audio — the remote participants, coming out of Teams/Zoom/Meet through the OS audio mixer

If you mix them into one stream, you lose the ability to label who said what. And if you try to run two speech recognition tasks on separate streams using Apple's SFSpeechRecognizer, you get a fun surprise: kAFAssistantErrorDomain Code=1101 — Apple's speech framework silently refuses to run two recognition tasks concurrently.

The solution I landed on uses two completely different ASR engines:

┌─────────────────┐     ┌──────────────────┐
│   Microphone     │     │  System Audio     │
│  (AVAudioEngine) │     │ (ScreenCaptureKit)│
└────────┬────────┘     └────────┬─────────┘
         │                       │
    whisper.cpp             SFSpeechRecognizer
    (Metal GPU)             (Apple on-device)
         │                       │
         └───── Transcript ──────┘
                    │
              Local Ollama LLM
                    │
              AI Summary + Chat

Mic → whisper.cpp: The Whisper model runs locally with Metal acceleration. The base model (142 MB) achieves >15x real-time on Apple Silicon — 5 seconds of audio transcribed in ~0.3 seconds.

System audio → SFSpeechRecognizer: Apple's on-device speech recognition handles the remote audio. It works well with compressed VoIP audio and doesn't compete for GPU resources with Whisper.

This hybrid approach avoids the SFSpeechRecognizer concurrency crash while keeping everything on-device.

Capturing System Audio with ScreenCaptureKit

Before macOS 13, capturing system audio from a specific app required hacks: virtual audio devices like BlackHole, aggregate devices, or kernel extensions. ScreenCaptureKit changed this entirely.

The key insight: ScreenCaptureKit can capture audio only — you don't need to record the screen at all. Set the video dimensions to 2×2 pixels and enable audio:

let config = SCStreamConfiguration()
config.capturesAudio = true
config.excludesCurrentProcessAudio = true  // prevent feedback loops
config.sampleRate = 16_000
config.channelCount = 1
config.width = 2   // minimal video — we only want audio
config.height = 2

excludesCurrentProcessAudio = true is critical — without it, any sounds your app plays would get captured and create an echo loop.

The catch: ScreenCaptureKit requires Screen Recording permission, even though we're not recording the screen. On macOS 15, self-signed apps frequently fail to acquire this permission through the normal TCC prompt. Users often need to manually add the app in System Settings → Privacy & Security → Screen Recording. This is the single biggest friction point in the user experience, and there's no programmatic workaround.

Integrating whisper.cpp into a Swift App

whisper.cpp provides a clean C API that's straightforward to bridge into Swift — no Objective-C++ needed.

Building the Static Library

The Makefile clones whisper.cpp, builds it with CMake (Metal enabled), and merges all the resulting .a files into a single static library:

cmake -B build -S vendor/whisper.cpp \
    -DCMAKE_OSX_ARCHITECTURES="arm64" \
    -DBUILD_SHARED_LIBS=OFF \
    -DGGML_METAL=ON \
    -DWHISPER_BUILD_TESTS=OFF

cmake --build build --config Release

libtool -static -o libwhisper.a \
    build/src/libwhisper.a \
    build/ggml/src/libggml.a \
    build/ggml/src/libggml-base.a \
    build/ggml/src/libggml-cpu.a \
    build/ggml/src/ggml-metal/libggml-metal.a

Swift Bridging via module.modulemap

Instead of a bridging header, I used a Swift Package Manager systemLibrary target with a module.modulemap:

module CWhisper {
    header "whisper.h"
    link "whisper"
    export *
}

This lets Swift code import CWhisper directly and call whisper_init_from_file_with_params, whisper_full, etc. as regular C functions.

Sliding Window Transcription

Real-time transcription with Whisper requires chunking the audio stream. I use a 5-second sliding window with 1-second overlap:

let chunkDuration: TimeInterval = 5.0
let overlapDuration: TimeInterval = 1.0

func processNextChunk() {
    let chunk = Array(sampleBuffer.prefix(chunkSamples))
    sampleBuffer.removeFirst(chunkSamples - overlapSamples)
    transcribeChunk(chunk)
}

The overlap prevents words at chunk boundaries from being cut off. Each chunk is processed on a background DispatchQueue — while one chunk is being transcribed, the next is accumulating.

Noise filtering is important: Whisper tends to hallucinate on silence, producing segments like [MUSIC], (silence), or Thank you. when there's no actual speech. A simple pattern-matching filter catches these:

static func isNoiseSegment(_ text: String) -> Bool {
    let trimmed = text.trimmingCharacters(in: .whitespacesAndNewlines)
    if trimmed.hasPrefix("[") && trimmed.hasSuffix("]") { return true }
    if trimmed.hasPrefix("(") && trimmed.hasSuffix(")") { return true }
    let noisePatterns = ["music", "silence", "blank", "no speech", "thank you"]
    return noisePatterns.contains { trimmed.lowercased().contains($0) }
}

The Voice Processing IO Saga

When you're on a meeting with speakers (not headphones), the system audio plays through the speakers and gets picked up by the microphone. The mic transcription ends up containing the remote participant's words — defeating the whole purpose of dual-channel separation.

The fix: Voice Processing IO — macOS's hardware-level acoustic echo cancellation:

try inputNode.setVoiceProcessingEnabled(true)

One line of code. Three days of debugging.

Pitfall 1: The 9-Channel Format

Enabling Voice Processing IO silently changes the microphone's output format from the expected mono/stereo to 9 channels. No documentation mentions this. My AVAudioConverter — which was converting the mic audio from its native format to mono 16kHz for Whisper — started crashing with EXC_BAD_ACCESS on the real-time audio thread.

The fix: bypass AVAudioConverter entirely. Extract channel 0 manually and resample with linear interpolation:

guard let ch0 = buffer.floatChannelData?[0] else { return }
let ratio = targetRate / buffer.format.sampleRate
var resampled = [Float](repeating: 0, count: Int(Double(frameCount) * ratio))
for i in 0..<resampled.count {
    let srcIdx = Double(i) / ratio
    let idx0 = Int(srcIdx)
    let frac = Float(srcIdx - Double(idx0))
    resampled[i] = ch0[idx0] + frac * (ch0[min(idx0 + 1, frameCount - 1)] - ch0[idx0])
}

Not the most elegant DSP, but it doesn't crash on the audio thread, which is more than AVAudioConverter can claim.

Pitfall 2: System Audio Ducking

After enabling Voice Processing IO, users reported that system volume suddenly dropped during recording. Voice Processing IO automatically ducks (reduces volume of) other audio sources to help with echo cancellation. This also affected ScreenCaptureKit's capture — the system audio recordings were nearly silent at -51 dB.

The fix (macOS 14+):

inputNode.voiceProcessingOtherAudioDuckingConfiguration =
    .init(enableAdvancedDucking: false, duckingLevel: .min)

Pitfall 3: Silent Audio Files

The same 9-channel issue that crashed AVAudioConverter for Whisper also broke audio file recording. The writeMicAudio function was using a converter to downsample the mic buffer to 1-channel AAC — but converting 9-channel real-time audio to mono AAC was silently producing empty frames. The resulting .m4a files were the right duration but contained silence (-91 dB).

The fix was the same manual channel extraction used for Whisper: extract channel 0, resample, write directly.

Lessons Learned

Apple's Voice Processing IO documentation is essentially nonexistent. The 9-channel behavior, the ducking side effect, the interaction with AVAudioConverter — none of this is documented. I found most of it through crash logs and mplog() statements. If you're building anything with Voice Processing IO, budget extra time for audio format debugging.

Local AI with Ollama

For AI summaries and chat, Scripta connects to a local Ollama instance. The integration is deliberately simple — a POST request to localhost:11434:

// Streaming summary generation
let request = OllamaRequest(
    model: modelName,
    prompt: "Summarize this meeting transcript...\n\n\(transcript)",
    stream: true
)

The response streams token-by-token, displayed in real-time in the UI. After the summary completes, users can ask follow-up questions through the Ask AI chat panel — multi-turn conversations with the transcript as system context.

The default model is qwen2.5:3b — small enough to run on any Apple Silicon Mac, multilingual, and produces surprisingly good meeting summaries. The install script handles Ollama installation, service startup, and model download automatically.

UX: Two Display Modes

Scripta offers two modes for different workflows:

Full mode is the main interface — transcript panel, AI summary, chat sidebar, recording controls, translation settings. This is where you review meetings after they end.

Minimal mode is a floating caption bar that stays on top of other windows. During a meeting, you switch to minimal mode and keep working while live captions scroll through:

The mic mute button works like Teams/Zoom — instant toggle, no pipeline teardown. The audio engine keeps running; the mute flag simply tells the tap callback to skip forwarding samples to Whisper and the audio writer.

Distribution Without the App Store

Scripta uses ScreenCaptureKit, communicates with Ollama on localhost, and links against a custom whisper.cpp static library — none of which are allowed under App Store sandboxing rules.

Instead, I distribute through GitHub Releases:

GitHub Actions CI builds for macOS 14 and macOS 15, signs with ad-hoc (codesign --sign "-")
curl | bash installer downloads the latest release, runs xattr -cr to clear the Gatekeeper quarantine flag, installs Ollama, pulls the AI model, and downloads the Whisper model
One command: curl -fsSL https://raw.githubusercontent.com/thehwang/Scripta/main/scripts/install.sh | bash

The xattr -cr step is what makes ad-hoc signed apps work without a paid Apple Developer ID. It clears the com.apple.quarantine extended attribute that macOS adds to downloaded files. Combined with the ad-hoc signature (which satisfies code integrity checks), this lets the app run without the "unidentified developer" warning.

What's Next

A few things I want to build:

Speaker diarization — cluster voice embeddings to distinguish Speaker 1, 2, 3 instead of just "Remote"
In-app auto-update — check GitHub Releases API on launch, download and replace via install script
Whisper model selection — let users choose between tiny (fast, less accurate) and small/medium (slower, better)
Export formats — SRT subtitles, JSON with timestamps, integration with note-taking apps

Try It

Scripta is open-source under the MIT license.

Install:

curl -fsSL https://raw.githubusercontent.com/thehwang/Scripta/main/scripts/install.sh | bash