<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: thehwang</title>
    <description>The latest articles on DEV Community by thehwang (@thehwang).</description>
    <link>https://dev.to/thehwang</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3923429%2F6a0a283b-ca79-41a4-90ca-9bbb2e4d8bfd.png</url>
      <title>DEV Community: thehwang</title>
      <link>https://dev.to/thehwang</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/thehwang"/>
    <language>en</language>
    <item>
      <title>Building a 100% Local Meeting Transcription App for macOS with whisper.cpp and ScreenCaptureKit</title>
      <dc:creator>thehwang</dc:creator>
      <pubDate>Tue, 12 May 2026 14:17:01 +0000</pubDate>
      <link>https://dev.to/thehwang/building-a-100-local-meeting-transcription-app-for-macos-with-whispercpp-and-screencapturekit-33m7</link>
      <guid>https://dev.to/thehwang/building-a-100-local-meeting-transcription-app-for-macos-with-whispercpp-and-screencapturekit-33m7</guid>
      <description>&lt;p&gt;&lt;em&gt;How I built Scripta — a dual-channel meeting recorder that transcribes your mic and system audio in real-time, generates AI summaries, and never sends a byte to the cloud.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;I spend 2–3 hours a day on Teams and Zoom calls. By the end of the day, I can barely remember who committed to what. I tried cloud transcription services — Otter.ai, Fireflies, Granola — but my company's security policy doesn't allow meeting audio to leave the corporate network.&lt;/p&gt;

&lt;p&gt;So I built &lt;strong&gt;Scripta&lt;/strong&gt;: an open-source macOS app that records both sides of a meeting, transcribes everything in real-time, and generates AI summaries — all running entirely on your Mac. Zero cloud requests. Zero subscriptions. Zero data exfiltration.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/screenshots%2Ffull_mode.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/screenshots%2Ffull_mode.png" alt="Scripta full mode" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GitHub&lt;/strong&gt;: &lt;a href="https://github.com/thehwang/Scripta" rel="noopener noreferrer"&gt;github.com/thehwang/Scripta&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The Dual-Channel Problem
&lt;/h2&gt;

&lt;p&gt;Most transcription apps work with a single audio stream. That's fine for podcasts, but in a meeting you have two distinct audio sources:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Your microphone&lt;/strong&gt; — your voice, physically entering the mic&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;System audio&lt;/strong&gt; — the remote participants, coming out of Teams/Zoom/Meet through the OS audio mixer&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you mix them into one stream, you lose the ability to label who said what. And if you try to run two speech recognition tasks on separate streams using Apple's &lt;code&gt;SFSpeechRecognizer&lt;/code&gt;, you get a fun surprise: &lt;code&gt;kAFAssistantErrorDomain Code=1101&lt;/code&gt; — Apple's speech framework silently refuses to run two recognition tasks concurrently.&lt;/p&gt;

&lt;p&gt;The solution I landed on uses two completely different ASR engines:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌─────────────────┐     ┌──────────────────┐
│   Microphone     │     │  System Audio     │
│  (AVAudioEngine) │     │ (ScreenCaptureKit)│
└────────┬────────┘     └────────┬─────────┘
         │                       │
    whisper.cpp             SFSpeechRecognizer
    (Metal GPU)             (Apple on-device)
         │                       │
         └───── Transcript ──────┘
                    │
              Local Ollama LLM
                    │
              AI Summary + Chat
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Mic → whisper.cpp&lt;/strong&gt;: The Whisper model runs locally with Metal acceleration. The &lt;code&gt;base&lt;/code&gt; model (142 MB) achieves &amp;gt;15x real-time on Apple Silicon — 5 seconds of audio transcribed in ~0.3 seconds.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;System audio → SFSpeechRecognizer&lt;/strong&gt;: Apple's on-device speech recognition handles the remote audio. It works well with compressed VoIP audio and doesn't compete for GPU resources with Whisper.&lt;/p&gt;

&lt;p&gt;This hybrid approach avoids the &lt;code&gt;SFSpeechRecognizer&lt;/code&gt; concurrency crash while keeping everything on-device.&lt;/p&gt;




&lt;h2&gt;
  
  
  Capturing System Audio with ScreenCaptureKit
&lt;/h2&gt;

&lt;p&gt;Before macOS 13, capturing system audio from a specific app required hacks: virtual audio devices like BlackHole, aggregate devices, or kernel extensions. ScreenCaptureKit changed this entirely.&lt;/p&gt;

&lt;p&gt;The key insight: ScreenCaptureKit can capture &lt;strong&gt;audio only&lt;/strong&gt; — you don't need to record the screen at all. Set the video dimensions to 2×2 pixels and enable audio:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight swift"&gt;&lt;code&gt;&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;SCStreamConfiguration&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;capturesAudio&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;excludesCurrentProcessAudio&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;  &lt;span class="c1"&gt;// prevent feedback loops&lt;/span&gt;
&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sampleRate&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;16_000&lt;/span&gt;
&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;channelCount&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;width&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;   &lt;span class="c1"&gt;// minimal video — we only want audio&lt;/span&gt;
&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;height&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;excludesCurrentProcessAudio = true&lt;/code&gt; is critical — without it, any sounds your app plays would get captured and create an echo loop.&lt;/p&gt;

&lt;p&gt;The catch: ScreenCaptureKit requires &lt;strong&gt;Screen Recording&lt;/strong&gt; permission, even though we're not recording the screen. On macOS 15, self-signed apps frequently fail to acquire this permission through the normal TCC prompt. Users often need to manually add the app in System Settings → Privacy &amp;amp; Security → Screen Recording. This is the single biggest friction point in the user experience, and there's no programmatic workaround.&lt;/p&gt;




&lt;h2&gt;
  
  
  Integrating whisper.cpp into a Swift App
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/ggerganov/whisper.cpp" rel="noopener noreferrer"&gt;whisper.cpp&lt;/a&gt; provides a clean C API that's straightforward to bridge into Swift — no Objective-C++ needed.&lt;/p&gt;

&lt;h3&gt;
  
  
  Building the Static Library
&lt;/h3&gt;

&lt;p&gt;The Makefile clones whisper.cpp, builds it with CMake (Metal enabled), and merges all the resulting &lt;code&gt;.a&lt;/code&gt; files into a single static library:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;cmake &lt;span class="nt"&gt;-B&lt;/span&gt; build &lt;span class="nt"&gt;-S&lt;/span&gt; vendor/whisper.cpp &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-DCMAKE_OSX_ARCHITECTURES&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"arm64"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-DBUILD_SHARED_LIBS&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;OFF &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-DGGML_METAL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;ON &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-DWHISPER_BUILD_TESTS&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;OFF

cmake &lt;span class="nt"&gt;--build&lt;/span&gt; build &lt;span class="nt"&gt;--config&lt;/span&gt; Release

libtool &lt;span class="nt"&gt;-static&lt;/span&gt; &lt;span class="nt"&gt;-o&lt;/span&gt; libwhisper.a &lt;span class="se"&gt;\&lt;/span&gt;
    build/src/libwhisper.a &lt;span class="se"&gt;\&lt;/span&gt;
    build/ggml/src/libggml.a &lt;span class="se"&gt;\&lt;/span&gt;
    build/ggml/src/libggml-base.a &lt;span class="se"&gt;\&lt;/span&gt;
    build/ggml/src/libggml-cpu.a &lt;span class="se"&gt;\&lt;/span&gt;
    build/ggml/src/ggml-metal/libggml-metal.a
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Swift Bridging via module.modulemap
&lt;/h3&gt;

&lt;p&gt;Instead of a bridging header, I used a Swift Package Manager &lt;code&gt;systemLibrary&lt;/code&gt; target with a &lt;code&gt;module.modulemap&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="n"&gt;module&lt;/span&gt; &lt;span class="n"&gt;CWhisper&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;header&lt;/span&gt; &lt;span class="s"&gt;"whisper.h"&lt;/span&gt;
    &lt;span class="n"&gt;link&lt;/span&gt; &lt;span class="s"&gt;"whisper"&lt;/span&gt;
    &lt;span class="n"&gt;export&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This lets Swift code &lt;code&gt;import CWhisper&lt;/code&gt; directly and call &lt;code&gt;whisper_init_from_file_with_params&lt;/code&gt;, &lt;code&gt;whisper_full&lt;/code&gt;, etc. as regular C functions.&lt;/p&gt;

&lt;h3&gt;
  
  
  Sliding Window Transcription
&lt;/h3&gt;

&lt;p&gt;Real-time transcription with Whisper requires chunking the audio stream. I use a &lt;strong&gt;5-second sliding window with 1-second overlap&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight swift"&gt;&lt;code&gt;&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;chunkDuration&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;TimeInterval&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;5.0&lt;/span&gt;
&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;overlapDuration&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;TimeInterval&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt;

&lt;span class="kd"&gt;func&lt;/span&gt; &lt;span class="nf"&gt;processNextChunk&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;chunk&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;Array&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sampleBuffer&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;prefix&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunkSamples&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;sampleBuffer&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;removeFirst&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunkSamples&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;overlapSamples&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;transcribeChunk&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The overlap prevents words at chunk boundaries from being cut off. Each chunk is processed on a background &lt;code&gt;DispatchQueue&lt;/code&gt; — while one chunk is being transcribed, the next is accumulating.&lt;/p&gt;

&lt;p&gt;Noise filtering is important: Whisper tends to hallucinate on silence, producing segments like &lt;code&gt;[MUSIC]&lt;/code&gt;, &lt;code&gt;(silence)&lt;/code&gt;, or &lt;code&gt;Thank you.&lt;/code&gt; when there's no actual speech. A simple pattern-matching filter catches these:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight swift"&gt;&lt;code&gt;&lt;span class="kd"&gt;static&lt;/span&gt; &lt;span class="kd"&gt;func&lt;/span&gt; &lt;span class="nf"&gt;isNoiseSegment&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="nv"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;String&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="kt"&gt;Bool&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;trimmed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;trimmingCharacters&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;in&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;whitespacesAndNewlines&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;trimmed&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;hasPrefix&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"["&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;trimmed&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;hasSuffix&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"]"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;trimmed&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;hasPrefix&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"("&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;trimmed&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;hasSuffix&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;")"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;noisePatterns&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s"&gt;"music"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"silence"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"blank"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"no speech"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"thank you"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;noisePatterns&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;contains&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="n"&gt;trimmed&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lowercased&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;contains&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;$0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  The Voice Processing IO Saga
&lt;/h2&gt;

&lt;p&gt;When you're on a meeting with speakers (not headphones), the system audio plays through the speakers and gets picked up by the microphone. The mic transcription ends up containing the remote participant's words — defeating the whole purpose of dual-channel separation.&lt;/p&gt;

&lt;p&gt;The fix: &lt;strong&gt;Voice Processing IO&lt;/strong&gt; — macOS's hardware-level acoustic echo cancellation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight swift"&gt;&lt;code&gt;&lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="n"&gt;inputNode&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;setVoiceProcessingEnabled&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One line of code. Three days of debugging.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pitfall 1: The 9-Channel Format
&lt;/h3&gt;

&lt;p&gt;Enabling Voice Processing IO silently changes the microphone's output format from the expected mono/stereo to &lt;strong&gt;9 channels&lt;/strong&gt;. No documentation mentions this. My &lt;code&gt;AVAudioConverter&lt;/code&gt; — which was converting the mic audio from its native format to mono 16kHz for Whisper — started crashing with &lt;code&gt;EXC_BAD_ACCESS&lt;/code&gt; on the real-time audio thread.&lt;/p&gt;

&lt;p&gt;The fix: bypass &lt;code&gt;AVAudioConverter&lt;/code&gt; entirely. Extract channel 0 manually and resample with linear interpolation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight swift"&gt;&lt;code&gt;&lt;span class="k"&gt;guard&lt;/span&gt; &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;ch0&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;buffer&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;floatChannelData&lt;/span&gt;&lt;span class="p"&gt;?[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;ratio&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;targetRate&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;buffer&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;format&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sampleRate&lt;/span&gt;
&lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="nv"&gt;resampled&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;Float&lt;/span&gt;&lt;span class="p"&gt;](&lt;/span&gt;&lt;span class="nv"&gt;repeating&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;count&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;Int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;Double&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;frameCount&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;ratio&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;..&amp;lt;&lt;/span&gt;&lt;span class="n"&gt;resampled&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;count&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;srcIdx&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;Double&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;ratio&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;idx0&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;Int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;srcIdx&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;frac&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;Float&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;srcIdx&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="kt"&gt;Double&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;idx0&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;resampled&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ch0&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;idx0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;frac&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ch0&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;idx0&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;frameCount&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;ch0&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;idx0&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Not the most elegant DSP, but it doesn't crash on the audio thread, which is more than &lt;code&gt;AVAudioConverter&lt;/code&gt; can claim.&lt;/p&gt;

&lt;h3&gt;
  
  
  Pitfall 2: System Audio Ducking
&lt;/h3&gt;

&lt;p&gt;After enabling Voice Processing IO, users reported that system volume suddenly dropped during recording. Voice Processing IO automatically &lt;strong&gt;ducks&lt;/strong&gt; (reduces volume of) other audio sources to help with echo cancellation. This also affected ScreenCaptureKit's capture — the system audio recordings were nearly silent at -51 dB.&lt;/p&gt;

&lt;p&gt;The fix (macOS 14+):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight swift"&gt;&lt;code&gt;&lt;span class="n"&gt;inputNode&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;voiceProcessingOtherAudioDuckingConfiguration&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt;
    &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;init&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;enableAdvancedDucking&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;duckingLevel&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;min&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Pitfall 3: Silent Audio Files
&lt;/h3&gt;

&lt;p&gt;The same 9-channel issue that crashed &lt;code&gt;AVAudioConverter&lt;/code&gt; for Whisper also broke audio file recording. The &lt;code&gt;writeMicAudio&lt;/code&gt; function was using a converter to downsample the mic buffer to 1-channel AAC — but converting 9-channel real-time audio to mono AAC was silently producing empty frames. The resulting &lt;code&gt;.m4a&lt;/code&gt; files were the right duration but contained silence (-91 dB).&lt;/p&gt;

&lt;p&gt;The fix was the same manual channel extraction used for Whisper: extract channel 0, resample, write directly.&lt;/p&gt;

&lt;h3&gt;
  
  
  Lessons Learned
&lt;/h3&gt;

&lt;p&gt;Apple's Voice Processing IO documentation is essentially nonexistent. The 9-channel behavior, the ducking side effect, the interaction with &lt;code&gt;AVAudioConverter&lt;/code&gt; — none of this is documented. I found most of it through crash logs and &lt;code&gt;mplog()&lt;/code&gt; statements. If you're building anything with Voice Processing IO, budget extra time for audio format debugging.&lt;/p&gt;




&lt;h2&gt;
  
  
  Local AI with Ollama
&lt;/h2&gt;

&lt;p&gt;For AI summaries and chat, Scripta connects to a local &lt;a href="https://ollama.com" rel="noopener noreferrer"&gt;Ollama&lt;/a&gt; instance. The integration is deliberately simple — a POST request to &lt;code&gt;localhost:11434&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight swift"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Streaming summary generation&lt;/span&gt;
&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;request&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;OllamaRequest&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="nv"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;modelName&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nv"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"Summarize this meeting transcript...&lt;/span&gt;&lt;span class="se"&gt;\n\n\(&lt;/span&gt;&lt;span class="n"&gt;transcript&lt;/span&gt;&lt;span class="se"&gt;)&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nv"&gt;stream&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The response streams token-by-token, displayed in real-time in the UI. After the summary completes, users can ask follow-up questions through the Ask AI chat panel — multi-turn conversations with the transcript as system context.&lt;/p&gt;

&lt;p&gt;The default model is &lt;code&gt;qwen2.5:3b&lt;/code&gt; — small enough to run on any Apple Silicon Mac, multilingual, and produces surprisingly good meeting summaries. The install script handles Ollama installation, service startup, and model download automatically.&lt;/p&gt;




&lt;h2&gt;
  
  
  UX: Two Display Modes
&lt;/h2&gt;

&lt;p&gt;Scripta offers two modes for different workflows:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Full mode&lt;/strong&gt; is the main interface — transcript panel, AI summary, chat sidebar, recording controls, translation settings. This is where you review meetings after they end.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcy6rl505cte97ur1gwuy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcy6rl505cte97ur1gwuy.png" alt=" " width="800" height="540"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Minimal mode&lt;/strong&gt; is a floating caption bar that stays on top of other windows. During a meeting, you switch to minimal mode and keep working while live captions scroll through:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6whmc59upgc8xintucnc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6whmc59upgc8xintucnc.png" alt=" " width="800" height="320"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The mic mute button works like Teams/Zoom — instant toggle, no pipeline teardown. The audio engine keeps running; the mute flag simply tells the tap callback to skip forwarding samples to Whisper and the audio writer.&lt;/p&gt;




&lt;h2&gt;
  
  
  Distribution Without the App Store
&lt;/h2&gt;

&lt;p&gt;Scripta uses ScreenCaptureKit, communicates with Ollama on localhost, and links against a custom whisper.cpp static library — none of which are allowed under App Store sandboxing rules.&lt;/p&gt;

&lt;p&gt;Instead, I distribute through GitHub Releases:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GitHub Actions CI&lt;/strong&gt; builds for macOS 14 and macOS 15, signs with ad-hoc (&lt;code&gt;codesign --sign "-"&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;curl | bash&lt;/code&gt; installer&lt;/strong&gt; downloads the latest release, runs &lt;code&gt;xattr -cr&lt;/code&gt; to clear the Gatekeeper quarantine flag, installs Ollama, pulls the AI model, and downloads the Whisper model&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;One command&lt;/strong&gt;: &lt;code&gt;curl -fsSL https://raw.githubusercontent.com/thehwang/Scripta/main/scripts/install.sh | bash&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The &lt;code&gt;xattr -cr&lt;/code&gt; step is what makes ad-hoc signed apps work without a paid Apple Developer ID. It clears the &lt;code&gt;com.apple.quarantine&lt;/code&gt; extended attribute that macOS adds to downloaded files. Combined with the ad-hoc signature (which satisfies code integrity checks), this lets the app run without the "unidentified developer" warning.&lt;/p&gt;




&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;A few things I want to build:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Speaker diarization&lt;/strong&gt; — cluster voice embeddings to distinguish Speaker 1, 2, 3 instead of just "Remote"&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;In-app auto-update&lt;/strong&gt; — check GitHub Releases API on launch, download and replace via install script&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Whisper model selection&lt;/strong&gt; — let users choose between &lt;code&gt;tiny&lt;/code&gt; (fast, less accurate) and &lt;code&gt;small&lt;/code&gt;/&lt;code&gt;medium&lt;/code&gt; (slower, better)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Export formats&lt;/strong&gt; — SRT subtitles, JSON with timestamps, integration with note-taking apps&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;

&lt;p&gt;Scripta is open-source under the MIT license.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Install&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-fsSL&lt;/span&gt; https://raw.githubusercontent.com/thehwang/Scripta/main/scripts/install.sh | bash
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;GitHub&lt;/strong&gt;: &lt;a href="https://github.com/thehwang/Scripta" rel="noopener noreferrer"&gt;github.com/thehwang/Scripta&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you find it useful, a star on GitHub would mean a lot. Issues, PRs, and feedback are all welcome.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Built on macOS with Swift, whisper.cpp, ScreenCaptureKit, SFSpeechRecognizer, and Ollama. No cloud required.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>swift</category>
      <category>opensource</category>
      <category>ai</category>
      <category>productivity</category>
    </item>
  </channel>
</rss>
