DEV Community

yamayu-dev
yamayu-dev

Posted on

Building on-device Video Notes in a macOS app

Building on-device Video Notes: SpeechAnalyzer, Foundation Models, and libmpv in a shipping macOS app

My macOS video player, Reel, has a feature called Video Notes: pick a local video, click Generate, and get a timestamped transcript, an optional translation, and a structured summary — entirely on-device. No API keys, no upload, works on an mkv file of a two-hour lecture.

The pipeline is four stages, each on a different technology:

video file (mp4/mov/mkv/webm/…)
   │  libmpv, ao=pcm            → 16 kHz mono WAV      (no Apple Intelligence needed)
   ▼
SpeechAnalyzer (macOS 26)       → timestamped segments  (no Apple Intelligence needed)
   ▼
Translation framework (optional)→ EN↔JA translation     (no Apple Intelligence needed)
   ▼
Foundation Models               → structured summary    (requires Apple Intelligence)
Enter fullscreen mode Exit fullscreen mode

Apple's 2025 APIs demo beautifully. Shipping them is a different story — the interesting bugs are all in the last stage. Here's what each stage looks like in production, and every workaround I needed.


Stage 1: audio extraction with headless libmpv (ao=pcm)

Why not AVFoundation? Because it isn't enough for a local-video library: mkv and webm — both common in the wild — won't open at all. Reel already ships libmpv for playback, and libmpv has a trick that makes it a universal audio extractor without any encoder: the PCM audio output.

guard let h = mpv_create() else { return nil }
func opt(_ k: String, _ v: String) { _ = mpv_set_option_string(h, k, v) }
opt("terminal", "no"); opt("config", "no"); opt("msg-level", "all=no")
opt("vid", "no"); opt("vo", "null")          // headless: no video, no window
opt("ao", "pcm")                             // "play" audio into a WAV file
opt("ao-pcm-file", out.path)
opt("ao-pcm-waveheader", "yes")
opt("audio-samplerate", "16000")             // what speech models want
opt("audio-channels", "mono")
guard mpv_initialize(h) >= 0 else { mpv_terminate_destroy(h); return nil }
// mpv_command(h, ["loadfile", url.path]) …then pump events until END_FILE
Enter fullscreen mode Exit fullscreen mode

You "play" the file with video disabled and the audio device replaced by a WAV writer; decoding runs much faster than realtime. Anything mpv can demux — mkv, webm, avi, and the other containers AVFoundation rejects — becomes a 16 kHz mono WAV. Two production notes:

  • Pump mpv_wait_event until MPV_EVENT_END_FILE, with a deadline. A pathological input that never emits END_FILE would otherwise hang your pipeline forever.
  • Sanity-check the output (exists, > 1 KB) before declaring success; a video with no audio track "succeeds" into an empty file.

Stage 2: transcription with SpeechAnalyzer (macOS 26)

SpeechAnalyzer / SpeechTranscriber is the new speech stack — on-device, fast (it chewed through long files far quicker than realtime in my testing), and notably it does not require Apple Intelligence, just a model download. Three things the sample code doesn't emphasize:

1. Ask for timestamps via attributeOptions. I want a transcript that seeks the player when you click a line, which means per-segment timing:

let transcriber = SpeechTranscriber(
    locale: locale,
    transcriptionOptions: [],
    reportingOptions: [],
    attributeOptions: [.audioTimeRange]   // ← per-run timing, for click-to-seek
)
Enter fullscreen mode Exit fullscreen mode

The timing comes back as attributed-string runs; read the first run carrying an audioTimeRange to get a segment's start time.

2. Model download is your job to surface. First use of a locale requires the asset:

if let request = try await AssetInventory.assetInstallationRequest(supporting: [transcriber]) {
    try await request.downloadAndInstall()
}
Enter fullscreen mode Exit fullscreen mode

This can take a while on first run — show progress, don't just spin.

3. Consume results concurrently with feeding the file. transcriber.results is an async sequence; collect it in a Task while analyzer.analyzeSequence(from:) consumes the audio file, then finalizeAndFinish(through:) the last sample. Collecting after the fact deadlocks; the collector must already be draining.

Stage 3: translation — the framework that only lives inside SwiftUI

The Translation framework does on-device EN↔JA (and more) with no Apple Intelligence requirement. Its one architectural surprise: you can't just instantiate a translation session in your service layer. A TranslationSession is only vended through SwiftUI's .translationTask modifier, so the translation stage is driven from the view/model layer, not from the same stateless service as the rest of the pipeline. Plan your architecture around that asymmetry; my pipeline service does extraction/transcription/summary, and the model that owns the SwiftUI surface drives translation.

Stage 4: summarization with Foundation Models — where the production issues showed up

This is the stage that "works in the demo" and then fails four different ways on real content. Every workaround below is in shipping code.

4a. The default guardrails refuse ordinary content

My first end-to-end test summarized a tourism clip fine, then refused an ordinary interview video. The default safety guardrails false-positive on perfectly normal human conversation — and summarizing the user's own file is exactly the use case Apple provides relaxed guardrails for:

private static let summaryModel =
    SystemLanguageModel(guardrails: .permissiveContentTransformations)
Enter fullscreen mode Exit fullscreen mode

Switching to .permissiveContentTransformations (the mode intended for content-transformation tasks over user-provided material) cut the false blocks dramatically.

4b. Tell guardrail refusals apart from real failures

When the model does refuse, users deserve a different message than "generation failed":

} catch let error as LanguageModelSession.GenerationError {
    switch error {
    case .guardrailViolation, .refusal: return .blocked   // won't change on retry
    default:                            return .none
    }
}
Enter fullscreen mode Exit fullscreen mode

.blocked gets its own UI string; retrying a safety block is pointless, so don't.

4c. The context window is small — map-reduce long transcripts

A single respond(to:) with a long transcript just throws: context window exceeded, no summary for exactly the videos that need one most. The fix is the classic map-reduce: chunk the transcript (~6,000 characters per request), summarize each chunk with a neutral prompt, then summarize the concatenated partial summaries with the user's preferences applied to the final pass. The chunker splits on whitespace, which means word boundaries in space-separated languages. Japanese has no internal spaces, but the transcript is assembled by joining timestamped segments with spaces, so it still chunks correctly — on segment boundaries. If even the combined partials exceed the budget, ship the concatenation rather than fail.

4d. The on-device model gets stuck in repetition loops

Occasionally the model emits the same sentence over and over — a classic greedy-decoding failure. Three defenses, all needed:

  1. A little temperature. GenerationOptions(temperature: 0.6) avoids most loops outright.
  2. Collapse what still repeats. Post-process: dedupe identical adjacent lines, then identical adjacent sentences within a line.
  3. Detect degenerate output and retry once. If, after collapsing, more than half the sentences are duplicates, the model looped — regenerate once, then give up rather than show garbage:
private static func isDegenerate(_ text: String) -> Bool {
    let units = text.split(whereSeparator: { $0 == "\n" || $0 == "。" || $0 == "." })
        .map { $0.trimmingCharacters(in: .whitespaces) }
        .filter { !$0.isEmpty }
    guard units.count >= 3 else { return false }
    return Set(units).count * 2 < units.count   // over half are duplicates
}
Enter fullscreen mode Exit fullscreen mode

4e. Smaller lessons from the same stage

  • Plain text beat Guided Generation for this task. Structured (guided) output occasionally came back malformed; a strictly-specified plain-text format ("1–2 sentence overview, blank line, 3–6 bullets") parses trivially and never broke.
  • Don't summarize the un-summarizable. Below ~140 characters of transcript, a "summary" just restates the input; skip the stage and point users at the transcript.
  • Cross-language output needs shouting. Producing a Japanese summary of an English transcript works, but only if the instructions say, in effect, CRITICAL: write the ENTIRE summary, including headings, in Japanese — a polite request gets you mixed-language output.
  • The transcript is noisy input — say so. One instruction line ("the transcript is auto-generated; silently correct obvious proper-noun misrecognitions") noticeably improves summaries of technical content.

Degrade gracefully

SystemLanguageModel.default.availability tells you whether Apple Intelligence is on. When it isn't, only the summary stage dies — transcription, translation, and frame captures all still work, so the feature stays useful on machines without Apple Intelligence. Check availability up front and disable exactly one stage, not the feature.


Assembly notes

  • Generate once, persist, reload instantly. The pipeline takes seconds to minutes; results are written as JSON (plus captured frames) in Application Support, keyed by a path-derived stable video ID. Regeneration is explicit.
  • Every stage is independent. Extraction/transcription need nothing from Apple Intelligence; translation is optional and on-demand; summary is the only gated stage. Model the pipeline that way and each capability degrades separately.
  • Frame captures reuse the player's engine. Timestamped captures come from the same headless-libmpv screenshot path the player uses for thumbnails, so "capture this moment" costs nothing extra. The end goal is a portable Markdown export — summary, highlights with images, transcript — that opens in Obsidian or GitHub.

The takeaway

On-device AI on the Mac in 2026 is genuinely shippable, but the difficulty is inverted from what you'd expect: the AI part (transcription quality, summary quality) mostly just works, while the production engineering — guardrail false positives, context budgeting, repetition loops, framework lifecycle quirks, graceful degradation — is where the real work lives. None of it is hard once you know it's coming. Now you do.

I built this for Reel, a local video player for macOS, available on the Mac App Store. The companion piece — getting libmpv itself through App Store review — is here.

Top comments (0)