Local Whisper pipeline beats paid Korean transcription services

#showdev #machinelearning #ai #productivity

We were paying for Notta to transcribe Korean meetings. The Korean accuracy on technical terms was consistently bad — we were spending more time fixing transcripts than just writing notes by hand.

So we built a local Whisper pipeline. Turns out it beats the paid service on Korean accuracy.

📚 Full writeup: https://treesoop.com/blog/whisper-transcription-local-korean-stt-2026
🔧 GitHub: https://github.com/treesoop/whisper_transcription

Setup

Audio → ffmpeg preprocessing → Whisper (large-v3) → sentence boundary post-processing → markdown

Key decisions:

Whisper large-v3 for Korean technical vocabulary accuracy. base/small/medium all struggle with domain-specific terms.
ffmpeg preprocessing — 16kHz sample rate, light noise filter. Measurable accuracy bump.
Sentence boundary post-processing — Whisper outputs long monologues. We re-chunk using commas, conjunctions, and timestamps.

Results (30-min Korean meeting)

Technical term accuracy: noticeably better than paid service
Processing speed on M1 Pro: faster than realtime
Cost: zero
Security: entirely local, no cloud transmission

Why local matters

Most of our use cases can't legally send audio to cloud:

Customer meeting recordings (NDA)
Legal/medical meetings (privacy laws)
Strategy meetings (trade secrets)
R&D discussions (IP)

Local-only pipeline removes all of that concern.

About VibeVoice

We tested it. Didn't run stably on Apple Silicon when we tried. Skipped for this release. Will revisit if they fix Apple Silicon compatibility.

TreeSoop context

We also have a commercial Korean STT product called Asimula with domain-specific fine-tuning for medical/legal. This OSS pipeline is a good starting point if you want to validate basic Whisper quality before investing in domain tuning.