We were paying for Notta to transcribe Korean meetings. The Korean accuracy on technical terms was consistently bad — we were spending more time fixing transcripts than just writing notes by hand.
So we built a local Whisper pipeline. Turns out it beats the paid service on Korean accuracy.
📚 Full writeup: https://treesoop.com/blog/whisper-transcription-local-korean-stt-2026
🔧 GitHub: https://github.com/treesoop/whisper_transcription
Setup
Audio → ffmpeg preprocessing → Whisper (large-v3) → sentence boundary post-processing → markdown
Key decisions:
- Whisper large-v3 for Korean technical vocabulary accuracy. base/small/medium all struggle with domain-specific terms.
- ffmpeg preprocessing — 16kHz sample rate, light noise filter. Measurable accuracy bump.
- Sentence boundary post-processing — Whisper outputs long monologues. We re-chunk using commas, conjunctions, and timestamps.
Results (30-min Korean meeting)
- Technical term accuracy: noticeably better than paid service
- Processing speed on M1 Pro: faster than realtime
- Cost: zero
- Security: entirely local, no cloud transmission
Why local matters
Most of our use cases can't legally send audio to cloud:
- Customer meeting recordings (NDA)
- Legal/medical meetings (privacy laws)
- Strategy meetings (trade secrets)
- R&D discussions (IP)
Local-only pipeline removes all of that concern.
About VibeVoice
We tested it. Didn't run stably on Apple Silicon when we tried. Skipped for this release. Will revisit if they fix Apple Silicon compatibility.
TreeSoop context
We also have a commercial Korean STT product called Asimula with domain-specific fine-tuning for medical/legal. This OSS pipeline is a good starting point if you want to validate basic Whisper quality before investing in domain tuning.
- MIT licensed
- macOS Silicon optimized (M1/M2/M3/M4)
- See repo for setup
More from TreeSoop: ai-news-mcp, hwp-mcp, claude2codex
Top comments (0)