Cohere released their new ASR model on March 26 with a 5.42% Word Error Rate on the LibriSpeech test-clean benchmark. That's a noticeable improvement over Whisper-large-v3 (~5.7%), and given it's open-source under a permissive license, I spent the last two weeks running it through real-world audio to see if the benchmark numbers translate.
The short answer: yes for clean studio audio, partially for noisy real-world recordings, and not yet for code-switched conversations.
What's actually new
Cohere's transcribe model is built on a different architecture than Whisper (encoder-decoder transformer with a lighter decoder). Key claims from the release notes:
- 5.42% WER on LibriSpeech test-clean
- Roughly 30% faster inference than Whisper-large-v3 at similar batch sizes
- Released with weights + inference code (not API-only)
- Supports streaming via chunked inference
The "30% faster" caveat: this assumes you're running on the same hardware Cohere benchmarked. Real-world speedup on consumer GPUs (RTX 4070, M-series Macs) varied from 1.1x to 1.6x in my tests, mostly due to memory bandwidth differences.
What I tested
I built a small benchmark suite of audio files split across four categories:
- Studio podcast clips (clean, single speaker, professional mic) - 12 files, 60 sec each
- Zoom meeting recordings (multi-speaker, occasional crosstalk, average mic) - 8 files, 90-120 sec each
- Phone call recordings (8kHz, compression artifacts, mobile mic) - 6 files, 30-60 sec each
- Code-switched audio (English-Mandarin, English-Spanish) - 5 files, 60-90 sec each
Ground truth transcripts came from a mix of human transcription (paid) and existing high-quality automatic transcripts that I manually corrected.
Results
Numbers below are WER percentages, lower is better:
| Category | Cohere Transcribe | Whisper-large-v3 | Margin |
|---|---|---|---|
| Studio podcast | 4.1% | 5.2% | -1.1% |
| Zoom meeting | 7.8% | 8.6% | -0.8% |
| Phone call (8kHz) | 14.2% | 13.5% | +0.7% |
| Code-switched | 19.6% | 12.4% | +7.2% |
Two takeaways:
Studio + meeting audio: Cohere wins by 0.8-1.1% absolute WER. Noticeable but not transformative.
Phone audio: Cohere is slightly worse. The training data appears to skew toward 16kHz+ recordings, so 8kHz phone audio degrades faster than Whisper, which has explicit phone-quality augmentation in its training mix.
Code-switched audio: Cohere is significantly worse. Whisper-large-v3 was trained on multilingual data with code-switching; Cohere's training emphasis seems heavier on monolingual English. If your use case involves bilingual speakers, Whisper still wins.
Latency comparison
Inference speed mattered for me because I'm building a small note-transcription tool. Average wall-clock time to transcribe 60 seconds of audio on RTX 4070 (12GB):
- Cohere Transcribe (default chunking): 4.2 seconds
- Whisper-large-v3 (CTranslate2): 6.1 seconds
- Whisper-large-v3 (vanilla PyTorch): 11.8 seconds
Cohere with the streaming API was further down to about 1.8 seconds for the first usable token, vs Whisper-streaming at around 2.5 seconds. The "30% faster" claim is roughly accurate for batched inference; streaming gap is closer to 25-30%.
Where it fits in the OSS speech stack
A practical framework for picking a model in April 2026:
- Studio podcasts, audiobooks, clean single-speaker audio: Cohere Transcribe wins on accuracy + speed
- Multi-speaker meetings (Zoom/Meet/Teams): Cohere has slight edge but both work; pick based on infra preference
- Phone audio, telephony, voicemail: Whisper-large-v3 still has the edge from telephony augmentation
- Code-switched / multilingual / bilingual conversations: Whisper, no question
- Real-time streaming UX (sub-2-sec first token): Cohere's streaming is meaningfully better
If you only have one model deployed, Whisper-large-v3 is still the safer default for general use because of the multilingual coverage. If you can deploy two, swap to Cohere for clean English audio paths.
Quick implementation notes
Running Cohere Transcribe locally took about 30 minutes from clone to first transcription. Notes from setup:
- The default inference script assumes CUDA. CPU fallback works but is roughly 8-12x slower
- Batch size affects memory more than throughput in my testing - I got best throughput at batch_size=4 on a 12GB GPU
- Streaming mode requires an explicit chunk_length parameter; defaults to 30 seconds, can go down to 5 seconds for lower latency at the cost of slightly higher WER
Compared to integrating Whisper via OpenAI's Python package (~10 minutes to first transcription), Cohere's setup is more manual but doesn't require an API key for self-hosting.
What I'm watching next
A few open questions I'd want to test if I have time:
- Long-form audio (30+ min): Both models drift on long audio without explicit chunking. Cohere's streaming mode might handle this better but I haven't measured.
- Domain-specific fine-tuning: Cohere's open weights make fine-tuning easier than Whisper if you have labeled audio in your vertical (legal, medical, technical podcasts).
- Distillation: Whisper has community distilled variants (Distil-Whisper, faster-whisper). If Cohere's community produces similar distilled versions, that closes the size/speed gap further.
- Voice activity detection (VAD) integration: Whisper has well-tested integrations with Silero VAD and pyannote. Cohere's ecosystem is younger.
Closing thought
Cohere's transcribe is a solid drop-in replacement for Whisper in clean-audio paths, with meaningful inference speed gains. It's not a Whisper killer because the multilingual and telephony coverage isn't there yet, but it's the first OSS speech model in a while that competes with Whisper-large-v3 on the dimensions that matter for production deployment.
If you're shipping anything that touches audio, it's worth running your own benchmark on your actual audio distribution. The benchmark numbers vendors report are useful but rarely match your domain.
Top comments (0)