Cohere just open-sourced a 5.42 WER speech model - here's what testing it on real audio showed

#ai #machinelearning #nlp #opensource

Cohere released their new ASR model on March 26 with a 5.42% Word Error Rate on the LibriSpeech test-clean benchmark. That's a noticeable improvement over Whisper-large-v3 (~5.7%), and given it's open-source under a permissive license, I spent the last two weeks running it through real-world audio to see if the benchmark numbers translate.

The short answer: yes for clean studio audio, partially for noisy real-world recordings, and not yet for code-switched conversations.

What's actually new

Cohere's transcribe model is built on a different architecture than Whisper (encoder-decoder transformer with a lighter decoder). Key claims from the release notes:

5.42% WER on LibriSpeech test-clean
Roughly 30% faster inference than Whisper-large-v3 at similar batch sizes
Released with weights + inference code (not API-only)
Supports streaming via chunked inference

The "30% faster" caveat: this assumes you're running on the same hardware Cohere benchmarked. Real-world speedup on consumer GPUs (RTX 4070, M-series Macs) varied from 1.1x to 1.6x in my tests, mostly due to memory bandwidth differences.

What I tested

I built a small benchmark suite of audio files split across four categories:

Studio podcast clips (clean, single speaker, professional mic) - 12 files, 60 sec each
Zoom meeting recordings (multi-speaker, occasional crosstalk, average mic) - 8 files, 90-120 sec each
Phone call recordings (8kHz, compression artifacts, mobile mic) - 6 files, 30-60 sec each
Code-switched audio (English-Mandarin, English-Spanish) - 5 files, 60-90 sec each

Ground truth transcripts came from a mix of human transcription (paid) and existing high-quality automatic transcripts that I manually corrected.

Results

Numbers below are WER percentages, lower is better:

Category	Cohere Transcribe	Whisper-large-v3	Margin
Studio podcast	4.1%	5.2%	-1.1%
Zoom meeting	7.8%	8.6%	-0.8%
Phone call (8kHz)	14.2%	13.5%	+0.7%
Code-switched	19.6%	12.4%	+7.2%

Two takeaways:

Studio + meeting audio: Cohere wins by 0.8-1.1% absolute WER. Noticeable but not transformative.

Phone audio: Cohere is slightly worse. The training data appears to skew toward 16kHz+ recordings, so 8kHz phone audio degrades faster than Whisper, which has explicit phone-quality augmentation in its training mix.

Code-switched audio: Cohere is significantly worse. Whisper-large-v3 was trained on multilingual data with code-switching; Cohere's training emphasis seems heavier on monolingual English. If your use case involves bilingual speakers, Whisper still wins.

Latency comparison

Inference speed mattered for me because I'm building a small note-transcription tool. Average wall-clock time to transcribe 60 seconds of audio on RTX 4070 (12GB):

Cohere Transcribe (default chunking): 4.2 seconds
Whisper-large-v3 (CTranslate2): 6.1 seconds
Whisper-large-v3 (vanilla PyTorch): 11.8 seconds

Cohere with the streaming API was further down to about 1.8 seconds for the first usable token, vs Whisper-streaming at around 2.5 seconds. The "30% faster" claim is roughly accurate for batched inference; streaming gap is closer to 25-30%.

Where it fits in the OSS speech stack

A practical framework for picking a model in April 2026:

Studio podcasts, audiobooks, clean single-speaker audio: Cohere Transcribe wins on accuracy + speed
Multi-speaker meetings (Zoom/Meet/Teams): Cohere has slight edge but both work; pick based on infra preference
Phone audio, telephony, voicemail: Whisper-large-v3 still has the edge from telephony augmentation
Code-switched / multilingual / bilingual conversations: Whisper, no question
Real-time streaming UX (sub-2-sec first token): Cohere's streaming is meaningfully better

If you only have one model deployed, Whisper-large-v3 is still the safer default for general use because of the multilingual coverage. If you can deploy two, swap to Cohere for clean English audio paths.

Quick implementation notes

Running Cohere Transcribe locally took about 30 minutes from clone to first transcription. Notes from setup:

The default inference script assumes CUDA. CPU fallback works but is roughly 8-12x slower
Batch size affects memory more than throughput in my testing - I got best throughput at batch_size=4 on a 12GB GPU
Streaming mode requires an explicit chunk_length parameter; defaults to 30 seconds, can go down to 5 seconds for lower latency at the cost of slightly higher WER

Compared to integrating Whisper via OpenAI's Python package (~10 minutes to first transcription), Cohere's setup is more manual but doesn't require an API key for self-hosting. I've been tracking similar open-source AI model deployments at OpenAI Tools Hub if you want a broader comparison.

What I'm watching next

A few open questions I'd want to test if I have time:

Long-form audio (30+ min): Both models drift on long audio without explicit chunking. Cohere's streaming mode might handle this better but I haven't measured.
Domain-specific fine-tuning: Cohere's open weights make fine-tuning easier than Whisper if you have labeled audio in your vertical (legal, medical, technical podcasts).
Distillation: Whisper has community distilled variants (Distil-Whisper, faster-whisper). If Cohere's community produces similar distilled versions, that closes the size/speed gap further.
Voice activity detection (VAD) integration: Whisper has well-tested integrations with Silero VAD and pyannote. Cohere's ecosystem is younger. (Related: Gemma 4 GGUF deployment notes — similar ecosystem maturity issues when working with newer open-source model weights.)

Closing thought

Cohere's transcribe is a solid drop-in replacement for Whisper in clean-audio paths, with meaningful inference speed gains. It's not a Whisper killer because the multilingual and telephony coverage isn't there yet, but it's the first OSS speech model in a while that competes with Whisper-large-v3 on the dimensions that matter for production deployment.

If you're shipping anything that touches audio, it's worth running your own benchmark on your actual audio distribution. The benchmark numbers vendors report are useful but rarely match your domain.