This is an excellent deep dive — the VAD comparison table alone is worth bookmarking.
One practical insight from building real-time voice agents: the VAD threshold tuning is where most people lose hours. Silero's default settings work great for quiet rooms but in real-world conditions (call centers, mobile phones in noisy environments, speakerphone in a car), you'll need to tune min_silence_duration_ms and speech_pad_ms aggressively.
A pattern that's worked well for us: start with aggressive endpointing (short silence threshold ~300ms) for responsive agents, then dynamically adjust based on detected background noise levels. If the audio energy baseline is high, extend the silence threshold to avoid cutting users off mid-thought.
Also worth noting for anyone building multilingual agents: Silero VAD's "6000+ languages" claim is technically true but performance degrades significantly on tonal languages and languages with longer natural pauses (like Japanese). Worth running your own evaluation on your target language before committing.
For the STT piece, I'd add that Nvidia's Parakeet TDT2 (mentioned in the table but buried) is becoming the go-to for English — it's beating Whisper large-v3 on most benchmarks while being significantly faster. The catch is it's English-only, so if you need multilingual, you're still looking at Whisper or specialized models.
For further actions, you may consider blocking this person and/or reporting abuse
We're a place where coders share, stay up-to-date and grow their careers.
This is an excellent deep dive — the VAD comparison table alone is worth bookmarking.
One practical insight from building real-time voice agents: the VAD threshold tuning is where most people lose hours. Silero's default settings work great for quiet rooms but in real-world conditions (call centers, mobile phones in noisy environments, speakerphone in a car), you'll need to tune min_silence_duration_ms and speech_pad_ms aggressively.
A pattern that's worked well for us: start with aggressive endpointing (short silence threshold ~300ms) for responsive agents, then dynamically adjust based on detected background noise levels. If the audio energy baseline is high, extend the silence threshold to avoid cutting users off mid-thought.
Also worth noting for anyone building multilingual agents: Silero VAD's "6000+ languages" claim is technically true but performance degrades significantly on tonal languages and languages with longer natural pauses (like Japanese). Worth running your own evaluation on your target language before committing.
For the STT piece, I'd add that Nvidia's Parakeet TDT2 (mentioned in the table but buried) is becoming the go-to for English — it's beating Whisper large-v3 on most benchmarks while being significantly faster. The catch is it's English-only, so if you need multilingual, you're still looking at Whisper or specialized models.