DEV Community

nareshipme
nareshipme

Posted on

Whisper Hallucination on Silence: Why Your Transcript Loops the Same Phrase

Whisper Hallucination on Silence: Why Your Transcript Loops the Same Phrase

The Problem

Have you ever transcribed audio with Whisper (or any ASR model) and gotten something like this in the output:

"Hello, welcome to our show. The weather is nice today."
"Hello, welcome to our show. The weather is nice today."
"Hello, welcome to our show. The weather is nice today."
"Hello, welcome to our show. The weather is nice today."
...

Repeating the same phrase 50+ times, even though the audio was clearly different?

What's Happening

This is called hallucination, and it happens when the model encounters silence, background noise, or low-confidence audio segments. Here's why:

1. Silence Gets Interpreted as Speech

When there's silence (or very quiet audio), the model's audio embeddings are near-zero. The model, trained on speech, tries to "fill in" what it thinks should be there. It loops the most recent phrase because that's the closest match to the empty audio.

2. Background Music Triggers Loops

Live streams, podcasts, or interviews often have background music or ambient noise. Whisper (and other models) struggle with non-speech audio. The model sees "something happening" and tries to match it to its training data — which is speech — so it loops phrases.

3. No Confidence Threshold

Whisper outputs text for every audio segment, even if the model is barely confident. There's no built-in VAD (voice activity detection) to skip silent/background-only segments.

The Workaround: Use Indian-Language-First Models

For content in Indian languages (Telugu, Tamil, Hindi, etc.), Sarvam AI's Saarika/Saaras models are purpose-built for Indian audio patterns. They handle:

  • Indian accents and code-switching
  • Background noise better
  • Less hallucination on silence

We migrated from Whisper to Sarvam for transcription, and the hallucination issue disappeared.

Long-Term Fix: Add VAD

The proper solution is to add voice activity detection before transcription:

// Pseudo-code for VAD preprocessing
const vad = new VAD();
const speechSegments = await vad.detect(audioBuffer);

// Only transcribe speech segments
for (const segment of speechSegments) {
  const transcript = await whisper.transcribe(segment);
  fullTranscript += transcript;
}
Enter fullscreen mode Exit fullscreen mode

This skips silence and background-only segments entirely, preventing hallucinations.

The Lesson

Don't trust ASR models with silence or background noise. Always preprocess with VAD, or use models trained for your specific audio domain (like Sarvam for Indian languages).

Sources


Have you experienced ASR hallucinations? What worked for you?

Top comments (0)