If you self-host media and lean on Whisper (via subgen, Bazarr's Whisper provider, whatever) to generate subtitles for the stuff no one has subs for, you've probably seen this: a perfectly English show comes back with Japanese subtitles. Or a dubbed film gets transcribed in the original language nobody in your house speaks. And it cost you a 40-minute GPU run to get the wrong answer.
I kept hitting it, dug into why, and the cause is almost funny once you see it.
The 30-second trap
Whisper detects language by sampling a short window, usually the first 30 seconds of the audio, and trusting whatever it hears. That's fine for a podcast. It's a trap for media, because the opening 30 seconds of a file are very often not representative of the actual dialogue:
An English dub whose cold open is the original-language theme song or opening narration.
A foreign film with a silent or music-only intro.
Anime, where the OP is in Japanese over an otherwise English-dubbed episode.
Detect on that window and you confidently get the wrong language, then commit a full transcription run to it. Garbage out, GPU time wasted, and your "wanted subtitles" list quietly fills up with junk.
The fix: sample more, trust less
The change that fixed it for me wasn't a better model, it was a better sampling strategy. Instead of one 30-second window at the start:
Pull a few chunks from across the whole file, not just the opening.
Run language detection on each chunk independently.
Vote, conservatively. If the chunks don't agree, or confidence is low, don't trust the result.
Confidence-gate the decision. A low-confidence detection is treated as "unknown / needs a human or a different signal," not as fact.
The cheap part is that detection on a handful of short chunks is way faster than a full transcription, so you spend a few seconds to avoid burning 40 minutes on the wrong language. The conservative voting is the important bit: it's better to say "I'm not sure" and skip than to be confidently wrong and pollute the library.
The bigger principle: measure, don't parrot
That detection fix is one instance of a pattern I ended up building the whole tool around: don't trust metadata, verify it.
The *arr stack is full of fields that claim things. Audio language tags lie. "Wanted" lists include files that already have an embedded sub. A subtitle's filename says one thing, the stream says another. If you act on the claim, you do wasted or wrong work. If you probe the file first and act on what's actually there, you don't.
So before anything is called a real gap, the file gets verified on disk (ffprobe: what audio is actually in here, is there already an embedded sub). Only verified rows become actionable. Un-probed ones wait, clearly marked, instead of being silently surfaced as confident gaps. Same philosophy as the language detection: spend a little to verify, so you don't spend a lot being wrong.
Where this lives
I packaged all of this into a self-hosted tool called subarr, which sits beside Bazarr and subgen (it doesn't replace either, it's the coordination layer and the interface they don't have). Python/FastAPI backend, the detection runs through a patched subgen, MIT, Docker, runs on a Pi if you're patient.
Repo's here if you want to poke at the detection code or just yell at my approach: https://github.com/coaxk/subarr
Genuinely curious whether others have hit the 30-second trap and solved it differently. The conservative-voting threshold especially feels like something people will have strong opinions on.
Top comments (0)