DEV Community

Cover image for How I Stopped Empty Tray Captures From Reaching Whisper in Yapper
Daniel Romitelli
Daniel Romitelli

Posted on • Originally published at craftedbydaniel.com

How I Stopped Empty Tray Captures From Reaching Whisper in Yapper

The bug was not transcription quality

The first sign of trouble was not in the output text. It was in the workflow. Yapper tray mode would finish a recording, hand it off, and still proceed toward transcription even when the capture had no speech worth sending upstream. That is a bad use of a speech-to-text pipeline, because the expensive part of the system should only run when the input has a real chance of producing text.

I knew immediately that this was not a model problem. Whisper was doing exactly what I asked it to do. The mistake was earlier: I was asking too often. In a background tray app, that distinction matters a lot. A single empty capture is annoying. A stream of them turns into wasted requests, noisy logs, and a system that feels busy without being useful.

That is why the fix lives in the tray experience and in the settings UI, not in the transcription layer itself. In app/settings_ui.py, Yapper now exposes a Voice Activity Detection section with a toggle and a threshold slider. The copy under that section says what the feature is for in plain language: Skip API calls when no speech is detected (reduces costs). That is the behavior I wanted, and it is the behavior I wired the app around.

The important change is not that Yapper became smarter. It became more selective. Tray mode should not treat every completed recording as evidence that the user meant to speak. If the capture does not cross the VAD bar, the app should stop there.

flowchart TD
  mic[Microphone capture] --> tray[Tray mode recorder]
  tray --> vad[VAD setting and threshold]
  vad -->|speech detected| whisper[Whisper API request]
  vad -->|no speech| skip[Skip API call]
  whisper --> output[Transcribe]```



## What the settings panel actually changed

The most visible part of this work is the VAD block in the settings window. I added it there on purpose, because this kind of behavior should be tunable by the person using the app. Different microphones behave differently. Different rooms behave differently. A laptop mic three feet away from your mouth is a very different problem from a desk mic in a quiet office.

The settings UI gives the user two levers. The first is a simple enable switch for Voice Activity Detection. The second is the threshold slider, which defaults to `-35 dB`. That default is a practical middle ground. It is low enough to avoid treating every tiny room noise as speech, but not so strict that normal conversational speech gets discarded on a quiet mic.

I like that the setting is exposed where the rest of the audio controls live. It makes the behavior legible. If a user says, 'Yapper keeps transcribing nothing,' the answer is not hidden in a private branch or some obscure debug command. The answer is visible in the settings panel: enable VAD, tune the threshold, and let the app decide whether a capture is worth sending.

That is what makes the feature feel like part of the product instead of a patch. The UI does not just describe the behavior after the fact; it advertises the rule the pipeline follows. If no speech is detected, the request path never starts.

## The choice I made in tray mode

Tray mode is the version of Yapper that keeps running in the background. It is the default mode, and it is the one that has to make judgment calls continuously. That is why VAD matters more there than in a single-shot recording flow. In tray mode, the app is always available, always listening for the next command, and always one accidental trigger away from doing unnecessary work.

That background posture changes the economics of every recording. A console mode that records once on a hotkey can tolerate a little more waste because the user is already intentionally entering a capture. Tray mode does not get that luxury. It sees more ambient noise, more partial utterances, more false starts, and more cases where the microphone is open but the person has not actually spoken yet.

So I put the decision before the request. That is the whole point. When the settings say the app will skip API calls when no speech is detected, the tray loop has to honor that promise. There is no value in a recording path that cheerfully hands empty audio to Whisper and then hopes the model will make sense of it. The correct answer is to stop earlier.

That choice also makes the system easier to reason about. When the request path only starts after a speech check, the logs become cleaner and the behavior becomes predictable. If the app transcribes something, there was speech. If it did not, the app did not waste time pretending otherwise.

## Why -35 dB is not a magic number

The default threshold matters, but not because it is somehow perfect. It is useful because it gives the user a starting point that works well across a lot of ordinary setups. I chose `-35 dB` because it sits in the right part of the range for typical desktop speech capture: sensitive enough to catch normal speaking voices, but conservative enough to ignore background hum and room tone.

That threshold is also where the tradeoff becomes visible. If I move it too low, the app starts treating noise as speech. That means more useless requests, more false positives, and more noise in the transcript history. If I move it too high, the app starts missing quiet speech, softer voices, and mics that sit a little too far away. The threshold is not just a number; it is a decision about what kind of environment the app should tolerate.

That is why I wanted the setting exposed instead of hardcoded. Different users will land in different places. Some people run Yapper on a quiet desktop with a close mic. Others run it on a laptop in a shared room. A threshold that feels right in one setup can feel wrong in the other. The slider makes the behavior adjustable without making the rest of the app complicated.

I think of the threshold as part of the app's operating profile. Once the app has a sensible default, the user can move it only when the default is not matching reality. That keeps the common case simple and the uncommon case editable.

## The real cost of empty captures

The obvious cost of empty captures is money. Every unnecessary API call is work I do not need to pay for. But the less obvious cost is friction. A background transcription app that keeps spending time on silence feels noisy even when the bill is small. It gives the impression that the system is active when it is not actually helping.

That is why this feature improved more than just cost. It improved the feel of the app.

Before the VAD path, the system had a habit of treating completed audio as if completion itself were enough reason to continue. It was a procedural bug disguised as progress. After the VAD check, the app became much more disciplined. A capture now has to earn its way into the transcription path.

That shift matters in practice because it eliminates several kinds of waste at once — empty recordings never become API requests, the logs stop filling with pointless transcription attempts, users stop wondering why the app reacted to silence, and the tray workflow becomes something you can actually trust.

The best part is that the behavior is easy to explain. The app is not trying to be clever. It is just refusing to send obviously bad input to Whisper. That simplicity is exactly what I wanted.

## Why I kept the control visible

I could have buried the VAD behavior in a private setting and left it alone. I did not want that. A setting like this belongs in the UI because it is not an implementation detail. It changes how the app behaves in the real world.

When I expose the threshold and the toggle, I give the user the same tuning surface I use when I test the app on different hardware. That matters because desktop audio is messy. Microphone gain differs. Room acoustics differ. Background noise differs. Even the same machine can behave differently depending on whether it is running on battery, plugged in, or sitting next to a loud fan.

The UI copy is also part of the product promise. The section does not describe a hidden optimization trick. It says exactly what the behavior is for: it skips API calls when no speech is detected. That wording matters because it tells the user what the system is protecting them from. It is not trying to detect language, intent, or semantics. It is only deciding whether the capture contains enough evidence of speech to justify the next step.

That is the right boundary. The more the UI matches the actual behavior, the easier it is to tune and trust.

## The decision point that changed the app

Tray mode needed one thing: a better decision point. Not a bigger model, not a fancier prompt — just a clear line between 'there is speech here' and 'there is nothing worth sending.' That line is now visible in the settings, obvious in the flow, and easy to adjust when a microphone or room changes. Once it was in place, the recorder could keep doing its job, the transcription layer only saw inputs that had a reason to exist, and the app finally matched the promise in the settings panel: if there is no speech, Yapper skips the call and moves on.


---

🎧 **Listen to the audiobook** — [Spotify](https://open.spotify.com/show/4ABVd5yDVfbX9HlV5JjT7D) · [Google Play](https://play.google.com/store/audiobooks/details/How_to_Architect_an_Enterprise_AI_System_And_Why_t?id=AQAAAECafz8_tM&hl=en) · [All platforms](https://www.craftedbydaniel.com/audiobook)
🎬 [Watch the visual overviews on YouTube](https://youtube.com/playlist?list=PLRteDbGJPYDb9XNjecvHplGlgW7tIv_q6)
📖 [Read the full 13-part series with AI assistant](https://www.craftedbydaniel.com/premium-access?from=%2Fblog%2Fseries%2Fhow-to-architect-an-enterprise-ai-system-and-why-the-engineer-still-matters)
Enter fullscreen mode Exit fullscreen mode

Top comments (0)