<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Davut Akça</title>
    <description>The latest articles on DEV Community by Davut Akça (@davutakca).</description>
    <link>https://dev.to/davutakca</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F4001492%2F16b1725e-6861-4e79-b360-5c60fc1a6015.png</url>
      <title>DEV Community: Davut Akça</title>
      <link>https://dev.to/davutakca</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/davutakca"/>
    <language>en</language>
    <item>
      <title>Translating Windows system audio in real time — driverless, with no virtual cable</title>
      <dc:creator>Davut Akça</dc:creator>
      <pubDate>Thu, 25 Jun 2026 03:54:23 +0000</pubDate>
      <link>https://dev.to/davutakca/translating-windows-system-audio-in-real-time-driverless-with-no-virtual-cable-2842</link>
      <guid>https://dev.to/davutakca/translating-windows-system-audio-in-real-time-driverless-with-no-virtual-cable-2842</guid>
      <description>&lt;p&gt;I build Voxis, an open-source Windows app that translates whatever your system is playing — a video, a game, the other side of a call — and plays the translation back as spoken voice, a few seconds behind the speaker. No subtitles, no virtual audio cable, no bot joining your meeting.&lt;/p&gt;

&lt;p&gt;The "no virtual cable" part is the bit worth writing about. Almost every system-audio tool on Windows tells you to install VB-CABLE or VoiceMeeter, or to drop a bot into your call. Voxis doesn't, for incoming audio. This post is how that capture engine works, and the sharp edges I hit building it in Python.&lt;/p&gt;

&lt;p&gt;I'll be specific about what's hard and honest about what's not mine to fix.&lt;/p&gt;

&lt;h2&gt;
  
  
  The goal
&lt;/h2&gt;

&lt;p&gt;Read the exact audio the user is hearing — the post-mix system output — at 16 kHz mono, and do it without installing anything. Then stream it to a translation model and play the result back, all while the original keeps playing underneath.&lt;/p&gt;

&lt;p&gt;Three constraints fall out of that:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Driverless.&lt;/strong&gt; If it needs a reboot and a driver, it's not zero-setup.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No self-feedback.&lt;/strong&gt; The app plays translated audio &lt;em&gt;into the same system mix it's capturing&lt;/em&gt;. Naively, it would capture its own voice and translate the translation. That has to be impossible by construction, not patched with an echo gate.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Realtime-safe.&lt;/strong&gt; Capture can't stall. If the downstream VAD or garbage collector hiccups, the WASAPI ring buffer must not overflow.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  WASAPI process-loopback: capturing the mix, minus yourself
&lt;/h2&gt;

&lt;p&gt;Windows 10 version 2004 added the &lt;strong&gt;ApplicationLoopback&lt;/strong&gt; API — a way to activate an &lt;code&gt;IAudioClient&lt;/code&gt; in loopback mode scoped to a process tree, either &lt;em&gt;including&lt;/em&gt; only that tree or &lt;em&gt;excluding&lt;/em&gt; it. Excluding our own process tree is exactly what constraint #2 needs: the captured mix is everything the user hears, with Voxis's own output removed.&lt;/p&gt;

&lt;p&gt;You don't get this client from the normal &lt;code&gt;IMMDeviceEnumerator&lt;/code&gt; path. You activate it by name through &lt;code&gt;ActivateAudioInterfaceAsync&lt;/code&gt;, passing the loopback parameters in a &lt;code&gt;PROPVARIANT&lt;/code&gt; carrying a &lt;code&gt;BLOB&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;params&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;AUDIOCLIENT_ACTIVATION_PARAMS&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ActivationType&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;AUDIOCLIENT_ACTIVATION_TYPE_PROCESS_LOOPBACK&lt;/span&gt;
&lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ProcessLoopbackParams&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TargetProcessId&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;my_pid&lt;/span&gt;
&lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;u&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ProcessLoopbackParams&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ProcessLoopbackMode&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; \
    &lt;span class="n"&gt;PROCESS_LOOPBACK_MODE_EXCLUDE_TARGET_PROCESS_TREE&lt;/span&gt;

&lt;span class="n"&gt;pv&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;PROPVARIANT&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;pv&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;vt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;VT_BLOB&lt;/span&gt;
&lt;span class="n"&gt;pv&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;blob&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cbSize&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;sizeof&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;pv&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;blob&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;pBlobData&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ctypes&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;cast&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;byref&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;params&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;c_void_p&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The device name is the magic string &lt;code&gt;VAD\Process_Loopback&lt;/code&gt;. The activation is asynchronous: you hand &lt;code&gt;ActivateAudioInterfaceAsync&lt;/code&gt; a completion handler and wait for it to fire.&lt;/p&gt;

&lt;h3&gt;
  
  
  The IAgileObject trap
&lt;/h3&gt;

&lt;p&gt;Here's the one that cost me an afternoon. The completion handler is a COM object you implement yourself (in Python, via &lt;code&gt;comtypes.COMObject&lt;/code&gt;). If it only implements &lt;code&gt;IActivateAudioInterfaceCompletionHandler&lt;/code&gt;, &lt;code&gt;ActivateAudioInterfaceAsync&lt;/code&gt; returns &lt;code&gt;E_ILLEGAL_METHOD_CALL&lt;/code&gt; and nothing tells you why.&lt;/p&gt;

&lt;p&gt;The fix: the handler must &lt;em&gt;also&lt;/em&gt; implement &lt;code&gt;IAgileObject&lt;/code&gt; — a marker interface with no methods that declares the object as apartment-agnostic. Add it to the COM interface list and the activation succeeds:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;_Handler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;COMObject&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;_com_interfaces_&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;IActivateAudioInterfaceCompletionHandler&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;IAgileObject&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;IAgileObject&lt;/code&gt; has an empty method list — it's purely a "you may call me from any apartment" promise. WASAPI refuses to proceed without it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Asking for the format you actually want
&lt;/h3&gt;

&lt;p&gt;The other nicety: WASAPI lets you &lt;code&gt;Initialize&lt;/code&gt; the loopback client with the exact &lt;code&gt;WAVEFORMATEX&lt;/code&gt; you want. I request 16 kHz, mono, 16-bit PCM directly — which happens to be exactly what the translation model wants as input — so there's &lt;strong&gt;no resampling step&lt;/strong&gt; in the hot path:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;wfx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;nChannels&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;span class="n"&gt;wfx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;nSamplesPerSec&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;16000&lt;/span&gt;
&lt;span class="n"&gt;wfx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;wBitsPerSample&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;16&lt;/span&gt;
&lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Initialize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;AUDCLNT_SHAREMODE_SHARED&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;AUDCLNT_STREAMFLAGS_LOOPBACK&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                  &lt;span class="mi"&gt;2_000_000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;byref&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;wfx&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That &lt;code&gt;2_000_000&lt;/code&gt; is a 200 ms buffer in 100-ns units.&lt;/p&gt;

&lt;h2&gt;
  
  
  Keeping capture realtime-safe
&lt;/h2&gt;

&lt;p&gt;A loopback capture loop has one job it must never miss: call &lt;code&gt;GetBuffer&lt;/code&gt;, copy the bytes, call &lt;code&gt;ReleaseBuffer&lt;/code&gt;. If &lt;code&gt;ReleaseBuffer&lt;/code&gt; is late because something downstream is slow, the ring overflows and you get glitches.&lt;/p&gt;

&lt;p&gt;So capture and processing are split across two threads with a bounded queue between them:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Capture thread:&lt;/strong&gt; &lt;code&gt;GetNextPacketSize&lt;/code&gt; → &lt;code&gt;GetBuffer&lt;/code&gt; → copy into a numpy array → &lt;code&gt;ReleaseBuffer&lt;/code&gt; → append to a deque. That's all it does. It never runs the VAD or the network code.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Processor thread:&lt;/strong&gt; drains the deque and runs the (sometimes slow) per-chunk callback — VAD gating, then handoff to the translator.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The queue is a &lt;code&gt;collections.deque(maxlen=N)&lt;/code&gt; — &lt;strong&gt;drop-oldest by construction&lt;/strong&gt;. If the processor falls behind, old audio is dropped to bound latency rather than letting the capture thread block. A GC pause or a VAD stall in the consumer therefore can never delay &lt;code&gt;ReleaseBuffer&lt;/code&gt;. This is the single most important design decision in the capture path, and it's three lines of code.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_queue&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;collections&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;deque&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;maxlen&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;64&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;   &lt;span class="c1"&gt;# bounded; ~a buffer's worth of packets
# capture thread:
&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_queue&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;        &lt;span class="c1"&gt;# never blocks; oldest is discarded under pressure
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Ducking without touching the audio
&lt;/h2&gt;

&lt;p&gt;When the translation speaks, you want the original quieter so the two voices don't fight. The tempting approach is to mix — capture the audio, attenuate it, play it back yourself. But then you own playback, latency, and device routing for every app on the system.&lt;/p&gt;

&lt;p&gt;Instead, Voxis ducks at the &lt;em&gt;source&lt;/em&gt; using the Windows &lt;strong&gt;session-volume API&lt;/strong&gt; (&lt;code&gt;ISimpleAudioVolume&lt;/code&gt; via pycaw): turn down the audio session of the app that's playing, not the bytes in our pipeline. The original keeps playing through its own path, untouched except for its level, and pops back up when the translation stops. No mixing, no added latency on the original, nothing to route.&lt;/p&gt;

&lt;p&gt;(There's a second capture path for people who &lt;em&gt;do&lt;/em&gt; install a virtual cable, where Voxis can do real M/S center-suppression to duck dialogue while preserving stereo music — but that's opt-in, and the driverless path above is the default.)&lt;/p&gt;

&lt;h2&gt;
  
  
  The latency I don't control — and the bit I do
&lt;/h2&gt;

&lt;p&gt;People always ask why it's not instant. Two honest sentences:&lt;/p&gt;

&lt;p&gt;The translation model is a &lt;strong&gt;native simultaneous interpreter&lt;/strong&gt;. Fed a continuous stream, it translates as the speaker talks and self-balances quality against sync, staying a few seconds behind — that ear-voice span is by design (it waits for enough context to translate a clause correctly), and it is &lt;strong&gt;not a knob the client can turn&lt;/strong&gt;. There's no "go faster" setting.&lt;/p&gt;

&lt;p&gt;What I &lt;em&gt;can&lt;/em&gt; do is not add latency on top:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Warm the connection&lt;/strong&gt; before capture starts, so the first sentence doesn't pay for the cold WebSocket handshake.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Disable WebSocket per-message compression&lt;/strong&gt; — it's pure overhead for PCM.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Send a continuous stream&lt;/strong&gt;, not client-side endpointing. The model owns its own endpointing; bracketing turns from the client only fights it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pin the VAD to CPU.&lt;/strong&gt; Silero VAD at batch size 1 is lower-latency on CPU than paying for a host↔device round-trip, and it avoids a CUDA-DLL probe stall on machines without a GPU.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Bound the input queue&lt;/strong&gt; drop-oldest, so a slow moment never snowballs into a growing backlog.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;None of these touch the model's core lag. I think it's better to say that clearly than to imply a few client tweaks made it real-time.&lt;/p&gt;

&lt;h2&gt;
  
  
  Open-core, and why the boundary is enforced in CI
&lt;/h2&gt;

&lt;p&gt;Voxis is open-core. The engine is on GitHub and runs BYOK — bring your own Gemini key, stored encrypted on your machine and bound to your Windows account. The open-source build makes &lt;strong&gt;no calls to my backend&lt;/strong&gt;: no auth, no quota, no telemetry, no usage reporting. The only network it touches is the Gemini WebSocket your own key opens.&lt;/p&gt;

&lt;p&gt;That's easy to claim and easy to break by accident. So the public-repo boundary is policed by a release-hygiene script wired into CI and a pre-push hook: it rejects any closed-core path, any live-secret signature, and any unguarded import of the closed package. A clean run is a release precondition. The separation is a property the build &lt;em&gt;proves&lt;/em&gt;, not a promise in a README.&lt;/p&gt;

&lt;h2&gt;
  
  
  What it doesn't do (yet)
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Windows only.&lt;/strong&gt; The whole capture story is a Windows-specific WASAPI feature. Other platforms would need a different capture strategy entirely.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Gemini-dependent.&lt;/strong&gt; It's built on one provider's live translate model. If that model changes, Voxis changes with it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Meeting outgoing needs a virtual mic.&lt;/strong&gt; Sending &lt;em&gt;your&lt;/em&gt; translated voice into a call means presenting a microphone the meeting app can select, and Windows only lets a virtual audio driver do that. Incoming translation needs nothing; outgoing falls back to listen-only without a cable.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Try it / read it
&lt;/h2&gt;

&lt;p&gt;The engine, the loopback code, and the CI boundary are all in the repo: &lt;strong&gt;&lt;a href="https://github.com/DavutAkca/voxislive" rel="noopener noreferrer"&gt;https://github.com/DavutAkca/voxislive&lt;/a&gt;&lt;/strong&gt; (PolyForm Noncommercial).&lt;/p&gt;

&lt;p&gt;If you've shipped WASAPI loopback from a managed/scripted language, I'd genuinely like to compare notes on the activation handler and the agile-object requirement — drop a comment.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>webdev</category>
      <category>programming</category>
      <category>productivity</category>
    </item>
  </channel>
</rss>
