Translating Windows system audio in real time — driverless, with no virtual cable

Davut Akça — Thu, 25 Jun 2026 03:54:23 +0000

I build Voxis, an open-source Windows app that translates whatever your system is playing — a video, a game, the other side of a call — and plays the translation back as spoken voice, a few seconds behind the speaker. No subtitles, no virtual audio cable, no bot joining your meeting.

The "no virtual cable" part is the bit worth writing about. Almost every system-audio tool on Windows tells you to install VB-CABLE or VoiceMeeter, or to drop a bot into your call. Voxis doesn't, for incoming audio. This post is how that capture engine works, and the sharp edges I hit building it in Python.

I'll be specific about what's hard and honest about what's not mine to fix.

The goal

Read the exact audio the user is hearing — the post-mix system output — at 16 kHz mono, and do it without installing anything. Then stream it to a translation model and play the result back, all while the original keeps playing underneath.

Three constraints fall out of that:

Driverless. If it needs a reboot and a driver, it's not zero-setup.
No self-feedback. The app plays translated audio into the same system mix it's capturing. Naively, it would capture its own voice and translate the translation. That has to be impossible by construction, not patched with an echo gate.
Realtime-safe. Capture can't stall. If the downstream VAD or garbage collector hiccups, the WASAPI ring buffer must not overflow.

WASAPI process-loopback: capturing the mix, minus yourself

Windows 10 version 2004 added the ApplicationLoopback API — a way to activate an IAudioClient in loopback mode scoped to a process tree, either including only that tree or excluding it. Excluding our own process tree is exactly what constraint #2 needs: the captured mix is everything the user hears, with Voxis's own output removed.

You don't get this client from the normal IMMDeviceEnumerator path. You activate it by name through ActivateAudioInterfaceAsync, passing the loopback parameters in a PROPVARIANT carrying a BLOB:

params = AUDIOCLIENT_ACTIVATION_PARAMS()
params.ActivationType = AUDIOCLIENT_ACTIVATION_TYPE_PROCESS_LOOPBACK
params.u.ProcessLoopbackParams.TargetProcessId = my_pid
params.u.ProcessLoopbackParams.ProcessLoopbackMode = \
    PROCESS_LOOPBACK_MODE_EXCLUDE_TARGET_PROCESS_TREE

pv = PROPVARIANT()
pv.vt = VT_BLOB
pv.blob.cbSize = sizeof(params)
pv.blob.pBlobData = ctypes.cast(byref(params), c_void_p)

The device name is the magic string VAD\Process_Loopback. The activation is asynchronous: you hand ActivateAudioInterfaceAsync a completion handler and wait for it to fire.

The IAgileObject trap

Here's the one that cost me an afternoon. The completion handler is a COM object you implement yourself (in Python, via comtypes.COMObject). If it only implements IActivateAudioInterfaceCompletionHandler, ActivateAudioInterfaceAsync returns E_ILLEGAL_METHOD_CALL and nothing tells you why.

The fix: the handler must also implement IAgileObject — a marker interface with no methods that declares the object as apartment-agnostic. Add it to the COM interface list and the activation succeeds:

class _Handler(COMObject):
    _com_interfaces_ = [IActivateAudioInterfaceCompletionHandler, IAgileObject]

IAgileObject has an empty method list — it's purely a "you may call me from any apartment" promise. WASAPI refuses to proceed without it.

Asking for the format you actually want

The other nicety: WASAPI lets you Initialize the loopback client with the exact WAVEFORMATEX you want. I request 16 kHz, mono, 16-bit PCM directly — which happens to be exactly what the translation model wants as input — so there's no resampling step in the hot path:

wfx.nChannels = 1
wfx.nSamplesPerSec = 16000
wfx.wBitsPerSample = 16
client.Initialize(AUDCLNT_SHAREMODE_SHARED, AUDCLNT_STREAMFLAGS_LOOPBACK,
                  2_000_000, 0, byref(wfx), None)

That 2_000_000 is a 200 ms buffer in 100-ns units.

Keeping capture realtime-safe

A loopback capture loop has one job it must never miss: call GetBuffer, copy the bytes, call ReleaseBuffer. If ReleaseBuffer is late because something downstream is slow, the ring overflows and you get glitches.

So capture and processing are split across two threads with a bounded queue between them:

Capture thread: GetNextPacketSize → GetBuffer → copy into a numpy array → ReleaseBuffer → append to a deque. That's all it does. It never runs the VAD or the network code.
Processor thread: drains the deque and runs the (sometimes slow) per-chunk callback — VAD gating, then handoff to the translator.

The queue is a collections.deque(maxlen=N) — drop-oldest by construction. If the processor falls behind, old audio is dropped to bound latency rather than letting the capture thread block. A GC pause or a VAD stall in the consumer therefore can never delay ReleaseBuffer. This is the single most important design decision in the capture path, and it's three lines of code.

self._queue = collections.deque(maxlen=64)   # bounded; ~a buffer's worth of packets
# capture thread:
self._queue.append(x)        # never blocks; oldest is discarded under pressure

Ducking without touching the audio

When the translation speaks, you want the original quieter so the two voices don't fight. The tempting approach is to mix — capture the audio, attenuate it, play it back yourself. But then you own playback, latency, and device routing for every app on the system.

Instead, Voxis ducks at the source using the Windows session-volume API (ISimpleAudioVolume via pycaw): turn down the audio session of the app that's playing, not the bytes in our pipeline. The original keeps playing through its own path, untouched except for its level, and pops back up when the translation stops. No mixing, no added latency on the original, nothing to route.

(There's a second capture path for people who do install a virtual cable, where Voxis can do real M/S center-suppression to duck dialogue while preserving stereo music — but that's opt-in, and the driverless path above is the default.)

The latency I don't control — and the bit I do

People always ask why it's not instant. Two honest sentences:

The translation model is a native simultaneous interpreter. Fed a continuous stream, it translates as the speaker talks and self-balances quality against sync, staying a few seconds behind — that ear-voice span is by design (it waits for enough context to translate a clause correctly), and it is not a knob the client can turn. There's no "go faster" setting.

What I can do is not add latency on top:

Warm the connection before capture starts, so the first sentence doesn't pay for the cold WebSocket handshake.
Disable WebSocket per-message compression — it's pure overhead for PCM.
Send a continuous stream, not client-side endpointing. The model owns its own endpointing; bracketing turns from the client only fights it.
Pin the VAD to CPU. Silero VAD at batch size 1 is lower-latency on CPU than paying for a host↔device round-trip, and it avoids a CUDA-DLL probe stall on machines without a GPU.
Bound the input queue drop-oldest, so a slow moment never snowballs into a growing backlog.

None of these touch the model's core lag. I think it's better to say that clearly than to imply a few client tweaks made it real-time.

Open-core, and why the boundary is enforced in CI

Voxis is open-core. The engine is on GitHub and runs BYOK — bring your own Gemini key, stored encrypted on your machine and bound to your Windows account. The open-source build makes no calls to my backend: no auth, no quota, no telemetry, no usage reporting. The only network it touches is the Gemini WebSocket your own key opens.

That's easy to claim and easy to break by accident. So the public-repo boundary is policed by a release-hygiene script wired into CI and a pre-push hook: it rejects any closed-core path, any live-secret signature, and any unguarded import of the closed package. A clean run is a release precondition. The separation is a property the build proves, not a promise in a README.

What it doesn't do (yet)

Windows only. The whole capture story is a Windows-specific WASAPI feature. Other platforms would need a different capture strategy entirely.
Gemini-dependent. It's built on one provider's live translate model. If that model changes, Voxis changes with it.
Meeting outgoing needs a virtual mic. Sending your translated voice into a call means presenting a microphone the meeting app can select, and Windows only lets a virtual audio driver do that. Incoming translation needs nothing; outgoing falls back to listen-only without a cable.

Try it / read it

The engine, the loopback code, and the CI boundary are all in the repo: https://github.com/DavutAkca/voxislive (PolyForm Noncommercial).

If you've shipped WASAPI loopback from a managed/scripted language, I'd genuinely like to compare notes on the activation handler and the agile-object requirement — drop a comment.

DEV Community: Davut Akça