The hardest part of my AI meeting app was the audio, not the AI

#ai #webdev #electron #career

I'm building an AI copilot for live calls. I assumed the AI would be the hard part. It wasn't — it was getting clean audio off the machine. Here's what I wish I'd known.

The constraint: no bot in the call
Most meeting AIs join your call as a participant ("X's Notetaker has joined"). I didn't want that. The goal: capture both sides of a conversation with nothing in the meeting — just the local device.

Two streams:

Your mic (you) — easy.
The system audio (everyone else) — the hard, platform-specific part.
The lessons that cost me weeks

Don't turn on system "voice processing."
The OS has a comms/voice mode that looks helpful — echo cancellation and gain control for free. Enable it and it globally drops your speaker volume and stacks auto-gain on your mic. It's invisible on your machine — it only shows up when you're in a real call and the other person sounds quiet or pumped. Leave it off.
Don't stack echo cancellation / AGC.
The meeting app (Zoom/Meet/Teams) already runs AEC + AGC + noise suppression. Add your own on the same stream and the two fight over the reference signal — quality drops. Process a separate copy, never what the meeting app hears.
Run capture in a separate process.
I moved native capture into a small helper that pipes raw PCM over IPC. Isolates the heavy native work, survives a crash (re-spawn instead of taking down the app), keeps the UI thread free.
"Ready" means frames are flowing — not a status flag.
My first version trusted a one-shot "capture started" signal. It lied: after sleep/wake or a slow cold start, the flag fired but no audio came. Ground truth = actual frames arriving. Gate on liveness, not a boolean.
Keep a fallback path — but tear it down when native wakes.
I run a browser-based fallback when the native helper isn't ready. The bug: leaving it running alongside native → double-captured audio. Make the switch reversible, and kill the fallback fully once native frames arrive.
Normalize early.
Downsample to 16 kHz mono before transcription, keep the two parties on separate channels so "them" and "you" never blur in the transcript. Cheap at capture time, painful later.

Takeaway
The AI/LLM layer had the most tutorials and fewest surprises. The audio layer — OS quirks, processing that stacks invisibly, startup races — was where every real bug lived. Building anything that listens to a call? Budget your time accordingly.

Happy to go deeper in the comments — especially how others handle AEC/AGC stacking cross-platform.

(Building this as TryCuebird — a real-time interview copilot. The capture lessons above are the generalizable bits.)