Qiushi

Posted on Mar 14 • Originally published at claw-stack.com

Teaching an AI to Attend Your Teams Meeting: A Real-Time Voice Pipeline

#voice #agents #openclaw #teams

Microsoft Teams has a lot of AI features now. Copilot can summarize meetings after the fact. It can generate action items. It can transcribe. What it cannot do — at least not in any form that worked for us — is let you summon an AI agent mid-meeting with a wake word and have a real conversation, with the AI hearing everyone, speaking back, and remembering context from the whole call.

We wanted that. So we built it.

The result is an open-source voice pipeline that integrates a Claude-backed AI agent into live Teams meetings. You say a wake word, the AI activates, hears your question in context, and responds in natural speech — all in real time, with echo cancellation so it doesn't confuse its own voice for input.

GitHub: teams-meeting-agent-public

Why not just use the Teams API?

Teams does have a calling API. You can add bots to meetings. But the meeting bot API is designed for structured integrations — transcription services, recording bots, meeting note-takers. The audio pipeline it exposes is not well-suited for a low-latency conversational agent that needs to hear all participants, do speaker identification, respond in under two seconds, and handle interruption.

We also have specific constraints: we work across a Mac (where the Teams client runs) and a Linux GPU server (where we run inference). The GPU server is where we want the speech recognition — running faster-whisper on CUDA is significantly faster and more accurate than anything you can do on a Mac in real time. That means we need a bridge between two machines, with audio flowing across an SSH tunnel.

The path of least resistance turned out to be: use PulseAudio virtual devices on the Linux side to intercept Teams audio, do all processing there, and build a custom WebSocket relay to coordinate everything.

Architecture overview

Mac (Teams client)                    Linux GPU Server
┌─────────────────────┐               ┌──────────────────────────────────┐
│                     │               │                                  │
│  Microsoft Teams    │◄──────────────│  teams_speaker (null-sink)       │
│  (speaker output)   │               │  teams_virtual_mic (null-sink)   │
│  (mic input)        │               │  teams_mic_input (virtual-source)│
│                     │               │                                  │
│  bridge.py          │◄─WebSocket────│  ws_relay.py (port 8765)         │
│  (wake word,        │───speak cmd──►│                                  │
│   transcript buf)   │               │  stt_pipeline.py                 │
│                     │               │  (faster-whisper + VAD)          │
│  OpenClaw Agent     │               │                                  │
│  (Claude + memory)  │               │  tts_pipeline.py                 │
│                     │               │  (Edge-TTS → PulseAudio)         │
└─────────────────────┘               │                                  │
         │                            │  speaker_id.py                   │
         └────────SSH tunnel──────────│  (ECAPA-TDNN voiceprints)        │
              (port 8765)             └──────────────────────────────────┘

The flow for a single exchange:

Teams audio plays through teams_speaker (a PulseAudio null-sink)
STT pipeline captures .monitor stream, runs VAD + speaker ID + whisper
Transcript is sent via WebSocket to the Mac bridge
Bridge detects wake word → buffers context → sends to OpenClaw agent over HTTP
Agent generates reply → bridge sends speak command back via WebSocket
TTS pipeline synthesizes with Edge-TTS → streams to teams_virtual_mic
Teams hears the AI speaking through teams_mic_input

Everything except the agent LLM call runs locally. STT is on-device CUDA, TTS streams within 200ms of the first Edge-TTS chunk, and the whole round-trip from wake word to first spoken word is typically under two seconds.

The PulseAudio trick

The whole system depends on a PulseAudio setup that most people haven't seen before. On the Linux server, we create three virtual audio devices:

pactl load-module module-null-sink \
    sink_name=teams_speaker \
    sink_properties=device.description=Teams_Speaker

pactl load-module module-null-sink \
    sink_name=teams_virtual_mic \
    sink_properties=device.description=Teams_Virtual_Mic

pactl load-module module-virtual-source \
    source_name=teams_mic_input \
    master=teams_virtual_mic.monitor \
    source_properties=device.description=Teams_Mic_Input

teams_speaker is a null-sink — audio goes in and plays to nothing. But null-sinks in PulseAudio automatically create a .monitor source that exposes the audio as a readable stream. So by setting Teams' speaker output to Teams_Speaker, we get teams_speaker.monitor — a real-time PCM stream of everything Teams is playing, including all meeting participants. The STT pipeline reads from this.

teams_virtual_mic and teams_mic_input work the same way in reverse. The TTS pipeline writes synthesized speech to teams_virtual_mic. The monitor exposes it as a readable source (teams_mic_input), which we set as Teams' microphone input. So when the AI "speaks", Teams hears it as a microphone signal.

This is entirely transparent to Teams. It doesn't know it's talking to virtual devices. No API access required. No bot registration. The Linux server just appears to be a meeting participant with a very smart microphone.

There's one complication: the virtual devices need to be created inside the Chrome Remote Desktop (CRD) PulseAudio session, not the system PulseAudio. CRD runs its own isolated PulseAudio daemon with a non-standard socket path. The startup script detects this automatically:

PULSE_PATH=$(ssh "${SSH_ALIAS}" \
  "cat /proc/\$(pgrep -u \$USER pulseaudio | tail -1)/environ 2>/dev/null \
   | tr '\0' '\n' | grep PULSE_RUNTIME_PATH | cut -d= -f2")

It reads the environment of the running PulseAudio process to find the socket path, then exports PULSE_SERVER=unix:${PULSE_PATH}/native before every pactl call. Everything else just works.

STT pipeline: faster-whisper + VAD + echo cancellation

The STT pipeline is where most of the interesting signal processing happens. It runs on the Linux server and has four responsibilities: capture audio, detect speech boundaries, transcribe, and suppress its own voice.

Audio capture uses PulseAudio's parec to stream raw PCM from teams_speaker.monitor at 16kHz mono (the format faster-whisper expects). Audio comes in as continuous 30ms chunks.

Voice activity detection uses Silero VAD. The raw audio stream is chunked into 512-sample frames and fed to the model. The VAD runs fast enough that it doesn't add perceptible latency. Speech segments are accumulated until a silence gap triggers a transcription.

Transcription uses faster-whisper with distil-large-v3 on CUDA. Distil-large-v3 is a distilled version of Whisper large-v3 — comparable accuracy, roughly 5x faster. For Chinese-English mixed meetings (our primary use case), it handles code-switching without needing language hints.

Echo cancellation is the part that took the most tuning. Without it, the AI would transcribe its own TTS output, which creates feedback loops where it hears itself speaking and tries to respond. The solution is a SpeakingState object that coordinates between the TTS and STT pipelines:

class SpeakingState:
    def __init__(self):
        self._speaking = False
        self._tail_suppress_until = 0.0

    def set_speaking(self, val: bool):
        self._speaking = val
        if not val:
            self._tail_suppress_until = time.time() + ECHO_TAIL_SUPPRESS_SEC

    def is_suppressed(self) -> bool:
        return self._speaking or time.time() < self._tail_suppress_until

When TTS starts playing, set_speaking(True) is called. STT checks is_suppressed() before processing any VAD-triggered segment. After TTS finishes, suppression continues for a configurable tail window (we use 0.8 seconds) to catch audio still draining through the PulseAudio buffer.

Barge-in detection is the other half of the interruption story. When a human speaks while the AI is talking, we want to stop the AI mid-sentence. This is done with energy-based detection rather than VAD, because VAD is too slow for a real-time interrupt trigger:

# Fast 200ms window vs slow 3.2s baseline
fast_energy = rms(audio[-200ms:])
slow_energy = rms(audio[-3200ms:])
if fast_energy > slow_energy * BARGE_IN_RATIO:
    trigger_interrupt()

If the fast window's energy spikes above a ratio of the slow baseline, it signals a barge-in. The TTS pipeline receives an interrupt command and stops playback immediately.

Speaker identification: voiceprint matching

Not everything said in a meeting should reach the AI. You might want only the meeting host to be able to invoke the agent, or only people on your team. Speaker identification solves this.

The implementation uses speechbrain's ECAPA-TDNN model — a speaker verification model that produces 192-dimensional speaker embeddings. For each audio segment, we extract an embedding and match it against a set of registered voiceprints using cosine similarity:

similarity = cosine_similarity(embedding, voiceprint)
if similarity > SPEAKER_MATCH_THRESHOLD:  # 0.30 by default
    return "matched_speaker_name"

The threshold of 0.30 is intentionally low — we'd rather have a false positive (recognizing an unknown speaker as known) than miss a legitimate user. For a trust boundary where only certain people can invoke the agent, you'd raise this.

The model is language-agnostic. It works equally well for Chinese and English speakers, which matters for our meetings. Inference runs in under 10ms on CUDA, so it adds negligible latency to the transcription pipeline.

Speaker identity flows into the WebSocket relay as a trust layer. Transcripts are tagged with either verified (matched a known voiceprint) or untrusted (unknown speaker). The bridge uses this to filter who can invoke the wake word.

Wake word detection and the bridge

The bridge runs on the Mac. Its job is to sit between the Linux server and the OpenClaw agent, making routing decisions about what gets sent where.

When a transcript arrives from the Linux server, the bridge checks for wake words using regex pattern matching. The wake word list is configurable — in our setup it includes "hey claude", "hey agent", and a few Chinese equivalents. The bridge also supports a "presentation mode" where the wake word requirement is relaxed and all transcripts flow through.

When a wake word is detected, the bridge transitions to "engaged" mode: it buffers subsequent transcripts, accumulates context from multiple speakers, and flushes the buffer to the OpenClaw agent's HTTP API when there's a natural pause. This means the AI doesn't just hear one sentence — it gets a multi-turn context window of what was being discussed when it was summoned.

The bridge also handles the response path. When OpenClaw generates a reply, the bridge sends a speak command back to the Linux server's WebSocket relay:

{"cmd": "speak", "text": "Here's the answer to your question..."}

The TTS pipeline picks this up, synthesizes it, and plays it through the virtual microphone.

Connection resilience is important here — meetings can last hours. The bridge implements automatic reconnection with exponential backoff and heartbeat pings to detect silent disconnections before they cause dropped transcripts.

TTS pipeline: streaming synthesis

Edge-TTS is a free, high-quality TTS service that produces natural-sounding speech. The limitation is that it doesn't stream — it waits for the full text to be synthesized before returning audio.

We work around this with chunked generation. Edge-TTS internally streams MP3 data, and the edge-tts Python library exposes this. We pipe the MP3 stream through ffmpeg to convert it to PCM on the fly, then write to PulseAudio as chunks arrive:

async for chunk in communicate.stream():
    if chunk["type"] == "audio":
        process.stdin.write(chunk["data"])  # ffmpeg stdin
        # ffmpeg is already decoding and writing to PulseAudio

The result is that the first audible audio plays within 200-300ms of the speak command arriving. For a meeting context where people are already talking and waiting for a response, this latency is barely perceptible.

The TTS pipeline integrates with the same SpeakingState used by STT. Before writing each chunk, it checks for an interrupt signal. If barge-in was detected while audio was being generated, playback stops mid-sentence and the pipeline sends an acknowledgment back to the bridge so the agent knows the response was cut short.

One-command startup

The entire system — SSH tunnel, remote PulseAudio setup, both processes in tmux sessions — starts with a single script:

./start_meeting.sh

The script handles the full orchestration:

Check if SSH tunnel is already running on port 8765; create it if not
SSH into the Linux server, detect the CRD PulseAudio socket path
Load the three virtual audio devices (idempotent — skips if already loaded)
Write a launcher script on the remote to avoid shell-escaping issues with the socket path
Start main_linux.py in a remote tmux session (teams-voice)
Poll until the WebSocket is accepting connections
Start bridge.py in a local tmux session (teams-bridge)

After startup, the output shows exactly what's running and what to configure in Teams:

══════════════════════════════════════════════════
  Teams Voice Agent — RUNNING
══════════════════════════════════════════════════
  Session key  : voice-meeting-20260313-1430
  Bridge tmux  : teams-bridge  (local)
  Remote tmux  : teams-voice   (on gpu-server)
  Tunnel       : localhost:8765 → gpu-server:8765

  ⚠️  Set Teams audio: Speaker=Teams_Speaker, Mic=Teams_Mic_Input
══════════════════════════════════════════════════

The only manual step is setting Teams' audio devices to the virtual ones. After that, everything is hands-free.

In practice

In a typical meeting, the pipeline is completely silent until activated. STT is running continuously, speaker ID is tagging everyone, but nothing flows to the agent. The AI is listening but not present.

When someone says "hey claude, can you look something up", the bridge catches the wake word, buffers the next few sentences of context, and sends the whole thing to the agent. The agent responds in under two seconds, the voice comes through everyone's speakers (since the AI is speaking through the virtual microphone), and then the pipeline goes quiet again.

We use it primarily for knowledge retrieval during technical discussions — pulling up documentation, cross-referencing notes, summarizing what was decided earlier in the call. The agent has access to the OpenClaw memory system, so it can retrieve context from past meetings about the same project. This is the part that still surprises people in meetings: the AI not only answers the question but references a decision from last week's call.

Barge-in works better than expected. If someone starts talking while the AI is mid-response, it stops within about half a second. There's a brief artifact from the audio that was already in the PulseAudio buffer, but it's not disruptive.

The main limitation is that it requires Chrome Remote Desktop to be running on the Linux server. CRD creates the PulseAudio environment that the virtual devices live in. Without it, you'd need to adapt the startup script to work with whatever PulseAudio setup you have. The core pipeline code doesn't care — it just needs the right device names to exist.

What we'd do differently

The current architecture puts all the signal processing on the Linux server, which makes sense for us but isn't universal. If you have a GPU-capable Mac or just want a simpler setup, the STT pipeline could run locally using MLX-Whisper — roughly the same accuracy on Apple Silicon, no remote server needed.

Speaker identification is currently based on pre-registered voiceprints. A more robust approach would be to do diarization-style "who spoke when" identification without requiring enrollment, using something like pyannote. This would let the agent attribute meeting contributions even for people who haven't registered.

The wake word system is regex-based, which works but is brittle to accents and speech recognition errors. A proper wake word model (like openWakeWord) would be more reliable, especially for non-English activations.

Takeaway

The interesting engineering here isn't the AI part — Claude handles that. It's the audio plumbing: getting the right audio to the right place at the right time, without the AI confusing its own voice for input, without adding enough latency to make conversation awkward.

PulseAudio virtual devices are genuinely powerful for this kind of audio routing. The null-sink + monitor pattern is a clean way to intercept audio streams without patching into any application's internal pipeline. More people should know it exists.

The full source is at github.com/QiushiWu95/teams-meeting-agent-public. The README has setup instructions for the hardware configuration we use (GPU server + Mac over Tailscale), but the pipeline itself should adapt to other setups with moderate effort.

This article was originally published on claw-stack.com. We're building an open-source AI agent runtime — check out the docs or GitHub.

DEV Community