I recently finished building AI-RTC-Agent, an open-source real-time voice assistant workspace. It handles low-latency audio streaming, voice activity segmentation, and executes local tools (like search, email, and calendar) while maintaining a steady voice stream.
Here is the GitHub repository if you want to check out the code or run it locally:
https://github.com/zkzkGamal/AI-RTC-Agent
The Architecture
The project is split into four decoupled services to keep CPU-heavy tasks from blocking the audio processing loop:
- React Client: A Vite frontend that manages the microphone with the browser's RTCPeerConnection API and handles half-duplex turn control to prevent audio feedback.
- WebRTC Audio Processor: An asynchronous Python backend using aiortc and webrtcvad. It downsamples 48kHz audio to 16kHz for voice activity detection and segments user speech.
- FastAPI Orchestrator: Powered by LangGraph to manage intent routing and conversation state.
- FastMCP Server: Runs a warm-booted Whisper model locally for speech-to-text (STT) and exposes search and Google API tools.
Decoupling the WebRTC connection from the transcription and tool execution was critical. If the thread running the audio ingestion gets blocked by a transcription job or a web search, the audio stream drops frames. Offloading these to the FastMCP instance solves this.
Dynamic Model Switching
You can configure the system to use different LLMs and STT models by modifying the .env file. The orchestrator supports swapping the main language model between Ollama (for running local models like Qwen), OpenAI, or Google Gemini.
Service-to-Service Security
To secure communication between local microservices without the overhead of a database, I implemented a custom dynamic cryptographic authentication middleware. The client and servers calculate a time-locked token based on a Unix epoch sliding window of 5 seconds. The receiving service verifies the signature against synchronized system clocks, keeping the auth stateless.
UI Feedback
To keep the UX responsive while tools are running, the FastAPI agent broadcasts Socket.IO events (like tool_start and tool_finished). The React frontend immediately displays indicators showing what the agent is doing (such as calling the DuckDuckGo search tool) before streaming the voice response back.
Feel free to check out the setup instructions and run start.sh to test it out. I would love to get your feedback on the architecture.
Top comments (0)