I recently built a production-grade real-time Voice AI workspace from scratch. While the whole system has many moving parts, two components required the most careful engineering: the authentication middleware between services and the Speech-to-Text (STT) pipeline.
Here’s exactly how I approached and solved both.
The Middleware Problem
I needed two local microservices — a WebRTC audio server and a FastMCP server — to communicate securely.
I didn’t want to introduce a database, Redis, or any hardcoded secrets. The solution had to be lightweight, stateless, and still reasonably secure for internal communication.
So I built a dynamic time-locked API key generator.
How it works:
Both services independently calculate the same cryptographic key using the current UTC timestamp.
- Take the current timestamp
- Divide it by a 5-second epoch window
- Generate a deterministic key from that value
If a request arrives outside the valid 5-second window, it is immediately rejected.
This approach gives me:
- No shared state
- No persistent storage
- No single point of failure
- Automatic key rotation every 5 seconds
The Real-Time STT Pipeline
I wanted low-latency transcription with zero cold starts and no HTTP polling.
Here’s the exact flow I created:
- Browser captures audio at 48kHz via WebRTC
- Audio is downsampled to 16kHz
- Voice Activity Detection (VAD) runs in 30ms lockstep
- 2.0 seconds of continuous silence = speech boundary
- Audio segment is sent to the FastMCP server
- Whisper "small" model is preloaded at boot (zero cold starts)
- Transcription result is pushed back to the React frontend over WebRTC DataChannel
This gives a true real-time feeling with sub-second end-to-end latency in most cases.
Architecture Flow
Why This Design Works Well
- Completely stateless middleware removes infrastructure complexity.
- Preloading Whisper eliminates cold start delays.
- Using WebRTC DataChannel for transcription delivery removes polling overhead.
- Clear separation of concerns with VAD segmentation and MCP tooling.
The full project is open source and meant to serve as an educational blueprint for developers working with WebRTC, MCP, and real-time AI.
Repository: https://lnkd.in/dFbE44e3
Contributions are welcome — especially on the agent routing and LLM orchestration layer that’s currently in progress.
Let me know in the comments if you’d like me to dive deeper into any specific part (VAD tuning, Whisper post-processing, rate limiting, etc.).

Top comments (0)