Building Zero-Shared-State Auth Middleware and Real-Time Whisper STT Pipeline for Voice AI

#webrtc #fastmcp #ai

I recently built a production-grade real-time Voice AI workspace from scratch. While the whole system has many moving parts, two components required the most careful engineering: the authentication middleware between services and the Speech-to-Text (STT) pipeline.

Here’s exactly how I approached and solved both.

The Middleware Problem

I needed two local microservices — a WebRTC audio server and a FastMCP server — to communicate securely.

I didn’t want to introduce a database, Redis, or any hardcoded secrets. The solution had to be lightweight, stateless, and still reasonably secure for internal communication.

So I built a dynamic time-locked API key generator.

How it works:

Both services independently calculate the same cryptographic key using the current UTC timestamp.

Take the current timestamp
Divide it by a 5-second epoch window
Generate a deterministic key from that value

If a request arrives outside the valid 5-second window, it is immediately rejected.

This approach gives me:

No shared state
No persistent storage
No single point of failure
Automatic key rotation every 5 seconds

The Real-Time STT Pipeline

I wanted low-latency transcription with zero cold starts and no HTTP polling.

Here’s the exact flow I created:

Browser captures audio at 48kHz via WebRTC
Audio is downsampled to 16kHz
Voice Activity Detection (VAD) runs in 30ms lockstep
2.0 seconds of continuous silence = speech boundary
Audio segment is sent to the FastMCP server
Whisper "small" model is preloaded at boot (zero cold starts)
Transcription result is pushed back to the React frontend over WebRTC DataChannel

This gives a true real-time feeling with sub-second end-to-end latency in most cases.

Architecture Flow

Why This Design Works Well

Completely stateless middleware removes infrastructure complexity.
Preloading Whisper eliminates cold start delays.
Using WebRTC DataChannel for transcription delivery removes polling overhead.
Clear separation of concerns with VAD segmentation and MCP tooling.

The full project is open source and meant to serve as an educational blueprint for developers working with WebRTC, MCP, and real-time AI.

Repository: https://lnkd.in/dFbE44e3

Contributions are welcome — especially on the agent routing and LLM orchestration layer that’s currently in progress.

Let me know in the comments if you’d like me to dive deeper into any specific part (VAD tuning, Whisper post-processing, rate limiting, etc.).

DEV Community

Building Zero-Shared-State Auth Middleware and Real-Time Whisper STT Pipeline for Voice AI

The Middleware Problem

The Real-Time STT Pipeline

Architecture Flow

Why This Design Works Well

Top comments (0)