DEV Community

Cover image for Building Zero-Shared-State Auth Middleware and Real-Time Whisper STT Pipeline for Voice AI
zkaria gamal
zkaria gamal

Posted on

Building Zero-Shared-State Auth Middleware and Real-Time Whisper STT Pipeline for Voice AI

I recently built a production-grade real-time Voice AI workspace from scratch. While the whole system has many moving parts, two components required the most careful engineering: the authentication middleware between services and the Speech-to-Text (STT) pipeline.

Here’s exactly how I approached and solved both.

The Middleware Problem

I needed two local microservices — a WebRTC audio server and a FastMCP server — to communicate securely.

I didn’t want to introduce a database, Redis, or any hardcoded secrets. The solution had to be lightweight, stateless, and still reasonably secure for internal communication.

So I built a dynamic time-locked API key generator.

How it works:

Both services independently calculate the same cryptographic key using the current UTC timestamp.

  • Take the current timestamp
  • Divide it by a 5-second epoch window
  • Generate a deterministic key from that value

If a request arrives outside the valid 5-second window, it is immediately rejected.

This approach gives me:

  • No shared state
  • No persistent storage
  • No single point of failure
  • Automatic key rotation every 5 seconds

The Real-Time STT Pipeline

I wanted low-latency transcription with zero cold starts and no HTTP polling.

Here’s the exact flow I created:

  • Browser captures audio at 48kHz via WebRTC
  • Audio is downsampled to 16kHz
  • Voice Activity Detection (VAD) runs in 30ms lockstep
  • 2.0 seconds of continuous silence = speech boundary
  • Audio segment is sent to the FastMCP server
  • Whisper "small" model is preloaded at boot (zero cold starts)
  • Transcription result is pushed back to the React frontend over WebRTC DataChannel

This gives a true real-time feeling with sub-second end-to-end latency in most cases.

Architecture Flow

Architecture Flow

Why This Design Works Well

  • Completely stateless middleware removes infrastructure complexity.
  • Preloading Whisper eliminates cold start delays.
  • Using WebRTC DataChannel for transcription delivery removes polling overhead.
  • Clear separation of concerns with VAD segmentation and MCP tooling.

The full project is open source and meant to serve as an educational blueprint for developers working with WebRTC, MCP, and real-time AI.

Repository: https://lnkd.in/dFbE44e3

Contributions are welcome — especially on the agent routing and LLM orchestration layer that’s currently in progress.

Let me know in the comments if you’d like me to dive deeper into any specific part (VAD tuning, Whisper post-processing, rate limiting, etc.).

Top comments (0)