The auth problem nobody talks about when running AI microservices locally

#python #ai #productivity #programming

Most voice AI tutorials assume you have an API key and a cloud endpoint. Mine had to run on the machine in front of me — 4GB GPU, no cloud, no managed auth layer.

That constraint forced me to solve a problem I hadn't seen written about anywhere: how do you authenticate requests between two local Python processes without a database, a shared secret sitting in your repo, or a session manager?

This is the story of what I built and the auth protocol I had to design from scratch.

What I Built

AI-RTC-Agent is a fully local real-time voice agent. The architecture has four isolated layers:

React client — captures mic audio, streams it over WebRTC using native RTCPeerConnection
Python WebRTC server — receives 48kHz PCM frames, runs VAD, segments utterances
FastMCP server — runs Whisper small for STT, plus email, calendar, and search tools
Agent layer — LLM intent routing with adapters for OpenAI, Gemini, and local Ollama

The data flow looks like this:

Browser mic → WebRTC (48kHz PCM) → VAD segmentation → FastMCP (Whisper STT) → transcript back over WebRTC DataChannel

No HTTP round-trip on the return path. The transcript is pushed directly over a WebRTC DataChannel, which keeps latency tight.

The Audio Pipeline

Before we get to auth, the VAD pipeline is worth explaining because it drives the whole segmentation design.

The browser streams 48kHz mono PCM. The server runs webrtcvad which requires 16kHz — so every incoming frame gets decimated in lockstep. But here's the thing: you can't feed 16kHz audio to Whisper and expect good results. So the system maintains two separate buffers from the same stream:

A 16kHz buffer evaluated by webrtcvad on a 300ms sliding window at aggressiveness 3
A raw 48kHz buffer that accumulates the actual speech frames for Whisper

When the VAD detects 2 consecutive seconds of silence, the utterance is considered complete. The raw 48kHz buffer gets wrapped with a WAV header, encoded to base64, and sent to the FastMCP server for transcription.

Whisper is preloaded as a module-level singleton at server boot via a LoadModelService — so there's no cold-start penalty on the first utterance.

The Auth Problem

Here's where it gets interesting.

The WebRTC server and the FastMCP server are two separate processes communicating over localhost HTTP. In production you'd put a reverse proxy in front, use mTLS, or drop a secret into a secrets manager. But this is a local developer workspace — no infrastructure, no ops, no database.

The naive solution is a static API key in .env. The problem: static keys sit in config files, get committed to repos, and never rotate. Even locally, it's a bad habit to build into an open-source blueprint.

I needed something that:

Required zero database
Left zero static credentials in the source files
Was stateless — no session syncing between processes
Was time-limited — a captured key shouldn't be reusable

The Solution: Deterministic Timestamp-Based Auth

Both processes independently run the same algorithm:

class api_key_generator:
    def __init__(self, expire_time: int = 5):
        self.expire_time = expire_time  # 5-second sliding epoch window

    def create_value(self) -> int:
        now_utc = datetime.datetime.now(tz=ZoneInfo("UTC"))
        return int(now_utc.timestamp()) // self.expire_time

    def generate_api_key(self) -> str:
        timestamp = self.create_value()
        suffix = self.generate_suffix(timestamp)
        prefix = self.generate_prefix(timestamp)
        return f"{suffix}_{timestamp}_{prefix}"

The prefix and suffix are derived from deterministic functions over the timestamp — math.sqrt(math.log10(timestamp)) — so both sides can independently compute the expected key for any given 5-second window.

How validation works:

WebRTC server generates a key, appends it as X-API-Key header, sends the audio payload
FastMCP middleware intercepts the request, extracts the header, parses the embedded timestamp
Middleware independently generates the expected key for that timestamp window
If the strings match and the timestamp is within one grace interval (5 seconds), the request is authenticated
If the key is replayed outside the window — rejected

What this gives you:

No database, no credential storage
Keys expire automatically every 5 seconds
A captured key is useless after the window closes
Both processes stay fully stateless

The MCP Layer

The FastMCP server is the heavy-lifting microservice. Beyond Whisper STT it exposes:

Mail tools — SMTP send/reply with thread headers (In-Reply-To, References)
Calendar tools — Google Calendar API with .ics fallback if OAuth isn't configured
Search tools — DuckDuckGo with a token-bucket rate limiter (1.0 req/sec)

Every tool response goes through a unified response parser: ok(data, message), err(message, code), paginated(items, total) — consistent shape across the whole layer, which makes testing clean.

Testing

Both the VAD server and the MCP tool layer have full pytest suites:

# Test the FastMCP tools
cd mcp && pytest tests/ -v

# Test the WebRTC VAD server
cd server && pytest tests/ -v

The MCP tests cover transcription accuracy, SMTP reply threading, calendar parsing, and rate limiter behavior.

Running It Locally

Requirements: Python 3.10+, Node.js 18+, ffmpeg, and a 4GB GPU (or CPU, slower).

# 1. Install Python deps
pip install -r requirements.txt

# 2. Start FastMCP server
cd mcp && python main.py        # localhost:8005

# 3. Start WebRTC backend
cd server && python main.py     # localhost:8080

# 4. Start React client
cd client && npm install && npm run dev   # localhost:5173

Speak into the mic, pause for 2 seconds, and the transcript appears in the dashboard.

Repo

github.com/zkzkGamal/AI-RTC-Agent

MIT licensed. Issues, PRs, and questions on the auth protocol or VAD pipeline are all welcome — especially curious if anyone has seen a cleaner approach to the zero-database local auth problem.