How I built a production-hardened LLM API with HMAC-signed outputs and 30-pattern injection detection

#python #fastapi #security #ai

I've been building on top of LLMs for a while, and one thing bothered me: nobody signs their outputs.
You call an AI API, get back text, and you trust it. But what if something in the chain mutated that text? A caching layer, a CDN, a reverse proxy doing something unexpected? You'd never know.
So I built OMEGA ARCHITECT — a FastAPI-based AI API that signs every response with HMAC-SHA256 and runs every input through 30 injection detection patterns before it ever reaches the model. Here's what I learned.

Why HMAC on LLM outputs?
Most APIs sign requests (inbound). HMAC on responses (outbound) is rare.
The threat model: your LLM returns deterministic, structured text. If a middleware layer, a cache, or an active network attacker modifies that response, your user gets different content than what your server generated. With no signature, they can't detect this.
The fix is simple: after generating your final response, compute HMAC-SHA256(signing_key, response_body) and include it in the response. The client recomputes and compares. Tampering becomes detectable.
One non-obvious detail: sign after truncation, not before. If you truncate long outputs before returning them, compute the HMAC on the truncated bytes. Otherwise you'll have verification failures on responses that hit your length limit.
pythonimport hmac
import hashlib

def sign_output(content: str, key: str) -> str:
return hmac.new(
key.encode(),
content.encode(),
hashlib.sha256
).hexdigest()

30-Pattern Injection Detection
Before any user input reaches Groq, it passes through four categories of pattern matching:
Prompt injection (15 patterns)
The obvious ones: "ignore previous instructions", "act as", "roleplay as", "forget your system prompt". But also the sneakier ones: Unicode homoglyph substitution.
Someone might write ɪɢɴᴏʀᴇ (IPA Small Caps) instead of IGNORE to bypass keyword filters. These characters look similar to Latin letters but have completely different Unicode codepoints.
Defense: NFKC normalization first, then pattern matching.
pythonimport unicodedata

def normalize_input(text: str) -> str:
# Strip zero-width characters
text = text.replace('\u200b', '').replace('\u200c', '').replace('\u200d', '')
# Normalize Unicode to canonical form
return unicodedata.normalize('NFKC', text)
Mixed script detection
Cyrillic characters mixed with Latin is a classic homoglyph attack. The word looks like English but contains е (Cyrillic) instead of e (Latin). Run a script detection check after normalization:
pythondef _detect_mixed_scripts(text: str) -> bool:
has_latin = any('\u0041' <= c <= '\u007a' for c in text)
has_cyrillic = any('\u0400' <= c <= '\u04ff' for c in text)
return has_latin and has_cyrillic
SQL injection (7 patterns)
There's no database on the path from user input to LLM call in my setup. But someone sending SELECT * FROM users or DROP TABLE isn't lost — they're probing. I reject it with 422 and log the attempt.
XSS and OS commands (8 patterns)
, javascript:, rm -rf, format c: — classic fuzzer signatures. If you're seeing these, someone is running automated tooling against your endpoint. Rate Limiting Behind a Reverse Proxy SlowAPI (the FastAPI rate limiting library) uses the client IP as the default key. This breaks badly behind a reverse proxy or load balancer. The issue: every request appears to come from the proxy IP. One unlucky user exhausts the limit for everyone sharing that proxy. Fix: extract the real IP from X-Forwarded-For. Take the first entry (leftmost = actual client), not the last. pythondef get_real_client_ip(request: Request) -> str: forwarded_for = request.headers.get("X-Forwarded-For", "") if forwarded_for: return forwarded_for.split(",")[0].strip() return request.client.host Timeout Strategy for LLM Backends My first deployment had a 10-second timeout on all requests. It worked fine until I tested with complex prompts — the Groq API call took 45 seconds and every request timed out. The solution: separate timeout constants. pythonTIMEOUT = 10 # Standard endpoints VALID_TIMEOUT = 90 # LLM-path endpoints Use the standard timeout for health checks, auth validation, everything synchronous. Reserve the extended timeout exclusively for the endpoints that actually call the model. Client Fingerprinting Without PII I wanted session-level anomaly detection without storing personally identifiable information. pythonimport hashlib def get_client_fingerprint(request: Request) -> str: ip = get_real_client_ip(request) ua = request.headers.get("User-Agent", "") accept = request.headers.get("Accept", "") raw = f"{ip}|{ua}|{accept}" return hashlib.sha256(raw.encode()).hexdigest()[:16] 16 hex characters. Stable across a session. Not reversible to PII without the original inputs. Good enough to detect the same client hammering different endpoints. The Server Header Gotcha I run a HardenedShieldMiddleware that sets security headers on every response. I initially left the Server header empty, thinking it would suppress the default. It doesn't. Uvicorn writes its own Server: uvicorn header if you don't set one explicitly. The fix: set it explicitly in your middleware. pythonresponse.headers["Server"] = "OMEGA" Small thing, but leaking your server software in production is unnecessary information for an attacker. Deployment Reality Running on Render's free tier during bootstrapping. The cold start problem is real — first request after inactivity takes 30-60 seconds. For a demo endpoint, this is acceptable. For paid users, it's not. My upgrade trigger: first paying customer → Render Starter ($7/month, no cold starts) + Groq paid tier. The economic logic: don't spend money proving there's demand. Spend money after demand is proven. What I'd Do Differently Structured output enforcement from the start. I added output schema validation late. It should be the first thing you build — before any of the injection detection, before the HMAC. If you can't guarantee output structure, you can't reliably sign it. Test the scanner against the scanner. I wrote a security scanner (final_audit.py) to test the API. Halfway through, I realized the scanner itself had the same injection patterns it was testing for in its test payloads. Isolate your test tooling from your application code completely. Document the timeout separately. Every time I looked at the codebase, I second-guessed the 90-second timeout. Now it has a comment explaining exactly why it's 90 and not 10. The API The API is live on RapidAPI with a free tier if you want to test it: OMEGA ARCHITECT on RapidAPI Demo endpoint (no auth, rate limited): bashcurl -X POST <a href="https://omega-architect-api.onrender.com/demo">https://omega-architect-api.onrender.com/demo</a> \ -H "Content-Type: application/json" \ -d '{"instruction": "FastAPI + JWT + PostgreSQL. Include Dockerfile."}' The response includes hmac_sha256 — you can verify it with the public signing approach described above. All code in this article is simplified for readability. Full implementation on GitHub. Discussion on Hacker News: <a href="https://news.ycombinator.com/item?id=47565934">https://news.ycombinator.com/item?id=47565934</a>

DEV Community

How I built a production-hardened LLM API with HMAC-signed outputs and 30-pattern injection detection

Top comments (0)