I open-sourced llm0-gateway recently — a Go binary that puts one OpenAI-compatible endpoint in front of OpenAI, Anthropic, Gemini, and local Ollama. MIT licensed. Single binary plus Postgres + Redis.
The technically interesting bits are how it stays fast: 3 ms p50 cache-hit latency, ~1,672 req/s sustained throughput, 1–2 Redis round trips on the hot path on a DigitalOcean 4 vCPU / 8 GB shared Linux droplet.
This post walks through the architecture decisions that got those numbers. Expect Lua scripts, a pgvector query, and an honest discussion of where I overstated things and got corrected by a Redis engineer.
The naive approach (and why it's slow)
A typical LLM gateway request needs to do six things:
- Authenticate the API key
- Check the per-API-key rate limit
- Check the per-project spend cap
- Look up exact-match cache
- (Maybe) check semantic cache
- (Maybe) forward to the upstream model
The naive approach is six serial Redis GETs. At ~1 ms each in Docker on a typical cloud VM, that's 5–10 ms gone before the request leaves the gateway. Half a 50th-percentile latency budget consumed on bookkeeping.
There's also a correctness problem. Rate-limit and spend-cap checks have a TOCTOU race:
read counter → check threshold → write counter
Two simultaneous requests can both pass the check and both increment, exceeding the cap.
The fix to both — speed and correctness — is to move the hot work into Redis itself, as atomic Lua scripts.
Token-bucket rate limiting in Lua
Token bucket: each API key has a bucket with a capacity (say, 60 tokens) that refills at a fixed rate (say, 1 per second). Each request takes 1 token. If the bucket is empty, the request is denied.
Here's the Lua script:
local now = tonumber(ARGV[1])
local capacity = tonumber(ARGV[2])
local refill_rate = tonumber(ARGV[3])
local requested = tonumber(ARGV[4]) or 1
local bucket = redis.call('HMGET', KEYS[1], 'tokens', 'last_refill')
local tokens = tonumber(bucket[1])
local last_refill = tonumber(bucket[2])
if tokens then
tokens = math.min(capacity, tokens + ((now - last_refill) * refill_rate * 0.001))
else
tokens = capacity
end
local allowed = 0
if tokens >= requested then
tokens = tokens - requested
allowed = 1
end
redis.call('HMSET', KEYS[1], 'tokens', tokens, 'last_refill', now)
redis.call('EXPIRE', KEYS[1], 900)
return {allowed, math.floor(tokens), 0}
What this gets you:
- Atomic. The entire script runs as one Redis command. No interleaving with other commands on the same instance.
- One round trip. Read state, compute new state, write state — one network call.
- No TOCTOU. Check and decrement happen in the same operation.
Calling it from Go (using go-redis):
const rateLimitScript = `[lua source above]`
result, err := rdb.Eval(ctx, rateLimitScript, []string{key},
time.Now().UnixMilli(), capacity, refillRate, 1).Result()
But Eval re-parses the script on every call, which is slow. Pre-load the script once and switch to EvalSha:
sha, err := rdb.ScriptLoad(ctx, rateLimitScript).Result()
// ... store sha somewhere accessible ...
result, err := rdb.EvalSha(ctx, sha, []string{key},
time.Now().UnixMilli(), capacity, refillRate, 1).Result()
If Redis evicts the script from its script cache (rare, but possible on a FLUSHALL), retry with Eval. The go-redis client's Run() method handles this automatically — pre-load once, call Run(), get EVALSHA's speed with EVAL's reliability.
Atomic spend-cap enforcement
Same pattern for the per-project spend cap. Get current spend, check threshold, increment if under cap — atomically.
local cost_usd = tonumber(ARGV[1])
local monthly_cap = tonumber(ARGV[2])
local current_spend = tonumber(redis.call('GET', KEYS[1])) or 0
if current_spend + cost_usd > monthly_cap then
return {0, current_spend, monthly_cap} -- blocked
end
local new_spend = redis.call('INCRBYFLOAT', KEYS[1], cost_usd)
redis.call('EXPIRE', KEYS[1], 2678400) -- 31 days
return {1, tonumber(new_spend), monthly_cap} -- allowed
Returns {allowed, current_spend, cap} so the gateway can include current_spend in response headers for client-side observability.
The 31-day expiry is deliberate — spend:project:{id}:{YYYY-MM} keys auto-evict after a calendar month rolls over, so monthly counters reset cleanly without needing a scheduled job to clear them.
The same pattern extends to per-end-user spend caps: pass X-Customer-ID on a request, and the gateway enforces daily and monthly USD limits per downstream user. Two overflow behaviors — block returns 429 with a retry_after, downgrade automatically routes to a cheaper model (e.g. gpt-4o → gpt-4o-mini). Same atomic Lua pattern, different key namespace.
Two-tier exact cache: Redis hot, Postgres warm
LLM responses are deterministic enough that exact-match caching pays off — same prompt + model + parameters = same answer. Why pay the upstream provider twice?
Cache key: SHA256(project_id + provider + model + sorted_messages_json).
Cache value: the full JSON response from the model.
Storage in two tiers:
| Tier | Storage | TTL | Latency |
|---|---|---|---|
| Hot | Redis | configurable (default 1 hour) | <1 ms |
| Warm | Postgres | same TTL, survives Redis eviction/restart | ~5 ms |
On read:
func (c *Cache) Get(ctx context.Context, key string) ([]byte, bool, error) {
// Try Redis first
cached, err := c.redis.Get(ctx, key).Bytes()
if err == nil {
return cached, true, nil
}
if err != redis.Nil {
return nil, false, err
}
// Fall back to Postgres
var response []byte
err = c.db.QueryRowContext(ctx,
"SELECT cached_response FROM exact_cache WHERE cache_key = $1 AND expires_at > NOW()",
key).Scan(&response)
if err == sql.ErrNoRows {
return nil, false, nil
}
if err != nil {
return nil, false, err
}
// Promote to Redis on Postgres hit
c.redis.Set(ctx, key, response, time.Hour)
return response, true, nil
}
On write: Redis SET synchronous, Postgres INSERT async (off the hot path) so writes don't slow down the response.
This gives you Redis-fast hits for recent prompts and Postgres-durable hits for older ones. Restart-survivable too: rebuilding the warm tier from production traffic happens naturally as users repeat queries.
Semantic cache: pgvector catches paraphrases
Exact-match caching catches "what is the capital of france?" twice. It misses "tell me france's capital city" — same question, different hash.
For that, you need vectors.
The gateway runs all-MiniLM-L6-v2 in a sidecar container — 22M-param model, 384 dimensions, CPU-only, ~20–40 ms per embedding. Storage is pgvector — the Postgres extension. No separate vector database to operate.
Schema:
CREATE EXTENSION vector;
CREATE TABLE semantic_cache (
id UUID PRIMARY KEY,
project_id UUID NOT NULL,
provider TEXT NOT NULL,
model TEXT NOT NULL,
embedding vector(384),
cached_response JSONB,
expires_at TIMESTAMPTZ
);
CREATE INDEX semantic_cache_hnsw_idx
ON semantic_cache
USING hnsw (embedding vector_cosine_ops);
Lookup query (configurable similarity threshold per project, default 0.95):
SELECT cached_response, 1 - (embedding <=> $1) AS similarity
FROM semantic_cache
WHERE project_id = $2
AND provider = $3
AND model = $4
AND expires_at > NOW()
AND 1 - (embedding <=> $1) > $5
ORDER BY embedding <=> $1
LIMIT 1;
<=> is pgvector's cosine distance operator. 1 - (embedding <=> $1) converts that to similarity for the threshold comparison.
In Go (using pgvector-go):
embedding, err := embeddingClient.GenerateEmbedding(ctx, prompt)
if err != nil {
return nil, false, err
}
pgvec := pgvector.NewVector(embedding)
var cached json.RawMessage
var sim float32
err = db.QueryRowContext(ctx, query, pgvec, projectID, provider, model, threshold).
Scan(&cached, &sim)
Numbers: paraphrased queries hit at ~0.95 similarity in ~40 ms, $0 cost. The bulk of that 40 ms is the embedding inference — pgvector itself is 5–8 ms.
Want a different embedder?
The embedding service contract is intentionally small:
POST /embed
request: { "texts": ["..."] }
response: { "embeddings": [[...]], "dimensions": 384 }
You can swap in OpenAI's text-embedding-3-small with a tiny adapter shim (OpenAI's endpoint is /v1/embeddings with a different request shape, so it's not literally a one-line change, but it's ~30 lines of Go). At $0.02 per million tokens, that's essentially free at any sane traffic level.
A note on Redis Search vs pgvector
When I first wrote about this, I claimed Redis vector search would block rate-limit Lua because Redis is single-threaded. A Redis engineer correctly pointed out this is wrong — RediSearch runs indexing and query execution off the main thread (worker thread pool since 2.4+, multi-threaded vector search since 2.6+). It won't block.
The defensible reasons to default to pgvector for the gateway are different ones:
- The latency floor is the embedding (~30 ms), not the vector search (5–8 ms). Even if RediSearch were instant, you'd save 5–8 ms — a 1.2x improvement on the cache-hit path, not a 10x one.
- Operational footprint. The gateway already operates Postgres for metadata/logs. Adding a Redis Stack instance just for vectors means a separate datastore, separate licensing considerations (post-2024 SSPL/RSAL), separate config.
-
Migration is decoupled. The
POST /embedcontract abstracts the embedder, and the vector-store layer is similarly abstracted. Switching to RediSearch later if pgvector becomes a bottleneck is a swap, not a rewrite.
I'm benchmarking this for a follow-up post.
EVALSHA pre-loading
There's one more optimization worth calling out. Redis caches Lua scripts by SHA hash. Calling via EVALSHA(hash) is faster than EVAL(full_script) because Redis doesn't have to re-parse the script every call.
On gateway startup, I pre-load all three scripts and store their SHAs:
type RedisClient struct {
rdb *redis.Client
rateLimitSHA string
spendCheckSHA string
customerSpendSHA string
}
func (c *RedisClient) loadScripts(ctx context.Context) error {
sha, err := c.rdb.ScriptLoad(ctx, rateLimitLuaScript).Result()
if err != nil {
return err
}
c.rateLimitSHA = sha
// ... repeat for the other two ...
return nil
}
If a script gets evicted (which can happen on SCRIPT FLUSH or restart), the call fails and the client retries with EVAL, which re-loads it. This is fully automatic with go-redis's Run() method.
Putting it together: 1–2 round trips on the hot path
A fully-configured cache-hit request runs through:
- Auth lookup — Redis GET (cached for the API-key TTL): 1 round trip.
- Rate-limit + spend-cap — one or two Lua scripts depending on which checks are configured: 1–2 round trips.
- Exact cache GET — Redis: 1 round trip.
- Response write — stream out.
On the fast-fail path (rate-limited or over the spend cap), we short-circuit at step 2 with a 429 — no exact cache lookup, no upstream call, no response generation. That's the property that keeps a single gateway instance stable during abuse bursts: a runaway client or credential leak can't meaningfully consume gateway CPU because each DENY takes ~2 ms of work and 0 provider cost.
Numbers
Measured on a DigitalOcean 4 vCPU / 8 GB shared Linux droplet, server-side from gateway_logs.latency_ms (excludes client RTT):
| Response | p50 | p95 | p99 |
|---|---|---|---|
| 200 cache hit | 3 ms | 12 ms | 23 ms |
| 429 rate-limit | ~2 ms | ~6 ms | — |
Sustained throughput: ~1,672 req/s on a single instance.
Methodology: hey at concurrency 20, 200 requests, pre-warmed cache, then SQL query against the gateway's own log table for the actual server-side distribution. Bench script in the repo at bench/load_test.sh — reproducible in ~5 minutes with docker compose up -d postgres redis + go run ./cmd/gateway + the bench script.
The reason I quote server-side numbers and not hey numbers is that hey includes network RTT, OS scheduler noise, and TCP overhead that doesn't reflect how the gateway itself performs. The README has the full breakdown including 2 vCPU and MacBook M4 comparisons — short version: Docker Desktop on macOS is slower than a 4 vCPU Linux droplet because every Redis round trip pays a 1–2 ms VM-network tax. Production numbers match the droplet rows, not the laptop row.
Caveats worth reading
- Sample size is small (~80 samples per run for p99). Enough to be directionally right, not tight enough to publish ±0.5 ms. Quote a range, not a single point.
-
p99 is GC- and connection-warmup-bound, not CPU-bound. Throwing more hardware at it won't reliably push p99 below ~15 ms without GC tuning (
GOGC=200+) and pool pre-warming. -
These are cache-hit numbers. Cache misses are dominated by upstream provider latency (
gpt-4o-mini≈ 300–800 ms to OpenAI). That's not gateway overhead.
What's shipping today (v0.1.x)
- OpenAI-compatible
/v1/chat/completionsand/v1/models -
SSE streaming across all four providers — OpenAI, Anthropic, Gemini, Ollama. Chunks normalized to a single OpenAI-compatible shape regardless of upstream, with a trailing metadata frame carrying rounded
cost_usd,usage,latency_ms, andproviderbefore[DONE]. Ollama's empty role-only chunks are filtered by default. -
Automatic cross-provider failover on 429/5xx/timeout/connection errors. Four modes (
cloud_first,local_first,local_only,cloud_only) — same env var flips between them. - Per-API-key token-bucket rate limits (atomic Redis Lua).
- Per-project hard monthly spend caps with pre-request cost estimation — blocks runaway prompts before the LLM call.
-
Per-customer spend caps with
blockordowngradebehavior on overflow. - Local Ollama support — any pulled model auto-routable, $0 cost, full tier substitution for cloud failover.
- Exact + semantic caching with per-project toggles and thresholds.
- Customer labels (
X-LLM0-*) stored as JSONB on every log row for downstream analytics.
What's not done yet
- Only four providers. Bedrock, Azure OpenAI, Groq, Together, DeepSeek, xAI are on the roadmap.
-
No Prometheus
/metricsendpoint yet. Observability today isgateway_logsin Postgres + response headers. -
No structured logging yet —
log/slogis planned. - No hosted/managed version yet. Self-hosted works today; a managed product gets built only if there's enough real demand (waitlist is for validation, not gating).
- Handler-level tests incomplete. Benchmarks cover the hot path; mock-based handler tests are in progress.
Why I built this
I wanted something fast, where I owned the provider keys, with per-user or ai agents spend limits, automatic failover across providers, and real cost savings from caching — both exact-match and semantic.
There are plenty of LLM gateways out there — OpenRouter, Helicone, Portkey, LiteLLM, Bifrost. They're good. But each one I evaluated was missing at least one of those, and the self-hosted BYOK angle in particular wasn't well covered: your own provider keys, your data never touches a third-party cloud, no markup, no shared rate limits with other tenants.
The performance work above is what makes the self-hosted story actually viable. If your gateway adds 50 ms of overhead, you're not really shipping a "lightweight" anything.
Code & links
MIT licensed. Runs from docker compose up.
If you spot something wrong, open an issue or comment below.

Top comments (0)