Mushfiq Rahman

Posted on May 27 • Edited on May 28

Building a fast LLM gateway in Go: Lua + pgvector

#llm #go #redis #systemdesign

I open-sourced llm0 recently — a Go binary that puts one OpenAI-compatible endpoint in front of OpenAI, Anthropic, Gemini, and local Ollama. MIT licensed. Single binary plus Postgres + Redis.

The technically interesting bits are how it stays fast: 3 ms p50 cache-hit latency, ~1,672 req/s sustained throughput, 1–2 Redis round trips on the hot path on a DigitalOcean 4 vCPU / 8 GB shared Linux droplet.

This post walks through the architecture decisions that got those numbers. Expect Lua scripts, a pgvector query, and an honest discussion of where I overstated things and got corrected by a Redis engineer.

The naive approach (and why it's slow)

A typical LLM gateway request needs to do six things:

Authenticate the API key
Check the per-API-key rate limit
Check the per-project spend cap
Look up exact-match cache
(Maybe) check semantic cache
(Maybe) forward to the upstream model

The naive approach is six serial Redis GETs. At ~1 ms each in Docker on a typical cloud VM, that's 5–10 ms gone before the request leaves the gateway. Half a 50th-percentile latency budget consumed on bookkeeping.

There's also a correctness problem. Rate-limit and spend-cap checks have a TOCTOU race:

read counter → check threshold → write counter

Two simultaneous requests can both pass the check and both increment, exceeding the cap.

The fix to both — speed and correctness — is to move the hot work into Redis itself, as atomic Lua scripts.

Token-bucket rate limiting in Lua

Token bucket: each API key has a bucket with a capacity (say, 60 tokens) that refills at a fixed rate (say, 1 per second). Each request takes 1 token. If the bucket is empty, the request is denied.

Here's the Lua script:

local now = tonumber(ARGV[1])
local capacity = tonumber(ARGV[2])
local refill_rate = tonumber(ARGV[3])
local requested = tonumber(ARGV[4]) or 1

local bucket = redis.call('HMGET', KEYS[1], 'tokens', 'last_refill')
local tokens = tonumber(bucket[1])
local last_refill = tonumber(bucket[2])

if tokens then
    tokens = math.min(capacity, tokens + ((now - last_refill) * refill_rate * 0.001))
else
    tokens = capacity
end

local allowed = 0
if tokens >= requested then
    tokens = tokens - requested
    allowed = 1
end

redis.call('HMSET', KEYS[1], 'tokens', tokens, 'last_refill', now)
redis.call('EXPIRE', KEYS[1], 900)

return {allowed, math.floor(tokens), 0}

What this gets you:

Atomic. The entire script runs as one Redis command. No interleaving with other commands on the same instance.
One round trip. Read state, compute new state, write state — one network call.
No TOCTOU. Check and decrement happen in the same operation.

Calling it from Go (using go-redis):

const rateLimitScript = `[lua source above]`

result, err := rdb.Eval(ctx, rateLimitScript, []string{key},
    time.Now().UnixMilli(), capacity, refillRate, 1).Result()

But Eval re-parses the script on every call, which is slow. Pre-load the script once and switch to EvalSha:

sha, err := rdb.ScriptLoad(ctx, rateLimitScript).Result()
// ... store sha somewhere accessible ...

result, err := rdb.EvalSha(ctx, sha, []string{key},
    time.Now().UnixMilli(), capacity, refillRate, 1).Result()

If Redis evicts the script from its script cache (rare, but possible on a FLUSHALL), retry with Eval. The go-redis client's Run() method handles this automatically — pre-load once, call Run(), get EVALSHA's speed with EVAL's reliability.

Atomic spend-cap enforcement

Same pattern for the per-project spend cap. Get current spend, check threshold, increment if under cap — atomically.

local cost_usd = tonumber(ARGV[1])
local monthly_cap = tonumber(ARGV[2])

local current_spend = tonumber(redis.call('GET', KEYS[1])) or 0

if current_spend + cost_usd > monthly_cap then
    return {0, current_spend, monthly_cap}  -- blocked
end

local new_spend = redis.call('INCRBYFLOAT', KEYS[1], cost_usd)
redis.call('EXPIRE', KEYS[1], 2678400)  -- 31 days

return {1, tonumber(new_spend), monthly_cap}  -- allowed

Returns {allowed, current_spend, cap} so the gateway can include current_spend in response headers for client-side observability.

The 31-day expiry is deliberate — spend:project:{id}:{YYYY-MM} keys auto-evict after a calendar month rolls over, so monthly counters reset cleanly without needing a scheduled job to clear them.

The same pattern extends to per-end-user spend caps: pass X-Customer-ID on a request, and the gateway enforces daily and monthly USD limits per downstream user. Two overflow behaviors — block returns 429 with a retry_after, downgrade automatically routes to a cheaper model (e.g. gpt-4o → gpt-4o-mini). Same atomic Lua pattern, different key namespace.

Two-tier exact cache: Redis hot, Postgres warm

LLM responses are deterministic enough that exact-match caching pays off — same prompt + model + parameters = same answer. Why pay the upstream provider twice?

Cache key: SHA256(project_id + provider + model + sorted_messages_json).
Cache value: the full JSON response from the model.

Storage in two tiers:

Tier	Storage	TTL	Latency
Hot	Redis	configurable (default 1 hour)	<1 ms
Warm	Postgres	same TTL, survives Redis eviction/restart	~5 ms

On read:

func (c *Cache) Get(ctx context.Context, key string) ([]byte, bool, error) {
    // Try Redis first
    cached, err := c.redis.Get(ctx, key).Bytes()
    if err == nil {
        return cached, true, nil
    }
    if err != redis.Nil {
        return nil, false, err
    }

    // Fall back to Postgres
    var response []byte
    err = c.db.QueryRowContext(ctx,
        "SELECT cached_response FROM exact_cache WHERE cache_key = $1 AND expires_at > NOW()",
        key).Scan(&response)
    if err == sql.ErrNoRows {
        return nil, false, nil
    }
    if err != nil {
        return nil, false, err
    }

    // Promote to Redis on Postgres hit
    c.redis.Set(ctx, key, response, time.Hour)
    return response, true, nil
}

On write: Redis SET synchronous, Postgres INSERT async (off the hot path) so writes don't slow down the response.

This gives you Redis-fast hits for recent prompts and Postgres-durable hits for older ones. Restart-survivable too: rebuilding the warm tier from production traffic happens naturally as users repeat queries.

Semantic cache: pgvector catches paraphrases

Exact-match caching catches "what is the capital of france?" twice. It misses "tell me france's capital city" — same question, different hash.

For that, you need vectors.

The gateway runs all-MiniLM-L6-v2 in a sidecar container — 22M-param model, 384 dimensions, CPU-only, ~20–40 ms per embedding. Storage is pgvector — the Postgres extension. No separate vector database to operate.

Schema:

CREATE EXTENSION vector;

CREATE TABLE semantic_cache (
    id UUID PRIMARY KEY,
    project_id UUID NOT NULL,
    provider TEXT NOT NULL,
    model TEXT NOT NULL,
    embedding vector(384),
    cached_response JSONB,
    expires_at TIMESTAMPTZ
);

CREATE INDEX semantic_cache_hnsw_idx
    ON semantic_cache
    USING hnsw (embedding vector_cosine_ops);

Lookup query (configurable similarity threshold per project, default 0.95):

SELECT cached_response, 1 - (embedding <=> $1) AS similarity
FROM semantic_cache
WHERE project_id = $2
  AND provider = $3
  AND model = $4
  AND expires_at > NOW()
  AND 1 - (embedding <=> $1) > $5
ORDER BY embedding <=> $1
LIMIT 1;

<=> is pgvector's cosine distance operator. 1 - (embedding <=> $1) converts that to similarity for the threshold comparison.

In Go (using pgvector-go):

embedding, err := embeddingClient.GenerateEmbedding(ctx, prompt)
if err != nil {
    return nil, false, err
}

pgvec := pgvector.NewVector(embedding)

var cached json.RawMessage
var sim float32
err = db.QueryRowContext(ctx, query, pgvec, projectID, provider, model, threshold).
    Scan(&cached, &sim)

Numbers: paraphrased queries hit at ~0.95 similarity in ~40 ms, $0 cost. The bulk of that 40 ms is the embedding inference — pgvector itself is 5–8 ms.

Want a different embedder?

The embedding service contract is intentionally small:

POST /embed
request:  { "texts": ["..."] }
response: { "embeddings": [[...]], "dimensions": 384 }

You can swap in OpenAI's text-embedding-3-small with a tiny adapter shim (OpenAI's endpoint is /v1/embeddings with a different request shape, so it's not literally a one-line change, but it's ~30 lines of Go). At $0.02 per million tokens, that's essentially free at any sane traffic level.

A note on Redis Search vs pgvector

When I first wrote about this, I claimed Redis vector search would block rate-limit Lua because Redis is single-threaded. A Redis engineer correctly pointed out this is wrong — RediSearch runs indexing and query execution off the main thread (worker thread pool since 2.4+, multi-threaded vector search since 2.6+). It won't block.

The defensible reasons to default to pgvector for the gateway are different ones:

The latency floor is the embedding (~30 ms), not the vector search (5–8 ms). Even if RediSearch were instant, you'd save 5–8 ms — a 1.2x improvement on the cache-hit path, not a 10x one.
Operational footprint. The gateway already operates Postgres for metadata/logs. Adding a Redis Stack instance just for vectors means a separate datastore, separate licensing considerations (post-2024 SSPL/RSAL), separate config.
Migration is decoupled. The POST /embed contract abstracts the embedder, and the vector-store layer is similarly abstracted. Switching to RediSearch later if pgvector becomes a bottleneck is a swap, not a rewrite.

I'm benchmarking this for a follow-up post.

EVALSHA pre-loading

There's one more optimization worth calling out. Redis caches Lua scripts by SHA hash. Calling via EVALSHA(hash) is faster than EVAL(full_script) because Redis doesn't have to re-parse the script every call.

On gateway startup, I pre-load all three scripts and store their SHAs:

type RedisClient struct {
    rdb              *redis.Client
    rateLimitSHA     string
    spendCheckSHA    string
    customerSpendSHA string
}

func (c *RedisClient) loadScripts(ctx context.Context) error {
    sha, err := c.rdb.ScriptLoad(ctx, rateLimitLuaScript).Result()
    if err != nil {
        return err
    }
    c.rateLimitSHA = sha
    // ... repeat for the other two ...
    return nil
}

If a script gets evicted (which can happen on SCRIPT FLUSH or restart), the call fails and the client retries with EVAL, which re-loads it. This is fully automatic with go-redis's Run() method.

Putting it together: 1–2 round trips on the hot path

A fully-configured cache-hit request runs through:

Auth lookup — Redis GET (cached for the API-key TTL): 1 round trip.
Rate-limit + spend-cap — one or two Lua scripts depending on which checks are configured: 1–2 round trips.
Exact cache GET — Redis: 1 round trip.
Response write — stream out.

On the fast-fail path (rate-limited or over the spend cap), we short-circuit at step 2 with a 429 — no exact cache lookup, no upstream call, no response generation. That's the property that keeps a single gateway instance stable during abuse bursts: a runaway client or credential leak can't meaningfully consume gateway CPU because each DENY takes ~2 ms of work and 0 provider cost.

Numbers

Measured on a DigitalOcean 4 vCPU / 8 GB shared Linux droplet, server-side from gateway_logs.latency_ms (excludes client RTT):

Response	p50	p95	p99
200 cache hit	3 ms	12 ms	23 ms
429 rate-limit	~2 ms	~6 ms	—

Sustained throughput: ~1,672 req/s on a single instance.

Methodology: hey at concurrency 20, 200 requests, pre-warmed cache, then SQL query against the gateway's own log table for the actual server-side distribution. Bench script in the repo at bench/load_test.sh — reproducible in ~5 minutes with docker compose up -d postgres redis + go run ./cmd/gateway + the bench script.

The reason I quote server-side numbers and not hey numbers is that hey includes network RTT, OS scheduler noise, and TCP overhead that doesn't reflect how the gateway itself performs. The README has the full breakdown including 2 vCPU and MacBook M4 comparisons — short version: Docker Desktop on macOS is slower than a 4 vCPU Linux droplet because every Redis round trip pays a 1–2 ms VM-network tax. Production numbers match the droplet rows, not the laptop row.

Caveats worth reading

Sample size is small (~80 samples per run for p99). Enough to be directionally right, not tight enough to publish ±0.5 ms. Quote a range, not a single point.
p99 is GC- and connection-warmup-bound, not CPU-bound. Throwing more hardware at it won't reliably push p99 below ~15 ms without GC tuning (GOGC=200+) and pool pre-warming.
These are cache-hit numbers. Cache misses are dominated by upstream provider latency (gpt-4o-mini ≈ 300–800 ms to OpenAI). That's not gateway overhead.

What's shipping today (v0.1.x)

OpenAI-compatible /v1/chat/completions and /v1/models
SSE streaming across all four providers — OpenAI, Anthropic, Gemini, Ollama. Chunks normalized to a single OpenAI-compatible shape regardless of upstream, with a trailing metadata frame carrying rounded cost_usd, usage, latency_ms, and provider before [DONE]. Ollama's empty role-only chunks are filtered by default.
Automatic cross-provider failover on 429/5xx/timeout/connection errors. Four modes (cloud_first, local_first, local_only, cloud_only) — same env var flips between them.
Per-API-key token-bucket rate limits (atomic Redis Lua).
Per-project hard monthly spend caps with pre-request cost estimation — blocks runaway prompts before the LLM call.
Per-customer spend caps with block or downgrade behavior on overflow.
Local Ollama support — any pulled model auto-routable, $0 cost, full tier substitution for cloud failover.
Exact + semantic caching with per-project toggles and thresholds.
Customer labels (X-LLM0-*) stored as JSONB on every log row for downstream analytics.

What's not done yet

Only four providers. Bedrock, Azure OpenAI, Groq, Together, DeepSeek, xAI are on the roadmap.
No Prometheus /metrics endpoint yet. Observability today is gateway_logs in Postgres + response headers.
No structured logging yet — log/slog is planned.
No hosted/managed version yet. Self-hosted works today; a managed product gets built only if there's enough real demand (waitlist is for validation, not gating).
Handler-level tests incomplete. Benchmarks cover the hot path; mock-based handler tests are in progress.

Why I built this

I wanted something fast, where I owned the provider keys, with per-user or ai agents spend limits, automatic failover across providers, and real cost savings from caching — both exact-match and semantic.

There are plenty of LLM gateways out there — OpenRouter, Helicone, Portkey, LiteLLM, Bifrost. They're good. But each one I evaluated was missing at least one of those, and the self-hosted BYOK angle in particular wasn't well covered: your own provider keys, your data never touches a third-party cloud, no markup, no shared rate limits with other tenants.

The performance work above is what makes the self-hosted story actually viable. If your gateway adds 50 ms of overhead, you're not really shipping a "lightweight" anything.

Code & links

Repo: https://github.com/llm0ai/llm0
Site: https://llm0.ai

MIT licensed. Runs from docker compose up.

If you spot something wrong, open an issue or comment below.

Top comments (1)

Harjot Singh • May 31

One endpoint in front of OpenAI, Anthropic, Gemini, and Ollama is the right abstraction, and the 3ms cache-hit p50 is the number that matters because it makes the gateway pay for itself. The architectural point I'd highlight for readers: a gateway like this is where cost and routing discipline actually become enforceable. Once every call flows through one place, you can do the things that are impossible when each service calls providers directly, route per-task to the cheapest model that clears the bar, fall back across providers on outage, and cache semantically so a near-duplicate query never hits a paid model at all. The pgvector semantic cache is the sleeper feature: exact-match caching helps a little, but semantic cache (this question is close enough to one I answered) is where the real spend savings hide, with the caveat that close-enough has to be tuned carefully or you serve a confidently-wrong cached answer to a subtly different question. The honest-where-I-overstated-it note is the most credible part of the whole post. This single-chokepoint-for-routing-and-cost is exactly how I think about provider orchestration in Moonshift. How do you tune the semantic cache similarity threshold to avoid false-positive hits on questions that look alike but need different answers?