<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Mushfiq Rahman</title>
    <description>The latest articles on DEV Community by Mushfiq Rahman (@mushfiq_rahmanmushfiq_).</description>
    <link>https://dev.to/mushfiq_rahmanmushfiq_</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F2673135%2F479c0c48-2870-4d6a-95e3-2c4eb4d7630a.png</url>
      <title>DEV Community: Mushfiq Rahman</title>
      <link>https://dev.to/mushfiq_rahmanmushfiq_</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/mushfiq_rahmanmushfiq_"/>
    <language>en</language>
    <item>
      <title>Building a fast LLM gateway in Go: Lua + pgvector</title>
      <dc:creator>Mushfiq Rahman</dc:creator>
      <pubDate>Wed, 27 May 2026 13:00:00 +0000</pubDate>
      <link>https://dev.to/mushfiq_rahmanmushfiq_/building-a-fast-llm-gateway-in-go-lua-pgvector-1ea0</link>
      <guid>https://dev.to/mushfiq_rahmanmushfiq_/building-a-fast-llm-gateway-in-go-lua-pgvector-1ea0</guid>
      <description>&lt;p&gt;I open-sourced &lt;a href="https://github.com/mrmushfiq/llm0-gateway" rel="noopener noreferrer"&gt;llm0-gateway&lt;/a&gt; recently — a Go binary that puts one OpenAI-compatible endpoint in front of OpenAI, Anthropic, Gemini, and local Ollama. MIT licensed. Single binary plus Postgres + Redis.&lt;/p&gt;

&lt;p&gt;The technically interesting bits are how it stays fast: &lt;strong&gt;3 ms p50 cache-hit latency, ~1,672 req/s sustained throughput, 1–2 Redis round trips on the hot path&lt;/strong&gt; on a DigitalOcean 4 vCPU / 8 GB shared Linux droplet.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdp40dn5x41zpryhxk45n.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdp40dn5x41zpryhxk45n.jpeg" alt="LLM-Gateway" width="800" height="730"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This post walks through the architecture decisions that got those numbers. Expect Lua scripts, a &lt;code&gt;pgvector&lt;/code&gt; query, and an honest discussion of where I overstated things and got corrected by a Redis engineer.&lt;/p&gt;

&lt;h2&gt;
  
  
  The naive approach (and why it's slow)
&lt;/h2&gt;

&lt;p&gt;A typical LLM gateway request needs to do six things:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Authenticate the API key&lt;/li&gt;
&lt;li&gt;Check the per-API-key rate limit&lt;/li&gt;
&lt;li&gt;Check the per-project spend cap&lt;/li&gt;
&lt;li&gt;Look up exact-match cache&lt;/li&gt;
&lt;li&gt;(Maybe) check semantic cache&lt;/li&gt;
&lt;li&gt;(Maybe) forward to the upstream model&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The naive approach is six serial Redis GETs. At ~1 ms each in Docker on a typical cloud VM, that's 5–10 ms gone before the request leaves the gateway. Half a 50th-percentile latency budget consumed on bookkeeping.&lt;/p&gt;

&lt;p&gt;There's also a correctness problem. Rate-limit and spend-cap checks have a TOCTOU race:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;read counter → check threshold → write counter
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two simultaneous requests can both pass the check and both increment, exceeding the cap.&lt;/p&gt;

&lt;p&gt;The fix to both — speed and correctness — is to move the hot work into Redis itself, as atomic Lua scripts.&lt;/p&gt;

&lt;h2&gt;
  
  
  Token-bucket rate limiting in Lua
&lt;/h2&gt;

&lt;p&gt;Token bucket: each API key has a bucket with a capacity (say, 60 tokens) that refills at a fixed rate (say, 1 per second). Each request takes 1 token. If the bucket is empty, the request is denied.&lt;/p&gt;

&lt;p&gt;Here's the Lua script:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight lua"&gt;&lt;code&gt;&lt;span class="kd"&gt;local&lt;/span&gt; &lt;span class="n"&gt;now&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;tonumber&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ARGV&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="kd"&gt;local&lt;/span&gt; &lt;span class="n"&gt;capacity&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;tonumber&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ARGV&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="kd"&gt;local&lt;/span&gt; &lt;span class="n"&gt;refill_rate&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;tonumber&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ARGV&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="kd"&gt;local&lt;/span&gt; &lt;span class="n"&gt;requested&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;tonumber&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ARGV&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;

&lt;span class="kd"&gt;local&lt;/span&gt; &lt;span class="n"&gt;bucket&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'HMGET'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;KEYS&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="s1"&gt;'tokens'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'last_refill'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="kd"&gt;local&lt;/span&gt; &lt;span class="n"&gt;tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;tonumber&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bucket&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="kd"&gt;local&lt;/span&gt; &lt;span class="n"&gt;last_refill&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;tonumber&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bucket&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;tokens&lt;/span&gt; &lt;span class="k"&gt;then&lt;/span&gt;
    &lt;span class="n"&gt;tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;math.min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;capacity&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tokens&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="n"&gt;now&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;last_refill&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;refill_rate&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;001&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="k"&gt;else&lt;/span&gt;
    &lt;span class="n"&gt;tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;capacity&lt;/span&gt;
&lt;span class="k"&gt;end&lt;/span&gt;

&lt;span class="kd"&gt;local&lt;/span&gt; &lt;span class="n"&gt;allowed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;tokens&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;requested&lt;/span&gt; &lt;span class="k"&gt;then&lt;/span&gt;
    &lt;span class="n"&gt;tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tokens&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="n"&gt;requested&lt;/span&gt;
    &lt;span class="n"&gt;allowed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
&lt;span class="k"&gt;end&lt;/span&gt;

&lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'HMSET'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;KEYS&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="s1"&gt;'tokens'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tokens&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'last_refill'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;now&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'EXPIRE'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;KEYS&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="mi"&gt;900&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;allowed&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;math.floor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tokens&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;What this gets you:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Atomic.&lt;/strong&gt; The entire script runs as one Redis command. No interleaving with other commands on the same instance.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;One round trip.&lt;/strong&gt; Read state, compute new state, write state — one network call.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No TOCTOU.&lt;/strong&gt; Check and decrement happen in the same operation.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Calling it from Go (using &lt;code&gt;go-redis&lt;/code&gt;):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;const&lt;/span&gt; &lt;span class="n"&gt;rateLimitScript&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s"&gt;`[lua source above]`&lt;/span&gt;

&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;rdb&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Eval&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rateLimitScript&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;UnixMilli&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;capacity&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;refillRate&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Result&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But &lt;code&gt;Eval&lt;/code&gt; re-parses the script on every call, which is slow. Pre-load the script once and switch to &lt;code&gt;EvalSha&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="n"&gt;sha&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;rdb&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ScriptLoad&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rateLimitScript&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Result&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="c"&gt;// ... store sha somewhere accessible ...&lt;/span&gt;

&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;rdb&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;EvalSha&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sha&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;UnixMilli&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;capacity&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;refillRate&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Result&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If Redis evicts the script from its script cache (rare, but possible on a &lt;code&gt;FLUSHALL&lt;/code&gt;), retry with &lt;code&gt;Eval&lt;/code&gt;. The &lt;code&gt;go-redis&lt;/code&gt; client's &lt;code&gt;Run()&lt;/code&gt; method handles this automatically — pre-load once, call &lt;code&gt;Run()&lt;/code&gt;, get EVALSHA's speed with EVAL's reliability.&lt;/p&gt;

&lt;h2&gt;
  
  
  Atomic spend-cap enforcement
&lt;/h2&gt;

&lt;p&gt;Same pattern for the per-project spend cap. Get current spend, check threshold, increment if under cap — atomically.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight lua"&gt;&lt;code&gt;&lt;span class="kd"&gt;local&lt;/span&gt; &lt;span class="n"&gt;cost_usd&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;tonumber&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ARGV&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="kd"&gt;local&lt;/span&gt; &lt;span class="n"&gt;monthly_cap&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;tonumber&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ARGV&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

&lt;span class="kd"&gt;local&lt;/span&gt; &lt;span class="n"&gt;current_spend&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;tonumber&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'GET'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;KEYS&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]))&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;current_spend&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;cost_usd&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;monthly_cap&lt;/span&gt; &lt;span class="k"&gt;then&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;current_spend&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;monthly_cap&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;  &lt;span class="c1"&gt;-- blocked&lt;/span&gt;
&lt;span class="k"&gt;end&lt;/span&gt;

&lt;span class="kd"&gt;local&lt;/span&gt; &lt;span class="n"&gt;new_spend&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'INCRBYFLOAT'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;KEYS&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;cost_usd&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'EXPIRE'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;KEYS&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="mi"&gt;2678400&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;-- 31 days&lt;/span&gt;

&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;tonumber&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;new_spend&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="n"&gt;monthly_cap&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;  &lt;span class="c1"&gt;-- allowed&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Returns &lt;code&gt;{allowed, current_spend, cap}&lt;/code&gt; so the gateway can include &lt;code&gt;current_spend&lt;/code&gt; in response headers for client-side observability.&lt;/p&gt;

&lt;p&gt;The 31-day expiry is deliberate — &lt;code&gt;spend:project:{id}:{YYYY-MM}&lt;/code&gt; keys auto-evict after a calendar month rolls over, so monthly counters reset cleanly without needing a scheduled job to clear them.&lt;/p&gt;

&lt;p&gt;The same pattern extends to &lt;strong&gt;per-end-user spend caps&lt;/strong&gt;: pass &lt;code&gt;X-Customer-ID&lt;/code&gt; on a request, and the gateway enforces daily and monthly USD limits per downstream user. Two overflow behaviors — &lt;code&gt;block&lt;/code&gt; returns 429 with a &lt;code&gt;retry_after&lt;/code&gt;, &lt;code&gt;downgrade&lt;/code&gt; automatically routes to a cheaper model (e.g. &lt;code&gt;gpt-4o&lt;/code&gt; → &lt;code&gt;gpt-4o-mini&lt;/code&gt;). Same atomic Lua pattern, different key namespace.&lt;/p&gt;

&lt;h2&gt;
  
  
  Two-tier exact cache: Redis hot, Postgres warm
&lt;/h2&gt;

&lt;p&gt;LLM responses are deterministic enough that exact-match caching pays off — same prompt + model + parameters = same answer. Why pay the upstream provider twice?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cache key:&lt;/strong&gt; &lt;code&gt;SHA256(project_id + provider + model + sorted_messages_json)&lt;/code&gt;.&lt;br&gt;
&lt;strong&gt;Cache value:&lt;/strong&gt; the full JSON response from the model.&lt;/p&gt;

&lt;p&gt;Storage in two tiers:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tier&lt;/th&gt;
&lt;th&gt;Storage&lt;/th&gt;
&lt;th&gt;TTL&lt;/th&gt;
&lt;th&gt;Latency&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Hot&lt;/td&gt;
&lt;td&gt;Redis&lt;/td&gt;
&lt;td&gt;configurable (default 1 hour)&lt;/td&gt;
&lt;td&gt;&amp;lt;1 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Warm&lt;/td&gt;
&lt;td&gt;Postgres&lt;/td&gt;
&lt;td&gt;same TTL, survives Redis eviction/restart&lt;/td&gt;
&lt;td&gt;~5 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;On read:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;Cache&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;Get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;([]&lt;/span&gt;&lt;span class="kt"&gt;byte&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c"&gt;// Try Redis first&lt;/span&gt;
    &lt;span class="n"&gt;cached&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Bytes&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;cached&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="no"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="no"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c"&gt;// Fall back to Postgres&lt;/span&gt;
    &lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="kt"&gt;byte&lt;/span&gt;
    &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;QueryRowContext&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="s"&gt;"SELECT cached_response FROM exact_cache WHERE cache_key = $1 AND expires_at &amp;gt; NOW()"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Scan&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;sql&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ErrNoRows&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="no"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="no"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="c"&gt;// Promote to Redis on Postgres hit&lt;/span&gt;
    &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Hour&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="no"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;On write: Redis SET synchronous, Postgres INSERT async (off the hot path) so writes don't slow down the response.&lt;/p&gt;

&lt;p&gt;This gives you Redis-fast hits for recent prompts and Postgres-durable hits for older ones. Restart-survivable too: rebuilding the warm tier from production traffic happens naturally as users repeat queries.&lt;/p&gt;

&lt;h2&gt;
  
  
  Semantic cache: pgvector catches paraphrases
&lt;/h2&gt;

&lt;p&gt;Exact-match caching catches &lt;code&gt;"what is the capital of france?"&lt;/code&gt; twice. It misses &lt;code&gt;"tell me france's capital city"&lt;/code&gt; — same question, different hash.&lt;/p&gt;

&lt;p&gt;For that, you need vectors.&lt;/p&gt;

&lt;p&gt;The gateway runs &lt;code&gt;all-MiniLM-L6-v2&lt;/code&gt; in a sidecar container — 22M-param model, 384 dimensions, CPU-only, ~20–40 ms per embedding. Storage is &lt;code&gt;pgvector&lt;/code&gt; — the Postgres extension. No separate vector database to operate.&lt;/p&gt;

&lt;p&gt;Schema:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="n"&gt;EXTENSION&lt;/span&gt; &lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;TABLE&lt;/span&gt; &lt;span class="n"&gt;semantic_cache&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="n"&gt;UUID&lt;/span&gt; &lt;span class="k"&gt;PRIMARY&lt;/span&gt; &lt;span class="k"&gt;KEY&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;project_id&lt;/span&gt; &lt;span class="n"&gt;UUID&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;provider&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="nb"&gt;TEXT&lt;/span&gt; &lt;span class="k"&gt;NOT&lt;/span&gt; &lt;span class="k"&gt;NULL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;embedding&lt;/span&gt; &lt;span class="n"&gt;vector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;384&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="n"&gt;cached_response&lt;/span&gt; &lt;span class="n"&gt;JSONB&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;expires_at&lt;/span&gt; &lt;span class="n"&gt;TIMESTAMPTZ&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="k"&gt;CREATE&lt;/span&gt; &lt;span class="k"&gt;INDEX&lt;/span&gt; &lt;span class="n"&gt;semantic_cache_hnsw_idx&lt;/span&gt;
    &lt;span class="k"&gt;ON&lt;/span&gt; &lt;span class="n"&gt;semantic_cache&lt;/span&gt;
    &lt;span class="k"&gt;USING&lt;/span&gt; &lt;span class="n"&gt;hnsw&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;embedding&lt;/span&gt; &lt;span class="n"&gt;vector_cosine_ops&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Lookup query (configurable similarity threshold per project, default 0.95):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="n"&gt;cached_response&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;embedding&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&amp;gt;&lt;/span&gt; &lt;span class="err"&gt;$&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;AS&lt;/span&gt; &lt;span class="n"&gt;similarity&lt;/span&gt;
&lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;semantic_cache&lt;/span&gt;
&lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;project_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="err"&gt;$&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;
  &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;provider&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="err"&gt;$&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;
  &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="err"&gt;$&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;
  &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="n"&gt;expires_at&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;NOW&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
  &lt;span class="k"&gt;AND&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;embedding&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&amp;gt;&lt;/span&gt; &lt;span class="err"&gt;$&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="err"&gt;$&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;
&lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;embedding&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&amp;gt;&lt;/span&gt; &lt;span class="err"&gt;$&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;
&lt;span class="k"&gt;LIMIT&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;&amp;lt;=&amp;gt;&lt;/code&gt; is pgvector's cosine distance operator. &lt;code&gt;1 - (embedding &amp;lt;=&amp;gt; $1)&lt;/code&gt; converts that to similarity for the threshold comparison.&lt;/p&gt;

&lt;p&gt;In Go (using &lt;code&gt;pgvector-go&lt;/code&gt;):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="n"&gt;embedding&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;embeddingClient&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;GenerateEmbedding&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="no"&gt;false&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="n"&gt;pgvec&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;pgvector&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;NewVector&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;embedding&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="n"&gt;cached&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;RawMessage&lt;/span&gt;
&lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="n"&gt;sim&lt;/span&gt; &lt;span class="kt"&gt;float32&lt;/span&gt;
&lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;db&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;QueryRowContext&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pgvec&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;projectID&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;provider&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;
    &lt;span class="n"&gt;Scan&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;cached&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;sim&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Numbers: paraphrased queries hit at ~0.95 similarity in &lt;strong&gt;~40 ms, $0 cost&lt;/strong&gt;. The bulk of that 40 ms is the embedding inference — pgvector itself is 5–8 ms.&lt;/p&gt;

&lt;h3&gt;
  
  
  Want a different embedder?
&lt;/h3&gt;

&lt;p&gt;The embedding service contract is intentionally small:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;POST /embed
request:  { "texts": ["..."] }
response: { "embeddings": [[...]], "dimensions": 384 }
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can swap in OpenAI's &lt;code&gt;text-embedding-3-small&lt;/code&gt; with a tiny adapter shim (OpenAI's endpoint is &lt;code&gt;/v1/embeddings&lt;/code&gt; with a different request shape, so it's not literally a one-line change, but it's ~30 lines of Go). At $0.02 per million tokens, that's essentially free at any sane traffic level.&lt;/p&gt;

&lt;h3&gt;
  
  
  A note on Redis Search vs pgvector
&lt;/h3&gt;

&lt;p&gt;When I first wrote about this, I claimed Redis vector search would block rate-limit Lua because Redis is single-threaded. &lt;strong&gt;A Redis engineer correctly pointed out this is wrong&lt;/strong&gt; — RediSearch runs indexing and query execution off the main thread (worker thread pool since 2.4+, multi-threaded vector search since 2.6+). It won't block.&lt;/p&gt;

&lt;p&gt;The defensible reasons to default to pgvector for the gateway are different ones:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;The latency floor is the embedding (~30 ms), not the vector search (5–8 ms).&lt;/strong&gt; Even if RediSearch were instant, you'd save 5–8 ms — a 1.2x improvement on the cache-hit path, not a 10x one.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Operational footprint.&lt;/strong&gt; The gateway already operates Postgres for metadata/logs. Adding a Redis Stack instance just for vectors means a separate datastore, separate licensing considerations (post-2024 SSPL/RSAL), separate config.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Migration is decoupled.&lt;/strong&gt; The &lt;code&gt;POST /embed&lt;/code&gt; contract abstracts the embedder, and the vector-store layer is similarly abstracted. Switching to RediSearch later if pgvector becomes a bottleneck is a swap, not a rewrite.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;I'm benchmarking this for a follow-up post.&lt;/p&gt;

&lt;h2&gt;
  
  
  EVALSHA pre-loading
&lt;/h2&gt;

&lt;p&gt;There's one more optimization worth calling out. Redis caches Lua scripts by SHA hash. Calling via &lt;code&gt;EVALSHA(hash)&lt;/code&gt; is faster than &lt;code&gt;EVAL(full_script)&lt;/code&gt; because Redis doesn't have to re-parse the script every call.&lt;/p&gt;

&lt;p&gt;On gateway startup, I pre-load all three scripts and store their SHAs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;type&lt;/span&gt; &lt;span class="n"&gt;RedisClient&lt;/span&gt; &lt;span class="k"&gt;struct&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;rdb&lt;/span&gt;              &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Client&lt;/span&gt;
    &lt;span class="n"&gt;rateLimitSHA&lt;/span&gt;     &lt;span class="kt"&gt;string&lt;/span&gt;
    &lt;span class="n"&gt;spendCheckSHA&lt;/span&gt;    &lt;span class="kt"&gt;string&lt;/span&gt;
    &lt;span class="n"&gt;customerSpendSHA&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;RedisClient&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;loadScripts&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Context&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="kt"&gt;error&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;sha&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;rdb&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ScriptLoad&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rateLimitLuaScript&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Result&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;err&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;rateLimitSHA&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sha&lt;/span&gt;
    &lt;span class="c"&gt;// ... repeat for the other two ...&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If a script gets evicted (which can happen on &lt;code&gt;SCRIPT FLUSH&lt;/code&gt; or restart), the call fails and the client retries with &lt;code&gt;EVAL&lt;/code&gt;, which re-loads it. This is fully automatic with &lt;code&gt;go-redis&lt;/code&gt;'s &lt;code&gt;Run()&lt;/code&gt; method.&lt;/p&gt;

&lt;h2&gt;
  
  
  Putting it together: 1–2 round trips on the hot path
&lt;/h2&gt;

&lt;p&gt;A fully-configured cache-hit request runs through:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Auth lookup&lt;/strong&gt; — Redis GET (cached for the API-key TTL): 1 round trip.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Rate-limit + spend-cap&lt;/strong&gt; — one or two Lua scripts depending on which checks are configured: 1–2 round trips.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Exact cache GET&lt;/strong&gt; — Redis: 1 round trip.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Response write&lt;/strong&gt; — stream out.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;On the &lt;strong&gt;fast-fail path&lt;/strong&gt; (rate-limited or over the spend cap), we short-circuit at step 2 with a 429 — no exact cache lookup, no upstream call, no response generation. That's the property that keeps a single gateway instance stable during abuse bursts: a runaway client or credential leak can't meaningfully consume gateway CPU because each &lt;code&gt;DENY&lt;/code&gt; takes ~2 ms of work and 0 provider cost.&lt;/p&gt;

&lt;h2&gt;
  
  
  Numbers
&lt;/h2&gt;

&lt;p&gt;Measured on a &lt;strong&gt;DigitalOcean 4 vCPU / 8 GB shared Linux droplet&lt;/strong&gt;, server-side from &lt;code&gt;gateway_logs.latency_ms&lt;/code&gt; (excludes client RTT):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Response&lt;/th&gt;
&lt;th&gt;p50&lt;/th&gt;
&lt;th&gt;p95&lt;/th&gt;
&lt;th&gt;p99&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;200 cache hit&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;3 ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;12 ms&lt;/td&gt;
&lt;td&gt;23 ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;429 rate-limit&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~2 ms&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~6 ms&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Sustained throughput: &lt;strong&gt;~1,672 req/s&lt;/strong&gt; on a single instance.&lt;/p&gt;

&lt;p&gt;Methodology: &lt;code&gt;hey&lt;/code&gt; at concurrency 20, 200 requests, pre-warmed cache, then SQL query against the gateway's own log table for the actual server-side distribution. Bench script in the repo at &lt;code&gt;bench/load_test.sh&lt;/code&gt; — reproducible in ~5 minutes with &lt;code&gt;docker compose up -d postgres redis&lt;/code&gt; + &lt;code&gt;go run ./cmd/gateway&lt;/code&gt; + the bench script.&lt;/p&gt;

&lt;p&gt;The reason I quote server-side numbers and not &lt;code&gt;hey&lt;/code&gt; numbers is that &lt;code&gt;hey&lt;/code&gt; includes network RTT, OS scheduler noise, and TCP overhead that doesn't reflect how the gateway itself performs. The README has the full breakdown including 2 vCPU and MacBook M4 comparisons — short version: &lt;strong&gt;Docker Desktop on macOS is slower than a 4 vCPU Linux droplet&lt;/strong&gt; because every Redis round trip pays a 1–2 ms VM-network tax. Production numbers match the droplet rows, not the laptop row.&lt;/p&gt;

&lt;h3&gt;
  
  
  Caveats worth reading
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Sample size is small&lt;/strong&gt; (~80 samples per run for p99). Enough to be directionally right, not tight enough to publish ±0.5 ms. Quote a range, not a single point.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;p99 is GC- and connection-warmup-bound, not CPU-bound.&lt;/strong&gt; Throwing more hardware at it won't reliably push p99 below ~15 ms without GC tuning (&lt;code&gt;GOGC=200+&lt;/code&gt;) and pool pre-warming.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;These are cache-hit numbers.&lt;/strong&gt; Cache misses are dominated by upstream provider latency (&lt;code&gt;gpt-4o-mini&lt;/code&gt; ≈ 300–800 ms to OpenAI). That's not gateway overhead.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What's shipping today (v0.1.x)
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;OpenAI-compatible &lt;code&gt;/v1/chat/completions&lt;/code&gt; and &lt;code&gt;/v1/models&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SSE streaming across all four providers&lt;/strong&gt; — OpenAI, Anthropic, Gemini, Ollama. Chunks normalized to a single OpenAI-compatible shape regardless of upstream, with a trailing metadata frame carrying rounded &lt;code&gt;cost_usd&lt;/code&gt;, &lt;code&gt;usage&lt;/code&gt;, &lt;code&gt;latency_ms&lt;/code&gt;, and &lt;code&gt;provider&lt;/code&gt; before &lt;code&gt;[DONE]&lt;/code&gt;. Ollama's empty role-only chunks are filtered by default.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Automatic cross-provider failover&lt;/strong&gt; on 429/5xx/timeout/connection errors. Four modes (&lt;code&gt;cloud_first&lt;/code&gt;, &lt;code&gt;local_first&lt;/code&gt;, &lt;code&gt;local_only&lt;/code&gt;, &lt;code&gt;cloud_only&lt;/code&gt;) — same env var flips between them.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Per-API-key token-bucket rate limits&lt;/strong&gt; (atomic Redis Lua).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Per-project hard monthly spend caps&lt;/strong&gt; with pre-request cost estimation — blocks runaway prompts before the LLM call.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Per-customer spend caps&lt;/strong&gt; with &lt;code&gt;block&lt;/code&gt; or &lt;code&gt;downgrade&lt;/code&gt; behavior on overflow.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Local Ollama support&lt;/strong&gt; — any pulled model auto-routable, $0 cost, full tier substitution for cloud failover.&lt;/li&gt;
&lt;li&gt;Exact + semantic caching with per-project toggles and thresholds.&lt;/li&gt;
&lt;li&gt;Customer labels (&lt;code&gt;X-LLM0-*&lt;/code&gt;) stored as JSONB on every log row for downstream analytics.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What's not done yet
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Only four providers.&lt;/strong&gt; Bedrock, Azure OpenAI, Groq, Together, DeepSeek, xAI are on the roadmap.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No Prometheus &lt;code&gt;/metrics&lt;/code&gt; endpoint yet.&lt;/strong&gt; Observability today is &lt;code&gt;gateway_logs&lt;/code&gt; in Postgres + response headers.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No structured logging yet&lt;/strong&gt; — &lt;code&gt;log/slog&lt;/code&gt; is planned.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No hosted/managed version yet.&lt;/strong&gt; Self-hosted works today; a managed product gets built only if there's enough real demand (waitlist is for validation, not gating).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Handler-level tests incomplete.&lt;/strong&gt; Benchmarks cover the hot path; mock-based handler tests are in progress.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Why I built this
&lt;/h2&gt;

&lt;p&gt;I wanted something fast, where I owned the provider keys, with per-user or ai agents spend limits, automatic failover across providers, and real cost savings from caching — both exact-match and semantic.&lt;/p&gt;

&lt;p&gt;There are plenty of LLM gateways out there — OpenRouter, Helicone, Portkey, LiteLLM, Bifrost. They're good. But each one I evaluated was missing at least one of those, and the self-hosted BYOK angle in particular wasn't well covered: your own provider keys, your data never touches a third-party cloud, no markup, no shared rate limits with other tenants.&lt;/p&gt;

&lt;p&gt;The performance work above is what makes the self-hosted story actually viable. If your gateway adds 50 ms of overhead, you're not really shipping a "lightweight" anything.&lt;/p&gt;

&lt;h2&gt;
  
  
  Code &amp;amp; links
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Repo:&lt;/strong&gt; &lt;a href="https://github.com/mrmushfiq/llm0-gateway" rel="noopener noreferrer"&gt;https://github.com/mrmushfiq/llm0-gateway&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Site:&lt;/strong&gt; &lt;a href="https://llm0.ai" rel="noopener noreferrer"&gt;https://llm0.ai&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;MIT licensed. Runs from &lt;code&gt;docker compose up&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;If you spot something wrong, open an issue or comment below.&lt;/p&gt;

</description>
      <category>llm</category>
      <category>go</category>
      <category>redis</category>
      <category>systemdesign</category>
    </item>
  </channel>
</rss>
