A team I worked with shipped their first LLM feature in two weeks. Six weeks later, they got a $47,000 OpenAI bill — for a free tier product.
The post-mortem found three things: one tenant ran a script that retried failed requests indefinitely, another had a buggy prompt that asked the model to "respond in ten thousand tokens," and a third was just abusive — they had discovered the API key was effectively unlimited and were running batch jobs through it.
There was no rate limit. No per-tenant budget. No cost ceiling. No audit trail. Just direct SDK calls from the application code straight to OpenAI.
If your team is shipping LLM features the same way, this post is for you. We will walk through a runnable Spring Boot LLM Gateway that sits between your clients and the provider, enforcing API keys, rate limits, token budgets, caching, and audit logging — the controls you need before going to production, not after.
Full source code, Docker Compose stack, and 9 execution screenshots are at exesolution.com. This post covers the architecture and the key design decisions.
What Direct SDK Usage Doesn't Give You
When your application code calls OpenAI directly, every request looks the same to the provider. They see one API key, one source, one bill. You can't:
Scope keys per tenant. A single shared key means one bad tenant takes down the whole product. Rotation is impossible without a coordinated multi-deploy.
Cap spend per tenant. Without a gateway, you find out you have blown the monthly budget when the invoice arrives. You can't throttle in real time.
Block runaway responses. A buggy prompt asking for 10,000 tokens executes happily. The provider does not know it is wrong; you only know after the fact.
Cache deterministic calls. Identical requests with temperature=0 are paid for every time. There is no shared cache layer because there is no shared layer at all.
Audit anything. When a customer complains "your AI gave me wrong information," you cannot reconstruct what was sent, what came back, or what model was used. The data is in OpenAI's logs, which you cannot query.
A gateway is the standard fix. The question is what controls it actually enforces.
The Gateway Architecture
The request pipeline has eight stages, each enforcing one specific concern:
Client
POST /v1/chat/completions
Authorization: Bearer <tenant_api_key>
Stage 1: Authentication -> hashed key lookup, tenant resolution
Stage 2: Input normalization -> canonicalize model/params, count bytes
Stage 3: Policy decision -> ALLOW / DEGRADE / BLOCK
Stage 4: Quota enforcement -> rate limit + budget check (Redis)
Stage 5: Cache lookup -> only if temperature=0 and policy allows
Stage 6: Provider call -> bounded timeout, circuit breaker
Stage 7: Response filtering -> strip provider metadata, redact PII
Stage 8: Audit + rollup -> write to PostgreSQL, increment counters
Client receives response
The architecture has three storage components:
PostgreSQL holds the durable state: tenants, hashed API keys, policies, audit logs, daily usage rollups. Everything that survives a restart.
Redis holds the hot path: per-tenant rate limit counters, in-flight request semaphores, optional response cache. Optional but strongly recommended for any meaningful QPS.
Stateless gateway instances sit behind a load balancer. All state lives in PostgreSQL and Redis, so you can scale horizontally without coordination.
The Three Enforcement Modes
This is the design decision that makes or breaks the gateway. Most teams default to either "block everything that exceeds limits" or "log everything but never block." Both are wrong in different ways.
The gateway supports three modes, configured per tenant per policy:
HARD — Reject the request when the limit is hit. Returns 429 (rate limit) or 402 (budget exhausted) with a reason code. Use for tenants on metered plans where overage isn't allowed.
SOFT — Degrade the request instead of rejecting it. The gateway rewrites the request: switches to a cheaper model, lowers max_tokens, tightens parameters. The user gets a response — just not the premium-quality one. Use during traffic spikes where degraded service is better than a 4xx.
OBSERVE — Allow the request but flag it in the audit log. Critical for rolling out a new policy: you see exactly which tenants would have been blocked or degraded, without actually impacting them. Validate the policy with real traffic before flipping to HARD.
The OBSERVE mode is the practical one. You are never going to get policy thresholds right on the first try. Setting them, running in OBSERVE for two weeks, reviewing the would-have-blocked traffic, then switching to HARD or SOFT is the only safe rollout path.
Data Model
Five tables cover the durable state.
tenants
id, name, status (ACTIVE/SUSPENDED), created_at
api_keys — keys are never stored in plaintext
id, tenant_id, key_hash, scopes, status,
created_at, last_used_at, rotated_at
policies — one row per tenant
tenant_id,
allowed_models (json),
max_prompt_bytes, max_input_tokens, max_output_tokens,
rate_limit_rps, max_inflight,
daily_budget_usd, monthly_budget_usd,
daily_token_cap, monthly_token_cap,
enforcement_mode (HARD/SOFT/OBSERVE),
redact_mode (NONE/BASIC/STRICT)
usage_rollup_daily — append-only counters, fast to aggregate
tenant_id, date,
requests, tokens_in, tokens_out, cost_usd_est, blocked_requests
audit_log — one row per request
request_id, tenant_id, key_id, model,
request_ts, latency_ms, tokens_in, tokens_out, cost_usd_est,
decision (ALLOW/BLOCK/DEGRADE), reason_code,
trace_id,
prompt_redacted, response_redacted -- nullable, policy-driven
The split between usage_rollup_daily and audit_log matters. The rollup is queried in the hot path on every request to check budget; it is small and indexed by (tenant_id, date). The audit log is much larger but only queried during incident investigation. Don't merge them.
API Key Handling
Three rules, no exceptions.
Keys are hashed at rest. SHA-256 with a per-instance salt. Constant-time comparison on lookup. The raw key is shown to the user once, at creation time, and then never again. If they lose it, they rotate it.
The Authorization header is never logged. Every audit entry references key_id (the database primary key), not the actual key value. Logs that capture HTTP requests have an explicit filter for the Authorization header.
Rotation is graceful. When a tenant rotates a key, the new key becomes active immediately. The old key continues working for a configurable grace period (default 24 hours) so deployments can roll out without downtime, then is automatically revoked.
This is straightforward Spring Security with a custom AuthenticationProvider. Nothing fancy — just disciplined.
Rate Limiting and Budget Enforcement
Both run in Redis, both use the same pattern: a sliding window counter incremented on each request, checked against the policy threshold.
Rate limiting is per-tenant requests-per-second using a token bucket algorithm. The bucket size and refill rate come from the tenant's policy. A semaphore counter enforces max_inflight to prevent a tenant from queueing thousands of concurrent requests.
Budget enforcement is more interesting because the cost is not known until the response comes back. The flow:
- Before the call: estimate the cost using the request's
max_tokensparameter and the configured price-per-token table. Check the estimate against the remaining budget. - If the estimate would exceed the budget: apply HARD/SOFT/OBSERVE per the enforcement mode.
- After the call: parse the actual
usageobject from the provider response, compute the actual cost, and updateusage_rollup_dailywith the real number.
The pre-call estimate prevents a single 10,000-token request from blowing the monthly budget. The post-call true-up keeps the rollup accurate. The two-step approach is the only way to get both safety and accuracy.
Caching
Caching LLM responses is dangerous if you are not careful. Two requests that look identical can have different intended outputs because of upstream context the gateway cannot see. So the cache only activates when:
- The policy explicitly allows caching for this route, AND
-
temperature=0(deterministic output), AND - The cache key includes
tenant_id + model + canonicalized prompt + relevant params
The tenant_id in the cache key prevents cross-tenant leakage even if two tenants happen to send identical prompts. TTL is configured per route — short for personalized routes, long for generic prompts.
Every cache hit is recorded in the audit log with cache_hit=true. This matters for billing: cached responses do not incur provider cost, so the rollup correctly shows zero cost for those requests.
Failure Modes
This is the section most gateway tutorials skip, and it is the section that determines whether the gateway is actually production-ready.
Provider outage (5xx, timeout) — Bounded retry (1-2 attempts) on transient errors only. Circuit breaker (Resilience4j) sheds load when the provider is consistently failing. Optional fallback: degrade to a cheaper alternative model.
Redis unavailable — Configurable behavior:
- HARD-FAIL: block all requests until Redis recovers (strict, but predictable)
- SOFT-FAIL: allow requests but log
quota_unavailable(risky — tenants can exceed budgets undetected)
The default is HARD-FAIL. SOFT-FAIL is only appropriate when paired with strict per-instance rate limiting as a fallback.
Budget calculation drift — The pre-call estimate uses an approximate token count. The post-call true-up uses the provider's actual usage field. Daily rollups reconcile based on actuals. The price table is versioned, so historical audit records remain accurate even after pricing changes.
Key leakage — Hashed keys at rest, fast rotation, per-key rate limits as a circuit breaker if anomalous traffic is detected on a single key.
Running It
docker compose up -d
This brings up: Spring Boot gateway, PostgreSQL, Redis, and a mock provider for testing without burning real OpenAI tokens.
Bootstrap a tenant:
curl -s -X POST http://localhost:8080/admin/tenants \
-H "Content-Type: application/json" \
-d '{"name":"team-a"}'
Issue an API key:
curl -s -X POST http://localhost:8080/admin/tenants/<tenant-id>/keys
The response shows the raw key once. Save it. You won't see it again.
Set a policy with a low budget for testing:
curl -s -X PUT http://localhost:8080/admin/tenants/<tenant-id>/policy \
-H "Content-Type: application/json" \
-d '{
"allowedModels": ["gpt-4o-mini"],
"maxOutputTokens": 200,
"rateLimitRps": 5,
"dailyBudgetUsd": 0.10,
"enforcementMode": "HARD"
}'
Call the gateway:
curl -s -X POST http://localhost:8080/v1/chat/completions \
-H "Authorization: Bearer sk-tenant-XXXXX" \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4o-mini",
"messages": [{"role":"user","content":"Hello"}]
}'
Trigger a budget block by running the call in a loop until the daily limit is hit:
for i in {1..50}; do
curl -s -X POST http://localhost:8080/v1/chat/completions \
-H "Authorization: Bearer sk-tenant-XXXXX" \
-H "Content-Type: application/json" \
-d '{"model":"gpt-4o-mini","messages":[{"role":"user","content":"test"}]}'
done
Eventually you will see BUDGET_EXCEEDED in the response. Then inspect the audit log:
curl -s -u admin:admin-secret \
"http://localhost:8080/admin/audit?tenantId=<tenant-id>&limit=10"
Each entry shows tokens, cost, decision (ALLOW/BLOCK), and a reason code.
What's in the Full Solution
The verified solution at exesolution.com contains everything to run this from scratch:
- Complete Spring Boot project: gateway controller, policy engine, rate limiter, audit writer, admin endpoints
- PostgreSQL schema with Flyway migrations for all 5 tables, including indexes for the hot-path queries
- Redis-backed token bucket implementation and in-flight semaphore
- Spring Security configuration: API key authentication for tenant routes, HTTP Basic for admin routes
- Docker Compose stack: gateway + PostgreSQL + Redis + mock provider
- Configurable price table for cost estimation across multiple models
- 9 evidence screenshots: build, startup, health, create tenant, issue API key, update policy, tenant call, admin visibility, usage dashboard
Full solution + runnable code + evidence at exesolution.com
Free registration required to access the code bundle and evidence images.
Where This Pays Off
The gateway pattern adds development time upfront, no question. The case for it gets clearer as you scale:
- The first time a tenant burns through their monthly budget in a day and you can throttle them in real time without redeploying — instead of finding out from the invoice.
- The first time a customer reports "your AI gave me wrong information" and you can reconstruct the exact request from the audit log in 30 seconds.
- The first time you rotate a leaked key without coordinating a multi-service deploy.
- The first time OBSERVE mode tells you a new policy would have blocked 12% of legitimate traffic, before you flip it to HARD.
If you are shipping LLM features in Spring Boot, the gateway is not a nice-to-have. It is the layer that lets you sleep at night.
Have questions about a specific part of the pipeline — rate limiting algorithm, audit log schema, key rotation flow? Drop a comment below.
Top comments (0)