Your LLM API request passes through 7 layers before it reaches OpenAI. Authentication. Rate limiting. Cache lookup. Model routing. The upstream call itself. Fallback logic. Logging and cost attribution. Most teams have no idea what happens in between — or that the entire round trip adds less than 50 milliseconds.
This post breaks down every layer of an LLM proxy, what each one costs in latency, and why those 47 milliseconds determine whether your AI infrastructure scales — or quietly bankrupts you.
TL;DR
- An LLM proxy intercepts your API request and passes it through 7 processing layers in under 50ms — adding auth, caching, routing, failover, and cost tracking that the provider API doesn't give you.
- Proxy overhead (3-50ms) is under 3% of total request time. The cost of not having a proxy — untracked spend, zero failover, no per-feature attribution — is far higher.
- The setup is one line of code: change your
base_url. Everything else stays the same.
What Is an LLM Proxy (and Why Should a CTO Care)?
An LLM proxy sits between your application code and the LLM provider. Your app sends requests to the proxy URL instead of directly to api.openai.com. The proxy handles everything else: authentication, routing, caching, logging, failover.
Think of it as an API gateway — but AI-aware. Traditional gateways (Kong, Nginx) understand HTTP. An LLM proxy understands tokens, models, prompt structure, and cost-per-request. It can make routing decisions based on task complexity, enforce per-team budget limits, and detect that 30% of your requests are semantically identical and cacheable.
The setup is one line of code:
# Before
client = OpenAI(api_key="sk-...")
# After — same SDK, same code, different base URL
client = OpenAI(
api_key="sk-...",
base_url="https://proxy.preto.ai/v1"
)
Everything downstream — your prompts, your response handling, your error handling — stays the same. The proxy is transparent to your application code.
The 7 Layers Your Request Passes Through
Here's what happens in those 47 milliseconds, layer by layer.
Layer 1: Ingress and Authentication (~2-5ms)
The proxy receives your HTTP request and validates the API key. But unlike a direct OpenAI call, the key maps to an internal identity: a team, a project, a budget. Your upstream provider keys are never exposed to application code.
One leaked key doesn't compromise your entire OpenAI account — it compromises one team's allocation with a hard spending cap.
Layer 2: Rate Limiting and Budget Enforcement (~1-3ms)
Before the request goes anywhere, the proxy checks two things: Is this user within their rate limit? Is their team within its budget?
Smart proxies enforce token-level rate limits, not just request-level — because one 100K-context request is not the same as one 500-token classification. Budget checks happen in-memory (synced with Redis every ~10ms) so they don't block the request path.
Layer 3: Cache Lookup (~1-8ms; hit returns in <5ms, saving 500ms-5s)
The proxy checks whether it has seen this request — or one semantically similar — before.
Exact caching hashes the prompt and returns an identical response.
Semantic caching generates an embedding, computes cosine similarity against recent requests, and returns a cached response if similarity exceeds a threshold.
A cache hit skips the LLM entirely: response in under 5ms instead of 2-5 seconds. In production, hit rates range from 20% to 45% depending on the use case — even 20% is a meaningful cost reduction.
Layer 4: Routing and Model Selection (~1-3ms)
If the request isn't cached, the proxy decides where to send it. Simple routing forwards to the model specified in the request. Advanced routing makes a decision: load balance across multiple Azure OpenAI deployments, select a cheaper model for simple tasks, or route based on headers or request patterns.
Cost-based routing — sending classification tasks to GPT-5 Mini instead of GPT-5 — can cut 80% of cost on affected requests with no accuracy loss.
Layer 5: Upstream Call + Streaming (~500ms-5,000ms)
The proxy forwards the request to the selected provider with the upstream API key. For streaming responses (stream: true), the proxy pipes tokens back to your application as they arrive — the client starts receiving output before the full response is generated.
The proxy also enforces request timeouts, killing requests that exceed a duration threshold before they waste tokens.
Layer 6: Fallback and Retry (~0ms unless triggered: then 100-500ms)
If the primary provider returns a 429 (rate limit), 503 (service unavailable), or times out, the proxy retries with exponential backoff — then falls back to the next provider in the chain.
GPT-5 fails? Route to Claude Sonnet. Claude is down? Try Gemini Pro.
Circuit breakers monitor error rates per provider: when a provider crosses a failure threshold, it's automatically removed from the rotation and re-tested after a cooldown period. Teams running this report 99.97% effective uptime despite individual provider outages, with failover in milliseconds instead of the 5+ minutes it takes to update a hard-coded API key.
Layer 7: Logging, Cost Attribution, and Response (~2-5ms, async)
As the response streams back, the proxy calculates cost (input tokens × input price + output tokens × output price), tags the request with team/feature/environment metadata, and ships the log to your observability backend.
This happens asynchronously — the client gets the response immediately. The log includes: model used, tokens consumed, cost, latency, cache hit/miss, which feature triggered it, and whether the request fell back to a secondary provider.
47ms in Context: Why Proxy Overhead Doesn't Matter (and When It Does)
The proxy adds 7-25ms to a request that takes 500ms-5,000ms from the LLM itself. That's 0.5-3% overhead. For most teams, this is noise.
| Scenario | LLM Latency | Proxy Overhead | % Impact |
|---|---|---|---|
| Standard completion (GPT-5, 500 tokens out) | ~2,000ms | ~20ms | 1.0% |
| Streaming first token (TTFT) | ~300ms | ~20ms | 6.7% |
| Cache hit (semantic match) | <5ms | ~8ms | 160%* |
| Long-form generation (2K tokens) | ~8,000ms | ~20ms | 0.25% |
| Mini model classification | ~400ms | ~20ms | 5.0% |
*The cache hit row looks alarming — but the total response time is 13ms instead of 2,000ms. Your user got a response 150x faster.
The only scenario where proxy latency is a real concern: real-time applications with sub-100ms requirements and no caching benefit — voice AI, game NPCs, live translation. For these, a Rust or Go proxy (under 1ms overhead) is the right choice. For everything else, the 20ms is the best trade in your stack.
Proxy Architecture Patterns: Forward, Reverse, and Sidecar
Not all proxies work the same way. The architecture pattern determines your failure modes, your latency profile, and what features you can use.
Forward Proxy (Client-Side Integration)
Your application points at the proxy URL. The proxy forwards requests to the provider. This is the most common pattern (Portkey, LiteLLM, Preto). You get the full feature set: caching, routing, failover, cost tracking. The trade-off: the proxy is in the critical path.
Reverse Proxy (Edge-Deployed)
The proxy runs at the edge (e.g., Cloudflare Workers), intercepting requests globally with minimal latency. Helicone uses this pattern. Low latency from geographic proximity, but limited by what you can run in an edge function.
Sidecar / Async Observer
The proxy doesn't sit in the request path at all. Instead, it observes traffic after the fact — through SDK hooks, log tailing, or provider API polling. Langfuse advocates this approach. Zero latency impact, no single point of failure — but you lose caching, real-time routing, and failover.
The honest trade-off: A synchronous proxy creates a dependency. Run it as a horizontally scaled service behind a load balancer, with health checks and automatic instance replacement. Keep a direct-to-provider fallback for critical paths. This is standard infrastructure — the same way you'd deploy any API gateway.
What Proxy Overhead Actually Costs in Dollars
The proxy adds latency. It also saves money. Here's the math for a team running 100,000 LLM requests per day on GPT-5 ($1.25/1M input, $5.00/1M output) with an average of 500 input + 300 output tokens per request.
Monthly LLM spend without a proxy: $6,450/month
What the proxy saves:
- Semantic caching (30% hit rate): -$1,935/month
- Cost-based routing (40% of requests downgraded to GPT-5 Mini): -$1,548/month
- Budget enforcement (prevents 2 runaway features/quarter): -$800-2,000/quarter
- Automatic failover (avoids 3 provider outages/quarter): prevents 4-12 hours of downtime
Net result: $3,483/month in direct savings, plus avoided downtime. The proxy pays for itself in the first week.
The Real Cost of Not Having a Proxy
Without a proxy, you have:
-
No per-feature cost attribution. OpenAI gives you two fields for attribution:
userandproject. That's it. You can't see which feature is responsible for 60% of your bill. - No automatic failover. When OpenAI goes down — and it does, multiple times per quarter — every AI feature in your product goes down with it. Manual failover takes 5+ minutes. At 3am, nobody is watching.
- No caching layer. Identical requests hit the LLM every time. The average production app sends 15-30% duplicate or near-duplicate requests.
- No budget enforcement. A new feature ships with a prompt that generates 2,000 output tokens per request instead of 300. Nobody notices until the monthly bill arrives 3x higher than expected.
The average production app we onboard discovers that 18% of its requests are cacheable on day one.
Build vs. Buy: The Decision Framework
Building a production-grade LLM proxy is a 6-12 month engineering effort. Based on published estimates:
- Core gateway (routing, auth, failover): $200K-$300K in engineering time
- Observability (logging, dashboards, alerting): $100K-$150K
- Prompt management UI: $100K-$150K
- Compliance and security (SOC 2, HIPAA): $50K-$100K/year ongoing
Total first-year investment: $450K-$700K, plus 12-18 months before your AI features ship with production-grade infrastructure.
One real case study: a team replaced their custom LLM manager with a managed proxy and removed 11,005 lines of code across 112 files.
Build if: LLM routing is your core product differentiator, you have unique compliance requirements, or your scale requires custom optimizations.
Buy if: You want to ship AI features this month, your engineering team should be building product not infrastructure, and your LLM spend is between $1K and $100K/month.
Latency Benchmarks by Implementation Language
| Proxy | Language | Overhead | Throughput | Note |
|---|---|---|---|---|
| Bifrost | Go | ~11μs at 5K RPS | 5,000+ RPS | Pure routing, no observability platform |
| TensorZero | Rust | <1ms P99 | 10,000 QPS | Built-in A/B testing |
| Helicone | Rust | ~1-5ms P95 | ~10,000 RPS | Edge-deployed on Cloudflare Workers |
| Portkey | Managed | <10ms | 1,000 RPS | Full-featured: guardrails, prompt mgmt |
| LiteLLM | Python | 3-50ms | 1,000 QPS | Most flexible (100+ providers) |
Rust and Go proxies handle 5-10x more throughput with 10-100x less overhead than Python. But LiteLLM has the largest provider coverage. For most teams under 1,000 RPS, the language doesn't matter. At 5,000+ RPS, it's the first thing that matters.
When You Don't Need a Proxy
Skip the proxy if:
- You're calling one model, from one service, at low volume
- Your LLM spend is under $500/month
- You need observability but not routing (an async observer works fine)
- You're still prototyping
Add the proxy when you have multiple models, multiple teams, real money at stake, and no visibility into where it's going.
We're building Preto.ai — LLM cost optimization that sits in your proxy layer. If you're evaluating options, the full build vs. buy decision checklist (12 questions, PDF) is linked below.
Top comments (0)