gauravdagde

Posted on Apr 4 • Originally published at preto.ai

We built an LLM proxy that adds 47ms of latency. Here's every millisecond accounted for.

#ai #llm #architecture #backend

Your LLM API request passes through 7 layers before it reaches OpenAI. Authentication. Rate limiting. Cache lookup. Model routing. The upstream call itself. Fallback logic. Logging and cost attribution. Most teams have no idea what happens in between — or that the entire round trip adds less than 50 milliseconds.

This post breaks down every layer of an LLM proxy, what each one costs in latency, and why those 47 milliseconds determine whether your AI infrastructure scales — or quietly bankrupts you.

TL;DR

An LLM proxy intercepts your API request and passes it through 7 processing layers in under 50ms — adding auth, caching, routing, failover, and cost tracking that the provider API doesn't give you.
Proxy overhead (3-50ms) is under 3% of total request time. The cost of not having a proxy — untracked spend, zero failover, no per-feature attribution — is far higher.
The setup is one line of code: change your base_url. Everything else stays the same.

What Is an LLM Proxy (and Why Should a CTO Care)?

An LLM proxy sits between your application code and the LLM provider. Your app sends requests to the proxy URL instead of directly to api.openai.com. The proxy handles everything else: authentication, routing, caching, logging, failover.

Think of it as an API gateway — but AI-aware. Traditional gateways (Kong, Nginx) understand HTTP. An LLM proxy understands tokens, models, prompt structure, and cost-per-request. It can make routing decisions based on task complexity, enforce per-team budget limits, and detect that 30% of your requests are semantically identical and cacheable.

The setup is one line of code:

# Before
client = OpenAI(api_key="sk-...")

# After — same SDK, same code, different base URL
client = OpenAI(
    api_key="sk-...",
    base_url="https://proxy.preto.ai/v1"
)

Everything downstream — your prompts, your response handling, your error handling — stays the same. The proxy is transparent to your application code.

The 7 Layers Your Request Passes Through

Here's what happens in those 47 milliseconds, layer by layer.

Layer 1: Ingress and Authentication (~2-5ms)

The proxy receives your HTTP request and validates the API key. But unlike a direct OpenAI call, the key maps to an internal identity: a team, a project, a budget. Your upstream provider keys are never exposed to application code.

One leaked key doesn't compromise your entire OpenAI account — it compromises one team's allocation with a hard spending cap.

Layer 2: Rate Limiting and Budget Enforcement (~1-3ms)

Before the request goes anywhere, the proxy checks two things: Is this user within their rate limit? Is their team within its budget?

Smart proxies enforce token-level rate limits, not just request-level — because one 100K-context request is not the same as one 500-token classification. Budget checks happen in-memory (synced with Redis every ~10ms) so they don't block the request path.

Layer 3: Cache Lookup (~1-8ms; hit returns in <5ms, saving 500ms-5s)

The proxy checks whether it has seen this request — or one semantically similar — before.

Exact caching hashes the prompt and returns an identical response.

Semantic caching generates an embedding, computes cosine similarity against recent requests, and returns a cached response if similarity exceeds a threshold.

A cache hit skips the LLM entirely: response in under 5ms instead of 2-5 seconds. In production, hit rates range from 20% to 45% depending on the use case — even 20% is a meaningful cost reduction.

Layer 4: Routing and Model Selection (~1-3ms)

If the request isn't cached, the proxy decides where to send it. Simple routing forwards to the model specified in the request. Advanced routing makes a decision: load balance across multiple Azure OpenAI deployments, select a cheaper model for simple tasks, or route based on headers or request patterns.

Cost-based routing — sending classification tasks to GPT-5 Mini instead of GPT-5 — can cut 80% of cost on affected requests with no accuracy loss.

Layer 5: Upstream Call + Streaming (~500ms-5,000ms)

The proxy forwards the request to the selected provider with the upstream API key. For streaming responses (stream: true), the proxy pipes tokens back to your application as they arrive — the client starts receiving output before the full response is generated.

The proxy also enforces request timeouts, killing requests that exceed a duration threshold before they waste tokens.

Layer 6: Fallback and Retry (~0ms unless triggered: then 100-500ms)

If the primary provider returns a 429 (rate limit), 503 (service unavailable), or times out, the proxy retries with exponential backoff — then falls back to the next provider in the chain.

GPT-5 fails? Route to Claude Sonnet. Claude is down? Try Gemini Pro.

Circuit breakers monitor error rates per provider: when a provider crosses a failure threshold, it's automatically removed from the rotation and re-tested after a cooldown period. Teams running this report 99.97% effective uptime despite individual provider outages, with failover in milliseconds instead of the 5+ minutes it takes to update a hard-coded API key.

Layer 7: Logging, Cost Attribution, and Response (~2-5ms, async)

As the response streams back, the proxy calculates cost (input tokens × input price + output tokens × output price), tags the request with team/feature/environment metadata, and ships the log to your observability backend.

This happens asynchronously — the client gets the response immediately. The log includes: model used, tokens consumed, cost, latency, cache hit/miss, which feature triggered it, and whether the request fell back to a secondary provider.

47ms in Context: Why Proxy Overhead Doesn't Matter (and When It Does)

The proxy adds 7-25ms to a request that takes 500ms-5,000ms from the LLM itself. That's 0.5-3% overhead. For most teams, this is noise.

Scenario	LLM Latency	Proxy Overhead	% Impact
Standard completion (GPT-5, 500 tokens out)	~2,000ms	~20ms	1.0%
Streaming first token (TTFT)	~300ms	~20ms	6.7%
Cache hit (semantic match)	<5ms	~8ms	160%*
Long-form generation (2K tokens)	~8,000ms	~20ms	0.25%
Mini model classification	~400ms	~20ms	5.0%

*The cache hit row looks alarming — but the total response time is 13ms instead of 2,000ms. Your user got a response 150x faster.

The only scenario where proxy latency is a real concern: real-time applications with sub-100ms requirements and no caching benefit — voice AI, game NPCs, live translation. For these, a Rust or Go proxy (under 1ms overhead) is the right choice. For everything else, the 20ms is the best trade in your stack.

Proxy Architecture Patterns: Forward, Reverse, and Sidecar

Not all proxies work the same way. The architecture pattern determines your failure modes, your latency profile, and what features you can use.

Forward Proxy (Client-Side Integration)

Your application points at the proxy URL. The proxy forwards requests to the provider. This is the most common pattern (Portkey, LiteLLM, Preto). You get the full feature set: caching, routing, failover, cost tracking. The trade-off: the proxy is in the critical path.

Reverse Proxy (Edge-Deployed)

The proxy runs at the edge (e.g., Cloudflare Workers), intercepting requests globally with minimal latency. Helicone uses this pattern. Low latency from geographic proximity, but limited by what you can run in an edge function.

Sidecar / Async Observer

The proxy doesn't sit in the request path at all. Instead, it observes traffic after the fact — through SDK hooks, log tailing, or provider API polling. Langfuse advocates this approach. Zero latency impact, no single point of failure — but you lose caching, real-time routing, and failover.

The honest trade-off: A synchronous proxy creates a dependency. Run it as a horizontally scaled service behind a load balancer, with health checks and automatic instance replacement. Keep a direct-to-provider fallback for critical paths. This is standard infrastructure — the same way you'd deploy any API gateway.

What Proxy Overhead Actually Costs in Dollars

The proxy adds latency. It also saves money. Here's the math for a team running 100,000 LLM requests per day on GPT-5 ($1.25/1M input, $5.00/1M output) with an average of 500 input + 300 output tokens per request.

Monthly LLM spend without a proxy: $6,450/month

What the proxy saves:

Semantic caching (30% hit rate): -$1,935/month
Cost-based routing (40% of requests downgraded to GPT-5 Mini): -$1,548/month
Budget enforcement (prevents 2 runaway features/quarter): -$800-2,000/quarter
Automatic failover (avoids 3 provider outages/quarter): prevents 4-12 hours of downtime

Net result: $3,483/month in direct savings, plus avoided downtime. The proxy pays for itself in the first week.

The Real Cost of Not Having a Proxy

Without a proxy, you have:

No per-feature cost attribution. OpenAI gives you two fields for attribution: user and project. That's it. You can't see which feature is responsible for 60% of your bill.
No automatic failover. When OpenAI goes down — and it does, multiple times per quarter — every AI feature in your product goes down with it. Manual failover takes 5+ minutes. At 3am, nobody is watching.
No caching layer. Identical requests hit the LLM every time. The average production app sends 15-30% duplicate or near-duplicate requests.
No budget enforcement. A new feature ships with a prompt that generates 2,000 output tokens per request instead of 300. Nobody notices until the monthly bill arrives 3x higher than expected.

The average production app we onboard discovers that 18% of its requests are cacheable on day one.

Build vs. Buy: The Decision Framework

Building a production-grade LLM proxy is a 6-12 month engineering effort. Based on published estimates:

Core gateway (routing, auth, failover): $200K-$300K in engineering time
Observability (logging, dashboards, alerting): $100K-$150K
Prompt management UI: $100K-$150K
Compliance and security (SOC 2, HIPAA): $50K-$100K/year ongoing

Total first-year investment: $450K-$700K, plus 12-18 months before your AI features ship with production-grade infrastructure.

One real case study: a team replaced their custom LLM manager with a managed proxy and removed 11,005 lines of code across 112 files.

Build if: LLM routing is your core product differentiator, you have unique compliance requirements, or your scale requires custom optimizations.

Buy if: You want to ship AI features this month, your engineering team should be building product not infrastructure, and your LLM spend is between $1K and $100K/month.

Latency Benchmarks by Implementation Language

Proxy	Language	Overhead	Throughput	Note
Bifrost	Go	~11μs at 5K RPS	5,000+ RPS	Pure routing, no observability platform
TensorZero	Rust	<1ms P99	10,000 QPS	Built-in A/B testing
Helicone	Rust	~1-5ms P95	~10,000 RPS	Edge-deployed on Cloudflare Workers
Portkey	Managed	<10ms	1,000 RPS	Full-featured: guardrails, prompt mgmt
LiteLLM	Python	3-50ms	1,000 QPS	Most flexible (100+ providers)

Rust and Go proxies handle 5-10x more throughput with 10-100x less overhead than Python. But LiteLLM has the largest provider coverage. For most teams under 1,000 RPS, the language doesn't matter. At 5,000+ RPS, it's the first thing that matters.

When You Don't Need a Proxy

Skip the proxy if:

You're calling one model, from one service, at low volume
Your LLM spend is under $500/month
You need observability but not routing (an async observer works fine)
You're still prototyping

Add the proxy when you have multiple models, multiple teams, real money at stake, and no visibility into where it's going.

We're building Preto.ai — LLM cost optimization that sits in your proxy layer. If you're evaluating options, the full build vs. buy decision checklist (12 questions, PDF) is linked below.

DEV Community