We built our LLM proxy in Go. Not Rust. Not Python. Here's the engineering trade-off nobody talks about: the language that's fastest in benchmarks isn't always the language that ships the fastest product.
TL;DR
- Go handles 5,000+ RPS with ~11 microseconds of overhead per request — more than enough for 99% of LLM proxy workloads.
- Rust is faster (sub-1ms P99 at 10K QPS), but the development velocity trade-off isn't worth it unless you're building for hyperscale.
- Python (LiteLLM) hits a wall at ~1,000 QPS due to the GIL — fine for prototyping, problematic for production traffic.
The Three Contenders
When we started building Preto's proxy layer, we had three options on the table. Each had a strong case.
Python was the obvious first choice. The LLM ecosystem lives in Python. LiteLLM — the most popular open-source proxy — is Python. Every provider SDK is Python-first. We could ship a working proxy in a weekend.
Rust was the performance choice. TensorZero and Helicone both use Rust. Sub-millisecond P99 latency at 10,000 QPS. Memory safety guarantees. If we wanted to claim "the fastest proxy," Rust was the path.
Go was the pragmatic choice. Bifrost (the open-source proxy that benchmarks 50x faster than LiteLLM) is written in Go. Goroutines make concurrent streaming connections trivial. The standard library includes a production-grade HTTP server. And we could hire for it.
The Benchmark That Settled the Python Question
We ran Python off the list first. Not because it's slow in theory — because it's slow in practice at our target scale.
LiteLLM's own published benchmarks tell the story:
- At 500 RPS: Stable. ~40ms overhead. Acceptable.
- At 1,000 RPS: Memory climbs to 4GB+. Latency variance increases.
- At 2,000 RPS: Timeouts start. Memory hits 8GB+. Requests fail.
The culprit is Python's Global Interpreter Lock. An LLM proxy is fundamentally a concurrent I/O problem — you're holding thousands of open streaming connections simultaneously. Python's asyncio helps, but the GIL still serializes CPU-bound work: JSON parsing, token counting, cost calculation, log serialization. Under load, these add up.
LiteLLM's team knows this. They've announced a Rust sidecar to handle the hot path. That's telling — even the most popular Python proxy is moving critical code out of Python.
Note: Python isn't wrong — it's wrong for this. If your LLM traffic is under 500 RPS and you need maximum provider coverage, LiteLLM is a solid choice. It supports 100+ providers with battle-tested adapters. The performance ceiling only matters if you're going to hit it.
Go vs. Rust: Where the Decision Gets Interesting
With Python out, the real comparison begins. Here's what we measured and researched:
| Dimension | Go | Rust |
|---|---|---|
| Proxy overhead | ~11μs at 5K RPS | <1ms P99 at 10K QPS |
| Max throughput (single instance) | 5,000+ RPS | 10,000+ QPS |
| Memory under load | ~200MB at 5K RPS | ~50MB at 10K QPS |
| Concurrency model | Goroutines (lightweight) | async/await (Tokio) |
| Streaming HTTP support | stdlib net/http | hyper/axum (good, more code) |
| Time to implement proxy MVP | ~2 weeks | ~5-6 weeks |
| Hiring pool | Large (DevOps, backend) | Small (systems specialists) |
| Compile times | ~5 seconds | ~2-5 minutes |
| Binary size | ~15MB | ~8MB |
The performance numbers are close enough to not matter for our use case. The development velocity numbers are not.
The Factor That Made It Obvious: Goroutines and Streaming
An LLM proxy's core job is holding thousands of concurrent HTTP connections open while streaming tokens back to clients. This is where Go's goroutine model shines.
In Go, every incoming request gets its own goroutine. Streaming the response is straightforward:
func proxyHandler(w http.ResponseWriter, r *http.Request) {
// Forward to upstream LLM provider
resp, err := http.DefaultClient.Do(upstreamReq)
if err != nil {
handleFallback(w, r) // try next provider
return
}
defer resp.Body.Close()
// Stream tokens back as they arrive
flusher, _ := w.(http.Flusher)
buf := make([]byte, 4096)
for {
n, err := resp.Body.Read(buf)
if n > 0 {
w.Write(buf[:n])
flusher.Flush() // send immediately
trackTokens(buf[:n]) // async cost tracking
}
if err != nil {
break
}
}
}
That's the core loop. In Rust, the equivalent code involves async/await, Pin<Box<dyn Stream>>, lifetime annotations, and careful ownership management. It's not harder conceptually — it's harder in practice, every time you refactor or add a new feature.
When your proxy needs to add a new middleware layer — say, budget enforcement before routing — the Go version is a new function in the chain. The Rust version often requires restructuring lifetimes and trait bounds across multiple files.
The Real-World Request Lifecycle in Our Go Proxy
Here's how a request flows through our stack, with timing at each stage:
-
TLS termination + HTTP parse — handled by Go's
net/httpserver. ~1ms. - API key lookup + team resolution — in-memory map with Redis sync every 10ms. ~0.5ms.
- Rate limit check — token-bucket algorithm in goroutine-safe map. ~0.1ms.
- Budget enforcement — check team's monthly spend against cap. ~0.2ms.
- Cache probe — SHA-256 hash of prompt + model + params, checked against local cache with Redis fallback. ~1-3ms.
- Route selection — match model to upstream endpoint, apply load balancing weights. ~0.1ms.
-
Upstream call + streaming — goroutine holds connection, pipes
data:chunks back. 500ms-5,000ms (the LLM). - Async logging — cost calculation and log entry shipped to ClickHouse via buffered channel. ~0ms on the request path (fires in background goroutine).
Total proxy overhead: ~5-8ms. The LLM takes 500-5,000ms. Our proxy is under 1% of total request time.
What We'd Choose Rust For
This isn't a "Go is better than Rust" argument. It's a "Go is better for our constraints" argument. We'd choose Rust if:
- We needed to handle 10,000+ QPS on a single instance. At that scale, Rust's zero-cost abstractions and lack of GC pauses become meaningful.
- Memory was a hard constraint. Rust's 50MB footprint vs. Go's 200MB matters if you're running on edge nodes or embedded devices.
- The proxy was the entire product. If our company was an LLM proxy company, spending 3x longer on the core engine is justified. Our proxy is infrastructure — the product is cost intelligence built on top.
TensorZero made the right call choosing Rust — their proxy IS the product, they need built-in A/B testing at wire speed, and they're targeting the highest-throughput tier. Helicone made the right call choosing Rust — they run on Cloudflare Workers at the edge, where memory and cold start time matter.
For a cost intelligence platform where the proxy is the data collection layer? Go is the right tool.
Lessons From 6 Months in Production
Three things surprised us after shipping:
1. Garbage collection pauses are a non-issue. Go's GC has improved dramatically. At 3,000 RPS, our P99 GC pause is under 500 microseconds. We were prepared to tune GOGC — we never needed to.
2. The standard library HTTP server is production-ready. We started with Go's net/http and never moved to a framework. It handles keep-alive, connection pooling, graceful shutdown, and HTTP/2 out of the box. One less dependency.
3. Goroutine leaks are the real danger. Early on, we had a bug where failed upstream connections weren't properly closed, leaking goroutines. runtime.NumGoroutine() caught it — but only after goroutine count climbed from 200 to 45,000 over a weekend. Monitor goroutine count as a first-class metric from day one.
The Build vs. Buy Question
If you're evaluating whether to build your own proxy or use a managed solution, the math is sobering: a production-grade proxy is a 6-12 month engineering effort, roughly $450K-$700K in first-year engineering time when you include observability, a management UI, and compliance work.
One team we onboarded had built their own LLM manager — a reasonable decision at the time. When they migrated to a managed proxy, they removed 11,005 lines of code across 112 files.
Build if LLM routing is your core product differentiator. Buy if you want to ship AI features this month.
We're building Preto.ai — LLM cost optimization that sits in your proxy layer. Free for up to 10K requests. See what your LLM spend actually looks like.
Top comments (0)