DEV Community

Cover image for LiteLLM Is Moving to Rust. Here's What the Benchmarks Look Like.
Paul Twist
Paul Twist

Posted on • Originally published at docs.litellm.ai

LiteLLM Is Moving to Rust. Here's What the Benchmarks Look Like.

I run LiteLLM as my AI gateway. 100+ providers, one OpenAI-compatible API. It works, it scales, I like it. But after a year of pushing traffic through the Python proxy, one thing kept bugging me: memory.

Under concurrent load, the Python proxy peaks around 359MB. Multiply that across pods, regions, retries. OOM kills at the worst possible time. You know the feeling.

LiteLLM just announced they're migrating the entire hot path to Rust. Not a rewrite. Not a v2. Same config.yaml, same database, same API. The runtime underneath just gets faster.

I went through their benchmark numbers. They look real.

The numbers

Rust gateway LiteLLM Python
Per-request overhead ~0.05ms ~7.5ms
Throughput (50 concurrent) 6,782 req/s 453 req/s
Peak memory under load 31.7MB 358.9MB

15x throughput. 11x less memory. 150x lower per-request overhead. The harness is checked into the repo so you can reproduce it yourself.

For most workloads, gateway overhead is noise compared to model latency. A Claude call takes 500ms to 30s. Adding 7ms vs 0.05ms, who cares. But for high-throughput stuff like classification batches, embeddings at scale, or coding agents hammering completions, it adds up fast.

How they're doing it

The migration is a clean four-stage plan:

Stage 0: Pure Python (today)
Stage 1: Rust core via PyO3, Python still does I/O
Stage 2: FastAPI thin shell, entire hot path in Rust
Stage 3: Pure Rust server (axum), Python plugins in sidecar
Enter fullscreen mode Exit fullscreen mode

What I like about this approach: they're not flipping everything at once. Each route moves individually. OCR first (smallest surface, no streaming). Then /v1/messages (adds streaming). Then /chat/completions (largest param surface). One provider at a time, parity check gates every step.

The Rust core is pure transforms. It turns your request into a provider request, turns the response back, handles stream chunks, counts tokens. No sockets, no secrets, no database access. Python keeps doing I/O until Stage 3. Clean separation.

Timeline

Aug 15  - litellm.ocr() → Rust
Sep 1   - /messages, /chat/completions → Rust
Sep 15  - Router (load balancing, fallbacks, retries) → Rust
Dec 1   - Full server: axum replaces FastAPI
Enter fullscreen mode Exit fullscreen mode

What stays the same

Everything you care about:

  • Same config.yaml
  • Same database and schema
  • Same client API, same request/response shapes
  • Same providers, routing, keys
  • Custom Python plugins keep working in the sidecar

You deploy the Rust binary, it uses ~65MB of memory, overhead stays under 1ms. Nothing in your setup changes.

Why this matters

The "Python is slow" argument against LiteLLM was always a stretch. Gateway overhead is maybe 0.3% of total latency on a typical LLM call. Most of the time you're waiting on the model, not the proxy.

But now even that argument is gone. Sub-1ms overhead, 32MB memory, 6,782 req/s on a single instance. Good luck finding a lighter gateway that still covers 100+ providers.

Full architecture diagrams and the reproducible benchmark setup are in the announcement: docs.litellm.ai/blog/litellm-rust-launch

Curious if anyone else is running their AI gateway through Rust. What's your setup look like?

Top comments (1)

Collapse
 
topstar_ai profile image
Luis

This post is basically about moving LiteLLM toward a Rust-based architecture and what the benchmark differences look like between the Python gateway vs a Rust implementation.

The core idea (and why it’s getting attention) is pretty simple:

LiteLLM has become a common “LLM gateway layer” in Python, but at scale it starts to show typical Python-infrastructure limits (latency overhead, concurrency constraints, memory pressure). The Rust migration argument is that rewriting or replacing parts of it in Rust can dramatically reduce overhead and improve throughput in production systems.

From the broader benchmarking direction in the ecosystem, the claims usually revolve around:

Lower per-request latency due to compiled execution
Much higher throughput under load
Smaller memory footprint and more predictable performance
Tradeoff: more complex implementation and less flexibility compared to Python-first tooling

This fits into a wider trend right now where LLM infrastructure is splitting into two layers:

Python: orchestration, experimentation, agent logic
Rust/Go: inference gateways, routing layers, and high-throughput serving infrastructure

The important nuance is that most of these posts (including this kind of benchmark-heavy migration narrative) are showing relative system design advantages, not “Rust magically replaces LiteLLM everywhere.” In practice, teams usually end up with hybrid stacks rather than full rewrites.

So the real signal here isn’t just “Rust is faster,” but:
LLM infrastructure is maturing into performance-critical systems where language choice at the gateway layer actually matters.

That’s the underlying shift this post is pointing at 🤝