DEV Community

Dipankar Sarkar
Dipankar Sarkar

Posted on

I put a Rust layer under LiteLLM. Here is where it actually helped (and where it did not)

LiteLLM is the glue a lot of us reach for when an app has to talk to more than one
model provider. One interface, dozens of backends. It is great. But once you run it
under real load, the hot paths stop being the model call and start being the
plumbing around it: connection pooling, rate limiting, token counting on big
inputs. That plumbing is pure Python, and it shows.

So I built fast-litellm: a drop-in Rust acceleration layer that swaps the hot
paths out for PyO3 extensions and falls back to Python everywhere else.

The honest benchmark table

I am going to lead with the numbers, including the ones that did not go my way.
These compare production-grade Python (thread-safe) against the Rust versions:

Component Result
Connection pool 3.2x faster (lock-free DashMap)
Rate limiting 1.6x faster (atomic ops)
Large-text token counting 1.5-1.7x faster
High-cardinality rate limits (1000+ keys) 42x less memory
Small-text token counting 0.5x — Python wins (FFI overhead dominates)
Routing with complex Python objects 0.4x — Python wins

That last block is the important part. Crossing the Python/Rust boundary is not
free. For a 12-token chat message, the FFI overhead is bigger than the work you
saved, so Rust loses. Anyone who tells you their native extension is faster at
everything is not measuring the small cases.

Where it wins, it wins because of data structures, not because "Rust is fast":
lock-free DashMap for concurrent connection state, and a memory layout for
high-cardinality rate limiting that holds 1000+ unique keys in a fraction of the
Python footprint. 42x less memory is a data-structure story, not a language story.

Drop-in, or it does not get used

The design constraint I cared about most: nobody rewrites their app to try this.

import fast_litellm  # accelerates LiteLLM automatically
import litellm

response = litellm.completion(
    model="gpt-3.5-turbo",
    messages=[{"role": "user", "content": "Hello!"}],
)
Enter fullscreen mode Exit fullscreen mode

One import before litellm. It monkeypatches the hot paths, and every accelerated
component has an automatic fallback to the original Python if anything looks off.
Feature flags let you roll it out to a percentage of traffic first. If you are
running the proxy under gunicorn, a two-line app.py with --preload does it.

What I would take from this

  • Profile before you port. The win was in three specific hot paths, not "the code."
  • Measure the small inputs too. FFI overhead is real and it will embarrass you.
  • Make it a drop-in or it dies on the vine. Zero-config plus automatic fallback is what makes a native accelerator safe to actually ship.

Code, full benchmark breakdown, and the PyO3 architecture are here:
https://github.com/neul-labs/fast-litellm

If you run LiteLLM at any real volume, I would love to know which path is your
bottleneck. Kick the tyres, issues welcome.

Top comments (0)