A YC founder's recent LinkedIn post calling to "avoid LiteLLM as much as possible" sparked a conversation that's been brewing in AI engineering circles for months. The post detailed systematic failures across core features - from broken rate limiting to incorrect token counting - and concluded with a stark admission: "The long-term maintenance cost and risk outweigh the benefits."
This isn't an isolated complaint. Production teams are quietly migrating away from LiteLLM, and the technical reasons why reveal important lessons about building reliable AI infrastructure.
The Core Problem: Architectural Debt in Production Systems
LiteLLM became popular because it solved an immediate problem: routing requests to multiple LLM providers through a single interface. For prototyping and development, it works. The issues emerge at scale.
The documented failures from the YC founder's team:
- Proxy calls to AI providers: Fundamental routing broken in production
- TPM rate limiting: Confuses requests per minute (RPM) with tokens per minute (TPM) - a catastrophic error when providers bill by tokens
- Per-user budget settings: Non-functional governance features
- Token counting for billing: Mismatches with actual provider billing
- High-volume API scaling: Performance degradation under load
- Short-lived API keys: Security features broken
These aren't edge cases. They're core features failing in production.
Why This Keeps Happening: The Python Problem
LiteLLM is written in Python, which introduces inherent constraints for high-throughput proxy applications:
1. The Global Interpreter Lock (GIL)
Python's GIL prevents true parallelism. When handling concurrent LLM requests, you can't utilize all CPU cores effectively. Teams work around this by spinning up multiple worker processes, which introduces memory overhead and coordination complexity.
2. Runtime Overhead
Every request goes through Python's interpreter. For a proxy that should add minimal latency, this overhead compounds at scale. Benchmarks show LiteLLM adding ~500μs of overhead per request - before considering network calls to providers.
3. Memory Management
Python's dynamic memory allocation and garbage collection create unpredictable performance characteristics. The YC founder mentioned maintaining an internal fork - a common pattern when dealing with memory leaks and performance regressions in Python-based infrastructure.
4. Type Safety
Python's dynamic typing makes it easy to introduce bugs in critical paths. When TPM limits get confused with RPM limits, that's often a type conversion error that wouldn't compile in a statically typed language.
The Performance Gap: What Proper Architecture Delivers
When we built Bifrost, we chose Go specifically to avoid these architectural constraints. The performance difference isn't incremental - it's structural.
Benchmark Results (AWS t3.medium, 1K RPS):
| Metric | LiteLLM | Bifrost | Improvement |
|---|---|---|---|
| P99 Latency | 90.7 s | 1.68 s | 54× faster |
| Added Overhead | ~500 μs | 59 μs | 8× lower |
| Memory Usage | 372 MB (growing) | 120 MB (stable) | 3× more efficient |
| Success Rate @ 5K RPS | Degrades | 100% | Handles 16× more load |
| Uptime Without Restart | 6–8 hours | 30+ days | Continuous operation |
These aren't theoretical numbers. They're measured in production environments handling enterprise-scale LLM traffic.
Why Go Makes the Difference
Goroutines vs. Threading
Go's goroutine model enables true concurrency without the overhead of OS threads or the constraints of Python's GIL. You can handle thousands of concurrent LLM requests on a single instance without the architectural gymnastics required in Python.
Static Typing and Compilation
Type errors that cause TPM/RPM confusion in LiteLLM are caught at compile time in Go. Rate limiting logic that correctly distinguishes between token counts and request counts isn't just tested - it's enforced by the type system.
Predictable Performance
Go's garbage collector is optimized for low-latency applications. Memory usage stays flat under load. No sawtooth patterns, no periodic worker restarts, no memory leaks that require maintaining internal forks.
Single Binary Deployment
LiteLLM requires Python runtime, pip dependencies, version management, and hope that nothing breaks in production. Bifrost compiles to a single static binary. No runtime, no dependencies, no surprises.
What Actually Works in Production
The features the YC founder listed as broken in LiteLLM are fundamental requirements for production LLM infrastructure:
Rate Limiting (Done Correctly)
Bifrost implements token-aware rate limiting that correctly tracks TPM and RPM separately. When you configure a 100K TPM limit, it enforces tokens per minute - not requests per minute multiplied by some average token count.
Accurate Token Counting
Token counts match provider billing because we use the same tokenization libraries as the providers. No surprises in your bill from miscounted tokens.
Per-Key Budget Management
Virtual keys with actual budget enforcement. Track spending per team, per user, per application. Get alerts before hitting limits, not after burning through budget.
Semantic Caching
This is where architecture really matters. Semantic caching requires embedding generation, vector similarity search, and cache invalidation - all on the hot path.
In Python, this adds hundreds of milliseconds. In Go, it adds ~40μs while delivering 40-60% cost reduction through intelligent cache hits.
Automatic Failover
When OpenAI rate limits you or Anthropic has an outage, Bifrost automatically routes to backup providers. The logic is simple enough to be reliable and fast enough to not add latency.
The Migration Path
The YC founder listed several alternatives they're evaluating. Here's the honest comparison:
Bifrost (Go-based): Production-grade performance, semantic caching, proper governance, single binary deployment. Open source with enterprise features available.
TensorZero (Rust-based): Excellent performance, more focused on experimentation and data collection than pure gateway functionality.
Keywords AI: Hosted solution, good for teams that don't want to manage infrastructure but need basic gateway features.
Vercel AI Gateway: Optimized for Vercel ecosystem, reliability-focused but limited governance features.
Build Your Own: Several YC companies have done this. Makes sense if you have specific requirements and engineering resources. Significant ongoing maintenance burden.
What Not to Use: The same post warned against Portkey after a founder lost $10K/day due to misconfigured cache headers - another example of how subtle bugs in gateway infrastructure create outsized production impact.
The Real Question: Build vs. Buy (The Right Thing)
The YC founder mentioned that several YC companies built their own gateways. This makes sense when you have unique requirements and dedicated infrastructure teams.
But there's a middle path: use open source infrastructure that's properly architected, then customize for your specific needs.
Bifrost is open source specifically because we've seen too many teams waste engineering resources either:
- Fighting with buggy Python-based gateways in production
- Rebuilding gateway infrastructure from scratch
The codebase is straightforward Go. If you need custom behavior, fork it and modify. The architecture is solid enough that you're not inheriting technical debt.
What This Means for AI Infrastructure
The LiteLLM situation illustrates a broader pattern in AI tooling: rapid development in Python creates immediate functionality, but architectural constraints create long-term production problems.
As AI applications move from proof-of-concept to production scale, infrastructure requirements change:
- Development: Any language that ships features quickly
- Production: Languages that handle concurrency, manage memory predictably, and enforce correctness through type systems
This isn't about Python vs. Go as programming languages. It's about choosing the right tool for infrastructure that sits in the critical path of every LLM request your application makes.
The Path Forward
If you're currently using LiteLLM in production and seeing reliability issues, the migration path is straightforward:
- Benchmark your current performance: Measure actual latency, token counting accuracy, rate limit behavior
- Test alternatives: Spin up Bifrost (or another option) in parallel, route a small percentage of traffic
- Compare results: Look at latency overhead, success rates, cost tracking accuracy
- Migrate incrementally: Move production traffic over gradually, monitor for issues
The YC founder's post resonated because many teams are experiencing these problems silently. They assume they're configuring something wrong, or that "this is just how it is" with LLM infrastructure.
It's not. Production LLM gateways can be fast, reliable, and actually implement the features they claim to provide.
Try Bifrost:
- GitHub: https://github.com/maximhq/bifrost
- Documentation: https://docs.getbifrost.ai
- Benchmarks: https://docs.getbifrost.ai/benchmarking/getting-started
The infrastructure layer for LLM applications is too critical to accept broken rate limiting, incorrect token counting, and unpredictable failures. Production systems deserve better.

Top comments (0)