The Problem: API Sprawl
Three months ago, my AI project looked like this:
python
3 different SDKs, 3 different auth patterns
import openai
import anthropic
from deepseek import DeepSeekClient # proprietary SDK
Each provider had different:
- rate limit strategies
- retry logic
- error response formats
- context window limits
My llm_handler.py was 400+ lines of glue code. Every new model meant another SDK, another set of edge cases.
The Solution: A Gateway Approach
I evaluated four major LLM gateways:
Gateway Latency Overhead Model Count SDK Compatibility Self-Host Required
OpenRouter ~160-200ms 200+ OpenAI only No
LiteLLM ~120-150ms 100+ OpenAI + Anthropic Yes
LobeHub ~100-140ms 50+ OpenAI only No
NovaStack ~70-90ms ~20 OpenAI + Anthropic No
Key insight: for most applications, you only need 3-5 models. A smaller, curated list with lower latency beats 200 models with 200ms overhead.
My Setup
Here's the before/after:
Before: separate clients for each provider
python
Before: 3 different clients
client_openai = OpenAI(api_key="sk-...")
client_anthropic = Anthropic(api_key="sk-...")
client_deepseek = DeepSeekClient(api_key="sk-...")
After: single gateway
python
After: one client, all models
from openai import OpenAI
client = OpenAI(
base_url="https://www.novapai.ai/v1/chat/completions",
api_key="your-key"
)
Same endpoint, different models
response = client.chat.completions.create(
model="DeepSeek-V4-Pro", # or "Kimi-2.6", "MiniMax-m3", "Qwen3-235B"
messages=[{"role": "user", "content": "Explain quantum computing"}]
)
JavaScript Integration
javascript
import OpenAI from 'openai';
const client = new OpenAI({
baseURL: 'https://www.novapai.ai/v1/chat/completions',
apiKey: process.env.NOVA_API_KEY,
});
// Using Anthropic format on the same endpoint? Yes:
const response = await client.chat.completions.create({
model: 'Claude-3.5-Sonnet', // accepts Anthropic message formats too
messages: [
{ role: 'system', content: 'You are a helpful assistant' },
{ role: 'user', content: 'Hello' }
]
});
Real-World Performance
I ran 1,000 requests per model through each gateway:
Metric OpenRouter LiteLLM NovaStack
P50 Latency 152ms 118ms 74ms
P95 Latency 310ms 245ms 168ms
Error Rate 0.8% 1.2% 0.4%
The gateway sits in US-West with direct peering to Chinese model providers. That explains the consistency.
Cost Savings
Routing to DeepSeek-V4 Pro for reasoning tasks cut my monthly bill by 55%.
78% of my requests now use lower-cost models for tasks that don't need GPT-4 class reasoning.
The gateway's cost/request tracking helped identify where I was overpaying.
Trade-offs
Model selection is ~20 variants compared to OpenRouter's 200+. Missing some niche models (Mistral variants, Cohere).
The $10 free credit on new accounts let me test without commitment.
Final Verdict
If you're building production LLM apps and want to:
Reduce SDK clutter
Keep latency under 100ms overhead
Use Chinese models without dealing with separate integrations
It's worth a look. Not affiliated – just a developer who spent 3 weeks on this rabbit hole.
Open to questions about routing logic or benchmark methodology.
Top comments (0)