DEV Community

Cover image for How I Built a Multi-LLM API Gateway with Smart Load Balancing
Alex Chen
Alex Chen

Posted on

How I Built a Multi-LLM API Gateway with Smart Load Balancing

## The Problem

Like many indie developers, I've been building small AI-powered projects over the past year. And like many of you, I kept running into the same frustrating issues:

- **Rate limiting** — `429 Too Many Requests` became a daily sight
- **Multiple API keys** — one for GPT, one for Claude, one for Gemini... managing them all was a mess
- **Regional restrictions** — certain models simply weren't available from my location
- **Unpredictable costs**  hard to track spending across different providers

Every time I hit one of these walls, I'd spend hours debugging infrastructure instead of building actual features. That's when I decided to solve this once and for all.

## The Solution

I built **ourhubapi.com**  a unified API gateway that acts as a smart relay between your application and multiple LLM providers.

Here's the core idea:
Enter fullscreen mode Exit fullscreen mode

[Your App] --> [Single API Endpoint] --> [Smart Router] --> [GPT/Claude/Gemini/...]
|
--> [Auto-failover when rate-limited]

Instead of calling each provider directly, your app talks to **one endpoint**. The gateway handles everything else behind the scenes.

## Key Technical Decisions

### 1. Smart Load Balancing

The most critical feature is automatic failover. When one upstream account hits a rate limit, the router instantly switches to another available account. Your app never sees a `429` error.

Here's a simplified version of the routing logic:

Enter fullscreen mode Exit fullscreen mode


python
def route_request(model, messages):
upstreams = get_available_upstreams(model)
for upstream in upstreams:
try:
response = upstream.call(messages)
return response
except RateLimitError:
mark_rate_limited(upstream)
continue
raise AllUpstreamsBusy()


2. Drop-in OpenAI SDK Compatibility
The API is fully compatible with the OpenAI SDK format. Switching takes exactly one line change:


Enter fullscreen mode Exit fullscreen mode


plaintext

Before: calling OpenAI directly

client = OpenAI(api_key="sk-...")

After: routing through the gateway

client = OpenAI(
api_key="your-ourhubapi-key",
base_url="https://api.ourhubapi.com/v1"
)

Everything else stays the same

response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Hello!"}]
)




3. Usage Quotas per API Key
For small teams, cost control is essential. Each API key can have:

Spending caps (daily / monthly)

Rate limits (requests per minute)

Model access control (enable only what the team needs)

This way, you can give keys to team members without worrying about surprise bills.

Why Not Just Use the Official APIs?
A fair question. If you're using a single model with low traffic, the official API might work fine. But once you:

Need multiple models in one project

Hit rate limits during development

Want predictable costs across a team

Having a middleware layer becomes genuinely useful. It's the same reason we use load balancers for web servers — redundancy and simplicity.

What I Learned
Building this taught me a lot about:

Handling distributed rate limits gracefully

Designing APIs that developers actually want to use

The importance of "it just works" over feature overload

Try It Out
The service is live at ourhubapi.com . I'd love to hear your feedback — what features would make this useful for your own projects?

This is very much a v1, built by a developer for developers. If you have thoughts, criticisms, or feature requests, drop a comment below. I'm reading every single one.
Enter fullscreen mode Exit fullscreen mode

Top comments (0)