<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Alex Chen</title>
    <description>The latest articles on DEV Community by Alex Chen (@onebuilds).</description>
    <link>https://dev.to/onebuilds</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3964230%2F0d9dacea-93a7-42cd-9f58-4ac61c64316b.png</url>
      <title>DEV Community: Alex Chen</title>
      <link>https://dev.to/onebuilds</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/onebuilds"/>
    <language>en</language>
    <item>
      <title>How I Built a Multi-LLM API Gateway with Smart Load Balancing</title>
      <dc:creator>Alex Chen</dc:creator>
      <pubDate>Tue, 02 Jun 2026 11:02:09 +0000</pubDate>
      <link>https://dev.to/onebuilds/how-i-built-a-multi-llm-api-gateway-with-smart-load-balancing-4fce</link>
      <guid>https://dev.to/onebuilds/how-i-built-a-multi-llm-api-gateway-with-smart-load-balancing-4fce</guid>
      <description>&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;## The Problem
&lt;/span&gt;
&lt;span class="n"&gt;Like&lt;/span&gt; &lt;span class="n"&gt;many&lt;/span&gt; &lt;span class="n"&gt;indie&lt;/span&gt; &lt;span class="n"&gt;developers&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;I&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;ve been building small AI-powered projects over the past year. And like many of you, I kept running into the same frustrating issues:

- **Rate limiting** — `429 Too Many Requests` became a daily sight
- **Multiple API keys** — one for GPT, one for Claude, one for Gemini... managing them all was a mess
- **Regional restrictions** — certain models simply weren&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="n"&gt;available&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="n"&gt;my&lt;/span&gt; &lt;span class="n"&gt;location&lt;/span&gt;
&lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;Unpredictable&lt;/span&gt; &lt;span class="n"&gt;costs&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt; &lt;span class="err"&gt;—&lt;/span&gt; &lt;span class="n"&gt;hard&lt;/span&gt; &lt;span class="n"&gt;to&lt;/span&gt; &lt;span class="n"&gt;track&lt;/span&gt; &lt;span class="n"&gt;spending&lt;/span&gt; &lt;span class="n"&gt;across&lt;/span&gt; &lt;span class="n"&gt;different&lt;/span&gt; &lt;span class="n"&gt;providers&lt;/span&gt;

&lt;span class="n"&gt;Every&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt; &lt;span class="n"&gt;I&lt;/span&gt; &lt;span class="n"&gt;hit&lt;/span&gt; &lt;span class="n"&gt;one&lt;/span&gt; &lt;span class="n"&gt;of&lt;/span&gt; &lt;span class="n"&gt;these&lt;/span&gt; &lt;span class="n"&gt;walls&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;I&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;d spend hours debugging infrastructure instead of building actual features. That&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="n"&gt;when&lt;/span&gt; &lt;span class="n"&gt;I&lt;/span&gt; &lt;span class="n"&gt;decided&lt;/span&gt; &lt;span class="n"&gt;to&lt;/span&gt; &lt;span class="n"&gt;solve&lt;/span&gt; &lt;span class="n"&gt;this&lt;/span&gt; &lt;span class="n"&gt;once&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="nb"&gt;all&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;

&lt;span class="c1"&gt;## The Solution
&lt;/span&gt;
&lt;span class="n"&gt;I&lt;/span&gt; &lt;span class="n"&gt;built&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;ourhubapi&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;com&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt; &lt;span class="err"&gt;—&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="n"&gt;unified&lt;/span&gt; &lt;span class="n"&gt;API&lt;/span&gt; &lt;span class="n"&gt;gateway&lt;/span&gt; &lt;span class="n"&gt;that&lt;/span&gt; &lt;span class="n"&gt;acts&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="n"&gt;smart&lt;/span&gt; &lt;span class="n"&gt;relay&lt;/span&gt; &lt;span class="n"&gt;between&lt;/span&gt; &lt;span class="n"&gt;your&lt;/span&gt; &lt;span class="n"&gt;application&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;multiple&lt;/span&gt; &lt;span class="n"&gt;LLM&lt;/span&gt; &lt;span class="n"&gt;providers&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;

&lt;span class="n"&gt;Here&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s the core idea:
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;[Your App] --&amp;gt; [Single API Endpoint] --&amp;gt; [Smart Router] --&amp;gt; [GPT/Claude/Gemini/...]&lt;br&gt;
|&lt;br&gt;
--&amp;gt; [Auto-failover when rate-limited]&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;Instead&lt;/span&gt; &lt;span class="n"&gt;of&lt;/span&gt; &lt;span class="n"&gt;calling&lt;/span&gt; &lt;span class="n"&gt;each&lt;/span&gt; &lt;span class="n"&gt;provider&lt;/span&gt; &lt;span class="n"&gt;directly&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;your&lt;/span&gt; &lt;span class="n"&gt;app&lt;/span&gt; &lt;span class="n"&gt;talks&lt;/span&gt; &lt;span class="n"&gt;to&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;one&lt;/span&gt; &lt;span class="n"&gt;endpoint&lt;/span&gt;&lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt; &lt;span class="n"&gt;The&lt;/span&gt; &lt;span class="n"&gt;gateway&lt;/span&gt; &lt;span class="n"&gt;handles&lt;/span&gt; &lt;span class="n"&gt;everything&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="n"&gt;behind&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;scenes&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;

&lt;span class="c1"&gt;## Key Technical Decisions
&lt;/span&gt;
&lt;span class="c1"&gt;### 1. Smart Load Balancing
&lt;/span&gt;
&lt;span class="n"&gt;The&lt;/span&gt; &lt;span class="n"&gt;most&lt;/span&gt; &lt;span class="n"&gt;critical&lt;/span&gt; &lt;span class="n"&gt;feature&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="n"&gt;automatic&lt;/span&gt; &lt;span class="n"&gt;failover&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt; &lt;span class="n"&gt;When&lt;/span&gt; &lt;span class="n"&gt;one&lt;/span&gt; &lt;span class="n"&gt;upstream&lt;/span&gt; &lt;span class="n"&gt;account&lt;/span&gt; &lt;span class="n"&gt;hits&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="n"&gt;rate&lt;/span&gt; &lt;span class="n"&gt;limit&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;router&lt;/span&gt; &lt;span class="n"&gt;instantly&lt;/span&gt; &lt;span class="n"&gt;switches&lt;/span&gt; &lt;span class="n"&gt;to&lt;/span&gt; &lt;span class="n"&gt;another&lt;/span&gt; &lt;span class="n"&gt;available&lt;/span&gt; &lt;span class="n"&gt;account&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt; &lt;span class="n"&gt;Your&lt;/span&gt; &lt;span class="n"&gt;app&lt;/span&gt; &lt;span class="n"&gt;never&lt;/span&gt; &lt;span class="n"&gt;sees&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="sb"&gt;`429`&lt;/span&gt; &lt;span class="n"&gt;error&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;

&lt;span class="n"&gt;Here&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s a simplified version of the routing logic:

&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;
python&lt;br&gt;
def route_request(model, messages):&lt;br&gt;
    upstreams = get_available_upstreams(model)&lt;br&gt;
    for upstream in upstreams:&lt;br&gt;
        try:&lt;br&gt;
            response = upstream.call(messages)&lt;br&gt;
            return response&lt;br&gt;
        except RateLimitError:&lt;br&gt;
            mark_rate_limited(upstream)&lt;br&gt;
            continue&lt;br&gt;
    raise AllUpstreamsBusy()&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
2. Drop-in OpenAI SDK Compatibility
The API is fully compatible with the OpenAI SDK format. Switching takes exactly one line change:


&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;
plaintext&lt;/p&gt;

&lt;h1&gt;
  
  
  Before: calling OpenAI directly
&lt;/h1&gt;

&lt;p&gt;client = OpenAI(api_key="sk-...")&lt;/p&gt;

&lt;h1&gt;
  
  
  After: routing through the gateway
&lt;/h1&gt;

&lt;p&gt;client = OpenAI(&lt;br&gt;
    api_key="your-ourhubapi-key",&lt;br&gt;
    base_url="&lt;a href="https://api.ourhubapi.com/v1" rel="noopener noreferrer"&gt;https://api.ourhubapi.com/v1&lt;/a&gt;"&lt;br&gt;
)&lt;/p&gt;

&lt;h1&gt;
  
  
  Everything else stays the same
&lt;/h1&gt;

&lt;p&gt;response = client.chat.completions.create(&lt;br&gt;
    model="gpt-4o",&lt;br&gt;
    messages=[{"role": "user", "content": "Hello!"}]&lt;br&gt;
)&lt;/p&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;


3. Usage Quotas per API Key
For small teams, cost control is essential. Each API key can have:

Spending caps (daily / monthly)

Rate limits (requests per minute)

Model access control (enable only what the team needs)

This way, you can give keys to team members without worrying about surprise bills.

Why Not Just Use the Official APIs?
A fair question. If you're using a single model with low traffic, the official API might work fine. But once you:

Need multiple models in one project

Hit rate limits during development

Want predictable costs across a team

Having a middleware layer becomes genuinely useful. It's the same reason we use load balancers for web servers — redundancy and simplicity.

What I Learned
Building this taught me a lot about:

Handling distributed rate limits gracefully

Designing APIs that developers actually want to use

The importance of "it just works" over feature overload

Try It Out
The service is live at ourhubapi.com . I'd love to hear your feedback — what features would make this useful for your own projects?

This is very much a v1, built by a developer for developers. If you have thoughts, criticisms, or feature requests, drop a comment below. I'm reading every single one.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

</description>
      <category>ai</category>
      <category>devops</category>
      <category>webdev</category>
      <category>python</category>
    </item>
  </channel>
</rss>
