I was deep into building a side project—a multi-model chat assistant that could switch between OpenAI, Anthropic, and a few open-source models. The idea was simple: let users pick their preferred backend. The reality was a nightmare of rate limits, API key management, and inconsistent response times.
At first, I tried the obvious: rotate API keys. I had a dozen keys from different accounts, swapped them in code, and hoped for the best. That worked for about a day. Then I started getting 429 errors again, plus the keys got flagged for abuse. Not great.
Next, I looked at third-party API gateways. Some were promising, but they added another hop and a monthly bill I didn't want for a personal project. I needed something lightweight, self-hosted, and controllable.
So I built my own AI gateway. Here's how it works, why I chose each piece, and the trade-offs I discovered along the way.
The Problem in Detail
When you call an AI API directly from your frontend or backend, you're at the mercy of that service's rate limits. OpenAI, for example, has tiered limits based on your account history. If you hit them, you wait. If you integrate multiple models, you're juggling different limit headers, retry logic, and costs.
I wanted a single endpoint that:
- Queues requests so I don't exceed limits
- Caches responses for identical prompts (within reason)
- Logs usage per model and per user
- Handles fallback if one provider is down
What I Tried That Didn't Work
1. Simple Retry with Exponential Backoff
I coded a retry loop with delays. It worked okay for occasional 429s, but high-frequency requests still failed. Worse, my app became unresponsive for clients as they waited for retries.
2. Using Multiple API Keys in Rotation
I stored keys in a pool and cycled through them. This reduced errors, but I hit account-level rate limits because all keys were on the same plan. Also, managing key revocation was a pain.
3. External Proxy Services
I tried a few managed API gateway services. They worked, but for a hobby project the cost ($20+/month) was hard to justify. Plus, I didn't love sending my prompt data through another unknown server.
The Approach That Worked: A Node.js Gateway
I decided to build a simple Express server that sits between my app and the AI providers. It handles queuing, caching, and logging. Here's the core structure.
Code Skeleton
const express = require('express');
const axios = require('axios');
const Bottleneck = require('bottleneck'); // for rate limiting
const NodeCache = require('node-cache');
const app = express();
app.use(express.json());
// Cache responses for 10 minutes; keyed by provider + prompt hash
const cache = new NodeCache({ stdTTL: 600 });
// Rate limiters per provider (example: 60 requests per minute)
const limiters = {
openai: new Bottleneck({ maxConcurrent: 1, minTime: 1000 }), // 1 req/sec
anthropic: new Bottleneck({ maxConcurrent: 1, minTime: 2000 }) // 0.5 req/sec
};
// Simple in-memory usage log
const usageLog = [];
app.post('/ai-gateway', async (req, res) => {
const { provider, prompt, model } = req.body;
// Generate cache key
const cacheKey = `${provider}:${model}:${hash(prompt)}`;
const cached = cache.get(cacheKey);
if (cached) {
console.log('Cache hit');
return res.json(cached);
}
const limiter = limiters[provider];
if (!limiter) return res.status(400).json({ error: 'Unknown provider' });
try {
const response = await limiter.schedule(() => callProvider(provider, prompt, model));
cache.set(cacheKey, response.data);
usageLog.push({ provider, model, timestamp: Date.now() });
res.json(response.data);
} catch (err) {
console.error('Provider error:', err.message);
// Fallback to another provider?
res.status(502).json({ error: 'Provider unavailable' });
}
});
async function callProvider(provider, prompt, model) {
// Implementation per provider
// Example: https://api.openai.com/v1/chat/completions
return axios.post(
provider === 'openai'
? 'https://api.openai.com/v1/chat/completions'
: 'https://api.anthropic.com/v1/messages',
{ model, messages: [{ role: 'user', content: prompt }] },
{ headers: { Authorization: `Bearer ${process.env[provider.toUpperCase() + '_KEY']}` } }
);
}
// Simple hash function
function hash(str) {
let hash = 0;
for (let i = 0; i < str.length; i++) {
hash = ((hash << 5) - hash) + str.charCodeAt(i);
hash |= 0;
}
return hash;
}
app.listen(3000, () => console.log('AI Gateway running on port 3000'));
This is a minimal version. In production, you'd want proper error handling, authentication, and a persistent cache like Redis. But it's enough to see the pattern.
Lessons Learned & Trade-offs
Caching: Double-Edged Sword
Caching identical prompts saved me tons of API calls and latency. But for conversational AI, prompts are rarely identical unless you're testing. I ended up caching only for exact matches and adding a small TTL (5 minutes) to avoid stale responses.
Rate Limiting: The Bottleneck Library
I used bottleneck to queue requests per provider. It works well, but setting the right minTime is tricky. Too fast and you still get 429s; too slow and your users wait. I started with conservative values and tuned after monitoring the logs.
Logging: Keep It Simple
I logged each request to an array in memory. For a real app, you'd want a database. I used this data to see which models were most popular and to calculate costs.
Fallback Strategy
In the code above, if one provider fails, I just return a 502. A better approach is to try another provider with the same prompt. I added that later, but be careful—some prompts work better on different models. It's not a perfect fallback.
What I'd Do Differently Next Time
-
Use Redis for caching and rate limiting – in-memory cache is lost on restart, and
node-cachedoesn't persist. - Add authentication – anyone can hit this gateway now. I'd add a simple API key check.
- Split the gateway per provider – one Express app for all providers is fine, but as you add more, the config gets messy.
- Consider a message queue – for very high loads, use RabbitMQ or Kafka to decouple request receipt from processing.
When NOT to Use This Approach
- If you only use one AI provider and are fine with their rate limits, don't add complexity.
- If your team has budget, a managed gateway like ai.interwestinfo.com (check it out for inspiration) might save you dev time.
- For production apps with strict latency requirements, a custom gateway adds overhead. You might want to call the API directly and handle retries client-side.
Wrapping Up
Building my own AI gateway was a fun exercise that taught me a lot about rate limiting, caching, and designing resilient systems. It's not perfect, but it solved my immediate problem and gave me full control over costs and data flow.
I'm curious—how do you handle multiple AI API integrations in your projects? Do you use a gateway, rotate keys, or something else? Let me know in the comments.
Built with the help of AI Gateway
Top comments (0)