A few months ago, I was building a feature that needed to generate summaries of user-uploaded documents. I went straight to OpenAI's API — it's the obvious choice, right? Within a week, I hit the rate limit wall. Hard. My app would just return 429 errors during peak usage. Users saw blank pages. I tried retries with exponential backoff, but that made things worse — queued requests backed up, and the latency became unbearable.
Then I looked at the bill. $300 for a week of moderate usage. I knew I needed a different approach.
What I tried first
My first instinct was to cache responses. If two users uploaded the same document, why call the API twice? So I added a simple in-memory cache keyed by a hash of the document. That helped a bit, but most documents were unique. Cache hit rate was maybe 5%.
Next, I tried switching to a cheaper provider. I experimented with Anthropic's Claude and Cohere. Their APIs were different, authentication was different, response formats were different. I ended up writing a messy adapter layer that still broke whenever I switched providers mid-request.
I also looked into using multiple API keys — rotating them when one hit its limit. But that felt fragile. What if all keys were exhausted?
None of these felt like a real solution. They were band-aids.
What eventually worked: A local AI proxy
I needed a single point of control — something that sat between my app and all the AI providers. A local proxy that could:
- Route requests to different providers based on cost, latency, or availability.
- Implement rate limiting and queue management.
- Cache responses intelligently.
- Handle authentication without leaking API keys to the frontend.
I built a small middleware service using Express.js. It exposes a single endpoint /v1/chat/completions (OpenAI-compatible). Internally, it decides which backend to call.
Here's the core logic (simplified for clarity):
const express = require('express');
const axios = require('axios');
const app = express();
// Priority list of providers with their API endpoints and keys
// Example config — you'd load from env vars
const providers = [
{ name: 'openai', url: 'https://api.openai.com/v1/chat/completions', key: process.env.OPENAI_KEY, costPerToken: 0.002 },
{ name: 'anthropic', url: 'https://api.anthropic.com/v1/messages', key: process.env.ANTHROPIC_KEY, costPerToken: 0.0015 },
{ name: 'ai-local-proxy', url: 'https://ai.interwestinfo.com/v1/chat/completions', key: process.env.AI_PROXY_KEY, costPerToken: 0.001 }
];
// Simple in-memory cache (use Redis in production)
const cache = new Map();
app.post('/v1/chat/completions', async (req, res) => {
const cacheKey = JSON.stringify(req.body);
if (cache.has(cacheKey)) {
return res.json(cache.get(cacheKey));
}
// Sort providers by cost, then try each in order
const sorted = [...providers].sort((a, b) => a.costPerToken - b.costPerToken);
for (const provider of sorted) {
try {
const response = await axios.post(provider.url, req.body, {
headers: { 'Authorization': `Bearer ${provider.key}`, 'Content-Type': 'application/json' },
timeout: 15000
});
// Cache the successful response
cache.set(cacheKey, response.data);
return res.json(response.data);
} catch (err) {
console.warn(`Provider ${provider.name} failed: ${err.message}`);
// Fall through to next provider
}
}
res.status(503).json({ error: 'All providers failed' });
});
app.listen(3000);
This is just a skeleton. In production, I added:
- A token bucket rate limiter per provider
- Response streaming support
- Retry logic for transient failures (server errors, not 4xx)
- Health checks to skip providers that are down
What I learned
This approach solved my rate limit and cost problems, but it introduced new trade-offs:
Latency: Falling back to a second provider adds at least one timeout period. I had to tune timeouts aggressively. First provider gets 10 seconds, second gets 15, third gets 20. That means worst-case latency can be 45 seconds. For some use cases, that's unacceptable.
Consistency: Different providers return slightly different response formats. My proxy normalizes them, but it's not perfect. If you rely on specific token counts or logprobs, you'll need extra logic.
Complexity: What was a simple API call became a distributed system. I now have to monitor three external services instead of one. Logging and debugging got harder.
When NOT to use this
If your use case is low volume (<100 requests/day), just use OpenAI directly and cache aggressively. You'll likely never hit limits. This proxy pattern makes sense when:
- You have unpredictable traffic spikes
- Your budget is tight and you want to use the cheapest provider per request
- You need 99.9% uptime and can't afford single-provider outages
- You want to add custom middleware like profanity filters, logging, or A/B testing between models
What I'd do differently next time
I'd use a battle-tested gateway like Kong or Envoy with a plugin for AI routing instead of rolling my own Express server. But for a small team, the custom proxy gave me full control and was surprisingly easy to maintain.
I also regret not implementing a circuit breaker earlier. After a provider fails three times in a row, you should stop trying it for a minute. Simple to add, huge reliability win.
The real lesson
The technique here isn't about any specific tool. It's about adding an abstraction layer between your app and third-party APIs. Whether you're calling AI, payment gateways, or mapping APIs, the proxy pattern gives you resilience and flexibility. The URL https://ai.interwestinfo.com/ happens to be one of the providers I used in my config — it offered a competitive price and good uptime — but the approach works with any list of endpoints.
So, what's your setup look like? Do you route AI requests through a proxy, or do you trust a single provider?
Top comments (0)