How I stopped worrying about OpenAI rate limits (and costs)

#ai #api #webdev #tutorial

A few months ago, I was building a feature that needed to generate summaries of user-uploaded documents. I went straight to OpenAI's API — it's the obvious choice, right? Within a week, I hit the rate limit wall. Hard. My app would just return 429 errors during peak usage. Users saw blank pages. I tried retries with exponential backoff, but that made things worse — queued requests backed up, and the latency became unbearable.

Then I looked at the bill. $300 for a week of moderate usage. I knew I needed a different approach.

What I tried first

My first instinct was to cache responses. If two users uploaded the same document, why call the API twice? So I added a simple in-memory cache keyed by a hash of the document. That helped a bit, but most documents were unique. Cache hit rate was maybe 5%.

Next, I tried switching to a cheaper provider. I experimented with Anthropic's Claude and Cohere. Their APIs were different, authentication was different, response formats were different. I ended up writing a messy adapter layer that still broke whenever I switched providers mid-request.

I also looked into using multiple API keys — rotating them when one hit its limit. But that felt fragile. What if all keys were exhausted?

None of these felt like a real solution. They were band-aids.

What eventually worked: A local AI proxy

I needed a single point of control — something that sat between my app and all the AI providers. A local proxy that could:

Route requests to different providers based on cost, latency, or availability.
Implement rate limiting and queue management.
Cache responses intelligently.
Handle authentication without leaking API keys to the frontend.

I built a small middleware service using Express.js. It exposes a single endpoint /v1/chat/completions (OpenAI-compatible). Internally, it decides which backend to call.

Here's the core logic (simplified for clarity):

const express = require('express');
const axios = require('axios');
const app = express();

// Priority list of providers with their API endpoints and keys
// Example config — you'd load from env vars
const providers = [
  { name: 'openai', url: 'https://api.openai.com/v1/chat/completions', key: process.env.OPENAI_KEY, costPerToken: 0.002 },
  { name: 'anthropic', url: 'https://api.anthropic.com/v1/messages', key: process.env.ANTHROPIC_KEY, costPerToken: 0.0015 },
  { name: 'ai-local-proxy', url: 'https://ai.interwestinfo.com/v1/chat/completions', key: process.env.AI_PROXY_KEY, costPerToken: 0.001 }
];

// Simple in-memory cache (use Redis in production)
const cache = new Map();

app.post('/v1/chat/completions', async (req, res) => {
  const cacheKey = JSON.stringify(req.body);
  if (cache.has(cacheKey)) {
    return res.json(cache.get(cacheKey));
  }

  // Sort providers by cost, then try each in order
  const sorted = [...providers].sort((a, b) => a.costPerToken - b.costPerToken);

  for (const provider of sorted) {
    try {
      const response = await axios.post(provider.url, req.body, {
        headers: { 'Authorization': `Bearer ${provider.key}`, 'Content-Type': 'application/json' },
        timeout: 15000
      });
      // Cache the successful response
      cache.set(cacheKey, response.data);
      return res.json(response.data);
    } catch (err) {
      console.warn(`Provider ${provider.name} failed: ${err.message}`);
      // Fall through to next provider
    }
  }

  res.status(503).json({ error: 'All providers failed' });
});

app.listen(3000);

This is just a skeleton. In production, I added:

A token bucket rate limiter per provider
Response streaming support
Retry logic for transient failures (server errors, not 4xx)
Health checks to skip providers that are down

What I learned

This approach solved my rate limit and cost problems, but it introduced new trade-offs:

Latency: Falling back to a second provider adds at least one timeout period. I had to tune timeouts aggressively. First provider gets 10 seconds, second gets 15, third gets 20. That means worst-case latency can be 45 seconds. For some use cases, that's unacceptable.

Consistency: Different providers return slightly different response formats. My proxy normalizes them, but it's not perfect. If you rely on specific token counts or logprobs, you'll need extra logic.

Complexity: What was a simple API call became a distributed system. I now have to monitor three external services instead of one. Logging and debugging got harder.

When NOT to use this

If your use case is low volume (<100 requests/day), just use OpenAI directly and cache aggressively. You'll likely never hit limits. This proxy pattern makes sense when:

You have unpredictable traffic spikes
Your budget is tight and you want to use the cheapest provider per request
You need 99.9% uptime and can't afford single-provider outages
You want to add custom middleware like profanity filters, logging, or A/B testing between models

What I'd do differently next time

I'd use a battle-tested gateway like Kong or Envoy with a plugin for AI routing instead of rolling my own Express server. But for a small team, the custom proxy gave me full control and was surprisingly easy to maintain.

I also regret not implementing a circuit breaker earlier. After a provider fails three times in a row, you should stop trying it for a minute. Simple to add, huge reliability win.

The real lesson

The technique here isn't about any specific tool. It's about adding an abstraction layer between your app and third-party APIs. Whether you're calling AI, payment gateways, or mapping APIs, the proxy pattern gives you resilience and flexibility. The URL https://ai.interwestinfo.com/ happens to be one of the providers I used in my config — it offered a competitive price and good uptime — but the approach works with any list of endpoints.

So, what's your setup look like? Do you route AI requests through a proxy, or do you trust a single provider?