zhongqiyue

Posted on Jun 14

How I Built a Secure AI API Proxy Without Losing My Sanity

#ai #api #webdev #tutorial

I’ve been integrating AI APIs into side projects for a few years now. Every time, I hit the same wall: I want to expose some AI-powered endpoint to my frontend, but I absolutely cannot put the API key in the client. The obvious answer is a backend proxy. But the first few times I did it, I ended up with a messy, insecure, or expensive mess.

Here’s what actually worked after a lot of trial and error.

The Problem

I was building a little tool that lets users ask questions about documentation. I needed to call an AI API (like OpenAI or Claude) from the frontend. Straightforward, right? Not quite.

API keys in the client: No way. Anyone can inspect network requests and steal your key.
Rate limits: Free tiers get hammered fast, and I didn’t want a runaway bill.
Latency: Direct calls from the browser sometimes get CORS errors or unexpected timeouts.
Prompt injection: If I didn’t sanitize user input, they could trick the AI into leaking system prompts or doing dangerous things.

So I needed a proxy. Not just any proxy — one that was secure, cost-aware, and easy to maintain.

What I Tried That Didn’t Work

1. A Simple Express Route

I started with the laziest possible solution:

app.get('/api/ai', async (req, res) => {
  const response = await fetch('https://api.openai.com/v1/chat/completions', {
    method: 'POST',
    headers: {
      'Authorization': `Bearer ${process.env.OPENAI_API_KEY}`,
      'Content-Type': 'application/json'
    },
    body: JSON.stringify({ ... })
  });
  res.json(await response.json());
});

This works, but then everyone on my team had access to the same key. No logging, no rate limiting, no cost tracking. A single bug in the frontend could send a million requests and drain my account.

2. Using a Serverless Function

Next I tried a Vercel Edge Function. Cool until I realized that cold starts made it unpredictable, and debugging was a nightmare. Plus, I couldn’t easily log usage.

3. Pre‑built API Gateways

I looked at Kong, Tyk, and even my cloud provider’s API Gateway. Too much overhead. I just wanted to call one AI endpoint, not build a full enterprise solution.

What Eventually Worked: A Minimalist Proxy with Rate Limiting, Caching, and Token Tracking

I built a tiny Node.js express server with three key features:

Request validation – sanitize user prompts and limit max tokens.
Rate limiting per IP – using express-rate-limit.
Response caching – identical prompts get cached for 5 minutes to save money.
Cost logging – track how many tokens we use and estimated cost.

Here’s the core structure:

const express = require('express');
const rateLimit = require('express-rate-limit');
const NodeCache = require('node-cache');

const app = express();
const cache = new NodeCache({ stdTTL: 300 });

// Rate limiter: 10 requests per minute per IP
const limiter = rateLimit({
  windowMs: 60 * 1000,
  max: 10,
  message: 'Too many requests, please slow down.'
});
app.use('/api/ai', limiter);

app.post('/api/ai', async (req, res) => {
  const { prompt } = req.body;
  if (!prompt || prompt.length > 1000) {
    return res.status(400).json({ error: 'Invalid prompt' });
  }

  // Cache check
  const cacheKey = prompt.trim().toLowerCase();
  const cached = cache.get(cacheKey);
  if (cached) {
    console.log('Cache hit');
    return res.json(cached);
  }

  try {
    const response = await fetch('https://api.openai.com/v1/chat/completions', {
      method: 'POST',
      headers: {
        'Authorization': `Bearer ${process.env.OPENAI_API_KEY}`,
        'Content-Type': 'application/json'
      },
      body: JSON.stringify({
        model: 'gpt-4o-mini',
        messages: [{ role: 'user', content: prompt }],
        max_tokens: 150
      })
    });

    if (!response.ok) {
      console.error('AI API error:', response.status);
      return res.status(502).json({ error: 'AI service unavailable' });
    }

    const data = await response.json();
    const tokensUsed = data.usage?.total_tokens || 0;
    console.log(`Tokens used: ${tokensUsed}`);

    // Cost estimation (example: gpt-4o-mini is $0.15 per 1M input tokens, $0.60 per 1M output)
    const estimatedCost = (tokensUsed / 1000000) * 0.15; // crude
    console.log(`Estimated cost: $${estimatedCost.toFixed(5)}`);

    // Cache the result
    cache.set(cacheKey, data);

    res.json(data);
  } catch (error) {
    console.error('Proxy error:', error);
    res.status(500).json({ error: 'Internal server error' });
  }
});

app.listen(3000, () => console.log('AI proxy running on port 3000'));

Lessons Learned

Rate limiting is non-negotiable. Even if you trust your users, you don’t trust the internet. Free tiers get abused.
Cache aggressively, but be smart about it. Identical prompts from different users should not hit the API again. But if your app is real‑time (e.g., chat), caching might hurt freshness. In that case, skip cache or reduce TTL.
Log everything (but don’t store raw prompts if they’re sensitive). I log token counts and response status, but I hash the prompt before logging to avoid storing user data.
Don’t just forward the request. Validate and sanitize. Strip any system role overrides the user might try to inject.

Trade-Offs & When NOT to Use This Approach

This proxy is single‑node. If you need high availability, you’ll want a load balancer or a more scalable solution like a serverless proxy on Cloudflare Workers (which can also cache at the edge).
If you’re using streaming (SSE), caching doesn’t work well. You’ll need to handle that differently.
The cost estimation is ballpark. For precise billing, use the AI provider’s dashboard – don’t rely on your own maths.

What I’d Do Differently Next Time

Start with a Cloudflare Worker instead of Express. It’s free for low usage, has built‑in caching and rate limiting, and runs globally. The code would be similar but in the Workers runtime.
Use a queue for non‑real‑time requests. For batch processing, I’d put prompts in a queue (like Bull or AWS SQS) and have a worker consume them with a concurrency limit. This avoids overwhelming the API and makes costs predictable.
Set up alerts. I’d use a webhook email service (e.g., SendGrid) to notify me if the error rate spikes or if daily token usage exceeds a threshold.

Final Thoughts

Building an AI proxy isn’t rocket science, but the details matter. The code above got me from zero to a usable, secure endpoint in about an hour. Since then I’ve refined it with more middleware – JWT authentication, user‑specific rate limits, and a simple dashboard for cost tracking.

Your turn: what’s your current setup for exposing AI APIs safely? Have you tried a different stack or found a clever caching trick? I’d love to hear what’s working for you.

DEV Community