DEV Community

zhongqiyue
zhongqiyue

Posted on

Building a Serverless Proxy for AI APIs: Lessons Learned

Last month, I needed to add an AI-powered chatbot to my side project. The requirement sounded simple: take user messages, send them to an LLM API (like GPT), stream the response back to the frontend, and keep the API key safe. Easy, right?

I thought so too – until I actually tried to wire it up from the browser. That’s when the real headache started.

The CORS Wall

Like many frontend developers, my first instinct was to call the AI API directly from JavaScript. Just fetch the endpoint, pass my API key, and get the response. But the API didn’t support CORS requests, and even if it did, exposing the API key in the client-side bundle is a security nightmare. So that was a dead end.

The Simple Node Server

Next, I spun up a small Express server. A few lines later, I had a /chat endpoint that proxied requests to the AI API. The API key stayed server-side, CORS was handled by the server, and everything worked… until I had to deploy it.

I needed something cheap and scalable. A full-blown VPS felt like overkill. I tried running the Express app on a cheap cloud server, but then I had to worry about uptime, process management, and restarting on crashes. Not fun for a weekend project.

I looked into services like Interwest AI (they offer a unified AI API with built-in proxy logic) – but I wanted full control over the request/response pipeline, and I didn’t want to rely on another dependency for such a small thing. Plus, I was curious if I could build something lean myself.

Enter Serverless Functions

I realised a serverless function (on Vercel, Netlify, or AWS Lambda) could act as a perfect lightweight proxy. No server to manage, no idle costs, and I could secure the API key using environment variables. I chose Vercel because I already host my frontend there, but the same approach works anywhere.

Here’s the core of the solution:

// api/chat.js (Vercel serverless function)
import { OpenAIStream, StreamingTextResponse } from 'ai';
import { Configuration, OpenAIApi } from 'openai-edge';

const config = new Configuration({
  apiKey: process.env.OPENAI_API_KEY,
});
const openai = new OpenAIApi(config);

export const config = {
  runtime: 'edge',
};

export default async function handler(req) {
  const { messages } = await req.json();

  const response = await openai.createChatCompletion({
    model: 'gpt-3.5-turbo',
    stream: true,
    messages,
  });

  const stream = OpenAIStream(response);
  return new StreamingTextResponse(stream);
}
Enter fullscreen mode Exit fullscreen mode

That’s it. The function receives user messages, calls the AI API with streaming enabled, and returns a streamed response to the frontend. The ai SDK from Vercel makes streaming trivial – but you can also implement it manually using the standard ReadableStream API.

On the frontend, I used the same ai library to consume the stream:

import { useChat } from 'ai/react';

function Chat() {
  const { messages, input, handleInputChange, handleSubmit } = useChat({
    api: '/api/chat',
  });

  return (
    <div>
      {messages.map(m => (
        <div key={m.id}>{m.role}: {m.content}</div>
      ))}
      <form onSubmit={handleSubmit}>
        <input value={input} onChange={handleInputChange} />
      </form>
    </div>
  );
}
Enter fullscreen mode Exit fullscreen mode

It felt almost too easy – and that’s the point. The heavy lifting is done by the serverless platform and the AI SDK. My only job was connecting the dots.

Lessons Learned (the hard way)

While this approach works beautifully for low-traffic projects, there are some sharp edges I discovered:

  • Cold starts: Serverless functions can have a few hundred milliseconds of cold start latency. For a chat app, that initial delay is noticeable. Vercel’s Edge Functions help reduce this, but it’s not zero.
  • Timeouts: Many serverless providers cap function execution time (e.g., 10 seconds on Vercel’s Hobby plan). Streaming helps because the response starts immediately, but if the AI takes too long to generate the first token, you might hit the timeout.
  • Rate limiting: Without any throttling, a user could spam the endpoint and rack up your API bill. I later added a simple in-memory rate limiter using Upstash (or a Redis-backed solution) to prevent abuse.
  • Error handling: The AI API can return errors (quota exceeded, invalid model, etc.). My initial version just forwarded the error, which led to confusing frontend messages. I added better error formatting and a fallback response.

What I’d Do Differently Next Time

If I were to rebuild this today, I’d:

  • Use a queue for heavy traffic: If many users hit the endpoint simultaneously, serverless functions can scale, but each request triggers a fresh API call. A queue (like BullMQ with Redis) would batch requests or deduplicate identical prompts.
  • Add caching for common prompts: Some users ask the same questions. Caching the AI response for a few minutes would save money and speed up replies.
  • Consider a dedicated worker for production: For a high-traffic app, a persistent Node.js process (deployed on Fly.io or Railway) might be simpler to debug and tune than serverless functions with their quirky limits.

When to Use This Approach (and When Not To)

This serverless proxy is perfect for:

  • Small side projects or MVPs
  • Prototyping AI features quickly
  • Apps with light to moderate traffic

Avoid it if:

  • You need sub-100ms response times at the p99
  • You have millions of requests per day (cost efficiency becomes tricky)
  • You want full control over the HTTP/2 multiplexing and connection pooling

Wrapping Up

Building a serverless proxy for AI APIs solved my immediate problem: secure, CORS-free, and deployable in minutes. It’s not a silver bullet, but for most indie devs and small teams, it’s more than enough.

Have you found a better approach for handling AI API calls in production? I’d love to hear what’s working for you.

Top comments (0)