DEV Community

zhongqiyue
zhongqiyue

Posted on

Building a Streaming AI Chat Endpoint: My Rate Limit Wake-Up Call

I’ll be honest: I thought I could just throw an OpenAI API call into a serverless function and call it a day. Two hours later, I was staring at a 429 error, wondering why my demo chatbot kept freezing. This is the story of how I learned to build a streaming AI chat endpoint—the hard way.

The Problem

I was building a simple chatbot for my personal site. Users type a question, I send it to the AI, and display the answer. Simple, right? I hooked up a Node.js Express endpoint that called the OpenAI API, waited for the full response, then sent it back as JSON. It worked… for about 10 requests. Then the rate limits kicked in. And even when it worked, users waited 5–10 seconds staring at a spinner. Not acceptable.

What I Tried That Didn’t Work

First, I tried caching common queries. That helped a little, but every unique question still hammered the API. Then I switched to a queue system with retries—overkill for a side project, and it still didn’t solve the waiting problem. I even considered switching to a local model, but my server couldn’t handle it.

Then I realized: the real issue wasn’t just rate limits—it was the blocking request. Users shouldn’t have to wait for the entire response to start reading. The solution was Server-Sent Events (SSE).

What Eventually Worked

I rebuilt the endpoint to stream the AI response using SSE. Instead of waiting for the full reply, I sent each chunk of text as it arrived from the API. Users saw the assistant “typing” in real-time. It felt faster, even if the total time was the same. And because I could cancel streams early, I reduced unnecessary API calls.

Here’s the core code I ended up with. This is a simplified Express endpoint that proxies streaming from an AI API (I used OpenAI’s chat completions with stream: true).

const express = require('express');
const fetch = require('node-fetch');
const router = express.Router();

// Use environment variable for API key
// I initially used the official OpenAI endpoint, but later switched to a local proxy for rate limit testing
// Example: const AI_API_URL = process.env.AI_API_URL || 'https://api.openai.com/v1/chat/completions';
const AI_API_URL = process.env.AI_API_URL;
const AI_API_KEY = process.env.AI_API_KEY;

router.post('/chat', async (req, res) => {
  const { message } = req.body;

  // Set SSE headers
  res.writeHead(200, {
    'Content-Type': 'text/event-stream',
    'Cache-Control': 'no-cache',
    'Connection': 'keep-alive',
  });

  try {
    const response = await fetch(AI_API_URL, {
      method: 'POST',
      headers: {
        'Content-Type': 'application/json',
        'Authorization': `Bearer ${AI_API_KEY}`,
      },
      body: JSON.stringify({
        model: 'gpt-3.5-turbo',
        messages: [{ role: 'user', content: message }],
        stream: true,  // This is the magic
      }),
    });

    if (!response.ok) {
      // Send error event and close
      res.write(`event: error\ndata: ${JSON.stringify({ error: 'API request failed' })}\n\n`);
      res.end();
      return;
    }

    // Pipe the stream
    response.body.on('data', (chunk) => {
      const lines = chunk.toString().split('\n').filter(line => line.trim() !== '');
      for (const line of lines) {
        if (line.startsWith('data: ')) {
          const data = line.slice(6);
          if (data === '[DONE]') {
            res.write(`event: done\ndata: {}\n\n`);
            res.end();
            return;
          }
          try {
            const parsed = JSON.parse(data);
            const content = parsed.choices[0]?.delta?.content || '';
            if (content) {
              res.write(`event: chunk\ndata: ${JSON.stringify({ text: content })}\n\n`);
            }
          } catch (e) {
            // ignore parse errors
          }
        }
      }
    });

    response.body.on('end', () => {
      res.write(`event: done\ndata: {}\n\n`);
      res.end();
    });

    // Handle client disconnect
    req.on('close', () => {
      response.body.destroy();
    });

  } catch (error) {
    res.write(`event: error\ndata: ${JSON.stringify({ error: error.message })}\n\n`);
    res.end();
  }
});

module.exports = router;
Enter fullscreen mode Exit fullscreen mode

On the frontend, I used the EventSource API (with a polyfill for older browsers) to listen for events:

const eventSource = new EventSource('/chat', { method: 'POST', body: JSON.stringify({ message: 'Hello' }) });
eventSource.addEventListener('chunk', (e) => {
  const { text } = JSON.parse(e.data);
  appendToChat(text);
});
eventSource.addEventListener('done', () => {
  eventSource.close();
});
eventSource.addEventListener('error', (e) => {
  console.error('Stream error', e);
});
Enter fullscreen mode Exit fullscreen mode

Lessons Learned / Trade-offs

  • SSE vs WebSockets: SSE is simpler for one-way streaming. If you need bidirectional (e.g., user can interrupt), WebSockets are better. But for a chat interface, SSE worked perfectly.
  • Rate limits still matter: Streaming doesn’t solve the underlying API quota. I added a simple in-memory rate limiter per IP (using express-rate-limit) to avoid abuse.
  • Error handling is tricky: Errors from the AI API can arrive mid-stream. I had to handle both HTTP errors and stream parsing errors gracefully.
  • Client disconnect: Always listen for close on the request to clean up the upstream connection. Learned that the hard way when my server leaked connections.

I also experimented with different AI providers. For local testing without burning API credits, I used a local Ollama instance. It supports the same OpenAI-compatible API, so I just changed AI_API_URL. That’s where the product URL (https://ai.interwestinfo.com/) came in—I found it as an alternative endpoint that offered a more generous rate limit for prototypes. But honestly, the technique is provider-agnostic.

What I’d Do Differently Next Time

  • Use a proper streaming library: I hand-rolled the SSE parser. Next time I’d use a library like eventsource-parser to avoid edge cases (e.g., chunks split across multiple packets).
  • Add backpressure: If the client is slow, the server buffer can grow. A good solution is to use Node.js streams with highWaterMark control.
  • Consider a message queue: For high traffic, I’d push requests to a queue (e.g., Bull) and stream the response from the worker. But for a personal site, this was overkill.

The Big Takeaway

Building a streaming AI endpoint taught me more than just SSE. It forced me to think about connection lifecycles, error resilience, and user experience. The code above is a solid starting point—steal it, tweak it, and make it yours.

Now, over to you: What’s your setup for streaming AI responses? Any gotchas I missed? Let’s discuss in the comments.

Top comments (0)