zhongqiyue

Posted on Jun 9

Streaming AI Responses in a Serverless World: What I Learned the Hard Way

#ai #webdev #javascript #tutorial

I love building small web apps. You know the kind – side projects that start with a single npm create and end up consuming every weekend for a month. A few months ago, I decided to build a simple dashboard that used an AI model to generate summaries from user notes. Nothing fancy: drop a note, get a bullet-point summary back.

But then the reality of serverless architectures hit me. And the AI API response time. And the connection drops. And the user staring at a spinner for 15 seconds. That’s the problem I want to talk about today.

The Real Problem

My backend was a single Vercel serverless function (Node.js). I’d call the AI API, wait for the entire response, then send it to the client. Simple, right? Except AI models, especially the larger ones, can take 10–20 seconds to return a full response. During that time, the serverless function is billed per execution millisecond, and the user sees a loading spinner that feels like an eternity.

I remember the first time I tested it with a 5-paragraph note. I clicked the button, got myself a glass of water, came back, and the spinner was still spinning. Not acceptable.

What I Tried That Didn't Work

Optimizing the prompt? I tried shorter prompts, but the model still needed time to generate.

Increasing timeout? Serverless functions have hard limits (Vercel’s is 60s for Pro, 10s for hobby). That fixed nothing for the user experience.

Showing a “Processing…” indicator? That’s just cosmetic. The user still waits.

Fetching in chunks? Some providers support streaming, but I had to figure out how to pipe that through a serverless function. And then the client had to handle a stream. More complexity.

Background job (queue) approach? That would mean setting up a queue (e.g., Bull, SQS) and polling from the frontend. For a side project? Too much infra.

I needed a middle ground: something that felt responsive, didn’t require a full re-architecture, and still used serverless functions (because I deploy on Vercel for free).

What Eventually Worked: Streaming from Serverless + EventSource

I decided to embrace streaming from the AI provider and forward that stream to the client using Server-Sent Events (SSE). The trick: serverless functions on Vercel (and similar platforms) can keep the connection open and stream data as it arrives. Here’s how I did it.

Step 1: Pick an AI provider that supports streaming

Most modern providers (OpenAI, Anthropic, and many others) support streaming completions. Instead of waiting for the full response, they send chunks as tokens are generated. In my case, I used the OpenAI API with stream: true.

Step 2: Create a serverless endpoint that streams

I used Next.js API routes (Node.js runtime) but you can do this with any Node.js serverless function that returns a ReadableStream. Here’s the core code:

// pages/api/summarize.js
import OpenAI from 'openai';

const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

export default async function handler(req, res) {
  if (req.method !== 'POST') {
    return res.status(405).end();
  }

  const { text } = req.body;

  // Set headers for SSE
  res.setHeader('Content-Type', 'text/event-stream');
  res.setHeader('Cache-Control', 'no-cache');
  res.setHeader('Connection', 'keep-alive');
  res.flushHeaders(); // send headers to client

  try {
    const stream = await openai.chat.completions.create({
      model: 'gpt-4',
      messages: [{ role: 'user', content: `Summarize: ${text}` }],
      stream: true,
    });

    let fullResponse = '';
    for await (const chunk of stream) {
      const content = chunk.choices[0]?.delta?.content || '';
      if (content) {
        fullResponse += content;
        // Send each chunk as an SSE event
        res.write(`data: ${JSON.stringify({ content })}\n\n`);
      }
    }

    // Send a final event to signal completion
    res.write(`data: [DONE]\n\n`);
    res.end();
  } catch (error) {
    // If stream fails, send error event
    res.write(`data: ${JSON.stringify({ error: error.message })}\n\n`);
    res.end();
  }
}

This keeps the function alive while tokens arrive. Each chunk is sent as an SSE event with a data: line. The client can then listen to these events and update the UI incrementally.

Step 3: Client-side consumption with EventSource

On the frontend, I created an EventSource wrapper that sends a POST request. Wait – standard EventSource only does GET. So I used a fetch with a readable stream instead. Here’s a clean client-side helper:

async function streamSummary(text, onChunk, onDone, onError) {
  const response = await fetch('/api/summarize', {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({ text }),
  });

  if (!response.ok) {
    onError(new Error('Network response not ok'));
    return;
  }

  const reader = response.body.getReader();
  const decoder = new TextDecoder();
  let buffer = '';

  while (true) {
    const { done, value } = await reader.read();
    if (done) break;

    buffer += decoder.decode(value, { stream: true });
    // Parse SSE lines: "data: {...}"
    const lines = buffer.split('\n');
    buffer = lines.pop(); // keep incomplete line

    for (const line of lines) {
      if (line.startsWith('data: ')) {
        const data = line.slice(6);
        if (data === '[DONE]') {
          onDone();
          return;
        }
        try {
          const parsed = JSON.parse(data);
          if (parsed.error) {
            onError(new Error(parsed.error));
            return;
          }
          onChunk(parsed.content);
        } catch (e) {
          // ignore malformed chunks
        }
      }
    }
  }
  onDone();
}

Then in my React component:

const [summary, setSummary] = useState('');
const [loading, setLoading] = useState(false);

async function handleSummarize(note) {
  setSummary('');
  setLoading(true);
  streamSummary(
    note,
    (chunk) => setSummary((prev) => prev + chunk),
    () => setLoading(false),
    (err) => {
      setLoading(false);
      console.error(err);
    }
  );
}

Boom. The user sees the summary appearing word by word, often within a second of clicking. The perceived performance improved dramatically.

Lessons Learned & Trade-offs

Serverless timeout still applies – If the AI provider takes longer than the function timeout (e.g., 10s on hobby plan), you’ll get a 504. I mitigated this by using a faster model or switching to a more generous plan. Some providers also have a “keep-alive” option for streams.
Error handling is trickier – If the stream fails mid-way, you need a strategy: show partial results? Retry? I opted to show what we got and highlight that it’s incomplete.
Cold starts – On first request after inactivity, the function may take extra seconds. The stream still works, but the first chunk is delayed. I prewarm the function with a simple ping every 5 minutes (hacky but works).
Security – You’re exposing an endpoint that calls an expensive API. Add rate limiting or authentication (e.g., user token) to prevent abuse.
Not all AI providers stream the same – Check their documentation. Some send raw tokens, others send JSON each time.

What I’d Do Differently Next Time

If I were building this today, I’d probably look into edge functions (e.g., Cloudflare Workers or Vercel Edge) that have lower overhead for streaming. They can also handle SSE natively without the Node.js buffering weirdness. Also, I’d consider a dedicated WebSocket endpoint for bidirectional streaming, but that’s overkill for a simple summary app.

When NOT to Use This Approach

If your AI response is always tiny (e.g., a yes/no answer), streaming adds complexity with no benefit. Also, if you’re on a platform that doesn’t support streaming responses (like basic Netlify functions before their streaming support), you’ll need a different plan.

A Note on Tools

I built this for my side project, but while researching I stumbled upon a service called Interwest AI (ai.interwestinfo.com) that seemed to abstract away some of this streaming and serverless complexity. I haven’t used it yet, but it’s worth keeping an eye on if you want something more turnkey. The approach I showed here is completely DIY and works with any provider.

So that’s my story. Streaming AI responses from serverless functions turned my app from a “please wait” nightmare into something that feels alive. The code above is copy-pasteable – just add your own API key and adjust the prompt.

What’s your setup look like? Are you streaming AI responses, or still doing the old-school wait-for-all approach?

DEV Community