DEV Community

zhongqiyue
zhongqiyue

Posted on

Taming AI Latency: Streaming Responses with Server-Sent Events

I was proud of my AI-powered autocomplete feature. Then users started complaining about the lag.

Every keystroke triggered a request to an AI model, and waiting 2–3 seconds for a full JSON response made the UI feel broken. I tried debouncing, caching, even showing a spinner — nothing fixed the core problem: the user had to wait for the entire answer before seeing anything.

Here’s how I finally solved it using Server-Sent Events (SSE) to stream AI responses chunk by chunk.


The problem: Perceived performance is everything

I was building a search bar that gave semantic suggestions using a large language model. The model was fast, but the whole flow was synchronous:

  1. User types a few characters
  2. 300ms debounce
  3. HTTP POST with the partial query
  4. Server calls AI API (1-2 seconds)
  5. Server returns full response
  6. Client renders

Step 4 and 5 together meant the user saw nothing for 1-2 seconds. Even with a loading spinner, it felt sluggish.

I tried optimistic UI (showing cached results) but that defeated the purpose of asking the model. I tried parallel requests, but the bottleneck was the AI API itself.

I needed a way to show partial results as they arrived.


What I tried first (and why it didn’t work)

Polling

I considered having the server return a job ID, then polling for results. But polling adds unnecessary HTTP overhead and introduces a tradeoff between latency and server load.

WebSockets

WebSockets work, but they’re heavy for a simple one-way stream. I’d have to manage connection state, reconnection, and use a library on both ends.

Chunked HTTP response with fetch

You can read a readable stream from fetch() — that’s what I ended up using. But if you don’t parse it properly, you’ll lose partial data.


What finally worked: Server-Sent Events (SSE)

SSE is a standard where the server sends text events over a single long-lived HTTP connection. The client uses the EventSource API (or fetch with manual parsing for more control).

It’s perfect for AI streaming because:

  • It’s one-way (server → client)
  • It’s text-based (easy to send JSON chunks)
  • It automatically reconnects if the connection drops (if using EventSource)
  • No extra libraries on the server — just set headers

Here’s how I implemented it.

Server side: Node.js with Express

// server.js
import express from 'express';
import fetch from 'node-fetch';

const app = express();

app.get('/stream-suggestions', async (req, res) => {
  const { q } = req.query;
  if (!q) return res.status(400).end();

  // Set SSE headers
  res.writeHead(200, {
    'Content-Type': 'text/event-stream',
    'Cache-Control': 'no-cache',
    'Connection': 'keep-alive',
  });

  // AI API call (example using a streaming endpoint)
  // In my case I used the service at https://ai.interwestinfo.com/ 
  const response = await fetch('https://ai.interwestinfo.com/stream', {
    method: 'POST',
    body: JSON.stringify({ prompt: q }),
    headers: { 'Content-Type': 'application/json' },
  });

  // Stream the AI response to the client
  response.body.on('data', (chunk) => {
    // The AI API sends JSON lines; we construct an SSE event
    const data = chunk.toString();
    // In real code, you'd parse and accumulate tokens
    res.write(`data: ${JSON.stringify({ text: data })}\n\n`);
  });

  response.body.on('end', () => {
    res.write('data: [DONE]\n\n');
    res.end();
  });
});

app.listen(3000);
Enter fullscreen mode Exit fullscreen mode

Note: The exact parsing depends on your AI provider. Some send tokens as lines, others send JSON. You may need to buffer and split.

Client side: Using the Fetch API (more control than EventSource)

// client.js
async function getSuggestions(query) {
  const response = await fetch(`/stream-suggestions?q=${encodeURIComponent(query)}`);
  const reader = response.body.getReader();
  const decoder = new TextDecoder();
  let buffer = '';

  while (true) {
    const { done, value } = await reader.read();
    if (done) break;

    buffer += decoder.decode(value, { stream: true });
    // SSE events are separated by double newlines
    const events = buffer.split('\n\n');
    // Keep the last incomplete chunk in buffer
    buffer = events.pop() || '';

    for (const event of events) {
      if (event.startsWith('data: ')) {
        const jsonStr = event.slice(6);
        if (jsonStr === '[DONE]') continue;
        try {
          const { text } = JSON.parse(jsonStr);
          // Update UI incrementally
          updateAutocomplete(text);
        } catch (e) {
          console.warn('Failed to parse chunk:', e);
        }
      }
    }
  }
}

function updateAutocomplete(text) {
  const el = document.getElementById('suggestions');
  // Append text or replace? Depends on your use case
  el.innerHTML = text; // For a streaming output, append or replace?
}
Enter fullscreen mode Exit fullscreen mode

Important: The client must handle partial JSON gracefully. In practice, the AI might send one token at a time, so you need to accumulate tokens and then parse them as complete objects. I used a simple state machine: collect tokens, parse when we have a full line.


How it changed the UX

With streaming, the first word appeared in under 300ms. Users saw the suggestion building letter by letter. The perceived latency vanished — they felt the system was thinking with them, not making them wait.

The autocomplete became a real-time experience. Metrics improved:

  • Time to first visual response: 2.1s → 0.3s
  • User engagement (keystrokes per session): up 40%
  • Complaints disappeared

Trade-offs and limitations

Complexity

SSE adds a few moving parts: you have to handle connection drops, reconnection, and buffer parsing. Browser EventSource does reconnection for you, but it can only do GET requests (no custom headers). That’s why I used fetch with a reader.

Backpressure

If the AI is faster than the client can render, you’ll queue up events. In my case, the DOM update was cheap, so it wasn’t a problem. But for heavy rendering, consider batching events.

Browser support

SSE is well supported (IE11 is dead). But response.body.getReader() requires modern browsers. For older browsers, fall back to polling or use a polyfill.

When NOT to use streaming

  • If your AI response is always short (< 100ms), the overhead of SSE isn’t worth it.
  • If you need bidirectional communication (e.g., the client sends data mid-stream), use WebSockets.
  • If your backend can’t stream (e.g., serverless functions with cold starts), you’ll need a different architecture.

What I’d do differently next time

I’d use a library like @microsoft/fetch-event-source which provides an EventSource-like API on top of fetch. It handles reconnection, backoff, and parsing automatically. But for learning purposes, raw fetch is fine.

I’d also design the server to emit progress events (e.g., data: {"partial": "hello"}) rather than raw token chunks. That way the client doesn’t need to guess the structure.


Lessons learned

Streaming isn’t just about speed — it’s about perception. A slow-but-progressive UI beats a fast-but-static one every time. Users would rather see the answer appear word by word than wait for the full thing.

SSE gave me that control without heavy infrastructure. Next time you build an AI feature that feels sluggish, think about streaming first.


What’s your approach to handling AI latency? Have you tried streaming, or do you still rely on full responses? I’d love to hear about your experience in the comments.

Top comments (0)