Mattias chaw

Posted on Jun 23

Streaming LLM Responses With Server-Sent Events: Build a Real-Time AI Chat Interface

#webdev #programming #ai #javascript

Streaming LLM Responses With Server-Sent Events: Build a Real-Time AI Chat Interface

If you have ever stared at a loading spinner for 15 seconds waiting for an LLM to finish generating a 2,000-word response, you already know why streaming matters. Users do not care about your backend architecture — they care about seeing words appear.

Server-Sent Events (SSE) is the standard way to stream LLM output to browsers. Not WebSockets. Not polling. SSE. In this guide, we build a production-grade streaming chat interface that works with any OpenAI-compatible API endpoint.

Why SSE, Not WebSockets?

WebSockets are bidirectional and powerful, but they are also overkill for LLM streaming. The data flow is one-directional: the server pushes tokens, the client reads them. SSE gives you exactly that with less complexity:

Auto-reconnection built into the browser
Simple HTTP — no upgrade handshake, no protocol negotiation
Works through proxies without special configuration
EventSource API is native to every modern browser

The only real limitation is that SSE is one-way (server → client), which is precisely what we need.

The Backend: Node.js + Express

Here is a minimal streaming endpoint that proxies requests to an OpenAI-compatible API:

const express = require("express");
const app = express();

app.use(express.json());

app.post("/api/chat", async (req, res) => {
  const { messages, model } = req.body;

  // Required SSE headers
  res.setHeader("Content-Type", "text/event-stream");
  res.setHeader("Cache-Control", "no-cache");
  res.setHeader("Connection", "keep-alive");
  res.setHeader("X-Accel-Buffering", "no"); // Disable nginx buffering

  try {
    const response = await fetch("https://api.aiwave.live/v1/chat/completions", {
      method: "POST",
      headers: {
        "Content-Type": "application/json",
        Authorization: `Bearer ${process.env.AI_API_KEY}`,
      },
      body: JSON.stringify({
        model: model || "deepseek-chat",
        messages,
        stream: true,
      }),
    });

    const reader = response.body.getReader();
    const decoder = new TextDecoder();

    while (true) {
      const { done, value } = await reader.read();
      if (done) break;

      const chunk = decoder.decode(value, { stream: true });
      const lines = chunk.split("\n").filter((l) => l.startsWith("data: "));

      for (const line of lines) {
        const data = line.slice(6);
        if (data === "[DONE]") {
          res.write("event: done\ndata: {}\n\n");
          continue;
        }

        try {
          const parsed = JSON.parse(data);
          const content = parsed.choices?.[0]?.delta?.content;
          if (content) {
            res.write(`data: ${JSON.stringify({ content })}\n\n`);
          }
        } catch {
          // Skip malformed chunks — partial JSON across boundaries
        }
      }
    }
  } catch (err) {
    res.write(`event: error\ndata: ${JSON.stringify({ error: err.message })}\n\n`);
  } finally {
    res.end();
  }
});

app.listen(3000, () => console.log("Server running on :3000"));

A few things worth noting:

The X-Accel-Buffering: no header is critical if you are behind nginx. Without it, nginx buffers the entire response and your users see nothing until the LLM finishes.
We parse data: lines individually because chunks can split JSON across boundaries.
The [DONE] sentinel is how OpenAI-compatible APIs signal end-of-stream.

The Frontend: Vanilla EventSource

You do not need a framework. Here is the entire client using the Fetch API with ReadableStream (since EventSource only supports GET):

async function streamChat(userMessage) {
  const output = document.getElementById("chat-output");
  const bubble = document.createElement("div");
  bubble.className = "assistant-message";
  output.appendChild(bubble);

  const response = await fetch("/api/chat", {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({
      model: "deepseek-chat",
      messages: [
        { role: "system", content: "You are a helpful coding assistant." },
        { role: "user", content: userMessage },
      ],
    }),
  });

  const reader = response.body.getReader();
  const decoder = new TextDecoder();
  let buffer = "";

  while (true) {
    const { done, value } = await reader.read();
    if (done) break;

    buffer += decoder.decode(value, { stream: true });
    const lines = buffer.split("\n\n");
    buffer = lines.pop();

    for (const line of lines) {
      if (line.startsWith("event: done")) continue;

      const dataMatch = line.match(/^data: (.+)$/m);
      if (dataMatch) {
        try {
          const { content } = JSON.parse(dataMatch[1]);
          if (content) bubble.textContent += content;
        } catch {}
      }
    }
  }
}

Handling Reconnection and Errors

Production systems need more than the happy path. Here are the three failure modes that will bite you:

1. Network Drops

SSE auto-reconnects natively with EventSource, but fetch streaming does not. If the connection drops mid-stream, you need manual recovery:

let retryCount = 0;
const MAX_RETRIES = 3;

async function resilientStream(message, onToken) {
  try {
    await streamChat(message, onToken);
    retryCount = 0;
  } catch (err) {
    if (retryCount < MAX_RETRIES) {
      retryCount++;
      const backoff = Math.min(1000 * 2 ** retryCount, 10000);
      await new Promise((r) => setTimeout(r, backoff));
      return resilientStream(message, onToken);
    }
    throw err;
  }
}

2. Rate Limiting

APIs return 429 Too Many Requests when you exceed rate limits. Always check the Retry-After header:

if (response.status === 429) {
  const wait = parseInt(response.headers.get("Retry-After") || "5", 10);
  await new Promise((r) => setTimeout(r, wait * 1000));
  return streamChat(message);
}

3. Heartbeat Keep-Alive

Some load balancers and CDN proxies (Cloudflare, AWS ALB) kill idle connections after 60–100 seconds. If your LLM takes a long time to generate the first token — common with reasoning models — the proxy may close the connection before any data flows.

Fix it with periodic heartbeats:

// Server-side: send a comment every 15 seconds
const heartbeat = setInterval(() => {
  res.write(": heartbeat\n\n");
}, 15000);

res.on("close", () => clearInterval(heartbeat));

Comments (lines starting with :) are part of the SSE spec and are silently ignored by the client, but they keep the TCP connection alive.

Choosing the Right Model for Streaming

Not all models stream equally well. Throughput matters because users perceive anything below ~20 tokens/second as sluggish. For cost-effective streaming, aiwave.live provides unified access to 50+ models including DeepSeek, GLM, Qwen, and ERNIE — all through a single OpenAI-compatible endpoint, so you can swap models without changing a single line of code.

Some models worth comparing for streaming throughput:

Model	Avg Tokens/sec	Context Window	Cost (per 1M input)
DeepSeek Chat	~60	128K	$0.27
GLM-4 Flash	~80	128K	$0.10
Qwen Plus	~50	131K	$0.40

Prices are approximate as of mid-2026 and vary by provider.

Putting It All Together

Here is the minimal HTML page that ties everything together:

<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8">
  <title>AI Chat Stream</title>
  <style>
    body { font-family: system-ui; max-width: 720px; margin: 40px auto; padding: 0 16px; }
    #chat-output { min-height: 300px; border: 1px solid #ddd; padding: 16px; border-radius: 8px; }
    .assistant-message { margin: 8px 0; line-height: 1.6; }
    .user-message { font-weight: 600; color: #555; }
    input { width: 70%; padding: 10px; border: 1px solid #ccc; border-radius: 4px; }
    button { padding: 10px 20px; background: #333; color: white; border: none; border-radius: 4px; cursor: pointer; }
  </style>
</head>
<body>
  <div id="chat-output"></div>
  <input id="user-input" placeholder="Ask anything..." />
  <button onclick="send()">Send</button>
  <script>
    async function send() {
      const input = document.getElementById("user-input");
      const msg = input.value.trim();
      if (!msg) return;
      input.value = "";

      const output = document.getElementById("chat-output");
      const userDiv = document.createElement("div");
      userDiv.className = "user-message";
      userDiv.textContent = msg;
      output.appendChild(userDiv);

      const bubble = document.createElement("div");
      bubble.className = "assistant-message";
      output.appendChild(bubble);

      const res = await fetch("/api/chat", {
        method: "POST",
        headers: { "Content-Type": "application/json" },
        body: JSON.stringify({
          model: "deepseek-chat",
          messages: [{ role: "user", content: msg }],
        }),
      });

      const reader = res.body.getReader();
      const decoder = new TextDecoder();
      let buffer = "";

      while (true) {
        const { done, value } = await reader.read();
        if (done) break;
        buffer += decoder.decode(value, { stream: true });
        const events = buffer.split("\n\n");
        buffer = events.pop();
        for (const evt of events) {
          const m = evt.match(/^data: (.+)$/m);
          if (m) {
            try { bubble.textContent += JSON.parse(m[1]).content; }
            catch {}
          }
        }
      }
    }
  </script>
</body>
</html>

Key Takeaways

SSE is the right tool for LLM streaming — simpler than WebSockets, native browser support.
Use fetch + ReadableStream instead of EventSource when you need POST requests.
Heartbeats prevent proxy timeouts — send an SSE comment every 15 seconds.
Always handle reconnection with exponential backoff.
Model throughput matters — compare tokens/sec, not just price.

Streaming is the single biggest UX improvement you can make to an AI-powered application. The difference between waiting 10 seconds for a full response and watching words appear in real-time is the difference between a tool that feels alive and one that feels broken.

Happy streaming!

Top comments (1)

Alex Shev • Jun 23

SSE is still a clean primitive for LLM streaming because the interaction is mostly one-way and progressive. The product detail is handling cancellation, partial output, retries, and final state without confusing the user.