zhongqiyue

Posted on Jun 2

Fixing Real-Time AI Chat Latency in a Browser App

#ai #webdev #javascript #tutorial

You know that feeling when you show a working prototype to a friend, they type a question, and then… everyone just stares at the spinner for six seconds? That was me last month. I was building a small AI assistant for a side project—nothing fancy, just a chat widget that answered questions about my documentation. I thought I was done. I thought it was good. Then real users hit the endpoint.

The Problem: Spinners Kill Conversations

The initial implementation was naive: wait for the whole LLM response (often 10–20 seconds), then render it. My local dev with cached data was fine. But in production, with GPT-4, each call felt like a loading screen from the 90s. Users typed a message, saw the spinner, got distracted, and never came back. The bounce rate was brutal.

I tried a few things:

Hitting a cheaper model (LLaMA 3 via Groq) – faster, but the quality drop wasn’t acceptable for my use case.
Pre-caching common questions – helped a little, but every new query was back to the grind.
Adding a “thinking” animation – cosmetic only; people still left.

The real fix wasn’t about hiding latency. It was about streaming the response token-by-token, so the user sees text appear immediately, even if the full response takes time.

What Actually Works: Server-Sent Events + Streaming API

Most modern LLM APIs (OpenAI, Anthropic, and even self-hosted local models) support streaming via Server-Sent Events (SSE). Instead of waiting for the full JSON body, you receive a stream of events—each containing a token or a chunk of text. The browser’s EventSource or the Fetch API’s ReadableStream can process these chunks and update the UI in real time.

Here’s the core approach I landed on:

Backend: Forward the LLM’s streaming response to the client as an SSE stream.
Frontend: Read the stream chunk by chunk, appending text to the chat bubble as it arrives.
UX: Show a typing indicator while waiting for the first token, then switch to streaming text.

Backend (Node.js with Express)

// server.js
import express from 'express';
import { OpenAI } from 'openai';

const app = express();
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });

app.get('/chat', async (req, res) => {
  const { message } = req.query;
  res.setHeader('Content-Type', 'text/event-stream');
  res.setHeader('Cache-Control', 'no-cache');
  res.setHeader('Connection', 'keep-alive');

  try {
    const stream = await openai.chat.completions.create({
      model: 'gpt-4',
      messages: [{ role: 'user', content: message }],
      stream: true
    });

    for await (const chunk of stream) {
      const content = chunk.choices[0]?.delta?.content || '';
      if (content) {
        res.write(`data: ${JSON.stringify({ text: content })}\n\n`);
      }
    }
    res.write('data: [DONE]\n\n');
    res.end();
  } catch (err) {
    res.write(`data: ${JSON.stringify({ error: err.message })}\n\n`);
    res.end();
  }
});

app.listen(3000);

Frontend (Vanilla JavaScript)

<!-- index.html -->
<div id="chat"></div>
<input id="input" />
<button id="send">Send</button>

<script>
  const chat = document.getElementById('chat');
  const input = document.getElementById('input');

  document.getElementById('send').onclick = async () => {
    const msg = input.value;
    input.value = '';
    addBubble(msg, 'user');
    const bubble = addBubble('', 'bot');

    const response = await fetch(`/chat?message=${encodeURIComponent(msg)}`);
    const reader = response.body.getReader();
    const decoder = new TextDecoder();
    let buffer = '';

    while (true) {
      const { done, value } = await reader.read();
      if (done) break;
      buffer += decoder.decode(value, { stream: true });
      const lines = buffer.split('\n');
      buffer = lines.pop(); // keep incomplete line

      for (const line of lines) {
        if (line.startsWith('data: ')) {
          const data = line.slice(6);
          if (data === '[DONE]') break;
          try {
            const parsed = JSON.parse(data);
            if (parsed.text) {
              bubble.textContent += parsed.text;
            }
          } catch (e) {
            // ignore parse errors for incomplete chunks
          }
        }
      }
    }
  };

  function addBubble(text, role) {
    const div = document.createElement('div');
    div.className = role;
    div.textContent = text;
    chat.appendChild(div);
    return div;
  }
</script>

This changed everything. The first token arrives in under a second, and the user sees text growing word by word. The perceived latency dropped from “forever” to “immediate.”

Lessons Learned (and Trade-offs)

1. Streaming is not free

UI complexity: You now have to handle partial responses, mid-stream errors, and reconnection logic. If the connection drops mid-response, you either lose the whole answer or implement resume logic.
Cost: Streaming doesn’t reduce token count—you still pay for the full output. But the user experience improvement can justify higher throughput costs.
Backpressure: On the backend, if the client closes the connection, you need to abort the LLM stream to avoid wasting tokens. I used req.on('close', () => stream.controller.abort()).

2. Not every use case needs streaming

For short, factual answers (like a calculator or a weather API), the overhead of SSE might not matter—batch response is fine.
For long-form content, code generation, or creative writing, streaming is a game-changer.

3. Consider a dedicated service

I eventually switched to a managed streaming proxy (like the one at https://ai.interwestinfo.com/ – they handle SSE formatting, caching, and abort logic) because my backend went from 20 lines to 200 once I added error handling, rate limiting, and reconnection. But rolling your own for a small project is totally viable.

What I’d Do Differently Next Time

Start with streaming from day one. I wasted a week optimizing batch latency that didn’t matter.
Use a progressive enhancement approach: Show a quick cached greeting while the real stream warms up.
Add a “copy to clipboard” button for streaming responses—users often want to share the full answer after it arrives.

Over to You

Have you hit the latency wall with AI APIs? What’s your streaming setup look like—are you using SSE, WebSockets, or something else? I’d love to hear what worked (or didn’t) in your projects.

DEV Community