Streaming LLM Responses with SSE: Build a Real-Time AI Chat UI in 60 Lines
If you've used ChatGPT, you've seen it: that satisfying typewriter effect where tokens stream in word by word. Behind the curtain, that's Server-Sent Events (SSE) — one of the most underused web standards powering modern AI interfaces.
Most tutorials treat SSE like a novelty. In production AI apps, it's table stakes. Users won't wait 8 seconds staring at a spinner while your model generates a 500-word response. They want to see the first token in under 300ms and watch the rest flow.
Today we'll build a complete streaming chat interface — frontend and backend — in about 60 lines of code. And we'll test it against real Chinese LLMs that cost 90% less than GPT-4o.
Why SSE Over WebSockets for LLM Streaming?
Before we code, let's settle this. Developers often reach for WebSockets when they hear "real-time." For LLM streaming, SSE is almost always the better choice:
| Feature | SSE | WebSocket |
|---|---|---|
| Direction | Server → Client (one-way) | Bidirectional |
| Protocol | HTTP | Custom upgrade |
| Auto-reconnect | Built-in | Manual |
| Proxy-friendly | Yes | Often problematic |
| Complexity | ~10 lines | ~50+ lines |
LLM streaming is inherently one-way: the server pushes tokens, the client receives them. You don't need bidirectional communication for that. SSE gives you automatic reconnection, works through any HTTP proxy, and requires zero protocol negotiation.
The Backend: 25 Lines of Node.js
Here's a complete Express endpoint that streams responses from any OpenAI-compatible API:
const express = require('express');
const app = express();
app.use(express.json());
app.post('/api/chat', async (req, res) => {
res.setHeader('Content-Type', 'text/event-stream');
res.setHeader('Cache-Control', 'no-cache');
res.setHeader('Connection', 'keep-alive');
const response = await fetch('https://aiwave.live/v1/chat/completions', {
method: 'POST',
headers: {
'Content-Type': 'application/json',
'Authorization': `Bearer ${process.env.AIWAVE_API_KEY}`
},
body: JSON.stringify({
model: req.body.model || 'deepseek-chat',
messages: req.body.messages,
stream: true
})
});
const reader = response.body.getReader();
const decoder = new TextDecoder();
while (true) {
const { done, value } = await reader.read();
if (done) break;
res.write(`data: ${decoder.decode(value)}\n\n`);
}
res.write('data: [DONE]\n\n');
res.end();
});
app.listen(3000);
That's it. 25 lines. The endpoint receives a chat request, forwards it to the LLM API with stream: true, and pipes the streamed chunks directly to the client via SSE.
Notice we're using https://aiwave.live/v1/chat/completions as our endpoint — it's an OpenAI-compatible gateway that provides access to 50+ Chinese AI models including DeepSeek, GLM-5, Qwen, and ERNIE through a single API. If you're already using the OpenAI SDK, you just change the baseURL and you're done.
The Frontend: 35 Lines of Vanilla JS
No React. No build step. No dependencies. Just an HTML file:
<!DOCTYPE html>
<html>
<body>
<div id="chat" style="font-family:monospace;max-width:600px;margin:50px auto;"></div>
<input id="input" placeholder="Ask anything..." style="width:500px;padding:10px;"
onkeydown="if(event.key==='Enter')send()">
<script>
const chat = document.getElementById('chat');
const input = document.getElementById('input');
async function send() {
const msg = input.value.trim();
if (!msg) return;
input.value = '';
chat.innerHTML += `<p><b>You:</b> ${msg}</p>`;
const replyEl = document.createElement('p');
replyEl.innerHTML = '<b>AI:</b> <span id="typing"></span>';
chat.appendChild(replyEl);
const typing = document.getElementById('typing');
const response = await fetch('/api/chat', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
messages: [...window.history || [], { role: 'user', content: msg }]
})
});
const reader = response.body.getReader();
const decoder = new TextDecoder();
let fullText = '';
while (true) {
const { done, value } = reader.read();
if (done) break;
const chunk = decoder.decode(value);
const lines = chunk.split('\n').filter(l => l.startsWith('data: '));
for (const line of lines) {
const data = line.slice(6);
if (data === '[DONE]') continue;
try {
const json = JSON.parse(data);
const token = json.choices?.[0]?.delta?.content || '';
fullText += token;
typing.textContent = fullText;
} catch (e) {}
}
}
typing.id = '';
}
</script>
</body>
</html>
35 lines of frontend code. The browser's native ReadableStream API handles the SSE parsing. Each chunk gets decoded, parsed as JSON, and the delta.content token gets appended to the displayed text.
Making It Production-Ready
The minimal version above works, but production needs a few additions:
1. Error Handling
app.post('/api/chat', async (req, res) => {
try {
// ... streaming logic
} catch (err) {
res.write(`data: ${JSON.stringify({ error: err.message })}\n\n`);
res.end();
}
});
2. Request Validation
Always validate the model name against an allowlist. Never pass user input directly to your API provider:
const ALLOWED_MODELS = ['deepseek-chat', 'glm-5', 'qwen-max', 'ernie-4.0-turbo'];
if (!ALLOWED_MODELS.includes(req.body.model)) {
return res.status(400).json({ error: 'Invalid model' });
}
3. Rate Limiting
LLM streaming requests are long-lived connections. A single user can exhaust your connection pool fast:
const activeStreams = new Map();
app.post('/api/chat', async (req, res) => {
const userId = req.ip;
if ((activeStreams.get(userId) || 0) >= 3) {
return res.status(429).json({ error: 'Too many concurrent streams' });
}
activeStreams.set(userId, (activeStreams.get(userId) || 0) + 1);
res.on('close', () => {
activeStreams.set(userId, activeStreams.get(userId) - 1);
});
// ... streaming logic
});
Why Chinese Models Are Perfect for Streaming
Here's something most developers don't realize: Chinese LLMs aren't just cheaper — they often have lower time-to-first-token (TTFT) than Western alternatives. DeepSeek V4 Pro delivers first tokens in ~200ms, compared to GPT-4o's ~600ms.
For streaming UX, TTFT matters more than total generation time. Users perceive a response as "fast" if the first token appears quickly, even if total generation takes 5+ seconds. That's why routing your streaming endpoints through a service like aiwave.live — which offers all major Chinese models behind one API with $5 free credits — gives you both cost and latency advantages.
Quick Comparison: TTFT Across Models
| Model | TTFT (avg) | Cost / 1M tokens |
|---|---|---|
| GPT-4o | ~600ms | $2.50 |
| Claude 4 Opus | ~800ms | $3.00 |
| DeepSeek V4 Pro | ~200ms | $0.14 |
| GLM-5 | ~250ms | $0.20 |
| Qwen-Max | ~300ms | $0.40 |
The latency difference is immediately noticeable in a streaming UI. Try it yourself — the same prompt will start typing noticeably faster on DeepSeek than on GPT-4o.
Putting It All Together
Here's the complete architecture:
Browser → fetch() with ReadableStream
↓ SSE
Your Server (Express)
↓ HTTP streaming
LLM API (aiwave.live/v1/chat/completions)
↓ Token-by-token
Model inference (DeepSeek / GLM-5 / Qwen)
The beauty of this architecture is its simplicity. No WebSocket libraries, no socket.io, no SignalR. Just HTTP responses with text/event-stream content type, and the browser handles everything else.
Key Takeaways
- SSE beats WebSocket for LLM streaming — simpler, auto-reconnecting, proxy-friendly
- 60 lines gets you a working chat UI — vanilla JS, no build step
- TTFT > total latency for streaming UX — Chinese models win here
- Always validate and rate-limit — streaming endpoints are expensive
- OpenAI-compatible APIs make switching trivial — just change the baseURL
If you want to try this with real Chinese AI models, you can sign up at aiwave.live and get $5 in free credits — enough for about 2.5 million tokens of experimentation. The API is fully OpenAI-compatible, so the code above works without modification.
Happy streaming! 🚀
Top comments (0)