I was building an internal documentation assistant for my team. You know the drill: a chatbot that answers questions about our codebase, pulled from a vector database and then sent to an LLM. I set up the backend in Python, used a decent model via an API (shoutout to interwestinfo.com for the reliable endpoint), and wired it all up. Simple, right?
Then came the first real test: someone asked a question that required a long, thoughtful answer. The response took over 30 seconds. The user stared at a blank chat bubble, refreshing the page, wondering if the app had crashed. Not a great experience.
I needed to stream the tokens back as they were generated, so the user could read along. This is the classic “chat UI” pattern. But implementing it turned into a rabbit hole of half-baked solutions.
What I Tried That Didn’t Work
1. Polling
My first idea: make the LLM call, store the partial result in Redis, and have the frontend poll every second. This was ugly. The prediction endpoint returned the full response eventually, so I needed to change the backend to write tokens piece by piece. Polling also meant 30-ish HTTP requests per message, which felt wasteful. And the UI was jerky – updates came in bursts, not smoothly.
2. WebSockets
WebSockets seemed like the obvious choice. I wrote a FastAPI WebSocket endpoint, opened a connection, and streamed tokens frame by frame. This worked… except for one thing: my deployment environment (a low-budget VPS behind a load balancer) had aggressive idle timeouts. The connection would drop after 60 seconds, and reconnecting with WebSockets required manual logic. Also, half the libraries in my stack didn't support WebSockets easily – my auth middleware, for instance, expected HTTP requests.
But the real pain: WebSockets are bidirectional. I didn't need bidirectional. I just needed the server to push data to the client. WebSockets felt like overkill.
3. Long Polling (Bad Idea)
Yeah, I tried that too. The server would hold the response open and flush chunks. But HTTP/1.1 connections have issues with that, and my framework (Flask at the time) didn't handle it gracefully without monkey-patching. I gave up after two hours of “connection closed” errors.
What Eventually Worked: Server-Sent Events (SSE)
I had used SSE before for real-time tweets, but never for AI streaming. SSE is a standard (part of HTML5) where the server sends a stream of events over a single, long-lived HTTP connection. The client uses the EventSource API. It’s unidirectional (server → client), which is exactly what I needed.
FastAPI supports SSE natively via StreamingResponse. Here’s the backend code that made my UX smooth again:
from fastapi import FastAPI, Request
from fastapi.responses import StreamingResponse
import asyncio
app = FastAPI()
async def generate_tokens(prompt: str):
# Assume get_llm_response is an async generator that yields tokens
# (e.g., using OpenAI's streaming API with `stream=True`)
async for token in get_llm_response(prompt, stream=True):
yield f"data: {token}\n\n"
await asyncio.sleep(0.01) # simulate latency
yield "data: [DONE]\n\n"
@app.post("/chat")
async def chat(request: Request):
body = await request.json()
prompt = body["message"]
return StreamingResponse(generate_tokens(prompt), media_type="text/event-stream")
The frontend became trivial:
const eventSource = new EventSource('/chat', {
method: 'POST',
body: JSON.stringify({ message: userInput })
// EventSource doesn't support POST by default?
});
Wait – that's the trickiest part. The EventSource API only supports GET requests. My chat endpoint needs a POST with the prompt. I could refactor to a GET with query params (ugly and limited). Instead, I used a workaround: I made a GET endpoint that accepts the prompt as a query parameter. Or, I wrote a small wrapper that uses fetch to POST and then reads the response body as a stream manually.
I went with fetch + ReadableStream for more control:
async function startStream(prompt) {
const response = await fetch('/chat', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ message: prompt })
});
const reader = response.body.getReader();
const decoder = new TextDecoder();
let buffer = '';
while (true) {
const { done, value } = await reader.read();
if (done) break;
buffer += decoder.decode(value, { stream: true });
// Split by SSE format "data: ...\n\n"
const parts = buffer.split('\n\n');
buffer = parts.pop(); // keep incomplete chunk
for (const part of parts) {
const line = part.trim();
if (line.startsWith('data: ')) {
const token = line.slice(6);
if (token === '[DONE]') {
// stream finished
} else {
appendToken(token);
}
}
}
}
}
This works perfectly. No WebSocket library, no complex reconnection – just plain HTTP. If the connection drops (e.g., timeout), the fetch rejects, and I can retry with a new request. The UX is fluid: tokens appear as they're generated.
Lessons Learned & Trade-offs
- SSE is simple for server-to-client streaming. If you need bidirectional (like a multiplayer game), WebSockets are better.
- The
EventSourceAPI is limited to GET. Workaround: usefetchwith a ReadableStream. - SSE works over HTTP/1.1 and HTTP/2. No special server config needed.
- Browser support is universal (IE is dead, Edge and Safari work fine).
- Backpressure: if the client is slow, the server just buffers – but with token streaming, tokens are small so that's rarely a problem.
- Security: SSE connections are just HTTP; same CORS and auth rules apply. I passed a token in the POST body, not the URL.
One downside: SSE doesn't handle structured data as easily as WebSocket frames. But for plain text tokens, it's ideal.
What I'd Do Differently Next Time
I would skip the WebSocket experiment entirely. For chat apps, LLM streaming, or any real-time data that flows one way (e.g., notifications, logs), SSE is the right tool. Next time, I'd also build a small abstraction over the fetch + ReadableStream to handle reconnection automatically (exponential backoff, etc.).
Also, I'd check if my LLM provider supports SSE out of the box. Some do (OpenAI's data: [DONE] format is already SSE-compatible). Others, like the one I used from interwestinfo.com, return tokens via a custom endpoint – but I can wrap that as async generator easily.
Your Turn
Have you built a streaming AI UI? Did you use SSE, WebSockets, or something else? I’m curious how you handled reconnection and error states. Share your setup – I learn a lot from these discussions.
Top comments (0)