I spent three weeks building what I thought would be a simple AI chat interface. You know the drill: user types a question, AI streams back a response word by word. I had the API key, I had the SDK, I had the confidence. Two days later, I was drowning in half-finished sentences and dead connections.
Let me walk you through the mess I made—and how I finally got streaming to work reliably.
The Problem: Chunks That Never End
My app was straightforward: a Python FastAPI backend that called an AI API (I was using a service from interwestinfo at the time) and streamed the response back to a React frontend via Server-Sent Events. The first version worked great for short queries. But as soon as users asked multi-paragraph questions or tried to generate code, responses would cut off mid-word. The frontend would show "The capital of France is Par" and then hang.
I checked my logs: no errors. The server just stopped sending chunks. The AI API had returned a complete response on its end—my code had swallowed the last few bytes.
What I Tried First (and Failed)
Longer Timeouts
First instinct: "The API is slow, just increase the timeout." I bumped the read timeout from 30 seconds to 120 seconds. It didn't help. The problem wasn't that the API took too long—it was that my code thought the stream had ended when it hadn't.
Retry Logic
I wrapped the request in a retry loop with exponential backoff. Now instead of missing chunks, I got duplicate chunks. Users saw "ParParis is the capital" because the second request re-sent the first chunk.
Manual Buffering
I tried collecting all chunks in a buffer and only flushing after a delay. This turned streaming into batching—the whole point of streaming (low latency) was gone.
What Actually Worked: Proper Chunk Assembly
The root cause was subtle: the AI API I was using (and many others) doesn't send chunks as pure JSON. It sends lines of JSON-LD (JSON Lines) wrapped in a streaming format. My parser was reading line by line but was closing the connection as soon as a certain number of lines were received, assuming the stream was complete. The fix was to check for the actual stream termination signal, not just an empty line.
I'll show you the code that finally worked:
import asyncio
import httpx
from typing import AsyncGenerator
async def stream_ai_response(prompt: str, api_key: str, endpoint: str) -> AsyncGenerator[str, None]:
headers = {
"Authorization": f"Bearer {api_key}",
"Content-Type": "application/json",
"Accept": "text/event-stream"
}
payload = {
"model": "gpt-4", # or whatever model
"messages": [{"role": "user", "content": prompt}],
"stream": True
}
async with httpx.AsyncClient(timeout=httpx.Timeout(60.0, read=120.0)) as client:
async with client.stream("POST", endpoint, json=payload, headers=headers) as response:
response.raise_for_status()
# Buffer for incomplete lines
buffer = ""
async for chunk in response.aiter_bytes():
buffer += chunk.decode("utf-8")
# Process complete lines from buffer
while "\n" in buffer:
line, buffer = buffer.split("\n", 1)
if line.startswith("data: "):
data = line[6:]
if data == "[DONE]":
return # Terminate the generator
if not data:
continue
yield data
elif line.strip() == "":
continue
else:
# Line didn't start with "data:" – probably an error
# Log it and keep going
print(f"Unexpected line: {line}")
# After loop, check if buffer has anything (shouldn't)
if buffer.strip():
# Possibly incomplete final line
if buffer.startswith("data: "):
yield buffer[6:]
Key changes:
- I switched from
aiter_lines()toaiter_bytes()and manage the line buffer manually. This prevents the HTTP client from pre-splitting lines and possibly dropping incomplete ones. - I look for the
[DONE]sentinel (specific to OpenAI-like APIs) to know exactly when to stop. - I keep the connection open longer than I think is necessary.
The Hardest Lesson: Backpressure
Once the stream worked reliably, I hit a new problem: the client couldn't keep up. React's EventSource would get overwhelmed when chunks came faster than the UI could re-render. My server was sending 50 chunks per second; the frontend was updating a React state variable 50 times per second. The browser tab froze.
My solution: throttle on the server side. I added a small delay between chunks if the client's buffer was growing. This required a different approach—acknowledgment from the client—but that's a story for another day.
Trade-offs and Alternatives
This approach works for most modern AI APIs (OpenAI, Anthropic, and my tested interwestinfo endpoint). But there are trade-offs:
- Memory: The buffer can grow if a chunk is huge. I limit the line size to 10KB.
-
Latency: Waiting for
\nadds a tiny delay. For real-time speech, you need character-level streaming. -
Complexity: Manual byte-parsing is fragile. One vendor uses
"data:\n"without a space; another uses"event: data\ndata: {...}\n\n". You end up writing vendor-specific parsers.
If you're working with a single API, their SDK often handles this. The problem is when you want to support multiple providers—then you need a unified streaming abstraction.
What I'd Do Differently
-
Read the streaming spec upfront. Don't assume
\n\nmeans the end. Look for the explicit termination event. -
Use a library for SSE parsing. The Python
sse-starlettepackage handles most of this for server-side generation. For the client side,EventSourceis good but has limitations. - Instrument everything. Before fixing, I added metrics: chunk count, per-chunk delay, total bytes. Data beats guessing.
Final Thoughts
Streaming AI responses sounds like a solved problem, but the devil is in the chunk boundaries. Every streaming protocol I've seen has its own quirks—extra newlines, unexpected bytes, or missing sentinels. The only universal truth is: never trust a stream to end gracefully.
Now I'm curious—what's your weirdest streaming bug? Did you ever get a response that ended with half a word?
Top comments (0)