zhongqiyue

Posted on Jun 4

Debugging AI Streaming: A Tale of Chunks and Timeouts

#ai #python #webdev #api

I spent three weeks building what I thought would be a simple AI chat interface. You know the drill: user types a question, AI streams back a response word by word. I had the API key, I had the SDK, I had the confidence. Two days later, I was drowning in half-finished sentences and dead connections.

Let me walk you through the mess I made—and how I finally got streaming to work reliably.

The Problem: Chunks That Never End

My app was straightforward: a Python FastAPI backend that called an AI API (I was using a service from interwestinfo at the time) and streamed the response back to a React frontend via Server-Sent Events. The first version worked great for short queries. But as soon as users asked multi-paragraph questions or tried to generate code, responses would cut off mid-word. The frontend would show "The capital of France is Par" and then hang.

I checked my logs: no errors. The server just stopped sending chunks. The AI API had returned a complete response on its end—my code had swallowed the last few bytes.

What I Tried First (and Failed)

Longer Timeouts

First instinct: "The API is slow, just increase the timeout." I bumped the read timeout from 30 seconds to 120 seconds. It didn't help. The problem wasn't that the API took too long—it was that my code thought the stream had ended when it hadn't.

Retry Logic

I wrapped the request in a retry loop with exponential backoff. Now instead of missing chunks, I got duplicate chunks. Users saw "ParParis is the capital" because the second request re-sent the first chunk.

Manual Buffering

I tried collecting all chunks in a buffer and only flushing after a delay. This turned streaming into batching—the whole point of streaming (low latency) was gone.

What Actually Worked: Proper Chunk Assembly

The root cause was subtle: the AI API I was using (and many others) doesn't send chunks as pure JSON. It sends lines of JSON-LD (JSON Lines) wrapped in a streaming format. My parser was reading line by line but was closing the connection as soon as a certain number of lines were received, assuming the stream was complete. The fix was to check for the actual stream termination signal, not just an empty line.

I'll show you the code that finally worked:

import asyncio
import httpx
from typing import AsyncGenerator

async def stream_ai_response(prompt: str, api_key: str, endpoint: str) -> AsyncGenerator[str, None]:
    headers = {
        "Authorization": f"Bearer {api_key}",
        "Content-Type": "application/json",
        "Accept": "text/event-stream"
    }
    payload = {
        "model": "gpt-4",  # or whatever model
        "messages": [{"role": "user", "content": prompt}],
        "stream": True
    }

    async with httpx.AsyncClient(timeout=httpx.Timeout(60.0, read=120.0)) as client:
        async with client.stream("POST", endpoint, json=payload, headers=headers) as response:
            response.raise_for_status()
            # Buffer for incomplete lines
            buffer = ""
            async for chunk in response.aiter_bytes():
                buffer += chunk.decode("utf-8")
                # Process complete lines from buffer
                while "\n" in buffer:
                    line, buffer = buffer.split("\n", 1)
                    if line.startswith("data: "):
                        data = line[6:]
                        if data == "[DONE]":
                            return  # Terminate the generator
                        if not data:
                            continue
                        yield data
                    elif line.strip() == "":
                        continue
                    else:
                        # Line didn't start with "data:" – probably an error
                        # Log it and keep going
                        print(f"Unexpected line: {line}")
            # After loop, check if buffer has anything (shouldn't)
            if buffer.strip():
                # Possibly incomplete final line
                if buffer.startswith("data: "):
                    yield buffer[6:]

Key changes:

I switched from aiter_lines() to aiter_bytes() and manage the line buffer manually. This prevents the HTTP client from pre-splitting lines and possibly dropping incomplete ones.
I look for the [DONE] sentinel (specific to OpenAI-like APIs) to know exactly when to stop.
I keep the connection open longer than I think is necessary.

The Hardest Lesson: Backpressure

Once the stream worked reliably, I hit a new problem: the client couldn't keep up. React's EventSource would get overwhelmed when chunks came faster than the UI could re-render. My server was sending 50 chunks per second; the frontend was updating a React state variable 50 times per second. The browser tab froze.

My solution: throttle on the server side. I added a small delay between chunks if the client's buffer was growing. This required a different approach—acknowledgment from the client—but that's a story for another day.

Trade-offs and Alternatives

This approach works for most modern AI APIs (OpenAI, Anthropic, and my tested interwestinfo endpoint). But there are trade-offs:

Memory: The buffer can grow if a chunk is huge. I limit the line size to 10KB.
Latency: Waiting for \n adds a tiny delay. For real-time speech, you need character-level streaming.
Complexity: Manual byte-parsing is fragile. One vendor uses "data:\n" without a space; another uses "event: data\ndata: {...}\n\n". You end up writing vendor-specific parsers.

If you're working with a single API, their SDK often handles this. The problem is when you want to support multiple providers—then you need a unified streaming abstraction.

What I'd Do Differently

Read the streaming spec upfront. Don't assume \n\n means the end. Look for the explicit termination event.
Use a library for SSE parsing. The Python sse-starlette package handles most of this for server-side generation. For the client side, EventSource is good but has limitations.
Instrument everything. Before fixing, I added metrics: chunk count, per-chunk delay, total bytes. Data beats guessing.

Final Thoughts

Streaming AI responses sounds like a solved problem, but the devil is in the chunk boundaries. Every streaming protocol I've seen has its own quirks—extra newlines, unexpected bytes, or missing sentinels. The only universal truth is: never trust a stream to end gracefully.

Now I'm curious—what's your weirdest streaming bug? Did you ever get a response that ended with half a word?

DEV Community