OpenAI streaming looks simple from the outside. Set stream: true, iterate the response, pipe it to the client. One afternoon of work.
Then you ship it. A client disconnects mid-generation and you eat 2,000 tokens nobody received. A slow mobile client causes your proxy's memory to climb. An OpenAI rate limit hits after you've already sent a 200. Here's what actually happens when you proxy SSE at scale — and the Go patterns that fix each failure mode.
TL;DR:
- There are four production failure modes: chunk boundary corruption, token leaks on disconnect, unbounded buffering under backpressure, and mid-stream errors after a 200
- Each has a clean fix in ~50 lines of Go
- All four are running in production at Preto at 5,000+ streaming req/s, <50ms p95 overhead
How LLM SSE Actually Works
OpenAI's streaming response is plain HTTP with Content-Type: text/event-stream. The body is a sequence of newline-delimited frames:
data: {"id":"chatcmpl-...","choices":[{"delta":{"content":"Hello"},...}]}
data: {"id":"chatcmpl-...","choices":[{"delta":{"content":" world"},...}]}
data: [DONE]
Each event is data: {JSON}\n\n. The stream ends with data: [DONE]\n\n. A proxy needs to read this upstream, optionally inspect frames, and write them downstream without adding latency. Here's what breaks.
Failure 1: Chunk Boundary Corruption
TCP delivers what it wants. A single SSE event may arrive split across multiple reads, or multiple events may arrive in one read.
If you need to inspect frames (extract token counts, inject headers, filter events), a naive fixed-buffer read corrupts the framing. Use bufio.Scanner with a custom SSE split function:
func proxySSE(upstream io.ReadCloser, w http.ResponseWriter, onEvent func([]byte)) {
scanner := bufio.NewScanner(upstream)
scanner.Split(scanSSEEvents)
flusher := w.(http.Flusher)
for scanner.Scan() {
line := scanner.Bytes()
w.Write(line)
w.Write([]byte("\n\n"))
flusher.Flush() // push each event immediately — don't buffer
if onEvent != nil && bytes.HasPrefix(line, []byte("data: ")) {
onEvent(line[6:]) // strip "data: " prefix
}
}
}
// Split on double-newline (SSE event boundary)
func scanSSEEvents(data []byte, atEOF bool) (advance int, token []byte, err error) {
if atEOF && len(data) == 0 {
return 0, nil, nil
}
if i := bytes.Index(data, []byte("\n\n")); i >= 0 {
return i + 2, bytes.TrimRight(data[:i], "\n"), nil
}
if atEOF {
return len(data), bytes.TrimRight(data, "\n"), nil
}
return 0, nil, nil
}
The flusher.Flush() call on every event is critical. Without it, Go's http.ResponseWriter buffers writes and the client gets batched chunks — defeating streaming entirely.
Failure 2: Token Leaks on Client Disconnect
This is the most expensive failure mode.
When a client closes the connection mid-stream — tab closed, app backgrounded, network timeout — a naive proxy keeps the upstream OpenAI request running. OpenAI finishes generating the full completion. You're billed for every token.
At 1,000 req/s with a 5% disconnect rate and 500 average output tokens: 25,000 wasted tokens per second — hundreds of dollars per day at GPT-4o-mini pricing, more at GPT-4o.
The fix is Go context propagation. When a client disconnects, Go's net/http server cancels r.Context(). Pass that context to the upstream call:
func (p *Proxy) handleStream(w http.ResponseWriter, r *http.Request) {
// r.Context() is cancelled when the client disconnects
ctx := r.Context()
upstreamReq, _ := http.NewRequestWithContext(ctx, "POST",
"https://api.openai.com/v1/chat/completions",
r.Body,
)
upstreamReq.Header = r.Header.Clone()
resp, err := p.client.Do(upstreamReq)
if err != nil {
if errors.Is(err, context.Canceled) {
return // client disconnected — upstream cancelled, no leak
}
http.Error(w, "upstream error", 502)
return
}
defer resp.Body.Close()
// ... proxy the stream
}
http.NewRequestWithContext instead of http.NewRequest. When context cancels, Go's HTTP client aborts the upstream TCP connection. OpenAI stops generating. You stop paying.
Failure 3: Backpressure and Unbounded Buffering
OpenAI streams at ~50–100 tokens/second. A fast client reads fine. A slow client — one processing each chunk before reading the next, or on a congested connection — falls behind.
Go's kernel socket buffers are bounded (~64–256KB per connection). When they fill, Write() blocks, which blocks your upstream reader, which stalls the TCP window, which pauses OpenAI. The stream stalls but the connection stays open — holding resources indefinitely.
The dangerous alternative is an unbounded in-memory buffer. Under load with many slow clients, OOM.
Our fix: a bounded channel between reader and writer goroutines, with a timeout:
func (p *Proxy) streamWithBackpressure(ctx context.Context,
upstream io.ReadCloser, w http.ResponseWriter) {
eventCh := make(chan []byte, 64) // max 64 events buffered
flusher := w.(http.Flusher)
go func() {
defer close(eventCh)
scanner := bufio.NewScanner(upstream)
scanner.Split(scanSSEEvents)
for scanner.Scan() {
select {
case eventCh <- append([]byte{}, scanner.Bytes()...):
case <-ctx.Done():
return
case <-time.After(5 * time.Second):
return // client too slow — abort cleanly
}
}
}()
for event := range eventCh {
w.Write(event)
w.Write([]byte("\n\n"))
flusher.Flush()
}
}
If the downstream writer doesn't consume an event within 5 seconds, the reader exits, closes the channel, and the writer loop exits cleanly. Context cancellation terminates the upstream connection.
Failure 4: Mid-Stream Errors After HTTP 200
HTTP status codes are sent before the body. Once you've written 200 OK and started streaming, you cannot send a 429 or 503 if something goes wrong. This happens: OpenAI sends a 200 header, begins streaming, then hits an internal rate limit mid-generation. The stream truncates. The client sees success with an incomplete response — and typically retries, paying for partial tokens twice.
The fix: send an in-band error event before closing:
func writeSSEError(w http.ResponseWriter, code, message string) {
flusher, ok := w.(http.Flusher)
if !ok {
return
}
fmt.Fprintf(w,
"data: {\"error\":{\"code\":\"%s\",\"message\":\"%s\"}}\n\n",
code, message,
)
flusher.Flush()
}
// After your scanner loop:
if err := scanner.Err(); err != nil {
if !errors.Is(err, context.Canceled) {
writeSSEError(w, "stream_error", "upstream stream interrupted")
}
}
Clients should check every data: payload for an error key, not just the HTTP status code. A truncated stream with an in-band error should retry differently from a clean completion.
Bonus: Cost Tracking Without Blocking
You can't calculate output token cost until the stream ends. Without the final count, you'd need to build your own token counter — and you'd still be estimating mid-stream.
Solution: pass stream_options: {"include_usage": true} in the request body. OpenAI sends a final usage chunk before [DONE]:
data: {"choices":[],"usage":{"prompt_tokens":142,"completion_tokens":387}}
data: [DONE]
In your onEvent handler, watch for the usage field. When it appears, fire the log entry into your async channel (see: how we log at sub-50ms with ClickHouse) and pass the chunk through unmodified. Zero latency added.
Summary
| Failure | Root Cause | Fix |
|---|---|---|
| Chunk corruption | Fixed-buffer reads split events |
bufio.Scanner with SSE split func |
| Token leak | Upstream outlives client | NewRequestWithContext(r.Context()) |
| Backpressure / OOM | Unbounded buffer | Bounded channel + write timeout |
| Mid-stream errors | Can't change 200 status | In-band error event before close |
All four are in production in Preto's Go proxy. If you want to see the full latency breakdown — proxy overhead vs. provider TTFT — on your own traffic, Preto is free up to 10K requests.
Building Preto.ai — LLM cost intelligence that runs on this proxy stack.
Top comments (0)