DEV Community

gauravdagde
gauravdagde

Posted on • Originally published at preto.ai

Streaming SSE Proxying for LLM APIs: The Hard Parts

OpenAI streaming looks simple from the outside. Set stream: true, iterate the response, pipe it to the client. One afternoon of work.

Then you ship it. A client disconnects mid-generation and you eat 2,000 tokens nobody received. A slow mobile client causes your proxy's memory to climb. An OpenAI rate limit hits after you've already sent a 200. Here's what actually happens when you proxy SSE at scale — and the Go patterns that fix each failure mode.

TL;DR:

  • There are four production failure modes: chunk boundary corruption, token leaks on disconnect, unbounded buffering under backpressure, and mid-stream errors after a 200
  • Each has a clean fix in ~50 lines of Go
  • All four are running in production at Preto at 5,000+ streaming req/s, <50ms p95 overhead

How LLM SSE Actually Works

OpenAI's streaming response is plain HTTP with Content-Type: text/event-stream. The body is a sequence of newline-delimited frames:

data: {"id":"chatcmpl-...","choices":[{"delta":{"content":"Hello"},...}]}

data: {"id":"chatcmpl-...","choices":[{"delta":{"content":" world"},...}]}

data: [DONE]
Enter fullscreen mode Exit fullscreen mode

Each event is data: {JSON}\n\n. The stream ends with data: [DONE]\n\n. A proxy needs to read this upstream, optionally inspect frames, and write them downstream without adding latency. Here's what breaks.


Failure 1: Chunk Boundary Corruption

TCP delivers what it wants. A single SSE event may arrive split across multiple reads, or multiple events may arrive in one read.

If you need to inspect frames (extract token counts, inject headers, filter events), a naive fixed-buffer read corrupts the framing. Use bufio.Scanner with a custom SSE split function:

func proxySSE(upstream io.ReadCloser, w http.ResponseWriter, onEvent func([]byte)) {
    scanner := bufio.NewScanner(upstream)
    scanner.Split(scanSSEEvents)
    flusher := w.(http.Flusher)

    for scanner.Scan() {
        line := scanner.Bytes()
        w.Write(line)
        w.Write([]byte("\n\n"))
        flusher.Flush() // push each event immediately — don't buffer

        if onEvent != nil && bytes.HasPrefix(line, []byte("data: ")) {
            onEvent(line[6:]) // strip "data: " prefix
        }
    }
}

// Split on double-newline (SSE event boundary)
func scanSSEEvents(data []byte, atEOF bool) (advance int, token []byte, err error) {
    if atEOF && len(data) == 0 {
        return 0, nil, nil
    }
    if i := bytes.Index(data, []byte("\n\n")); i >= 0 {
        return i + 2, bytes.TrimRight(data[:i], "\n"), nil
    }
    if atEOF {
        return len(data), bytes.TrimRight(data, "\n"), nil
    }
    return 0, nil, nil
}
Enter fullscreen mode Exit fullscreen mode

The flusher.Flush() call on every event is critical. Without it, Go's http.ResponseWriter buffers writes and the client gets batched chunks — defeating streaming entirely.


Failure 2: Token Leaks on Client Disconnect

This is the most expensive failure mode.

When a client closes the connection mid-stream — tab closed, app backgrounded, network timeout — a naive proxy keeps the upstream OpenAI request running. OpenAI finishes generating the full completion. You're billed for every token.

At 1,000 req/s with a 5% disconnect rate and 500 average output tokens: 25,000 wasted tokens per second — hundreds of dollars per day at GPT-4o-mini pricing, more at GPT-4o.

The fix is Go context propagation. When a client disconnects, Go's net/http server cancels r.Context(). Pass that context to the upstream call:

func (p *Proxy) handleStream(w http.ResponseWriter, r *http.Request) {
    // r.Context() is cancelled when the client disconnects
    ctx := r.Context()

    upstreamReq, _ := http.NewRequestWithContext(ctx, "POST",
        "https://api.openai.com/v1/chat/completions",
        r.Body,
    )
    upstreamReq.Header = r.Header.Clone()

    resp, err := p.client.Do(upstreamReq)
    if err != nil {
        if errors.Is(err, context.Canceled) {
            return // client disconnected — upstream cancelled, no leak
        }
        http.Error(w, "upstream error", 502)
        return
    }
    defer resp.Body.Close()

    // ... proxy the stream
}
Enter fullscreen mode Exit fullscreen mode

http.NewRequestWithContext instead of http.NewRequest. When context cancels, Go's HTTP client aborts the upstream TCP connection. OpenAI stops generating. You stop paying.


Failure 3: Backpressure and Unbounded Buffering

OpenAI streams at ~50–100 tokens/second. A fast client reads fine. A slow client — one processing each chunk before reading the next, or on a congested connection — falls behind.

Go's kernel socket buffers are bounded (~64–256KB per connection). When they fill, Write() blocks, which blocks your upstream reader, which stalls the TCP window, which pauses OpenAI. The stream stalls but the connection stays open — holding resources indefinitely.

The dangerous alternative is an unbounded in-memory buffer. Under load with many slow clients, OOM.

Our fix: a bounded channel between reader and writer goroutines, with a timeout:

func (p *Proxy) streamWithBackpressure(ctx context.Context,
    upstream io.ReadCloser, w http.ResponseWriter) {

    eventCh := make(chan []byte, 64) // max 64 events buffered
    flusher  := w.(http.Flusher)

    go func() {
        defer close(eventCh)
        scanner := bufio.NewScanner(upstream)
        scanner.Split(scanSSEEvents)
        for scanner.Scan() {
            select {
            case eventCh <- append([]byte{}, scanner.Bytes()...):
            case <-ctx.Done():
                return
            case <-time.After(5 * time.Second):
                return // client too slow — abort cleanly
            }
        }
    }()

    for event := range eventCh {
        w.Write(event)
        w.Write([]byte("\n\n"))
        flusher.Flush()
    }
}
Enter fullscreen mode Exit fullscreen mode

If the downstream writer doesn't consume an event within 5 seconds, the reader exits, closes the channel, and the writer loop exits cleanly. Context cancellation terminates the upstream connection.


Failure 4: Mid-Stream Errors After HTTP 200

HTTP status codes are sent before the body. Once you've written 200 OK and started streaming, you cannot send a 429 or 503 if something goes wrong. This happens: OpenAI sends a 200 header, begins streaming, then hits an internal rate limit mid-generation. The stream truncates. The client sees success with an incomplete response — and typically retries, paying for partial tokens twice.

The fix: send an in-band error event before closing:

func writeSSEError(w http.ResponseWriter, code, message string) {
    flusher, ok := w.(http.Flusher)
    if !ok {
        return
    }
    fmt.Fprintf(w,
        "data: {\"error\":{\"code\":\"%s\",\"message\":\"%s\"}}\n\n",
        code, message,
    )
    flusher.Flush()
}

// After your scanner loop:
if err := scanner.Err(); err != nil {
    if !errors.Is(err, context.Canceled) {
        writeSSEError(w, "stream_error", "upstream stream interrupted")
    }
}
Enter fullscreen mode Exit fullscreen mode

Clients should check every data: payload for an error key, not just the HTTP status code. A truncated stream with an in-band error should retry differently from a clean completion.


Bonus: Cost Tracking Without Blocking

You can't calculate output token cost until the stream ends. Without the final count, you'd need to build your own token counter — and you'd still be estimating mid-stream.

Solution: pass stream_options: {"include_usage": true} in the request body. OpenAI sends a final usage chunk before [DONE]:

data: {"choices":[],"usage":{"prompt_tokens":142,"completion_tokens":387}}

data: [DONE]
Enter fullscreen mode Exit fullscreen mode

In your onEvent handler, watch for the usage field. When it appears, fire the log entry into your async channel (see: how we log at sub-50ms with ClickHouse) and pass the chunk through unmodified. Zero latency added.


Summary

Failure Root Cause Fix
Chunk corruption Fixed-buffer reads split events bufio.Scanner with SSE split func
Token leak Upstream outlives client NewRequestWithContext(r.Context())
Backpressure / OOM Unbounded buffer Bounded channel + write timeout
Mid-stream errors Can't change 200 status In-band error event before close

All four are in production in Preto's Go proxy. If you want to see the full latency breakdown — proxy overhead vs. provider TTFT — on your own traffic, Preto is free up to 10K requests.


Building Preto.ai — LLM cost intelligence that runs on this proxy stack.

Top comments (0)