Ravi Patel

Posted on Jun 8 • Originally published at ssimplifi.com

The hidden cost of streaming LLMs: caches you can't use, bills you don't expect, and complexity you don't need

#llm #streaming #costoptimization #ux

Streaming is the default in modern LLM applications, mostly because the canonical OpenAI ChatGPT UX trained users to expect tokens appearing word-by-word. That visual feedback is real — perceived latency drops dramatically when the first token arrives in 200ms instead of waiting 2 seconds for the whole response. But the costs of streaming are systematically under-counted. Streaming defeats response caching on the way out, creates billing surprises when cancellations happen, complicates failover and observability, and is operationally messier than the buffered alternative for most workloads outside chat UIs. This post walks through the actual costs — they're not trivial — and the workloads where streaming is still worth it. Most production teams default to streaming reflexively; many shouldn't.

The parent guide LLM cost reduction covers the broader cost-reduction context; this article is the technique-specific argument for being more deliberate about when streaming makes sense.

What streaming actually does

A non-streaming LLM request sends the prompt, waits for the full response to generate, and returns the response as a single JSON object. Latency: the full time-to-last-token (typically 500-2000ms for a sentence-length response, up to tens of seconds for long-form output).

A streaming request sends the prompt, then receives the response in Server-Sent Events chunks as the model generates tokens. Each chunk contains a small slice of the response (typically 1-5 tokens). Latency to first token: usually 200-500ms. Latency to last token: the same as non-streaming (the model isn't generating faster — you're just receiving partial results sooner).

The UX win is real on workloads where users read the response as it arrives. Chat interfaces, code completion UIs, anything where the user is watching the screen while tokens land. The tradeoff is everything underneath.

Hidden cost #1 — Caching becomes structurally harder

Response-level caching (the 3-layer cache stack that catches 30-60% of LLM traffic on workloads where it applies) operates on complete responses. The cache stores (fingerprint, response) pairs; on a hit, it returns the stored response.

Streaming complicates this in two directions.

On the way in (cache lookup): the cache lookup happens before any model call. If the cache hits, the typical pattern is to serve the cached response as non-streaming JSON regardless of the request's stream=true flag. This works but breaks the visual expectation — the user-facing client expects an SSE stream and gets a JSON blob. Some clients handle this gracefully; many don't. Workarounds include "fake-streaming" the cached response (chunk it artificially into SSE events to match the expected format), which works but adds complexity.

On the way out (cache store): the cache write happens after the model generates the response. For streaming requests this means buffering the entire stream before storing — you can't cache a partial response. Two failure modes:

If the client disconnects mid-stream (closed tab, network drop, application timeout), the cache write doesn't happen. Subsequent identical requests miss the cache that should have been populated.
The cache write adds a few milliseconds of latency at end-of-stream, which can affect SSE close timing on flaky clients.

Both issues are solvable but require careful engineering. The non-streaming alternative just works: complete response in, fingerprint, store, return. No edge cases.

VERIFY (founder): confirm Prism's streaming cache behaviour matches the description — serves cache hits as non-streaming JSON regardless of stream=true, buffers streaming responses before storing, never caches partial streams.

Hidden cost #2 — Billing surprises from cancellation

Provider billing on streaming is usually pay-for-what-you-generate. If the model generates 200 tokens before the client disconnects, you pay for 200 tokens — even though the client only saw 100 of them.

The math gets uncomfortable in three scenarios:

Scenario A: User navigates away mid-response. Common in chat UIs. User asks a question, sees the response starting, decides they don't need it, closes the tab. Model keeps generating until the gateway notices the disconnection and propagates cancel; takes ~200-500ms in typical setups. You pay for that 200-500ms of token generation — sometimes 50-150 tokens — even though the user never read them.

Scenario B: Application timeout under provider slowness. Application sets a 10-second timeout on a streaming request. Provider is slow today; first token arrives in 4 seconds, response is still being generated at the 10-second mark. Application disconnects. Provider keeps generating until disconnection propagates. You pay for tokens you never received.

Scenario C: Streaming with speculative routing or fan-out. Some patterns (Prism's speculative routing, OpenRouter's Fusion) fire multiple provider calls in parallel and take the first response. When the first response wins, the others are cancelled — but cancellation isn't instant. The losers keep generating for some milliseconds, and you pay for those wasted tokens. The Prism v1.5 speculative routing cost analysis puts this at ~1.3x token cost on average; the streaming version of the pattern adds ~10-20% on top of that because the cancellation propagates more slowly on SSE connections than on JSON request/response.

The non-streaming alternative is cleaner: you either get the full response or you get an error. No partial billing for tokens you didn't use.

Hidden cost #3 — Failover and reliability complications

Provider failover is the discipline of automatically retrying a request against a different provider when the first one fails or times out. Non-streaming failover is straightforward: the request fails (5xx, timeout, connection drop), the gateway retries against an alternate provider, returns the second provider's response.

Streaming failover is operationally messy.

Mid-stream failover is essentially unworkable in practice. If a provider drops the connection mid-stream after returning the first 50 tokens, what do you do? Restart on a different provider — but the user has already seen those 50 tokens. The fresh stream from the new provider will repeat or contradict them. The cleanest answer is "fail the request and let the application retry," which the application probably wasn't expecting in the middle of an apparent-success stream.

Most production gateways skip mid-stream failover entirely and only failover on connection-establishment errors (initial 5xx, initial timeout). Streams that drop mid-flight propagate the error to the client, which has to handle it. This is correct behaviour but means streaming requests get less reliability cover than non-streaming requests get from the same gateway.

Provider-health observation is also messier on streams. A stream that delivers slowly is still "succeeding" until the gateway times out or the stream completes. Distinguishing "slow provider" from "healthy provider with a long response" requires more careful instrumentation than non-streaming, where latency is just end - start.

Hidden cost #4 — Observability gets harder

Per-request observability — the kind that drives LLM observability decisions and FinOps attribution — depends on knowing what happened on each request. Non-streaming makes this easy: when the response returns, you have token counts, cost, latency, status, all in one place.

Streaming defers most of this to the final usage chunk of the stream (with stream_options.include_usage set on OpenAI; analogous configs on other providers). If the stream is interrupted before the final chunk arrives, you don't get the usage block, and you have to estimate token counts from the partial content received. The numbers in your dashboards drift from the numbers on your provider bill.

The mitigations are real but add complexity. Application-layer token counting at the chunk level. Reconciliation jobs that compare gateway-side estimates with provider-side billing. The non-streaming alternative just doesn't have this problem.

Hidden cost #5 — Client-side complexity is real

The often-skipped cost: every consumer of the streaming response needs to handle SSE parsing, partial-chunk JSON, mid-stream errors, connection-drop recovery, and the bookkeeping to assemble the partial chunks into the final response. Each of these is a few lines of code per language; collectively they're a non-trivial surface to maintain.

For first-party clients you control (your own web app, your own mobile app), this is fine — write it once, ship it. For third-party integrations (webhooks, customer Code samples, SDK consumers), every additional consumer pays the streaming-complexity tax. SDKs that abstract this away help; SDKs that don't leave it as an exercise for the reader.

Non-streaming is a request/response pattern that every HTTP client understands. Streaming is a protocol overlay that every consumer has to implement correctly.

When streaming is actually worth it

Three workload categories where the UX benefit outweighs the costs:

1. Interactive chat UIs with human users watching. The OpenAI ChatGPT pattern. First-token latency matters; users read as the response arrives. Worth streaming. The costs (cache complexity, cancellation billing, etc.) are accepted as the price of the UX.

2. Long-form content generation where the user is actively reading. Article generation, long-form summaries, multi-paragraph explanations where waiting for the full response would feel wrong. The "watching the model think" UX has value here.

3. Code completion / inline assistants. Cursor, GitHub Copilot, similar tools where partial tokens appear inline as the user types. First-token latency dominates the user experience; non-streaming would feel sluggish.

For these categories, the engineering effort to handle the hidden costs is worth it. The hidden costs are real but bounded; the UX benefit is also real and bounded.

When streaming probably isn't worth it

Five workload categories where streaming is the default but probably shouldn't be:

1. Backend integrations / webhooks. No human is watching. The downstream service is going to wait for the full response anyway before processing it. Streaming adds complexity for zero perceptible benefit. Use non-streaming.

2. Async pipelines (queue-driven, batch-driven). Same reason. The pipeline doesn't care about first-token latency; it cares about total throughput. Non-streaming is structurally simpler.

3. Structured-output workloads. JSON-mode requests, function-calling responses, anything where the consumer is going to parse the response as a whole. Partial JSON is unhelpful; you can't parse it until the closing brace arrives. Non-streaming is the right shape.

4. Evaluation runs / benchmarks / cron-scheduled work. No human watching, predictable patterns, often cacheable. Streaming makes caching harder for no UX benefit. Non-streaming is the right shape.

5. Mobile push notifications / SMS / email content generation. The end-user never sees the streaming; they see the final content delivered through a different channel. The streaming protocol is dead weight.

The pattern across all five: no human is watching the response stream as it lands. When there's no UX benefit, the costs are pure overhead.

What about "first-token latency matters for our agents"?

Common counter-argument: agent workloads need fast first-token to feel responsive. The honest answer is "sometimes yes, mostly no."

Agent workloads typically involve multiple LLM calls per user action (call 1: plan the agent step; call 2: execute via tool; call 3: synthesise result). The user experience is driven by the total time across all calls, not by the first-token latency of any individual call. Streaming each call delivers tokens that the downstream code parses and acts on; the user sees the result of the final agent action, not the intermediate tokens. Streaming intermediate calls adds complexity without affecting user-perceived speed.

The exception: the final LLM call in an agent flow, the one that produces the user-visible response, may benefit from streaming if that response is going straight to a chat UI. The earlier calls in the flow don't benefit and shouldn't stream.

The pattern: stream only the calls whose output goes directly to a human watching the response. Buffer everything else.

How Prism handles streaming

Prism supports streaming on the chat completions endpoint. The mechanics worth knowing:

Cache hits return as non-streaming JSON. When a request with stream=true hits the cache, Prism returns the cached response as a single JSON object, not as an SSE stream. Client code that expects SSE may need to handle the alternative shape. This is documented + the right default.
Cache writes happen at end-of-stream. Streaming responses are buffered before storing. Partial streams (errored, disconnected) are never cached.
Failover happens only at connection establishment. Mid-stream failover is not attempted; an error mid-stream propagates to the client.
Speculative routing is disabled on streaming requests on sport mode. The fan-out complexity isn't worth the latency hedging benefit when the stream is already delivering tokens incrementally.
Token counts in the usage block come at the end of the stream (standard OpenAI behaviour). If a streaming request is interrupted, the final usage chunk may not arrive and the gateway-side accounting falls back to estimating from the received content.

VERIFY (founder): confirm Prism speculative routing is actually disabled on streaming requests (the engineering rationale is sound; verify the implementation actually does this).

The pattern Prism recommends: stream the workloads where humans are actively reading the response; use non-streaming everywhere else. The savings from the non-streaming slice on better caching + cleaner billing + simpler failover is meaningful.

Decision framework

If you're evaluating whether to use streaming for a specific workload:

Is a human watching the response land in real time? Yes → consider streaming. No → don't stream.
Does the downstream consumer parse partial content? No (waits for full response before processing) → non-streaming is structurally simpler.
Is the workload cacheable? Yes + high cache hit rate → non-streaming preserves caching cleanly; streaming adds edge cases.
Does the workload involve multiple chained LLM calls? Yes → stream only the final user-facing call; buffer intermediate calls.
Is the workload reliability-critical? Yes → non-streaming has cleaner failover; mid-stream failures are harder to recover from.
Default to non-streaming, opt in to streaming. Reverse of the common pattern, but matches the cost/benefit better for most workloads.

The streaming-by-default reflex is a UX convention from ChatGPT that doesn't map to the operational realities of every workload. Be deliberate about which workloads actually benefit; default the rest to non-streaming.

Where to go next

For the broader cost-reduction context: LLM cost reduction playbook and the ranked top-5: LLM cost reduction techniques ranked by ROI.

For the caching layer that streaming complicates: AI API caching, exact vs semantic caching for LLMs.

For the routing + failover discipline that streaming makes harder: task-type routing, multi-provider failover, speculative routing.

For modelling streaming impact on your specific workload: savings calculator (toggle streaming-vs-buffered to compare).

FAQ

Doesn't every LLM application stream? Why is this a discussion?

ChatGPT trained the default. The OpenAI playground streams by default; tutorials stream by default; SDK examples stream by default. The reflex is to inherit that default without revisiting whether your specific workload benefits. For chat UIs the default is right; for the bulk of backend LLM workloads (data pipelines, async generation, structured output, evaluation runs) it isn't.

What's the latency penalty of switching to non-streaming?

Time-to-first-token goes from 200-500ms (streamed) to whatever the time-to-last-token is (500-2000ms typically). For non-watching workloads this is invisible — the consumer was going to wait for the full response anyway. For watching workloads it's the cost of switching, and it's a real cost (which is why those workloads keep streaming).

Can I opt into non-streaming on a per-request basis?

Yes, on every gateway and provider. Set stream: false in the request. The default tends to be false in most SDKs; streaming is opt-in via setting stream: true. The "streaming everywhere" pattern comes from explicit choice in application code, not from a default.

Does the OpenAI Batch API solve some of this?

For async workloads, yes — Batch API is non-streaming by design and gets a 50% discount. If your workload was already eligible for async processing, Batch + non-streaming is the right combination.

What about partial JSON parsing for structured outputs?

Partial JSON parsing is technically possible (libraries like ijson exist) but operationally fragile. Most production code waits for the closing brace before parsing. Streaming structured-output workloads optimises for a benefit no one consumes.

Do streaming and prompt caching interact badly?

Mostly no — both Anthropic and OpenAI support prompt caching on streaming responses. The cache_read_tokens / cached_tokens appear in the final usage chunk. The interaction is fine; it just requires the consumer to actually consume the final chunk to record the savings.

Is there a "fake streaming" pattern for cached responses?

Yes — chunk the cached response into SSE events client-side, deliver them with a small inter-chunk delay to mimic streaming. Useful when the client expects streaming and changing the protocol is more expensive than faking it. Most gateways don't do this by default; Prism doesn't.

Does streaming actually cost more in raw provider billing?

Slightly, due to cancellation overhead — the loser tokens from disconnected/cancelled streams are billed. Empirically the overhead is small (1-5% on streaming-heavy workloads). The bigger costs are downstream: cache complexity, failover limitations, observability harder. The raw billing isn't the headline; the operational tax is.

The streaming-by-default reflex costs real money on workloads where the UX benefit doesn't exist. The LLM cost reduction playbook covers the broader discipline; the savings calculator models the workload-specific impact.

Top comments (1)

Josh Green • Jun 8

The caching point is the one that really bit me. I had an agent pipeline where I was regenerating the system prompt on every call and could not figure out why costs were higher than expected. Took way too long to realise that streaming meant none of the prefix caching was actually kicking in. For anything agent-based where the same system prompt repeats constantly, batching requests makes a huge diffrence.