Token paint with backpressure stops fast streams from crashing slow UIs
Partial tool_use blocks let you show "querying X..." 200ms into a call
AbortController cancels mid-stream so a stopped reply costs ~12% not 100%
Chunked persistence means a refresh resumes mid-thought instead of replaying
Last month I shipped an agent that streamed a 4,000 token answer into a React chat. The first version froze the tab. The second version dropped citations halfway through. The third version, the one I actually use now, runs five patterns I want to write down before I forget why each one exists.
This is not a tutorial. These are the streaming patterns I reach for when I build production Anthropic SDK apps, with the tradeoff each one solves. I am writing them down in the order I add them to a new project, because the order matters more than people admit. Skip the first one and the rest cannot save you.
The Two UX Patterns That Make Streaming Feel Alive
Streaming a response is easy. Streaming a response that feels like a thought is harder. Two patterns carry most of the weight.
Token-by-token paint with backpressure. The default streaming loop pushes every content_block_delta straight into state. That works until the model outpaces your renderer, which on a slow phone happens around 80 tokens per second. The fix is a small queue that flushes on requestAnimationFrame instead of on every event.
let pending = ''
for await (const ev of stream) {
if (ev.type === 'content_block_delta') {
pending += ev.delta.text
schedulePaint(() => { flush(pending); pending = '' })
}
}
That schedulePaint is the whole trick. One paint per frame, batched. The stream still arrives in real time, but the UI thread breathes. I dropped a Pixel 6a tab freeze from 4 seconds to zero by adding this single queue.
Streaming tool calls. The richer move, the one that actually impresses people, is showing the partial tool_use block as it arrives. Claude streams the tool name first, then the input JSON character by character. If you wait for the full block you waste 800ms of staring at a spinner. If you parse the partial, you can render "Claude is querying the calendar..." inside 200ms.
if (ev.type === 'content_block_start' && ev.content_block.type === 'tool_use') {
showStatus(`Querying ${ev.content_block.name}...`)
}
I use this on every agent that touches an MCP server. The status line updates the second the tool name lands, before a single argument has streamed. The perceived speed delta is huge, even when the actual latency is identical. For more on the tool side of this, see the Claude MCP Servers Practical Guide.
One small refinement. When the input JSON starts streaming, I also surface the first parsed key. So instead of "Querying calendar..." the status sharpens into "Querying calendar for next week..." once the date range arrives. That extra word of context turns a spinner into a sentence, and sentences are what users trust.
The Citation Pattern Nobody Talks About
When you turn on grounded responses with citations, the SDK streams citation blocks interleaved with text. The naive renderer dumps the text first and the footnote markers at the end, which produces a wall of unmarked prose followed by a bibliography that no one reads.
The better pattern is to render citations inline as they arrive, with placeholder superscripts that resolve when the citation block lands.
if (ev.type === 'citations_delta') {
const ref = registerCitation(ev.citation)
appendInline(`[${ref.id}]`)
}
Two reasons this matters. First, reading comprehension stays high because the footnote sits next to its claim. Second, the user sees evidence accumulating, which builds trust during a long stream. I built this for a research tool and watched the average session length jump from 90 seconds to almost 4 minutes. People scroll back to verify citations only if the citations are findable.
The implementation detail that bit me is ordering. Citation deltas can arrive slightly before or after the text they refer to, depending on how the model emitted them. I keep a small queue keyed by character offset and resolve placeholders when both halves are present. Without that queue you get superscripts pointing at the wrong sentence, which is worse than no citations at all.
If you are wiring citations into a chat that also speaks, the same pattern works for ElevenLabs voice output. Queue the sentence, resolve the citation, then send the cleaned text to TTS. Voice users want the citation read after the claim, not dropped into the audio mid-sentence.
The Cost Pattern: Mid-Stream Cancellation
This is the pattern that pays for itself in a week.
Every production chat needs a stop button. Without proper cancellation, hitting stop just hides the UI while the model keeps generating, and you keep paying for tokens you will never show. On a 4,000 token reply at Sonnet pricing that is real money over a month of bored users.
AbortController is the lever. The SDK accepts a signal, and on abort the upstream HTTP connection closes within the next chunk boundary.
const ac = new AbortController()
stopButton.onclick = () => ac.abort()
const stream = client.messages.stream({ ... }, { signal: ac.signal })
The catch is cleanup. When the signal fires, you need to flush the pending queue one last time, mark the message as canceled in your store, and persist what you have so far. I forgot the persist step the first time and ended up with half-painted messages that disappeared on refresh, which feels like a bug even when it is technically just an aborted stream.
The numbers are clean. Across the agents I run, average cancellation lands at 12% of the way through the stream. So a hard-stop user costs me 12% of a full reply instead of 100%. Over thousands of sessions that adds up faster than any caching trick. If you want to stack savings, pair this with prompt caching, which I wrote about in the 1-hour caching April update.
One subtle thing. Do not abort inside the async iterator with a thrown error if you can avoid it, because some SDK versions swallow the partial queue. Abort via the signal, let the iterator exit cleanly, then flush.
The other thing worth doing is showing the user what you canceled. A faint label like "stopped after 247 tokens" reads as competent, not broken. Hiding the abort behind a blank message reads as a bug. Same engineering, opposite UX.
The Reliability Pattern (The One That Bit Me)
Replay-safe persistence. This is the pattern I learned the hard way after a deploy mid-stream nuked a user's 3,000 token answer that had been painting for 18 seconds.
The fix is to chunk the streamed response into durable storage every N tokens or every M milliseconds, whichever comes first. On reconnect, you replay the persisted chunks instantly, then resume the stream from the model's next token if the request is still alive, or simply present the partial as final if it is not.
let queued = ''
stream.onChange(text => {
queued += text
if (queued.length > 200) {
db.appendChunk(messageId, queued)
queued = ''
}
})
I write to IndexedDB on the client and Postgres on the server, mirrored. The client write is instant. The server write batches every 500ms. On refresh, the client reads its own log first, paints the message in one frame, and only then asks the server if the stream is still resumable.
What this kills is the "did it save?" anxiety. The user reloads, the answer is there, every token. They never know the connection broke. This is the same trust contract that good editors built decades ago, and it is the bar AI chat needs to clear before it stops feeling beta.
The thing to watch is double-writes. If both the client and server persist independently and you replay both, you can end up with the same token twice. I key chunks by (messageId, byteOffset) and dedupe on read. The byte offset matters more than a chunk index because retries on flaky networks can resend a chunk with a different sequence number but the same content range. Trust the offset, not the counter.
For broader patterns on running SDK apps in production, my Claude Agent SDK production notes cover the surrounding architecture.
Bottom Line
Streaming is the difference between an AI app that feels alive and one that feels like a form submission. Five patterns get me there. Backpressure keeps slow phones from melting. Partial tool_use parsing makes agents feel intentional. Inline citations build trust. AbortController saves real money. Chunked persistence makes refreshes invisible.
None of these are clever. They are the boring infrastructure that lets the interesting parts of a Claude app feel finished. I add all five to any new agent on day one, before I worry about prompts, retrieval, or evals, because the UX cost of getting them wrong compounds faster than any model upgrade can fix.
If you want the full stack of patterns I use across SDK apps, MCP servers, prompt caching, and agent loops, I keep them collected on the Claude Blueprint page. Start with the two UX patterns. The rest will tell you when you need them.
This article contains affiliate links. If you sign up through them, I may earn a small commission at no extra cost to you. (Ad)
Top comments (0)