We have watched tokens stream in from an LLM before where they appeared one at a time, like the model was typing. If you used the Anthropic SDK's ....
For further actions, you may consider blocking this person and/or reporting abuse
the delta type check is the one that bites people building tool call UIs — they filter for text_delta fine, then add tool use and suddenly input_json_delta just vanishes. saw this in production where a UI worked for months until a prompt started routing to a tool.
the pop trick for incomplete SSE frames is clean. one extra pattern: buffer growing past a threshold without hitting \n\n is usually a stalled connection, not a slow model. explicit stall timeout beats waiting on the fetch signal.
are you planning to cover streaming with tool calls since the delta types diverge pretty sharply from pure text streaming?
Hi @mudassirworks
Yes, will be covering in the upcoming posts.
Great follow-up, Jasmin!
The GIFs make the streaming flow really easy to understand, especially how you handle the chunks and combine them properly.
Loved seeing the delta.content part explained clearly. Streaming definitely makes the UX feel much more responsive.
Looking forward to the tool calling and memory parts next. Keep these coming! 👏
Thanks again, @tahosin.
Glad you liked this one too very motivating. No pressure, but I’ll make sure to keep the momentum going! 😁
silent recovery on stream errors is the sneaky footgun - output comes back wrong but no exception thrown.
So true @itskondrat, the silence is the dangerous part. I started treating a missing finish_reason as a hard error instead of assuming no exception means it worked. Truncation shouldn’t get to fail quietly.
Streaming is one of those UX details that quietly makes or breaks an AI product, the same response feels 3x faster when it streams vs lands all at once, even at identical total latency. The gotchas people hit: handling partial or aborted streams, backpressure, and rendering markdown/code incrementally without flicker. Worth getting right once and reusing. I lean on streamed output in Moonshift so users watch the agent working in real time instead of a spinner, perceived progress is half the experience. Did you hit the partial-markdown-rendering problem, or keep it plain text?
Agreed @harjjotsinghh
Yeah, the perceived speed thing is wild which is same latency, totally different feel. A spinner just can't compete with watching tokens land.
I kept these gifs plain text on purpose, to keep the focus on the streaming itself. But partial markdown is a tricky one for sure, flickering half done code blocks are a pain. How do you handle it in Moonshift, buffer till the block closes or re-parse each chunk?
Nice work! The GIF format actually makes the streaming concept click way faster than reading the docs. One thing I'd add — when you're switching between OpenAI and Anthropic, the streaming format differs slightly (OpenAI sends multiple delta chunks vs Anthropic sends longer content blocks). It's a subtle difference that can break your UI if you're not handling both. Definitely worth a follow-up GIF comparing the two!
Thanks Felix!
That's a great point. I would definitely look into it. Your feedback really helps a lot!🙂
"Great article! I also just started my web dev journey as a beginner. Learning so much!"
Thanks @devdatta_gawali_28
Glad you liked it.