DEV Community

Gtio
Gtio

Posted on

I needed resumable LLM streams in Go — so I built streamhub

If you've built anything that streams LLM responses over SSE, you've probably hit this: the user refreshes the page, or their network blips, or the load balancer routes the reconnect to a different instance — and the stream is just gone. The generation keeps burning tokens on your backend, but the client sees nothing.

In the JS/TS world this is mostly solved. Vercel shipped resumable-stream, there's ai-resumable-stream, Ably has a whole token streaming product. But if your backend is in Go? Nothing.

I ran into this while working on a project where the LLM worker and the HTTP handler live in different processes. I needed something that:

  • persists chunks so reconnecting clients can replay what they missed
  • delivers cancel signals across instances (user clicks "stop" on one node, generation stops on another)
  • prevents duplicate producers (two requests racing to start the same session)

So I built streamhub.

How it works

Two Redis primitives, that's it:

  • Redis Streams store chunks. New subscribers read history first, then get live data.
  • Redis Pub/Sub carries cancel signals. Fast, fire-and-forget.

Each producer gets a generation ID that acts as a fencing token — if a stale producer tries to write after losing ownership, the writes are rejected.

What the code looks like

Producer side:

stream, created, err := hub.Register("chat:123", func() {
    // called when someone cancels this session
})
if !created {
    return // another instance already owns this
}
defer stream.Close()

stream.Publish("hello")
stream.Publish(" world")
Enter fullscreen mode Exit fullscreen mode

Consumer side (can be a completely different process):

stream := hub.Get("chat:123")
chunks, unsubscribe := stream.Subscribe(128)
defer unsubscribe()

for chunk := range chunks {
    // replays existing chunks first, then streams live
    fmt.Fprint(w, chunk)
    w.(http.Flusher).Flush()
}
Enter fullscreen mode Exit fullscreen mode

Cancel from anywhere:

hub.Get("chat:123").Cancel()
Enter fullscreen mode Exit fullscreen mode

Why not just use X?

"Just use Redis Streams directly" — you can, but you'll end up reimplementing subscriber fan-out, replay-then-live handoff, generation fencing, and the cancel side-channel. That's what streamhub is.

"Use Centrifuge/Centrifugo" — great project, but it's a full real-time messaging framework. If all you need is to make your LLM streams durable, it's a lot of surface area.

"Use vercel/resumable-stream" — TypeScript only, tightly coupled to the Vercel AI SDK.

Status

Early days. The API surface might still change. If you're dealing with this same problem in Go, I'd appreciate feedback: github.com/gtoxlili/streamhub

Top comments (1)

Collapse
 
zknill profile image
Zak

Hey, Zak here. I work on the AI streaming product at Ably. Nice writeup and thanks for the callout! It's good to see LLM response streaming getting some proper attention rather than being treated as a solved problem. Looks like you've tackled a lot of the problems in streamhub already.

SSE is one of those transports that looks resumable on the surface (there's a Last-Event-ID header in the spec) but the AI project demos and examples almost universally skip over what actually needs to happen on the server for that header to mean anything. Most of the "here's how to stream from an LLM" tutorials stop at the point where tokens are sent to the browser, and quietly assume nothing ever goes wrong with the HTTP connection. The bits that get hard once you take resume seriously:

  • Ordering. Chunks have to arrive in the right sequence even across a reconnect that lands on a different instance.
  • Compaction. Streamed tokens eventually need to become a single coherent response. You don't want to store individual tokens forever; especially when the response generation is completed and you can just serve the full thing.
  • Resuming from a specific point. Not "replay everything," but "I last saw chunk 47, give me 48 onwards." Requires a cursor the client sends back and a server that honors it.
  • Fan-in as well as fan-out. When you've got multiple agents producing into the same session (a planner and a tool-caller, or parallel sub-agents) you're not just streaming one producer to many consumers, you're also merging many producers into one ordered stream. Interleaving, attribution and ordering across producers all become things you have to handle.

And then there's the fact that SSE is unidirectional. So anything that needs to flow back to the producer (like steering messages mid-generation, or cancellation and interrupts) has to be sent on an out-of-band channel. Which is fine until you need to figure out which instance is actually running the generation you want to cancel, across an autoscaling group. The routing problem is a real problem, and a lot of folks resort to writing cancellation messages to the datastore, and polling for them. These are the problems we're solving in the Ably product, so I'd love to show you the Go SDKs we're working on if you're up for it.