TL;DR: AI agent sessions break in production in four specific ways — and none of them are model problems. This post covers the failure modes, why the standard Redis buffer workaround has sharp edges, and what a proper session layer actually needs to provide.
Take any AI agent demo from the last six months. It works. Now ship it to real users on real networks, real devices, real attention spans. A meaningful share of those users will never finish their first conversation cleanly. Not because the model gave a bad answer. Because the connection dropped, the tab refreshed, the phone took over from the laptop, or the spinner kept spinning forever.
We interviewed 38 companies building AI products at scale and evaluated 37 vendors across the AI infrastructure landscape. Almost everyone is hitting the same wall. None of the problems are model problems. And there isn't a layer in the stack today that solves them by default.
Here's what's actually breaking.
The four failure modes
1. Streams break and you lose the live state
HTTP streaming over SSE works fine in development. In production, every hop between server and user has its own timeout:
- AWS ALB kills idle connections after 60 seconds by default
- Cloudflare returns a 524 after ~100 seconds for proxied origins
- Istio/Envoy default to a 5-minute stream idle timeout
-
Corporate proxies buffer un-chunked
text/event-streamresponses (no Content-Length) - Mobile carriers rebind NAT entries on idle TCP flows
- Browsers throttle background tabs
When any of these fire, you can replay completed state from a buffer if you wrote one. The bit the user was actually watching — the live stream — is gone.
2. Sessions belong to the browser tab, not the user
Almost every agent framework is point-to-point: one connection, one device. Switch from laptop to phone, the conversation doesn't follow. Refresh, the live stream is gone.
This isn't a framework failure — it's an HTTP constraint. Vercel and TanStack have both shipped connection adapter interfaces specifically so a different transport can be plugged in. Of the 37 vendors we evaluated, 32 have no multi-device fan-out for AI sessions at all.
3. Users can't interrupt mid-stream
Once an agent starts generating, HTTP gives you no clean way to route a new instruction back to the running agent. The request is in flight. The response stream is one-directional.
We spoke to one of the largest customer support platforms in the world. They disabled all user input while the agent was responding — handling interruption reliably was technically too difficult with SSE. You've felt this: watching the agent go down the wrong path, unable to stop it.
Coding agents like Claude Code are the leading indicator. Once users get used to interrupting mid-stream, they'll expect it from every agent product they touch.
4. Agents fail silently
From the client's perspective, an idle SSE connection looks identical to a dead one. When an agent crashes, stalls, or loses its connection, the client can't tell the difference between a thinking agent, a stalled agent, and a dead agent — three completely different states, indistinguishable on screen.
33 of the 37 vendors we evaluated have no agent health signal at the infrastructure level.
What most teams build instead
The pattern is consistent across teams. Engineers add a Redis buffer between agent and client for live stream replay on reconnect. They build polling or queueing so a new instruction can find the right running agent. They add fan-out for multi-device.
Vercel's lead maintainer put it plainly in a widely referenced GitHub issue: "to solve this we would need to have a channel to the server that allows transporting that information. WebSockets are one option."
A Pydantic AI user on Hacker News: "a lot of glue."
Every serious production team independently arrives at the same conclusion: generation has to be decoupled from delivery. Most end up building their own version of the same architecture.
Why reconnection alone isn't enough
The instinct is to treat this as a streaming reliability problem — reconnects, timeouts, duplicate tokens. That's part of it, but only part.
The real category is what we call durable sessions: a persistent, addressable connection between agents and users that outlives any individual connection, device, or participant.
- Disconnect and reconnect → session is still there
- Switch devices → session follows you
- Agent crashes and respawns → session survives
This is different from durable execution (Temporal makes the backend crash-proof). Durable sessions make the user experience crash-proof. Both matter. They solve different halves.
In practice:
- Reconnection becomes reattachment — client requests from its last serial, gets everything missed in order
- Device switch is just another subscriber joining the same channel
- Multi-agent coordination is fan-in to a shared session
- Presence lets the agent know when no one's watching (pause expensive work)
- Organisation-side handover — a supervisor joins a live session on a different device, hours later, with full context
Where this is going
The frontier labs spend tens of millions of engineering dollars building this layer themselves. Everyone else either accepts the broken experience or burns engineering cycles rebuilding fragments of it.
The delivery problem is where the work is now. The model is fine. The session is what breaks.
Ably AI Transport is the session layer for this gap. The docs go deeper on the session model.
Which of these failure modes have you hit? Have you disabled user input during agent responses as a workaround?




Top comments (0)