Saqueib Ansari

Posted on Apr 30 • Originally published at qcode.in

WebSockets make agent workflows faster, but a lot less explicit

#aiagents #websockets #systemdesign #backend

WebSockets make agentic products feel dramatically better in the first demo. The agent streams earlier, tool calls look alive instead of stalled, and the whole system starts feeling less like “submit prompt, wait, poll, repeat” and more like a continuous loop.

That speedup is real. So is the complexity bill.

The minute you move agent loops onto persistent connections, you stop operating in a world where each interaction has a clean request boundary. State starts leaking into connection lifetime, retries stop being obvious, caches become harder to trust, and debugging turns from “what happened in this request?” into “what state was this workflow carrying when that event arrived?”

That is the real shape of agentic websocket tradeoffs: you gain responsiveness by giving up some explicitness.

For some products, that is absolutely the right deal. For others, teams are paying architectural rent they do not yet need. The mistake is not using WebSockets. The mistake is using them as if lower latency is a free upgrade instead of a state-model change.

The performance win is obvious because request boundaries are slow for agents

Classic request-response flows are fine for ordinary CRUD apps. They are awkward for agents because agents do not just answer. They plan, call tools, wait on tools, continue reasoning, stream partial output, and sometimes ask for human confirmation mid-flight.

In a stateless loop, every phase boundary creates friction:

re-sending context
re-authenticating and reloading session state
polling for tool completion
serializing partial progress into coarse API responses
treating intermediate reasoning as repeated round trips

That overhead does not just waste milliseconds. It changes how interactive the product can feel.

Why agent loops benefit more than ordinary chat

Plain chat mostly benefits from token streaming. Agentic systems benefit from streaming and orchestration continuity.

A single agent turn can involve:

user input arrives
model decides to call a tool
tool starts and reports progress
tool finishes and returns data
model continues from updated context
agent emits partial answer
user interrupts or steers the run

If each of those transitions has to cross a hard request boundary, the product feels mechanical. With a persistent socket, those boundaries soften. The loop stays warm.

That is why WebSockets feel so compelling in agent products: they do not merely accelerate text output. They reduce orchestration dead air.

The first speed trap

Because the first user-visible improvement is so strong, teams quickly start putting more responsibility into the live connection than it should carry.

That is usually where the trouble begins.

The hard part is not the socket. It is the hidden state model

A WebSocket by itself is not scary. The risky part is what teams start assuming once a connection stays open.

Request-response systems force explicitness. Each request has to carry what matters. That is sometimes inefficient, but it makes reasoning easier.

Persistent connections tempt teams to do the opposite. They let session state accumulate informally inside the live loop:

pending tool decisions
partial plans
in-memory conversation deltas
optimistic UI assumptions
connection-scoped caches
auth or capability state that quietly outlives its intended boundary

This is where the debugging model changes.

In a request-response system, you ask:

What input produced this response?

In a WebSocket-driven agent system, you start asking:

What sequence of socket events, workflow states, and in-flight mutations produced this moment?

That is a much harder question.

Request boundaries used to protect you

Teams often underestimate how much safety came from boring statelessness.

Hard request boundaries naturally encourage:

explicit payloads
simpler audit trails
easier replay during debugging
clearer auth checks
stronger idempotency habits
cleaner failure boundaries

When you move to persistent connections, none of that disappears automatically. It just stops being free.

If you do not rebuild those protections intentionally, the system will still work in happy paths and become slippery under load, reconnects, and multi-client usage.

Concurrency gets worse because the connection is not the workflow

This is the most important architectural distinction in the whole topic:

A connection is not a workflow.

The socket is only a transport channel. The workflow is the durable unit of meaning.

Teams that blur those two eventually get burned.

Why the single-user mental model breaks down

The intuitive picture is simple: one user opens one socket and one agent loop runs across it.

Real systems are not that clean.

You may have:

the same user in multiple tabs
the same conversation resumed from desktop and mobile
a reconnect while tools are still running
server-side retries racing with live client state
multiple UI panels subscribed to the same workflow stream

Once that happens, the socket stops being a trustworthy identity anchor.

Failure modes that come from conflating transport with task state

When connection identity and workflow identity get mixed together, you start seeing bugs like:

tool calls firing twice after reconnect
final output arriving on one tab while another still thinks the run is in progress
a cancellation event closing the stream but not actually stopping tool execution
stale client state overwriting newer persisted workflow state
duplicate “completion” handling because two listeners believed they owned the run

These are not exotic edge cases. They are normal outcomes once an interactive system has more than one consumer path.

Make workflow identity explicit

A safer event model separates the workflow from the transport immediately.

{
  "workflow_id": "wf_812",
  "turn_id": "turn_19",
  "connection_id": "conn_44",
  "event_type": "tool_started",
  "sequence": 128,
  "state_version": 7
}

Now the connection is just where the event traveled. The workflow is the actual source of truth.

That distinction makes reconnect, duplication handling, and multi-tab rendering much easier to reason about.

Caching gets more fragile because live state and durable state diverge

Caching is already hard in distributed systems. Agentic WebSocket systems make it weirder because the product often mixes:

persisted workflow state
streaming partial output
tool artifacts
frontend store snapshots
server-side caches for retrieval or planning context

In a request-response system, caches usually sit around stable request boundaries. In a live agent loop, state may be mutating continuously while clients are also caching earlier snapshots.

That means a cache can be structurally valid and temporally misleading.

The most common caching mistake in live agent UIs

A frontend stores “the latest known run state” locally and treats it as authoritative, even though the real workflow is still evolving through live events and background tool completions.

Then you get symptoms like:

a restored tab that misses the last tool result
a UI that thinks the workflow is complete because the token stream ended
a cached transcript that does not include post-tool synthesis
a resumed session that replays stale partial text as if it were final

This is not just a frontend bug. It is a mismatch between live stream semantics and durable workflow semantics.

Separate three kinds of state

A more stable model is to split state into layers:

Durable workflow state

The authoritative state of the run:

workflow status
completed tool calls
persisted checkpoints
final artifacts
cancellation and completion status

Ephemeral event stream state

The transient live layer:

token chunks
progress updates
tool-start and tool-finish events
optimistic UI hints
heartbeat-style live signals

Derived presentation state

What the UI renders from combining the durable base with recent stream events.

This split makes it easier to answer a critical question: what should survive reconnect, reload, or multi-client replay?

Usually the answer is not “everything that came over the socket.”

A simple event contract helps

type AgentEvent =
  | { type: 'token'; workflowId: string; sequence: number; text: string }
  | { type: 'tool_started'; workflowId: string; sequence: number; tool: string }
  | { type: 'tool_finished'; workflowId: string; sequence: number; tool: string; resultRef: string }
  | { type: 'checkpoint'; workflowId: string; sequence: number; stateVersion: number }
  | { type: 'completed'; workflowId: string; sequence: number; finalArtifactId: string }

The key idea is not TypeScript elegance. It is that stream events and durable checkpoints are not the same thing.

Debugging gets much worse unless you log the workflow, not just the transport

A lot of teams add WebSockets and keep HTTP-shaped observability. That is not enough.

They log:

socket open/close
server exceptions
maybe provider latency
maybe some tool errors

What they do not log well is the workflow progression itself.

That gap is why live agent bugs become painful to explain.

You can often tell that the socket stayed open and that the model responded. You still cannot answer:

what the workflow believed at each stage
whether the client missed a checkpoint event
whether reconnect created duplicate subscribers
whether retry logic re-executed a step already completed in the durable state
which state version the UI rendered when it offered the next action

What to trace instead

For WebSocket-driven agent systems, structured tracing should include:

workflow ID
turn ID
connection ID when relevant
sequence number
state version
tool call IDs
retry and reconnect markers
cancellation intent versus cancellation completion
finalization decisions

That gives you a narrative of the run instead of a pile of transport crumbs.

The difference between transport logs and workflow logs

A transport log tells you that a tool_finished event was emitted.

A workflow log tells you:

which workflow emitted it
which checkpoint preceded it
whether that tool result was already persisted
whether the completion path ran once or twice
whether the client that saw it was current or stale

That second layer is what makes complex systems operable.

Cancellation and retry semantics become design decisions, not implementation details

This is another place where stateless systems were simpler than they looked.

In an HTTP-style system, cancel often means abort the request. Retry often means make the request again.

In a persistent agent loop, those words stop being precise.

What exactly does cancel mean?

When a user presses stop, are they trying to cancel:

token streaming only?
the current model step?
queued tool calls?
the entire workflow?
background continuation after disconnect?

If you have not defined this clearly, different parts of the system will interpret cancellation differently.

That leads to ugly user experiences where:

the stream stops but the tools keep running
the UI says canceled but a completion arrives later
one tab stops the run while another still shows it active

Retry is just as ambiguous

If a workflow partially completed and then broke, what should retry do?

rerun the whole turn?
rerun only the failed tool?
restart synthesis from the last persisted checkpoint?
create a fresh workflow linked to the old one?

Without durable checkpoints, most systems end up with only two options: start over or guess.

That is not a strong production model.

Checkpoints make retries less destructive

If the workflow persists stages like:

planning complete
tool A complete
tool B failed retryably
synthesis not started

then a retry can target the real failure boundary.

That is far better than replaying the whole loop and hoping side effects remain idempotent.

WebSockets are worth it when the product is truly interactive

This is where teams need more discipline. Not every agent feature needs a persistent live loop.

Some do. Many do not.

Strong-fit cases

WebSockets usually earn their complexity when you need:

live token streaming with interruption
visible multi-step tool progress
human-in-the-loop steering during execution
collaborative views watching the same workflow
low-latency back-and-forth between model and user

In these cases, persistent transport changes the actual value of the product.

Weak-fit cases

They are much less compelling when the task is basically:

submit work
wait
fetch the result later

For long-running background jobs with loose interactivity, a durable queue plus polling or server-sent updates may be easier to operate and good enough for users.

This is the judgment call many teams skip. They adopt WebSockets because agent products look more modern with sockets, not because the workflow truly demands that shape.

The safest architecture is durable workflow, disposable socket

If I had to compress the whole topic into one recommendation, it would be this:

Design the workflow so the socket can vanish at any moment without corrupting the task.

That means:

workflow state is persisted independently of the connection
tool execution is tied to workflow identity, not socket lifetime
live events have sequence numbers
reconnect is treated as normal, not exceptional
the UI can rebuild from durable state plus recent events
final completion is explicit, not inferred from stream silence

A good split of responsibilities

A mature setup usually looks like this:

workflow coordinator owns state transitions
tool execution layer owns idempotency and side effects
event emitter broadcasts live progress
WebSocket transport delivers updates and user steering
frontend store reconciles live events with persisted checkpoints

This is more deliberate than keeping everything inside a live session object. It is also much more survivable once concurrency becomes real.

What to avoid

Be careful with designs where:

active socket state is the only source of in-progress truth
reconnect silently creates shadow runs
tool outcomes exist only as stream events with no durable checkpoint
completion is inferred because the stream ended instead of because the workflow closed explicitly

Those systems feel great in demos and become deeply confusing in production.

The real tradeoff is speed versus explicitness

That is the honest summary.

WebSockets make agentic workflows faster because they remove a lot of coordination overhead and let the loop stay hot between steps. But they also make the system harder to reason about because request boundaries no longer force explicit state transitions for you.

So the right question is not “should agent systems use WebSockets?” It is:

Where is lower latency valuable enough that you are willing to rebuild explicitness in other layers?

For highly interactive agent loops, the answer is often yes.

For simpler asynchronous flows, maybe not.

The practical decision rule is this:

Use WebSockets to improve transport, not to avoid designing a durable workflow model.

If you keep the workflow explicit and the socket disposable, you can capture most of the speed upside without making the system impossible to debug.

If you let the live connection become the workflow, the agent will absolutely feel faster right up until your team has to explain why one client saw a different truth than the durable system of record everyone thought they were building.

Read the full post on QCode: https://qcode.in/agentic-workflows-get-faster-with-websockets-but-harder-to-reason-about/

DEV Community