Voice AI works — until you interrupt it
Modern voice AI looks impressive.
Speech recognition is accurate.
LLMs are fluent.
Text-to-speech sounds natural.
But many real-time voice systems fail at a very human interaction:
interruption.
As soon as a user talks over the system — to stop it, redirect it, or correct it — things start to break.
Audio may stop, but generation continues.
State becomes inconsistent.
The system behaves unpredictably.
This is often blamed on latency or model quality.
In practice, it’s neither.
Interruptibility is not an audio feature
In real conversations, interruption is normal.
People interrupt:
mid-sentence
to change direction
to stop the other party immediately
Supporting this in software reveals a key insight:
Interruptibility is not an audio feature.
It’s a system-level contract.
The real questions are:
Who owns execution right now?
When is interruption legal?
What happens to in-flight work?
Those are architectural concerns.
Why WebRTC is necessary — but insufficient
WebRTC is excellent at:
capturing audio
transporting audio
playing audio
But it does not decide:
whether the system should be speaking
whether generation must be canceled
how cleanup should happen
When behavior decisions are embedded in:
WebRTC callbacks
async / Promise chains
loosely coordinated flags
the system becomes timing-dependent.
It may work — until concurrency and streaming expose race conditions.
This is not a WebRTC issue.
It’s a missing control plane.
The missing control plane: an explicit state machine
Reliable real-time voice systems share a common trait:
there is exactly one place that decides behavior.
That place:
enumerates system states
defines legal transitions
treats interruption as a designed transition, not an exception
owns cancellation and cleanup
In other words: an explicit state machine.
Not implicit flags.
Not “who resolves first.”
Not timing assumptions.
The state machine becomes the System Anchor.
A practical mental model
Think of a voice AI system like this:
LLM → brain
WebRTC → sensory & motor system
State machine → spine
Without the spine, the brain can be smart —
but behavior becomes uncoordinated and unsafe.
The system may still talk, but it can’t behave reliably.
Why this matters more than model quality
Models will keep improving.
But no model can compensate for:
undefined execution ownership
hidden state
non-deterministic cancellation
If your voice system can’t:
stop cleanly
recover after interruption
resume without corruption
the problem is architectural, not algorithmic.
Takeaway
Interruptible voice AI is not an AI breakthrough.
It’s a systems engineering decision.
Once you treat real-time voice interaction as a living execution system on a time axis,
state machines stop being “old-school design” and become inevitable.
Not because they’re elegant —
but because they survive concurrency, streaming, and interruption.
Top comments (2)
Really this all highlights the deeper problem with LLMs which is that they're effectively stateless and can only really carry a pseudo state through their outputed token stream. Since the outputed tokens are what's directly converted to audio, "interrupting" the LLM necessarily means interrupting the token stream which means the LLMs are effectively "frozen" until an external system re-intializes them with a new prompt.
I imagine solving this problem in a robust and reliable manner will require a new AI architecture that's more than just a stateless math function. Something that's always on and handles "context" as a continuous state change that updates over time as opposed to an external collection of tokens that are reprocessed at each inference step.
This resonates a lot.
What breaks on interruption isn’t audio or inference — it’s ownership. Once generation starts, most systems have no single authority that can say “this execution is no longer valid.”
Treating interruption as a first-class state transition (instead of a side effect of audio events or async cancellation) is the key insight here. Without an explicit execution spine, you end up with overlapping responsibilities between WebRTC callbacks, streaming tokens, and application logic — which explains the non-determinism people observe.
I especially like the “control plane” framing. Models will get better, but without a deterministic place that owns state, cancellation, and cleanup, real-time systems will always feel brittle under concurrency.
This is one of those problems that looks like “AI UX” on the surface, but is very clearly systems engineering underneath.