Voice AI works — until you interrupt it
Modern voice AI looks impressive.
Speech recognition is accurate.
LLMs are fluent.
Text-to-speech sounds natural.
But many real-time voice systems fail at a very human interaction:
interruption.
As soon as a user talks over the system — to stop it, redirect it, or correct it — things start to break.
Audio may stop, but generation continues.
State becomes inconsistent.
The system behaves unpredictably.
This is often blamed on latency or model quality.
In practice, it’s neither.
Interruptibility is not an audio feature
In real conversations, interruption is normal.
People interrupt:
mid-sentence
to change direction
to stop the other party immediately
Supporting this in software reveals a key insight:
Interruptibility is not an audio feature.
It’s a system-level contract.
The real questions are:
Who owns execution right now?
When is interruption legal?
What happens to in-flight work?
Those are architectural concerns.
Why WebRTC is necessary — but insufficient
WebRTC is excellent at:
capturing audio
transporting audio
playing audio
But it does not decide:
whether the system should be speaking
whether generation must be canceled
how cleanup should happen
When behavior decisions are embedded in:
WebRTC callbacks
async / Promise chains
loosely coordinated flags
the system becomes timing-dependent.
It may work — until concurrency and streaming expose race conditions.
This is not a WebRTC issue.
It’s a missing control plane.
The missing control plane: an explicit state machine
Reliable real-time voice systems share a common trait:
there is exactly one place that decides behavior.
That place:
enumerates system states
defines legal transitions
treats interruption as a designed transition, not an exception
owns cancellation and cleanup
In other words: an explicit state machine.
Not implicit flags.
Not “who resolves first.”
Not timing assumptions.
The state machine becomes the System Anchor.
A practical mental model
Think of a voice AI system like this:
LLM → brain
WebRTC → sensory & motor system
State machine → spine
Without the spine, the brain can be smart —
but behavior becomes uncoordinated and unsafe.
The system may still talk, but it can’t behave reliably.
Why this matters more than model quality
Models will keep improving.
But no model can compensate for:
undefined execution ownership
hidden state
non-deterministic cancellation
If your voice system can’t:
stop cleanly
recover after interruption
resume without corruption
the problem is architectural, not algorithmic.
Takeaway
Interruptible voice AI is not an AI breakthrough.
It’s a systems engineering decision.
Once you treat real-time voice interaction as a living execution system on a time axis,
state machines stop being “old-school design” and become inevitable.
Not because they’re elegant —
but because they survive concurrency, streaming, and interruption.
Top comments (0)