Real-time voice AI demos are easy to make look impressive.
Production voice systems are a different problem.
The demo usually has one clean interaction, one happy path, and no load. Production has packet jitter, user interruptions, reconnects, flaky speech recognition, delayed TTS, provider failures, and all the little timing issues that make a conversation feel robotic if you do not design for them upfront.
This is how I think about building real-time voice AI applications with LiveKit and FastAPI in a way that can actually ship.
It is less about one framework trick and more about architecture decisions: where state lives, where latency accumulates, how retries work, and what to observe before users tell you something feels off.
Reference architecture at a glance
A practical voice AI stack has a few clear layers:
- Client: browser or mobile app that captures mic input, streams audio, and plays synthesized responses
- Voice session layer: session identity, auth tokens, connection lifecycle, and per-user context
- LiveKit room: low-latency media transport and participant coordination
- STT pipeline: speech-to-text with partial and final transcripts
- LLM orchestration: prompt construction, tool calls, memory policy, and response shaping
- TTS pipeline: text-to-speech chunks streamed back to the user
- Backend APIs: FastAPI services handling session state, business actions, and persistence
- Observability: metrics, traces, logs, and replay signals for latency and failure analysis
I try to keep each layer independently testable.
If the orchestration logic can only run when a full audio room is active, debugging becomes painful fast.
Client and session boundaries
The client should do as little product logic as possible.
It captures audio, handles UI state, and relays events. It should not decide authorization or business outcomes.
For every voice session, I prefer generating a short-lived token on the backend, scoped to a single room and participant role. That keeps room access controlled and avoids broad credentials leaking into the frontend.
I also keep a server-side session record keyed by a stable session ID.
That record can track:
- user ID, if authenticated
- room ID
- started timestamp
- current mode
- latest orchestration state
- current turn state
- reconnect status
When a user reconnects, the backend can recover context without guessing from client memory.
That matters because voice sessions are not just request-response flows. They are live interactions with timing, interruptions, and state.
LiveKit room design decisions
LiveKit gives you the real-time substrate, but you still need conventions.
For most assistant-style experiences, I prefer one room per active user session. It keeps event scopes clear and reduces accidental cross-talk.
If the use case requires multiple participants, such as a user, an AI agent, and a supervisor, define explicit participant metadata and role handling early.
Two patterns help a lot:
- Use data channels for control events like interrupt, mute, handoff, and system status
- Treat room events as first-class telemetry, not just infrastructure logs
Join, leave, track publish, track unpublish, reconnect, and bitrate drops are all product signals.
That event data becomes crucial when someone says:
The assistant started speaking over me.
At that point, you need to know whether the issue was model latency, voice activity detection timing, TTS cancellation, or a reconnect edge case.
STT: partials, finals, and confidence handling
Speech-to-text should emit partial transcripts quickly for responsiveness.
But downstream business logic should usually wait for final segments or confidence thresholds.
If you run every partial transcript through your orchestration loop, you create race conditions and noisy model calls.
I usually think about transcript events in explicit states:
- partial: render to UI, but do not commit to durable context
- final: append to conversation history and trigger orchestration
- revised final: patch a previous segment if the provider corrects recognition
This makes transcript behaviour deterministic and easier to test.
It also avoids subtle bugs where the assistant answers a phrase the user never actually said in the final transcript.
LLM orchestration in FastAPI
For orchestration, I usually expose a FastAPI endpoint or event handler that receives normalized transcript events and returns structured actions rather than raw prose.
The action envelope might include:
- assistant text
- tool calls
- state updates
- UI directives
- follow-up prompts
- confirmation requests
- escalation signals
When teams skip this structure, the orchestration layer turns into prompt glue and ad-hoc branching.
I prefer strict schemas, even when the model itself is flexible.
With schema-first orchestration, you can validate outputs, reject malformed actions, and retry safely without duplicating side effects.
This is where FastAPI works well for me: clear request models, async handling, and a straightforward way to compose tool integrations while keeping the contract explicit.
Latency budgets and interruption behaviour
Voice UX is mostly latency UX.
If a response arrives late, users interrupt.
If interruption handling is weak, trust drops quickly.
I like setting a practical latency budget per turn and breaking it down by stage:
- STT latency
- orchestration latency
- tool call latency
- TTS startup time
- time to first audio byte
- total turn completion time
Once you have per-stage timings, optimization decisions become much clearer.
Interruption should be supported as a first-class control flow:
- client sends interrupt event immediately
- current TTS stream is cancelled
- orchestrator marks the prior response as interrupted
- next turn continues from a clean state
Without explicit cancellation semantics, the system often continues generating text in the background and then leaks stale context into the next turn.
That is one of the fastest ways to make a voice assistant feel broken.
Retries and idempotency
Retries are inevitable with external STT, LLM, and TTS providers.
The critical point is making retries safe.
I like attaching idempotency keys to orchestration turns and tool executions, then persisting turn state transitions:
- received
- processing
- completed
- failed
- cancelled
If a timeout occurs and the client retries, the backend can return an existing result or resume from a known step instead of replaying side effects.
This matters a lot when tool calls trigger real actions, such as booking, messaging, account changes, database writes, or payment-related workflows.
A voice system should feel conversational, but the backend still needs to behave like a reliable distributed system.
Observability that actually helps
For voice systems, aggregate uptime is not enough.
I care about metrics that explain the actual user experience:
- end-to-end turn latency percentiles
- time to first transcript token
- time to final transcript
- time to first audio byte from TTS
- interrupt rate per session
- provider error codes by stage
- reconnect frequency
- average recovery time
- cancelled turn count
- failed orchestration count
I also like structured logs keyed by session ID and turn ID, so a problematic interaction can be reconstructed quickly.
You do not always need to store raw audio to debug effectively. In many cases, replaying timeline events is enough, and it is much safer from a privacy perspective.
Deployment and scaling notes
On deployment, I treat the voice path as a latency-sensitive service tier.
Keep orchestration workers close to your media region where possible, and avoid unnecessary synchronous hops.
FastAPI services running behind autoscaling can work well, but cold starts and noisy neighbours still matter for interactive voice.
I usually separate concerns into at least two deployables:
- API/session control
- orchestration workers
That gives you flexibility to scale orchestration independently when usage spikes.
If you rely on Kubernetes, readiness checks should validate downstream dependency health rather than only process liveness.
For a normal API, a process being alive may be enough to accept traffic. For real-time voice, a process that cannot reach STT, TTS, or the orchestration layer is not actually ready.
Security and privacy baseline
Voice products can capture sensitive data by accident.
Even if your first release is small, it is worth setting conservative boundaries early:
- short retention for raw transcripts unless explicitly needed
- redaction for known sensitive fields in logs
- scoped credentials for room tokens and provider APIs
- clear user controls for muting
- clear consent flows
- deletion controls
- limited access to session logs
- separation between debugging metadata and sensitive content
These boundaries are much easier to create early than retrofit later.
Closing thoughts
Building real-time voice AI applications is not only an LLM problem.
It is a systems problem spanning networking, state management, retries, observability, security, and product interaction design.
LiveKit and FastAPI make an effective foundation, but the quality comes from how you define boundaries and failure behaviour.
For me, the winning pattern is simple:
- predictable contracts
- explicit state
- tight latency feedback loops
- safe retries
- clear cancellation semantics
- observability from day one
That is what keeps the experience feeling conversational while still operating like production software.
Top comments (0)