Joshua Fields

Posted on Jun 23

Building Real-Time Voice AI Applications with LiveKit and FastAPI

#ai #webdev #programming #productivity

Real-time voice AI demos are easy to make look impressive.

Production voice systems are a different problem.

The demo usually has one clean interaction, one happy path, and no load. Production has packet jitter, user interruptions, reconnects, flaky speech recognition, delayed TTS, provider failures, and all the little timing issues that make a conversation feel robotic if you do not design for them upfront.

This is how I think about building real-time voice AI applications with LiveKit and FastAPI in a way that can actually ship.

It is less about one framework trick and more about architecture decisions: where state lives, where latency accumulates, how retries work, and what to observe before users tell you something feels off.

Reference architecture at a glance

A practical voice AI stack has a few clear layers:

Client: browser or mobile app that captures mic input, streams audio, and plays synthesized responses
Voice session layer: session identity, auth tokens, connection lifecycle, and per-user context
LiveKit room: low-latency media transport and participant coordination
STT pipeline: speech-to-text with partial and final transcripts
LLM orchestration: prompt construction, tool calls, memory policy, and response shaping
TTS pipeline: text-to-speech chunks streamed back to the user
Backend APIs: FastAPI services handling session state, business actions, and persistence
Observability: metrics, traces, logs, and replay signals for latency and failure analysis

I try to keep each layer independently testable.

If the orchestration logic can only run when a full audio room is active, debugging becomes painful fast.

Client and session boundaries

The client should do as little product logic as possible.

It captures audio, handles UI state, and relays events. It should not decide authorization or business outcomes.

For every voice session, I prefer generating a short-lived token on the backend, scoped to a single room and participant role. That keeps room access controlled and avoids broad credentials leaking into the frontend.

I also keep a server-side session record keyed by a stable session ID.

That record can track:

user ID, if authenticated
room ID
started timestamp
current mode
latest orchestration state
current turn state
reconnect status

When a user reconnects, the backend can recover context without guessing from client memory.

That matters because voice sessions are not just request-response flows. They are live interactions with timing, interruptions, and state.

LiveKit room design decisions

LiveKit gives you the real-time substrate, but you still need conventions.

For most assistant-style experiences, I prefer one room per active user session. It keeps event scopes clear and reduces accidental cross-talk.

If the use case requires multiple participants, such as a user, an AI agent, and a supervisor, define explicit participant metadata and role handling early.

Two patterns help a lot:

Use data channels for control events like interrupt, mute, handoff, and system status
Treat room events as first-class telemetry, not just infrastructure logs

Join, leave, track publish, track unpublish, reconnect, and bitrate drops are all product signals.

That event data becomes crucial when someone says:

The assistant started speaking over me.

At that point, you need to know whether the issue was model latency, voice activity detection timing, TTS cancellation, or a reconnect edge case.

STT: partials, finals, and confidence handling

Speech-to-text should emit partial transcripts quickly for responsiveness.

But downstream business logic should usually wait for final segments or confidence thresholds.

If you run every partial transcript through your orchestration loop, you create race conditions and noisy model calls.

I usually think about transcript events in explicit states:

partial: render to UI, but do not commit to durable context
final: append to conversation history and trigger orchestration
revised final: patch a previous segment if the provider corrects recognition

This makes transcript behaviour deterministic and easier to test.

It also avoids subtle bugs where the assistant answers a phrase the user never actually said in the final transcript.

LLM orchestration in FastAPI

For orchestration, I usually expose a FastAPI endpoint or event handler that receives normalized transcript events and returns structured actions rather than raw prose.

The action envelope might include:

assistant text
tool calls
state updates
UI directives
follow-up prompts
confirmation requests
escalation signals

When teams skip this structure, the orchestration layer turns into prompt glue and ad-hoc branching.

I prefer strict schemas, even when the model itself is flexible.

With schema-first orchestration, you can validate outputs, reject malformed actions, and retry safely without duplicating side effects.

This is where FastAPI works well for me: clear request models, async handling, and a straightforward way to compose tool integrations while keeping the contract explicit.

Latency budgets and interruption behaviour

Voice UX is mostly latency UX.

If a response arrives late, users interrupt.

If interruption handling is weak, trust drops quickly.

I like setting a practical latency budget per turn and breaking it down by stage:

STT latency
orchestration latency
tool call latency
TTS startup time
time to first audio byte
total turn completion time

Once you have per-stage timings, optimization decisions become much clearer.

Interruption should be supported as a first-class control flow:

client sends interrupt event immediately
current TTS stream is cancelled
orchestrator marks the prior response as interrupted
next turn continues from a clean state

Without explicit cancellation semantics, the system often continues generating text in the background and then leaks stale context into the next turn.

That is one of the fastest ways to make a voice assistant feel broken.

Retries and idempotency

Retries are inevitable with external STT, LLM, and TTS providers.

The critical point is making retries safe.

I like attaching idempotency keys to orchestration turns and tool executions, then persisting turn state transitions:

received
processing
completed
failed
cancelled

If a timeout occurs and the client retries, the backend can return an existing result or resume from a known step instead of replaying side effects.

This matters a lot when tool calls trigger real actions, such as booking, messaging, account changes, database writes, or payment-related workflows.

A voice system should feel conversational, but the backend still needs to behave like a reliable distributed system.

Observability that actually helps

For voice systems, aggregate uptime is not enough.

I care about metrics that explain the actual user experience:

end-to-end turn latency percentiles
time to first transcript token
time to final transcript
time to first audio byte from TTS
interrupt rate per session
provider error codes by stage
reconnect frequency
average recovery time
cancelled turn count
failed orchestration count

I also like structured logs keyed by session ID and turn ID, so a problematic interaction can be reconstructed quickly.

You do not always need to store raw audio to debug effectively. In many cases, replaying timeline events is enough, and it is much safer from a privacy perspective.

Deployment and scaling notes

On deployment, I treat the voice path as a latency-sensitive service tier.

Keep orchestration workers close to your media region where possible, and avoid unnecessary synchronous hops.

FastAPI services running behind autoscaling can work well, but cold starts and noisy neighbours still matter for interactive voice.

I usually separate concerns into at least two deployables:

API/session control
orchestration workers

That gives you flexibility to scale orchestration independently when usage spikes.

If you rely on Kubernetes, readiness checks should validate downstream dependency health rather than only process liveness.

For a normal API, a process being alive may be enough to accept traffic. For real-time voice, a process that cannot reach STT, TTS, or the orchestration layer is not actually ready.