Building a real-time voice-agent runtime in Rust: no GIL, one binary, 2,000 calls a box

#agents #ai #performance #rust

We built Flowcat, an Apache-2.0 native-Rust runtime for real-time voice AI agents
(phone + WebRTC), as a clean-room counterpart to the architecture of pipecat
(Python). This post is about the Rust-specific design decisions that let one
process hold a flat ~0.6 ms p99 from 10 to 2,000 concurrent calls on a single box
— and, honestly, about where that number does not mean what it looks like.

Repo: https://github.com/AreevAI/flowcat

THE PROBLEM

A voice agent carries a call through a pipeline:

transport in  ->  VAD / turn-taking  ->  STT  ->  LLM  ->  TTS  ->  transport out

(or a single speech-to-speech model in the middle). At 50 audio frames/sec per
call, every stage touches every frame. Run a few hundred concurrent calls and the
runtime's per-frame overhead — not the AI — becomes the thing that stalls.

In Python that overhead is real: the GIL serializes frame routing onto one core,
so you scale by running one process per core (~14 on a 16-vCPU box), each with its
own memory baseline and connection pools. And GC pauses show up as tail latency.

THE RUST DESIGN

Three decisions did most of the work.

Each pipeline stage is its own tokio task behind a bounded channel.
Backpressure is just the channel filling up — no manual flow control.

// Each FrameProcessor owns the receive half of a bounded mpsc.
// A full channel naturally back-pressures the upstream stage.
let (tx, rx) = tokio::sync::mpsc::channel::(CAP);
The hot audio frame is an Arc, so each hop moves a pointer, not PCM.
A 20 ms mu-law frame is small, but copying it across 7 stages x 50 fps x N
calls adds up fast. Cloning an Arc is a refcount bump.

enum Frame {
Audio(Arc), // clone = refcount bump, not a buffer copy
Text(...),
Control(...),
System(...), // Start / Stop / Cancel / Interruption
}
System frames jump the queue. Start / Cancel / Interruption / End ride a
separate priority channel and invoke start()/stop() lifecycle hooks, so an
interruption (caller barges in) isn't stuck behind a backlog of audio frames
in the normal queue.

No GC means no pause; no GIL means one process uses every core. Measured, one
process scales 8.4x across 14 cores.

THE BENCHMARK

Identical Rust WebSocket + mu-law load generator, full-duplex echo, 50 fps/call,
10 s/point, on one Azure Standard_FX16mds_v2 (16 vCPU). pipecat is given its fair
multiprocess deployment (12 workers, SO_REUSEPORT, one per core) — not a single
process. p99 round-trip latency:

Concurrent calls	Flowcat (1 process)	pipecat (12 workers)
50	0.39 ms	1.13 ms
100	0.51 ms	33 ms
250	0.59 ms	51 ms
500	0.51 ms	843 ms
1000	0.47 ms	5,673 ms (77% throughput)
2000	0.61 ms	5,074 ms (41%, conns failing)

Per-frame framework routing measured ~0.20 us in Rust vs ~106 us in Python; RAM
per idle session ~19.6 KB vs up to ~1 MB.

The tail is the real story. Even at 10 calls, pipecat's p50/p90/p99 are all
sub-millisecond — but its p99.9 is 102 ms and max is 163 ms. About 1 frame in
1,000 eats a GC/GIL stall, which for real-time audio is an audible glitch.
Multiprocess spreads that jitter across workers; it doesn't remove it, because
it's intrinsic to each Python pipeline.

WHAT THIS DOES NOT MEAN (the honest part)

That 0.6 ms is runtime/framework overhead, not end-to-end conversational latency.
What a caller hears is dominated by your STT/LLM/TTS providers (hundreds of ms) —
Rust can't change that. The claim is narrower and more useful: the runtime itself
never becomes the bottleneck or the source of a stall.

Also: the ~525x per-frame framework-routing ratio compresses hard once real
shared I/O (mu-law encode/decode, socket syscalls) is added, because that work is
near-identical in both languages. The realistic end-to-end density win is single-
to low-double-digit x, not 525x. The latency table above is the real end-to-end
measurement, not the framework floor.

TRY IT

The whole benchmark kit is reproducible:

docker compose -f bench/compose.yml up --build   # on a 16-vCPU VM

Default build pulls zero provider/network deps — every provider and transport is
a dep:-gated Cargo feature. And you don't have to write Rust to use it: run
flowcat-server from a YAML config and talk to an agent in your browser.

Repo, full percentile distributions, and methodology:
https://github.com/AreevAI/flowcat

Apache-2.0, pre-1.0, built in the open. Feedback and provider PRs welcome.

Disclosure: this writeup was drafted with LLM assistance and edited by the Flowcat
maintainers; the benchmark numbers are from the reproducible kit in the repo.
pipecat is an independent open-source project used here as an architecture
reference and benchmark baseline; Flowcat is not affiliated with or endorsed by
Daily or the pipecat project.

DEV Community

Building a real-time voice-agent runtime in Rust: no GIL, one binary, 2,000 calls a box

Top comments (0)