We built Flowcat, an Apache-2.0 native-Rust runtime for real-time voice AI agents
(phone + WebRTC), as a clean-room counterpart to the architecture of pipecat
(Python). This post is about the Rust-specific design decisions that let one
process hold a flat ~0.6 ms p99 from 10 to 2,000 concurrent calls on a single box
— and, honestly, about where that number does not mean what it looks like.
Repo: https://github.com/AreevAI/flowcat
THE PROBLEM
A voice agent carries a call through a pipeline:
transport in -> VAD / turn-taking -> STT -> LLM -> TTS -> transport out
(or a single speech-to-speech model in the middle). At 50 audio frames/sec per
call, every stage touches every frame. Run a few hundred concurrent calls and the
runtime's per-frame overhead — not the AI — becomes the thing that stalls.
In Python that overhead is real: the GIL serializes frame routing onto one core,
so you scale by running one process per core (~14 on a 16-vCPU box), each with its
own memory baseline and connection pools. And GC pauses show up as tail latency.
THE RUST DESIGN
Three decisions did most of the work.
-
Each pipeline stage is its own tokio task behind a bounded channel.
Backpressure is just the channel filling up — no manual flow control.// Each FrameProcessor owns the receive half of a bounded mpsc.
// A full channel naturally back-pressures the upstream stage.
let (tx, rx) = tokio::sync::mpsc::channel::(CAP); -
The hot audio frame is an Arc, so each hop moves a pointer, not PCM.
A 20 ms mu-law frame is small, but copying it across 7 stages x 50 fps x N
calls adds up fast. Cloning an Arc is a refcount bump.enum Frame {
Audio(Arc), // clone = refcount bump, not a buffer copy
Text(...),
Control(...),
System(...), // Start / Stop / Cancel / Interruption
} System frames jump the queue. Start / Cancel / Interruption / End ride a
separate priority channel and invoke start()/stop() lifecycle hooks, so an
interruption (caller barges in) isn't stuck behind a backlog of audio frames
in the normal queue.
No GC means no pause; no GIL means one process uses every core. Measured, one
process scales 8.4x across 14 cores.
THE BENCHMARK
Identical Rust WebSocket + mu-law load generator, full-duplex echo, 50 fps/call,
10 s/point, on one Azure Standard_FX16mds_v2 (16 vCPU). pipecat is given its fair
multiprocess deployment (12 workers, SO_REUSEPORT, one per core) — not a single
process. p99 round-trip latency:
| Concurrent calls | Flowcat (1 process) | pipecat (12 workers) |
|---|---|---|
| 50 | 0.39 ms | 1.13 ms |
| 100 | 0.51 ms | 33 ms |
| 250 | 0.59 ms | 51 ms |
| 500 | 0.51 ms | 843 ms |
| 1000 | 0.47 ms | 5,673 ms (77% throughput) |
| 2000 | 0.61 ms | 5,074 ms (41%, conns failing) |
Per-frame framework routing measured ~0.20 us in Rust vs ~106 us in Python; RAM
per idle session ~19.6 KB vs up to ~1 MB.
The tail is the real story. Even at 10 calls, pipecat's p50/p90/p99 are all
sub-millisecond — but its p99.9 is 102 ms and max is 163 ms. About 1 frame in
1,000 eats a GC/GIL stall, which for real-time audio is an audible glitch.
Multiprocess spreads that jitter across workers; it doesn't remove it, because
it's intrinsic to each Python pipeline.
WHAT THIS DOES NOT MEAN (the honest part)
That 0.6 ms is runtime/framework overhead, not end-to-end conversational latency.
What a caller hears is dominated by your STT/LLM/TTS providers (hundreds of ms) —
Rust can't change that. The claim is narrower and more useful: the runtime itself
never becomes the bottleneck or the source of a stall.
Also: the ~525x per-frame framework-routing ratio compresses hard once real
shared I/O (mu-law encode/decode, socket syscalls) is added, because that work is
near-identical in both languages. The realistic end-to-end density win is single-
to low-double-digit x, not 525x. The latency table above is the real end-to-end
measurement, not the framework floor.
TRY IT
The whole benchmark kit is reproducible:
docker compose -f bench/compose.yml up --build # on a 16-vCPU VM
Default build pulls zero provider/network deps — every provider and transport is
a dep:-gated Cargo feature. And you don't have to write Rust to use it: run
flowcat-server from a YAML config and talk to an agent in your browser.
Repo, full percentile distributions, and methodology:
https://github.com/AreevAI/flowcat
Apache-2.0, pre-1.0, built in the open. Feedback and provider PRs welcome.
Disclosure: this writeup was drafted with LLM assistance and edited by the Flowcat
maintainers; the benchmark numbers are from the reproducible kit in the repo.
pipecat is an independent open-source project used here as an architecture
reference and benchmark baseline; Flowcat is not affiliated with or endorsed by
Daily or the pipecat project.
Top comments (0)