DEV Community

Cover image for We Open-Sourced Our Production Voice AI Stack (Rust Runtime, Sub-Second Latency)
Nelson
Nelson

Posted on • Originally published at github.com

We Open-Sourced Our Production Voice AI Stack (Rust Runtime, Sub-Second Latency)

TL;DR — We open-sourced Feros, a full Voice Agent OS you can self-host in one docker compose up. It has a Rust voice engine for sub-second latency, a Python control plane, a Next.js dashboard, and an AI builder that writes your agent for you. Apache 2.0.

⭐ If this looks useful, star us on GitHub — it's how others find the project.


The voice AI tax is real — and we got tired of paying it

If you've shipped a voice agent at any non-trivial scale, you've hit the wall:

  • Managed platforms (Vapi, Retell) are magical to start with and brutal to scale. Per-minute billing that looks like pocket change at 1,000 calls becomes a six-figure line item at 100,000. And if you're in healthcare, fintech, or anything with data residency requirements? "We handle it in our cloud" isn't good enough.

  • Low-level frameworks (Pipecat, LiveKit) give you the Lego bricks but not the house. You spend three weeks plumbing VAD → STT → LLM → TTS before writing a single line of actual agent logic. Then you maintain that plumbing forever.

  • Visual node builders (older-generation platforms) make you hand-wire every branch, intent, and call flow in a drag-and-drop UI. It gets unmaintainable the moment your agent needs to do anything non-trivial.

We built Feros to collapse all three layers into one self-hostable system that doesn't make you choose between speed, cost, and control.


What Feros actually is

Feros is a Voice Agent OS — a complete, production-ready stack that handles everything from the WebRTC/telephony layer to the agent builder UI.

Browser / Phone
       │
  voice-server   ← Rust: telephony gateway, WebSocket router
       │
  voice-engine   ← Rust: VAD → STT → LLM → TTS orchestration
       │
  studio-api     ← Python (FastAPI): agent config, sessions, evals
       │
  studio-web     ← Next.js: dashboard, AI builder, live call monitor
Enter fullscreen mode Exit fullscreen mode

Every component is swappable. STT vendor going down? Change one config line. Want to use a local Whisper instance to eliminate STT costs entirely? There's an optional self-hosted inference stack included.


The voice engine is Rust — and yes, that matters

The hot path — VAD detection, streaming STT, LLM inference, TTS synthesis, audio mixing — runs entirely in a Tokio async runtime written in Rust.

Why Rust here specifically?

Latency predictability. GC pauses in the hot path are not a latency spike you can explain away. At 20ms audio frames, a 50ms GC pause is audible and destroys the "natural conversation" illusion. Rust gives you deterministic performance without the safetynet of a garbage collector.

Memory safety without overhead. A live call session manages multiple async streams simultaneously — inbound audio chunks, STT partial results, LLM streaming tokens, TTS audio segments, WebRTC pacing. Getting these wrong means memory corruption or deadlocks. Rust's ownership model enforces correct concurrency at compile time.

Real numbers we care about: Voice agents feel unnatural with high latency (the time from user stops speaking to agent starts responding). The Feros pipeline is deeply optimized for low latency and we're constantly working on improving it.


The AI builder is the part that might surprise you

Instead of dragging nodes in a canvas, you describe your agent in plain language. The AI builder reads your intent and autonomously provisions:

  • The system prompt
  • Tool definitions (CRM lookups, calendar booking, webhook calls, etc.)
  • Routing logic between conversation states

This isn't a gimmick — it's genuinely the fastest path from "I need a voice agent that books appointments and checks account status" to a working, testable agent. You still have full access to the underlying configuration and can edit anything the AI generated.


One command to run the whole stack

git clone https://github.com/ferosai/feros.git
cd feros
cp .env.example .env
docker compose up -d
Enter fullscreen mode Exit fullscreen mode

Open http://localhost:3000. That's the full stack:

Service URL
Studio Web http://localhost:3000
Studio API http://localhost:8000
Voice Server http://localhost:8300

We publish pre-built multi-arch images so the default path doesn't require compiling Rust locally. If you need to build from source (e.g., you're modifying the engine):

docker compose -f docker-compose.yml -f docker-compose.source.yml up -d --build
Enter fullscreen mode Exit fullscreen mode

The integrations layer: your secrets stay yours

Every third-party integration — CRMs, calendars, webhooks — goes through an encrypted credential vault. Secrets are encrypted at rest and decrypted only inside the runtime. They never hit external audit logs or managed cloud infrastructure in plaintext. This was a non-negotiable for our early enterprise users.


What we're building next

The roadmap is public and tracked in the repo:

  • Outbound calls — agent-initiated dialing with retry and scheduling
  • Dynamic Agent Variables — resolve runtime context at session start for personalized conversations
  • Gemini Live native audio — end-to-end multimodal backend (actively in progress)
  • Direct PSTN via SIP — eliminating the Twilio/Telnyx dependency entirely
  • Agent-to-agent evaluation — a tester agent calls your target agent over live audio to evaluate regressions
  • Evaluation replay — run historical transcripts against new agent versions

Why open source, and why Apache 2.0?

Because the voice AI infrastructure layer should not be a moat. It should be a foundation.

We've been on the receiving end of per-minute pricing that punished growth, "enterprise plans" that required a sales call before you could see a price, and APIs that broke in production with no recourse. We built the thing we wanted to exist.

Apache 2.0 means you can self-host it, build products on it, and modify it without legal friction.


Stack summary (for the skim readers)

Layer Technology
Voice Engine Rust / Tokio
Voice Server Rust
Control Plane Python / FastAPI
Dashboard Next.js / TypeScript
Database PostgreSQL
Inference (optional) Whisper + Fish TTS on GPU
Protocol Protobuf over WebSocket + WebRTC
License Apache 2.0

Give it a spin

git clone https://github.com/ferosai/feros.git
Enter fullscreen mode Exit fullscreen mode

We're actively building in public. If you run into anything, open an issue. If you have a provider or integration you need, open a discussion before implementing — we want to make sure the architecture stays coherent as the project grows.

If this is interesting to you: ⭐ on GitHub helps more people find it. That's the whole ask.

github.com/ferosai/feros


What voice AI problems are you dealing with right now? Cloud costs? Latency? Data residency? Drop it in the comments — we're actively informing the roadmap from real use cases.

Top comments (0)