Building Voice Agents with Rooh

fraser sequeira — Wed, 11 Mar 2026 23:42:19 +0000

The soul of a voice agent is not in its speech synthesis or its language model. It is in the space between the silence it chooses to honour.

A prevailing shortcoming of today’s voice agents is deceptively simple:- they don’t truly understand the rhythm of human conversation.

I experienced this firsthand during a skill-based interview conducted entirely by a voice agent. The agent was articulate, its questions were well-formed, and the experience began promisingly. But the moment I paused not because I was finished, but because the question demanded deliberation, the agent interpreted my silence as a completed response and barrelled into the next question. What followed was a cascade of half-formed answers, each truncated prematurely. By the end of the session, the agent had dutifully catalogued a series of fragments, and I was left with a genuinely dispiriting experience.

This is not an edge case. It is the default behaviour of most voice pipelines today. Silence is treated as a terminal signal rather than what it often is: the cognitive pause between thought and articulation.

As this domain matures agents conversing with humans, agents orchestrating with other agents, agents mediating multi-party workflows: the bar for conversational intelligence must rise accordingly. We need voice pipelines that are not merely transactional, but empathetic, patient, and context-aware. Pipelines that understand that a pause is not always an ending.

That’s where Rooh comes in. The name means “soul” and that’s precisely what it aspires to gives your voice agents. Rooh is an open-source Python framework for building real-time voice pipelines that supports both edge deployment in fully offline mode and cloud-based inference. It is not opinionated about which models you use; it is opinionated about giving you the architectural primitives to build agents that genuinely listen.

Rooh orchestrates the entire voice pipeline flow. Each stage is a swappable abstraction backed by a registry of concrete implementations :- Deepgram or Whisper for STT, Claude or GPT-4o or Ollama for LLM, Cartesia or Piper for TTS. You choose the providers; Rooh handles the wiring, the streaming overlap, the barge-in detection, and the lifecycle management.

Step 1 : Initializing the environment

RoohAI requires python 3.11+ . Lets start by creating a virtual environment in python where we’d install Rooh and all its necessary dependencies

python3.13 -m venv .venv && source .venv/bin/activate

Step 2 : Rooh installation

That single pip install gives you every built-in STT, TTS, LLM, and VAD provider. No extras, no conditional dependencies to chase. For NVIDIA NeMo models (Canary, Parakeet), there is an optional extra:

pip install roohai

Step 3 : Starting the Server

Rooh ships with a FastAPI-backed server and a browser-based UI. Launch it with a single command:

roohai

Once running, open http://localhost:8000. You are greeted with a dark-themed interface, a conversation panel, a sidebar listing your agents, and a button to create new ones. No build step, no frontend toolchain. It simply works.

Step 4: The Agent Builder Wizard: No Code Required

The fastest path to a working voice agent is the Agent Builder Wizard , a guided, multi-step flow accessible directly from the browser. You do not need to write a single line of code or touch a YAML file.

The wizard walks you through four steps:

STT : Choose your speech recognition provider. For this walkthrough, select Deepgram Nova-3 (cloud, streaming, highest accuracy).
LLM : Choose your language model. Select Amazon Bedrock and pick the Claude model you want, for instance, global.anthropic.claude-haiku-4–5–20251001-v1:0 for a fast, cost-effective global endpoint.
TTS : Choose your voice. Select Cartesia Sonic for natural, low-latency cloud synthesis.
Review & Create : Name your agent, write a system prompt that defines its personality, and configure the below pipeline settings:

VAD Sensitivity : A slider from 0.0 to 1.0. Lower values are more sensitive (picks up quiet speech and echo); higher values require louder, more distinct speech. A default of 0.70 works well on speakers.
Transport : Choose between WebSocket and WebRTC. WebRTC offers lower steady-state latency (~500ms) with native Opus codec, though the initial handshake takes slightly longer than a WebSocket connection.

Hit Create Agent, and that is it. Under the hood, the wizard translates your selections into a YAML configuration file stored in ~/.roohai/agents/, which the Rooh pipeline class reads at activation time. For most production use cases, the wizard-generated configuration is all you need.

created_at: '2026-03-11T10:15:04.551587+00:00'
llm_streaming: true
models:
  bedrock-claude:
    auth_mode: api_key
    class: bedrock-claude
    model_id: global.anthropic.claude-haiku-4-5-20251001-v1:0
    region: us-east-1
    type: llm
  cartesia:
    class: cartesia
    language: en
    model_id: sonic-2
    sample_rate: '24000'
    type: tts
    voice_id: a0e99841-438c-4a64-b679-ae501e7d6091
  deepgram-nova-3:
    class: deepgram-nova-3
    language: en
    model: nova-3
    type: stt
  silero:
    class: silero
    type: vad
name: Voice-Agent-DCC
pipeline:
  llm: bedrock-claude
  stt: deepgram-nova-3
  tts: cartesia
  vad: silero
  vad_threshold: 0.7
status: active
system_prompt: You will be helpful and concise
transport: websocket

Dive Deep: The Rooh Builder API

The wizard is the quickest route, but when you need programmatic control, dynamic configuration, custom hooks, CI/CD-driven deployments, or embedding Rooh inside a larger application — the fluent Builder API lets you construct pipelines entirely in Python. No YAML files, no server UI, just code.

Example: A Cloud Powered Voice Agent

from roohai import Rooh

pipeline = (
    Rooh.builder()
    .stt("deepgram", model="nova-3", language="en")
    .llm("bedrock", model_id="anthropic.claude-3-haiku-20240307-v1:0", region="us-west-2")
    .tts("cartesia", voice_id="79a125e8-cd45-4c13-8a67-188112f4dd22")
    .vad("silero")
    .vad_threshold(0.7)
    .barge_in(True)
    .silence_duration(1000)
    .system_prompt(
        "You are a friendly voice assistant. Keep responses concise "
        "and conversational — they will be spoken aloud."
    )
    .build()
)
pipeline.load()

Key aspects of the code

vad_threshold(0.7) : The sensitivity dial. Lower values make the VAD more trigger-happy (detects speech more readily); higher values require more confidence. 0.7 is a sensible default for quiet environments.
barge_in(True) : Enables interruption. If the user starts speaking while the agent is still talking, the pipeline cancels in-progress TTS, sends an interrupt signal to the client, and immediately begins processing the new utterance. This is what makes a voice agent feel responsive rather than robotic.
silence_duration(1000) : The patience parameter, measured in milliseconds. After the VAD detects silence, the pipeline waits this long before concluding the user has finished speaking. At 1000ms, it is adequate for casual conversation. For interviews or complex domains where users need thinking time, you would increase this substantially — 2000ms, 3000ms, or more.
system_prompt(…) : The behavioural directive passed to the LLM on every turn. This shapes the agent’s personality, verbosity, and domain focus.
pipeline.load() : Loads all configured models into memory. For cloud models (Deepgram, Bedrock, Cartesia), this initialises API clients. For local models (Whisper, Piper, Silero), this downloads weights on first use and loads them onto the device.

Example: A Fully Offline Agent

Not every deployment has internet access. Medical devices, factory floors, classified environments, these demand pipelines that run entirely on the edge. Rooh handles this with the same Builder API:

pipeline = (
    Rooh.builder()
    .stt("whisper", model_id="openai/whisper-small", language="en")
    .llm("ollama", model_id="gemma3:4b")
    .tts("piper", voice="en_US-amy-medium")
    .vad("silero")
    .vad_threshold(0.5)
    .barge_in(True)
    .silence_duration(1200)
    .system_prompt(
        "You are a helpful voice assistant running locally. "
        "Keep responses under two sentences."
    )
    .build()
)
pipeline.load()

No API keys. No network calls. Whisper runs inference locally via HuggingFace Transformers, Ollama serves any open model (Llama 3, Mistral, Gemma, Qwen) on localhost, and Piper synthesises speech using lightweight ONNX models that download once and run offline thereafter.

The tradeoff is latency, local LLMs on CPU are measurably slower than a Bedrock API call but the privacy and availability guarantees are absolute.

Understanding the Provider Ecosystem
Rooh ships with a curated set of built-in providers. Each is referenced by a string name in the Builder API, and each accepts provider-specific keyword arguments:

### Speech-to-Text

| Provider | Kind | Default Model | Key Parameters |
|----------|------|---------------|----------------|
| `"deepgram"` | Cloud | `nova-3` | `model`, `language`, `api_key` |
| `"whisper"` | Local | `openai/whisper-tiny` | `model_id`, `language` |
| `"wav2vec2"` | Local | `facebook/wav2vec2-base-960h` | `model_id` |
| `"nvidia-parakeet"` | Local | `nvidia/parakeet-tdt-0.6b-v2` | `model_id`, `language` |

Deepgram is the only STT provider with real-time streaming , interim transcriptions appear as the user speaks, and utterance boundaries are detected server-side. The batch providers (Whisper, Wav2Vec2, NVIDIA) accumulate audio during speech and transcribe once silence is detected.

### Large Language Models

| Provider | Kind | Default Model | Key Parameters |
|----------|------|---------------|----------------|
| `"bedrock"` | Cloud | `anthropic.claude-3-haiku-20240307-v1:0` | `model_id`, `region`, `auth_mode` |
| `"openai"` | Cloud | `gpt-4o` | `model_id`, `api_key`, `base_url` |
| `"anthropic"` | Cloud | `claude-sonnet-4-20250514` | `model_id`, `api_key` |
| `"gemini"` | Cloud | `gemini-2.5-flash` | `model_id`, `api_key` |
| `"ollama"` | Local | `llama3` | `model_id`, `host` |
| `"local"` | Local | `TinyLlama/TinyLlama-1.1B-Chat-v1.0` | `model_id` |

### Text-to-Speech

| Provider | Kind | Default Voice/Model | Key Parameters |
|----------|------|---------------------|----------------|
| `"cartesia"` | Cloud | `sonic-2` | `model_id`, `voice_id`, `language`, `sample_rate`, `api_key` |
| `"deepgram-tts"` | Cloud | `aura-2-thalia-en` | `model`, `sample_rate`, `api_key` |
| `"piper"` | Local | `en_US-lessac-medium` | `voice` |
| `"speecht5"` | Local | `microsoft/speecht5_tts` | `model_id`, `vocoder` |
| `"bark"` | Local | `suno/bark-small` | `model_id` |

Extensibility: Bringing Your Own Models
The built-in providers cover the most common use cases, but production systems often require bespoke integrations a proprietary STT engine, a fine-tuned TTS model, a domain-specific LLM behind a custom API.

Rooh’s architecture is designed for this. Every model is a subclass of one of four abstract base classes (STTModel, TTSModel, LLMModel, VADModel), each requiring just four methods. And as of v0.1.7, the Builder API supports custom model classes directly no framework modifications, no forking, no monkey-patching:

from roohai import Rooh, TTSModel
import numpy as np

class ElevenLabsTTS(TTSModel):
    """Custom TTS provider using ElevenLabs API."""

    META = {
        "display_name": "ElevenLabs",
        "description": "Ultra-realistic voice synthesis.",
        "kind": "cloud",
    }

    def __init__(self, api_key=None, voice_id="default", **kwargs):
        self._api_key = api_key
        self.voice_id = voice_id
        self._client = None

    def load(self):
        import os
        from elevenlabs import ElevenLabs
        key = self._api_key or os.environ.get("ELEVENLABS_API_KEY")
        self._client = ElevenLabs(api_key=key)

    def synthesize(self, text: str) -> tuple[np.ndarray, int]:
        import io, soundfile as sf
        audio_bytes = self._client.text_to_speech.convert(
            text=text, voice_id=self.voice_id, model_id="eleven_multilingual_v2",
        )
        data, sr = sf.read(io.BytesIO(b"".join(audio_bytes)), dtype="float32")
        return data, sr

    def unload(self):
        self._client = None

    @property
    def is_loaded(self) -> bool:
        return self._client is not None


# Use it alongside built-in providers — no framework changes
pipeline = (
    Rooh.builder()
    .stt("deepgram", model="nova-3")
    .llm("anthropic", model_id="claude-sonnet-4-20250514")
    .custom_tts("elevenlabs", ElevenLabsTTS, voice_id="pNInz6obpgDQGcFmaJgB")
    .vad()
    .system_prompt("You are a helpful voice assistant.")
    .build()
)
pipeline.load()

The custom_tts() method (and its siblings custom_stt(), custom_llm(), custom_vad()) accepts a name, a class, and any constructor kwargs. It registers the class in Rooh’s global model registry, making it a first-class citizen eligible for hot-swapping, visible in the catalog API, and tracked by pipeline metrics.

This is the extensibility contract: implement four methods, pass the class to the Builder, and you are done.

The Patience Problem, Revisited

Remember the interview agent from the opening? Let us solve that problem properly.

The naive approach is to increase silence_duration wait longer before concluding the user is done. But silence duration alone is a blunt instrument. A 3-second threshold helps with thinking pauses, but it also introduces a 3-second delay after every genuinely completed answer. The user finishes speaking, then sits in awkward silence for three seconds before the agent responds. That is not empathy; it is a different flavour of frustration.

The elegant solution combines two mechanisms:

A generous silence threshold to avoid premature truncation
LLM-driven completeness evaluation to determine whether a pause is a thinking pause or a genuine end-of-turn

Rooh’s hook system makes this possible. Instead of routing transcribed text directly to an LLM for response generation, you intercept it with a hook that evaluates the answer’s completeness:

pipeline = (
    Rooh.builder()
    .stt("deepgram", model="nova-3")
    .llm("bedrock", model_id="anthropic.claude-3-haiku-20240307-v1:0")
    .tts("cartesia", voice_id="79a125e8-cd45-4c13-8a67-188112f4dd22")
    .vad("silero")
    .silence_duration(3000)  # Wait 3 seconds — give the user room to think
    .barge_in(True)          # Let them resume after a filler
    .llm_hook(interview_hook)
    .system_prompt("You are a patient, professional interviewer.")
    .build()
)

The interview_hook receives the transcribed text and the session_id (a UUID unique to each connection). It maintains per-session state — conversation history, the current question, and critically, a pending_partial buffer. When the LLM judges an answer as incomplete, the hook responds with a brief filler (“take your time”, “mm-hmm”) and stores the partial answer. When the user resumes speaking (triggering barge-in after the filler), the next invocation concatenates the continuation with the stored partial and evaluates the combined answer.

The result is an interviewer that genuinely listens. It pauses when you pause. It encourages when you hesitate. It advances only when your answer is substantively complete. The conversational rhythm feels human because the pipeline is modelling human conversational norms rather than optimising for throughput.

Conclusion
Building a voice agent that merely speaks is table stakes. Building one that listens, that respects the cadence of human thought, that distinguishes a thinking pause from a completed utterance, that knows when to wait and when to respond that is the harder, more consequential problem.

Rooh does not solve this problem for you. It gives you the architectural primitives to solve it yourself: swappable models, configurable silence thresholds, barge-in control, LLM hooks for custom logic, per-session state via session_id, and an extensibility model that lets you bring any provider into the pipeline without forking the framework.

GitRepo: https://github.com/roohai/roohai-framework

DEV Community: fraser sequeira