<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: fraser sequeira</title>
    <description>The latest articles on DEV Community by fraser sequeira (@fraser_sequeira_19d159328).</description>
    <link>https://dev.to/fraser_sequeira_19d159328</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3819143%2Febe0f5b7-982f-4677-a1b2-026bd3dd0762.jpg</url>
      <title>DEV Community: fraser sequeira</title>
      <link>https://dev.to/fraser_sequeira_19d159328</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/fraser_sequeira_19d159328"/>
    <language>en</language>
    <item>
      <title>Building Voice Agents with Rooh</title>
      <dc:creator>fraser sequeira</dc:creator>
      <pubDate>Wed, 11 Mar 2026 23:42:19 +0000</pubDate>
      <link>https://dev.to/fraser_sequeira_19d159328/building-voice-agents-with-rooh-1nkf</link>
      <guid>https://dev.to/fraser_sequeira_19d159328/building-voice-agents-with-rooh-1nkf</guid>
      <description>&lt;p&gt;&lt;em&gt;The soul of a voice agent is not in its speech synthesis or its language model. It is in the space between the silence it chooses to honour.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;A prevailing shortcoming of today’s voice agents is deceptively simple:- they don’t truly understand the rhythm of human conversation.&lt;/p&gt;

&lt;p&gt;I experienced this firsthand during a skill-based interview conducted entirely by a voice agent. The agent was articulate, its questions were well-formed, and the experience began promisingly. But the moment I paused not because I was finished, but because the question demanded deliberation, the agent interpreted my silence as a completed response and barrelled into the next question. What followed was a cascade of half-formed answers, each truncated prematurely. By the end of the session, the agent had dutifully catalogued a series of fragments, and I was left with a genuinely dispiriting experience.&lt;/p&gt;

&lt;p&gt;This is not an edge case. It is the default behaviour of most voice pipelines today. Silence is treated as a terminal signal rather than what it often is: the cognitive pause between thought and articulation.&lt;/p&gt;

&lt;p&gt;As this domain matures agents conversing with humans, agents orchestrating with other agents, agents mediating multi-party workflows: the bar for conversational intelligence must rise accordingly. We need voice pipelines that are not merely transactional, but empathetic, patient, and context-aware. Pipelines that understand that a pause is not always an ending.&lt;/p&gt;

&lt;p&gt;That’s where &lt;a href="https://github.com/RoohAI/roohai-framework" rel="noopener noreferrer"&gt;Rooh&lt;/a&gt; comes in. The name means “soul” and that’s precisely what it aspires to gives your voice agents. Rooh is an open-source Python framework for building real-time voice pipelines that supports both edge deployment in fully offline mode and cloud-based inference. It is not opinionated about which models you use; it is opinionated about giving you the architectural primitives to build agents that genuinely listen.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://pypi.org/project/roohai/" rel="noopener noreferrer"&gt;Rooh&lt;/a&gt; orchestrates the entire voice pipeline flow. Each stage is a swappable abstraction backed by a registry of concrete implementations :- Deepgram or Whisper for STT, Claude or GPT-4o or Ollama for LLM, Cartesia or Piper for TTS. You choose the providers; Rooh handles the wiring, the streaming overlap, the barge-in detection, and the lifecycle management.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 1 : Initializing the environment
&lt;/h2&gt;

&lt;p&gt;RoohAI requires python 3.11+ . Lets start by creating a virtual environment in python where we’d install Rooh and all its necessary dependencies&lt;/p&gt;

&lt;p&gt;&lt;code&gt;python3.13 -m venv .venv &amp;amp;&amp;amp; source .venv/bin/activate&lt;/code&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 2 : Rooh installation
&lt;/h2&gt;

&lt;p&gt;That single &lt;code&gt;pip install&lt;/code&gt; gives you every built-in STT, TTS, LLM, and VAD provider. No extras, no conditional dependencies to chase. For NVIDIA NeMo models (Canary, Parakeet), there is an optional extra:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;pip install roohai&lt;/code&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 3 : Starting the Server
&lt;/h2&gt;

&lt;p&gt;Rooh ships with a FastAPI-backed server and a browser-based UI. Launch it with a single command:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;roohai&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Once running, open &lt;a href="http://localhost:8000" rel="noopener noreferrer"&gt;http://localhost:8000&lt;/a&gt;. You are greeted with a dark-themed interface, a conversation panel, a sidebar listing your agents, and a button to create new ones. No build step, no frontend toolchain. It simply works.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fix9yud154a9axinv9brx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fix9yud154a9axinv9brx.png" alt=" " width="800" height="414"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 4: The Agent Builder Wizard: No Code Required
&lt;/h2&gt;

&lt;p&gt;The fastest path to a working voice agent is the Agent Builder Wizard , a guided, multi-step flow accessible directly from the browser. You do not need to write a single line of code or touch a YAML file.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The wizard walks you through four steps:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;STT&lt;/strong&gt; : Choose your speech recognition provider. For this walkthrough, select Deepgram Nova-3 (cloud, streaming, highest accuracy).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;LLM&lt;/strong&gt; : Choose your language model. Select Amazon Bedrock and pick the Claude model you want, for instance, global.anthropic.claude-haiku-4–5–20251001-v1:0 for a fast, cost-effective global endpoint.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;TTS&lt;/strong&gt; : Choose your voice. Select Cartesia Sonic for natural, low-latency cloud synthesis.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Review &amp;amp; Create&lt;/strong&gt; : Name your agent, write a system prompt that defines its personality, and configure the below pipeline settings:&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;VAD Sensitivity&lt;/strong&gt; : A slider from 0.0 to 1.0. Lower values are more sensitive (picks up quiet speech and echo); higher values require louder, more distinct speech. A default of 0.70 works well on speakers.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Transport&lt;/strong&gt; : Choose between &lt;strong&gt;WebSocket&lt;/strong&gt; and &lt;strong&gt;WebRTC&lt;/strong&gt;. WebRTC offers lower steady-state latency (~500ms) with native Opus codec, though the initial handshake takes slightly longer than a WebSocket connection.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Hit &lt;strong&gt;Create Agent&lt;/strong&gt;, and that is it. Under the hood, the wizard translates your selections into a YAML configuration file stored in &lt;code&gt;~/.roohai/agents/&lt;/code&gt;, which the &lt;code&gt;Rooh&lt;/code&gt; pipeline class reads at activation time. For most production use cases, the wizard-generated configuration is all you need.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;created_at&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;2026-03-11T10:15:04.551587+00:00'&lt;/span&gt;
&lt;span class="na"&gt;llm_streaming&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;span class="na"&gt;models&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;bedrock-claude&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;auth_mode&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;api_key&lt;/span&gt;
    &lt;span class="na"&gt;class&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;bedrock-claude&lt;/span&gt;
    &lt;span class="na"&gt;model_id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;global.anthropic.claude-haiku-4-5-20251001-v1:0&lt;/span&gt;
    &lt;span class="na"&gt;region&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;us-east-1&lt;/span&gt;
    &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;llm&lt;/span&gt;
  &lt;span class="na"&gt;cartesia&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;class&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;cartesia&lt;/span&gt;
    &lt;span class="na"&gt;language&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;en&lt;/span&gt;
    &lt;span class="na"&gt;model_id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;sonic-2&lt;/span&gt;
    &lt;span class="na"&gt;sample_rate&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;24000'&lt;/span&gt;
    &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;tts&lt;/span&gt;
    &lt;span class="na"&gt;voice_id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;a0e99841-438c-4a64-b679-ae501e7d6091&lt;/span&gt;
  &lt;span class="na"&gt;deepgram-nova-3&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;class&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;deepgram-nova-3&lt;/span&gt;
    &lt;span class="na"&gt;language&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;en&lt;/span&gt;
    &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;nova-3&lt;/span&gt;
    &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;stt&lt;/span&gt;
  &lt;span class="na"&gt;silero&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;class&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;silero&lt;/span&gt;
    &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;vad&lt;/span&gt;
&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Voice-Agent-DCC&lt;/span&gt;
&lt;span class="na"&gt;pipeline&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;llm&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;bedrock-claude&lt;/span&gt;
  &lt;span class="na"&gt;stt&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;deepgram-nova-3&lt;/span&gt;
  &lt;span class="na"&gt;tts&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;cartesia&lt;/span&gt;
  &lt;span class="na"&gt;vad&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;silero&lt;/span&gt;
  &lt;span class="na"&gt;vad_threshold&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0.7&lt;/span&gt;
&lt;span class="na"&gt;status&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;active&lt;/span&gt;
&lt;span class="na"&gt;system_prompt&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;You will be helpful and concise&lt;/span&gt;
&lt;span class="na"&gt;transport&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;websocket&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Dive Deep: The Rooh Builder API
&lt;/h2&gt;

&lt;p&gt;The wizard is the quickest route, but when you need programmatic control, dynamic configuration, custom hooks, CI/CD-driven deployments, or embedding Rooh inside a larger application — the fluent Builder API lets you construct pipelines entirely in Python. No YAML files, no server UI, just code.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example: A Cloud Powered Voice Agent&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;roohai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Rooh&lt;/span&gt;

&lt;span class="n"&gt;pipeline&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;Rooh&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;deepgram&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;nova-3&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;language&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;en&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bedrock&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;anthropic.claude-3-haiku-20240307-v1:0&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;region&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;us-west-2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;tts&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cartesia&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;voice_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;79a125e8-cd45-4c13-8a67-188112f4dd22&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;vad&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;silero&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;vad_threshold&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.7&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;barge_in&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;silence_duration&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;system_prompt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You are a friendly voice assistant. Keep responses concise &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;and conversational — they will be spoken aloud.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;build&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;pipeline&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Key aspects of the code&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;vad_threshold(0.7)&lt;/strong&gt; : The sensitivity dial. Lower values make the VAD more trigger-happy (detects speech more readily); higher values require more confidence. 0.7 is a sensible default for quiet environments.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;barge_in(True)&lt;/strong&gt; : Enables interruption. If the user starts speaking while the agent is still talking, the pipeline cancels in-progress TTS, sends an interrupt signal to the client, and immediately begins processing the new utterance. This is what makes a voice agent feel responsive rather than robotic.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;silence_duration(1000)&lt;/strong&gt; : The patience parameter, measured in milliseconds. After the VAD detects silence, the pipeline waits this long before concluding the user has finished speaking. At 1000ms, it is adequate for casual conversation. For interviews or complex domains where users need thinking time, you would increase this substantially — 2000ms, 3000ms, or more.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;system_prompt(…)&lt;/strong&gt; : The behavioural directive passed to the LLM on every turn. This shapes the agent’s personality, verbosity, and domain focus.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;pipeline.load()&lt;/strong&gt; : Loads all configured models into memory. For cloud models (Deepgram, Bedrock, Cartesia), this initialises API clients. For local models (Whisper, Piper, Silero), this downloads weights on first use and loads them onto the device.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Example: A Fully Offline Agent&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Not every deployment has internet access. Medical devices, factory floors, classified environments, these demand pipelines that run entirely on the edge. Rooh handles this with the same Builder API:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;pipeline&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;Rooh&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;whisper&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openai/whisper-small&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;language&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;en&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ollama&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gemma3:4b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;tts&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;piper&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;voice&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;en_US-amy-medium&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;vad&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;silero&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;vad_threshold&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;barge_in&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;silence_duration&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1200&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;system_prompt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You are a helpful voice assistant running locally. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Keep responses under two sentences.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;build&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;pipeline&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No API keys. No network calls. Whisper runs inference locally via HuggingFace Transformers, Ollama serves any open model (Llama 3, Mistral, Gemma, Qwen) on localhost, and Piper synthesises speech using lightweight ONNX models that download once and run offline thereafter.&lt;/p&gt;

&lt;p&gt;The tradeoff is latency, local LLMs on CPU are measurably slower than a Bedrock API call but the privacy and availability guarantees are absolute.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Understanding the Provider Ecosystem&lt;/strong&gt;&lt;br&gt;
Rooh ships with a curated set of built-in providers. Each is referenced by a string name in the Builder API, and each accepts provider-specific keyword arguments:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gu"&gt;### Speech-to-Text&lt;/span&gt;

| Provider | Kind | Default Model | Key Parameters |
|----------|------|---------------|----------------|
| &lt;span class="sb"&gt;`"deepgram"`&lt;/span&gt; | Cloud | &lt;span class="sb"&gt;`nova-3`&lt;/span&gt; | &lt;span class="sb"&gt;`model`&lt;/span&gt;, &lt;span class="sb"&gt;`language`&lt;/span&gt;, &lt;span class="sb"&gt;`api_key`&lt;/span&gt; |
| &lt;span class="sb"&gt;`"whisper"`&lt;/span&gt; | Local | &lt;span class="sb"&gt;`openai/whisper-tiny`&lt;/span&gt; | &lt;span class="sb"&gt;`model_id`&lt;/span&gt;, &lt;span class="sb"&gt;`language`&lt;/span&gt; |
| &lt;span class="sb"&gt;`"wav2vec2"`&lt;/span&gt; | Local | &lt;span class="sb"&gt;`facebook/wav2vec2-base-960h`&lt;/span&gt; | &lt;span class="sb"&gt;`model_id`&lt;/span&gt; |
| &lt;span class="sb"&gt;`"nvidia-parakeet"`&lt;/span&gt; | Local | &lt;span class="sb"&gt;`nvidia/parakeet-tdt-0.6b-v2`&lt;/span&gt; | &lt;span class="sb"&gt;`model_id`&lt;/span&gt;, &lt;span class="sb"&gt;`language`&lt;/span&gt; |
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Deepgram is the only STT provider with real-time streaming , interim transcriptions appear as the user speaks, and utterance boundaries are detected server-side. The batch providers (Whisper, Wav2Vec2, NVIDIA) accumulate audio during speech and transcribe once silence is detected.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gu"&gt;### Large Language Models&lt;/span&gt;

| Provider | Kind | Default Model | Key Parameters |
|----------|------|---------------|----------------|
| &lt;span class="sb"&gt;`"bedrock"`&lt;/span&gt; | Cloud | &lt;span class="sb"&gt;`anthropic.claude-3-haiku-20240307-v1:0`&lt;/span&gt; | &lt;span class="sb"&gt;`model_id`&lt;/span&gt;, &lt;span class="sb"&gt;`region`&lt;/span&gt;, &lt;span class="sb"&gt;`auth_mode`&lt;/span&gt; |
| &lt;span class="sb"&gt;`"openai"`&lt;/span&gt; | Cloud | &lt;span class="sb"&gt;`gpt-4o`&lt;/span&gt; | &lt;span class="sb"&gt;`model_id`&lt;/span&gt;, &lt;span class="sb"&gt;`api_key`&lt;/span&gt;, &lt;span class="sb"&gt;`base_url`&lt;/span&gt; |
| &lt;span class="sb"&gt;`"anthropic"`&lt;/span&gt; | Cloud | &lt;span class="sb"&gt;`claude-sonnet-4-20250514`&lt;/span&gt; | &lt;span class="sb"&gt;`model_id`&lt;/span&gt;, &lt;span class="sb"&gt;`api_key`&lt;/span&gt; |
| &lt;span class="sb"&gt;`"gemini"`&lt;/span&gt; | Cloud | &lt;span class="sb"&gt;`gemini-2.5-flash`&lt;/span&gt; | &lt;span class="sb"&gt;`model_id`&lt;/span&gt;, &lt;span class="sb"&gt;`api_key`&lt;/span&gt; |
| &lt;span class="sb"&gt;`"ollama"`&lt;/span&gt; | Local | &lt;span class="sb"&gt;`llama3`&lt;/span&gt; | &lt;span class="sb"&gt;`model_id`&lt;/span&gt;, &lt;span class="sb"&gt;`host`&lt;/span&gt; |
| &lt;span class="sb"&gt;`"local"`&lt;/span&gt; | Local | &lt;span class="sb"&gt;`TinyLlama/TinyLlama-1.1B-Chat-v1.0`&lt;/span&gt; | &lt;span class="sb"&gt;`model_id`&lt;/span&gt; |
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;&lt;span class="gu"&gt;### Text-to-Speech&lt;/span&gt;

| Provider | Kind | Default Voice/Model | Key Parameters |
|----------|------|---------------------|----------------|
| &lt;span class="sb"&gt;`"cartesia"`&lt;/span&gt; | Cloud | &lt;span class="sb"&gt;`sonic-2`&lt;/span&gt; | &lt;span class="sb"&gt;`model_id`&lt;/span&gt;, &lt;span class="sb"&gt;`voice_id`&lt;/span&gt;, &lt;span class="sb"&gt;`language`&lt;/span&gt;, &lt;span class="sb"&gt;`sample_rate`&lt;/span&gt;, &lt;span class="sb"&gt;`api_key`&lt;/span&gt; |
| &lt;span class="sb"&gt;`"deepgram-tts"`&lt;/span&gt; | Cloud | &lt;span class="sb"&gt;`aura-2-thalia-en`&lt;/span&gt; | &lt;span class="sb"&gt;`model`&lt;/span&gt;, &lt;span class="sb"&gt;`sample_rate`&lt;/span&gt;, &lt;span class="sb"&gt;`api_key`&lt;/span&gt; |
| &lt;span class="sb"&gt;`"piper"`&lt;/span&gt; | Local | &lt;span class="sb"&gt;`en_US-lessac-medium`&lt;/span&gt; | &lt;span class="sb"&gt;`voice`&lt;/span&gt; |
| &lt;span class="sb"&gt;`"speecht5"`&lt;/span&gt; | Local | &lt;span class="sb"&gt;`microsoft/speecht5_tts`&lt;/span&gt; | &lt;span class="sb"&gt;`model_id`&lt;/span&gt;, &lt;span class="sb"&gt;`vocoder`&lt;/span&gt; |
| &lt;span class="sb"&gt;`"bark"`&lt;/span&gt; | Local | &lt;span class="sb"&gt;`suno/bark-small`&lt;/span&gt; | &lt;span class="sb"&gt;`model_id`&lt;/span&gt; |
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Extensibility: Bringing Your Own Models&lt;/strong&gt;&lt;br&gt;
The built-in providers cover the most common use cases, but production systems often require bespoke integrations a proprietary STT engine, a fine-tuned TTS model, a domain-specific LLM behind a custom API.&lt;/p&gt;

&lt;p&gt;Rooh’s architecture is designed for this. Every model is a subclass of one of four abstract base classes (&lt;code&gt;STTModel&lt;/code&gt;, &lt;code&gt;TTSModel&lt;/code&gt;, &lt;code&gt;LLMModel&lt;/code&gt;, &lt;code&gt;VADModel&lt;/code&gt;), each requiring just four methods. And as of v0.1.7, the Builder API supports &lt;strong&gt;custom model&lt;/strong&gt; classes directly no framework modifications, no forking, no monkey-patching:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;roohai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Rooh&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;TTSModel&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ElevenLabsTTS&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;TTSModel&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Custom TTS provider using ElevenLabs API.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="n"&gt;META&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;display_name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ElevenLabs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Ultra-realistic voice synthesis.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;kind&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cloud&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;voice_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;default&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_api_key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;api_key&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;voice_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;voice_id&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;
        &lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;elevenlabs&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ElevenLabs&lt;/span&gt;
        &lt;span class="n"&gt;key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_api_key&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ELEVENLABS_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ElevenLabs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;synthesize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;tuple&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ndarray&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;io&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;soundfile&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;sf&lt;/span&gt;
        &lt;span class="n"&gt;audio_bytes&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text_to_speech&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;convert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;voice_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;voice_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;eleven_multilingual_v2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sr&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;io&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;BytesIO&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;b&lt;/span&gt;&lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;audio_bytes&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt; &lt;span class="n"&gt;dtype&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;float32&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sr&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;unload&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;

    &lt;span class="nd"&gt;@property&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;is_loaded&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_client&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;


&lt;span class="c1"&gt;# Use it alongside built-in providers — no framework changes
&lt;/span&gt;&lt;span class="n"&gt;pipeline&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;Rooh&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;deepgram&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;nova-3&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;anthropic&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude-sonnet-4-20250514&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;custom_tts&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;elevenlabs&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ElevenLabsTTS&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;voice_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pNInz6obpgDQGcFmaJgB&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;vad&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;system_prompt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You are a helpful voice assistant.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;build&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;pipeline&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;custom_tts()&lt;/code&gt; method (and its siblings &lt;code&gt;custom_stt()&lt;/code&gt;, &lt;code&gt;custom_llm()&lt;/code&gt;, &lt;code&gt;custom_vad()&lt;/code&gt;) accepts a name, a class, and any constructor kwargs. It registers the class in Rooh’s global model registry, making it a first-class citizen eligible for hot-swapping, visible in the catalog API, and tracked by pipeline metrics.&lt;/p&gt;

&lt;p&gt;This is the extensibility contract: implement four methods, pass the class to the Builder, and you are done.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Patience Problem, Revisited
&lt;/h2&gt;

&lt;p&gt;Remember the interview agent from the opening? Let us solve that problem properly.&lt;/p&gt;

&lt;p&gt;The naive approach is to increase &lt;code&gt;silence_duration&lt;/code&gt; wait longer before concluding the user is done. But silence duration alone is a blunt instrument. A 3-second threshold helps with thinking pauses, but it also introduces a 3-second delay after every genuinely completed answer. The user finishes speaking, then sits in awkward silence for three seconds before the agent responds. That is not empathy; it is a different flavour of frustration.&lt;/p&gt;

&lt;p&gt;The elegant solution combines two mechanisms:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;A &lt;strong&gt;generous silence threshold&lt;/strong&gt; to avoid premature truncation&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;LLM-driven completeness evaluation&lt;/strong&gt; to determine whether a pause is a thinking pause or a genuine end-of-turn&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Rooh’s hook system makes this possible. Instead of routing transcribed text directly to an LLM for response generation, you intercept it with a hook that evaluates the answer’s completeness:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;pipeline&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;Rooh&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;deepgram&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;nova-3&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bedrock&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;anthropic.claude-3-haiku-20240307-v1:0&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;tts&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cartesia&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;voice_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;79a125e8-cd45-4c13-8a67-188112f4dd22&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;vad&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;silero&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;silence_duration&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Wait 3 seconds — give the user room to think
&lt;/span&gt;    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;barge_in&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;          &lt;span class="c1"&gt;# Let them resume after a filler
&lt;/span&gt;    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;llm_hook&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;interview_hook&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;system_prompt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You are a patient, professional interviewer.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;build&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;interview_hook&lt;/code&gt; receives the transcribed text and the &lt;code&gt;session_id&lt;/code&gt; (a UUID unique to each connection). It maintains per-session state — conversation history, the current question, and critically, a &lt;code&gt;pending_partial&lt;/code&gt; buffer. When the LLM judges an answer as incomplete, the hook responds with a brief filler (“take your time”, “mm-hmm”) and stores the partial answer. When the user resumes speaking (triggering barge-in after the filler), the next invocation concatenates the continuation with the stored partial and evaluates the combined answer.&lt;/p&gt;

&lt;p&gt;The result is an interviewer that genuinely listens. It pauses when you pause. It encourages when you hesitate. It advances only when your answer is substantively complete. The conversational rhythm feels human because the pipeline is modelling human conversational norms rather than optimising for throughput.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Conclusion&lt;/strong&gt;&lt;br&gt;
Building a voice agent that merely speaks is table stakes. Building one that listens, that respects the cadence of human thought, that distinguishes a thinking pause from a completed utterance, that knows when to wait and when to respond that is the harder, more consequential problem.&lt;/p&gt;

&lt;p&gt;Rooh does not solve this problem for you. It gives you the architectural primitives to solve it yourself: swappable models, configurable silence thresholds, barge-in control, LLM hooks for custom logic, per-session state via &lt;code&gt;session_id&lt;/code&gt;, and an extensibility model that lets you bring any provider into the pipeline without forking the framework.&lt;/p&gt;

&lt;p&gt;GitRepo: &lt;a href="https://github.com/roohai/roohai-framework" rel="noopener noreferrer"&gt;https://github.com/roohai/roohai-framework&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>aws</category>
      <category>programming</category>
      <category>python</category>
    </item>
    <item>
      <title>We’ve all had that awkward conversation with a voice bot where you pause for one second to collect your thoughts, and it immediately interrupts you. Link to the blog
https://medium.com/@frasersequeira/building-voice-agents-with-rooh-b4ece2abbb14</title>
      <dc:creator>fraser sequeira</dc:creator>
      <pubDate>Wed, 11 Mar 2026 23:23:47 +0000</pubDate>
      <link>https://dev.to/fraser_sequeira_19d159328/weve-all-had-that-awkward-conversation-with-a-voice-bot-where-you-pause-for-one-second-to-collect-485l</link>
      <guid>https://dev.to/fraser_sequeira_19d159328/weve-all-had-that-awkward-conversation-with-a-voice-bot-where-you-pause-for-one-second-to-collect-485l</guid>
      <description>&lt;div class="crayons-card c-embed text-styles text-styles--secondary"&gt;
    &lt;div class="c-embed__content"&gt;
      &lt;div class="c-embed__body flex items-center justify-between"&gt;
        &lt;a href="https://medium.com/@frasersequeira/building-voice-agents-with-rooh-b4ece2abbb14" rel="noopener noreferrer" class="c-link fw-bold flex items-center"&gt;
          &lt;span class="mr-2"&gt;medium.com&lt;/span&gt;
          

        &lt;/a&gt;
      &lt;/div&gt;
    &lt;/div&gt;
&lt;/div&gt;


</description>
    </item>
  </channel>
</rss>
