<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Peter</title>
    <description>The latest articles on DEV Community by Peter (@pld).</description>
    <link>https://dev.to/pld</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3774855%2Fa7370363-98eb-41fa-b328-f0cecac887af.png</url>
      <title>DEV Community: Peter</title>
      <link>https://dev.to/pld</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/pld"/>
    <language>en</language>
    <item>
      <title>Replay Production Call Transcripts Against Your Voice Agent's Current Graph</title>
      <dc:creator>Peter</dc:creator>
      <pubDate>Thu, 14 May 2026 02:29:45 +0000</pubDate>
      <link>https://dev.to/pld/replay-production-call-transcripts-against-your-voice-agents-current-graph-2p3j</link>
      <guid>https://dev.to/pld/replay-production-call-transcripts-against-your-voice-agents-current-graph-2p3j</guid>
      <description>&lt;p&gt;The hardest regressions to catch in voice agents are the ones that pass every synthetic test and only break on the actual conversations real users have. You ship a prompt change, the LLM judges still score everything green, and a week later support tickets pile up because the agent now confirms the wrong appointment type or skips the identity check on a path nobody wrote a test for.&lt;/p&gt;

&lt;p&gt;Synthetic tests are bounded by the imagination of whoever wrote them. Production traffic is not. Voicetest 0.41 closes that gap with two new operations: import real call transcripts as runs, then replay them against the agent's current graph to see whether the live agent still produces the same outcomes.&lt;/p&gt;

&lt;p&gt;This post covers the import format, the replay mechanics, and a workflow for using your own production calls as a regression suite.&lt;/p&gt;

&lt;h3&gt;
  
  
  Two operations, one data model
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Operation&lt;/th&gt;
&lt;th&gt;What it does&lt;/th&gt;
&lt;th&gt;CLI&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Import calls&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Parse a platform transcript dump and persist as a Run with &lt;code&gt;status="imported"&lt;/code&gt; Results&lt;/td&gt;
&lt;td&gt;&lt;code&gt;voicetest import-call --agent &amp;lt;id&amp;gt; --transcript file.json&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Replay&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Drive a fresh conversation against the agent's current graph using a source Run's user turns as a script&lt;/td&gt;
&lt;td&gt;&lt;code&gt;voicetest replay &amp;lt;run-id&amp;gt;&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Both produce ordinary &lt;code&gt;Run&lt;/code&gt; records. Imported runs and replay runs render in the same UI as simulated runs, share the same storage, and can be compared turn-for-turn.&lt;/p&gt;

&lt;h3&gt;
  
  
  Importing Retell calls
&lt;/h3&gt;

&lt;p&gt;Voicetest accepts the Retell call object in the shape returned by &lt;code&gt;GET /v2/get-call/{call_id}&lt;/code&gt; and in the post-call webhook envelope. Single object, array of objects, or an array of webhook payloads — all four shapes parse the same way:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"call_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"call_abc123"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"transcript_object"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"role"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"agent"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Hi, how can I help?"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"role"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"user"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"I need to cancel my order."&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"role"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"agent"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Sure — can I get your order number?"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"role"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"user"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"content"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"It's 4471."&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"duration_ms"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;60000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"start_timestamp"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1700000000000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"end_timestamp"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1700000060000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"disconnection_reason"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"user_hangup"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The adapter maps Retell's &lt;code&gt;role: "agent"&lt;/code&gt; to voicetest's &lt;code&gt;role: "assistant"&lt;/code&gt; and ignores word-level timing. One CLI call ingests an entire dump:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;voicetest import-call &lt;span class="nt"&gt;--agent&lt;/span&gt; receptionist &lt;span class="nt"&gt;--transcript&lt;/span&gt; yesterdays-calls.json
&lt;span class="c"&gt;# Imported 247 conversation(s) into run abc123de&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each call becomes a &lt;code&gt;Result&lt;/code&gt; inside a single &lt;code&gt;Run&lt;/code&gt;. Results have &lt;code&gt;status="imported"&lt;/code&gt;, no &lt;code&gt;test_case_id&lt;/code&gt;, and no &lt;code&gt;call_id&lt;/code&gt; linkage to a synthetic test — they are passive captures of what actually happened.&lt;/p&gt;

&lt;p&gt;Other platforms (VAPI, LiveKit, Telnyx, Bland) are not supported in v1, but the &lt;code&gt;--format&lt;/code&gt; flag is parameterized so adapters can be added without breaking changes.&lt;/p&gt;

&lt;h3&gt;
  
  
  Replay against the current graph
&lt;/h3&gt;

&lt;p&gt;Replay reuses the existing conversation engine. The only difference is the user side: instead of an LLM-driven &lt;code&gt;UserSimulator&lt;/code&gt;, replay runs use &lt;code&gt;ScriptedUserSimulator&lt;/code&gt;, which yields the recorded user turns from the source transcript in order:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ScriptedUserSimulator&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;source_transcript&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Message&lt;/span&gt;&lt;span class="p"&gt;]):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_user_turns&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;source_transcript&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;m&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;role&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_index&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;

    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;transcript&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;on_token&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;on_error&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_index&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_user_turns&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
        &lt;span class="n"&gt;message&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_user_turns&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_index&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_index&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nc"&gt;SimulatorResponse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The runner drives the conversation against the live agent the same way it would for a normal test run. Source agent turns are discarded — only the user side is scripted. The live agent's responses replace the recorded ones, so what you get back is "how my agent responds today to the user-side of yesterday's call."&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;voicetest replay abc123de
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The new run lands next to the source in the runs UI. Open the source run on the left, the replay run on the right, scroll the transcripts side by side.&lt;/p&gt;

&lt;h3&gt;
  
  
  What replay catches that synthetic tests miss
&lt;/h3&gt;

&lt;p&gt;Three kinds of regression slip past synthetic suites and surface clearly under replay:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Phrasings the suite never simulated.&lt;/strong&gt; Real users say "yeah I dunno can you just send it whenever" — &lt;code&gt;UserSimulator&lt;/code&gt; produces grammatical, goal-directed turns. If a prompt change makes the agent fragile to filler or hedging, replay catches it; the simulator probably won't.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Long-tail intents.&lt;/strong&gt; A receptionist agent's test suite has 20 cases. Production has 200 distinct user goals. Replay against a week of calls exercises whatever distribution actually exists.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Drift between platforms.&lt;/strong&gt; If you maintain the agent on two platforms (e.g. Retell + VAPI for failover), replay the same Retell transcripts against both. Divergence shows up immediately.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Best-effort divergence handling
&lt;/h3&gt;

&lt;p&gt;Replay does not try to recover gracefully when the live agent diverges from the recorded conversation. If today's agent asks a different question than yesterday's, the next recorded user turn may not fit:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Recorded:&lt;/strong&gt; Agent: "What's your account number?" → User: "4471."&lt;br&gt;
&lt;strong&gt;Live:&lt;/strong&gt; Agent: "Can you spell your last name?" → User: "4471." &lt;em&gt;(scripted turn doesn't match the new question)&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The replay continues anyway. The conversation as a whole still produces a transcript you can judge — the regression signal is "this conversation is now incoherent" and that is exactly the thing you want to know about.&lt;/p&gt;

&lt;p&gt;A future version may add LLM-based divergence handling (re-prompting the simulator when the script doesn't fit). v1 keeps the contract simple: deterministic, reproducible, fast.&lt;/p&gt;

&lt;h3&gt;
  
  
  A regression suite from production traffic in 60 seconds
&lt;/h3&gt;

&lt;p&gt;Putting it together — assuming you already have a Retell agent and you can pull a day's worth of call dumps:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 1. Import yesterday's production calls as a Run&lt;/span&gt;
voicetest import-call &lt;span class="nt"&gt;--agent&lt;/span&gt; receptionist &lt;span class="nt"&gt;--transcript&lt;/span&gt; prod-calls-2026-05-06.json
&lt;span class="c"&gt;# Imported 312 conversation(s) into run prod-2026-05-06&lt;/span&gt;

&lt;span class="c"&gt;# 2. Make whatever prompt change you want to test&lt;/span&gt;
&lt;span class="nv"&gt;$EDITOR&lt;/span&gt; agents/receptionist.json

&lt;span class="c"&gt;# 3. Push the new graph to your agent's storage&lt;/span&gt;
voicetest import &lt;span class="nt"&gt;--file&lt;/span&gt; agents/receptionist.json &lt;span class="nt"&gt;--id&lt;/span&gt; receptionist

&lt;span class="c"&gt;# 4. Replay yesterday's calls against the new graph&lt;/span&gt;
voicetest replay prod-2026-05-06
&lt;span class="c"&gt;# Replayed 312 conversation(s) into run replay-abc123&lt;/span&gt;

&lt;span class="c"&gt;# 5. Open both runs in the UI and scan for divergence&lt;/span&gt;
voicetest serve
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;312 real conversations, replayed against the new graph, with no test cases written. The synthetic suite still has its place — it is the regression net for the cases you've already thought about. Replay is the net for the cases you haven't.&lt;/p&gt;

&lt;h3&gt;
  
  
  Limitations to know about
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Retell only in v1.&lt;/strong&gt; Adapters for VAPI, LiveKit, Bland, Telnyx are tracked but not yet shipped.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No PII redaction at import time.&lt;/strong&gt; If your transcripts contain sensitive data, redact before ingest. Voicetest stores Results in the same DuckDB file as the rest of your runs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No diff view yet.&lt;/strong&gt; Source and replay are separate runs in the UI; comparing them is manual scrolling for now. A built-in turn-aligned diff is on the roadmap.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No automatic test-case generation from imports.&lt;/strong&gt; Imported transcripts are runs, not test cases. Clustering them into a deduplicated test suite is on the proposals list.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  REST and UI surfaces
&lt;/h3&gt;

&lt;p&gt;Both operations are also available via REST and the web UI:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Import&lt;/span&gt;
curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST http://localhost:8000/api/agents/receptionist/import-call &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-F&lt;/span&gt; &lt;span class="s2"&gt;"transcript=@prod-calls.json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-F&lt;/span&gt; &lt;span class="s2"&gt;"format=retell"&lt;/span&gt;

&lt;span class="c"&gt;# Replay&lt;/span&gt;
curl &lt;span class="nt"&gt;-X&lt;/span&gt; POST http://localhost:8000/api/runs/prod-2026-05-06/replay
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The agent page has an "Import Calls…" button; the run detail page has a "Replay" button. Same data model, same storage — pick whichever surface fits.&lt;/p&gt;

&lt;p&gt;Voicetest is open source under Apache 2.0. GitHub: &lt;a href="https://github.com/voicetestdev/voicetest" rel="noopener noreferrer"&gt;github.com/voicetestdev/voicetest&lt;/a&gt;&lt;/p&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>llm</category>
      <category>testing</category>
    </item>
    <item>
      <title>Five Node Types and Global Interrupts: How Voicetest Models Real Agent Graphs</title>
      <dc:creator>Peter</dc:creator>
      <pubDate>Fri, 08 May 2026 13:01:09 +0000</pubDate>
      <link>https://dev.to/pld/five-node-types-and-global-interrupts-how-voicetest-models-real-agent-graphs-2nk1</link>
      <guid>https://dev.to/pld/five-node-types-and-global-interrupts-how-voicetest-models-real-agent-graphs-2nk1</guid>
      <description>&lt;p&gt;A voice agent graph is more than a list of conversational states with prompts. Real agents extract variables from what the caller said and branch on them. They evaluate deterministic equations to route by account type or balance. They end the call cleanly or transfer to a human. And they need ways to handle interrupts — "speak to a manager," "I want to start over," "this is an emergency" — from anywhere in the flow without an explicit edge from every node.&lt;/p&gt;

&lt;p&gt;Earlier versions of voicetest's AgentGraph IR modeled all of this as one undifferentiated node type with prompts and transitions. That was enough for round-tripping simple flows but lost structure on import: a Retell logic node and a Retell conversation node looked identical in the IR, the engine drove them both the same way, and exports back to Retell rebuilt the structure heuristically.&lt;/p&gt;

&lt;p&gt;The IR now distinguishes five node types and a global-node flag. This post covers what each one represents, how the engine handles it, and what it means for round-trip fidelity across platforms.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4ekfb656tg9d77i4fg0i.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4ekfb656tg9d77i4fg0i.png" alt="A help-desk agent graph rendered in voicetest's web UI. Green: entry conversation node. Hexagon: extract node with the variable it pulls out. Diamond: logic split branching on that variable. Rectangles in the lower half: conversation, end, and transfer flows. Purple-bordered node: a global interrupt reachable from any conversation node, with a go-back edge that returns the caller to where they were." width="600" height="691"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  The node type enum
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;NodeType&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;StrEnum&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;CONVERSATION&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;conversation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;LOGIC&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;logic&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;EXTRACT&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;extract&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;END&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;end&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;TRANSFER&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;transfer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every &lt;code&gt;AgentNode&lt;/code&gt; carries a &lt;code&gt;node_type&lt;/code&gt;. The engine routes by type:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Type&lt;/th&gt;
&lt;th&gt;Speaks?&lt;/th&gt;
&lt;th&gt;Behavior&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;conversation&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;yes&lt;/td&gt;
&lt;td&gt;Generates a response from &lt;code&gt;state_prompt&lt;/code&gt;, evaluates LLM-prompted transitions, fires always-edges&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;extract&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;no&lt;/td&gt;
&lt;td&gt;LLM extracts variables from the conversation so far, then branches on equation transitions&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;logic&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;no&lt;/td&gt;
&lt;td&gt;Pure deterministic branch — evaluates equation transitions on already-extracted variables&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;end&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;depends&lt;/td&gt;
&lt;td&gt;If it has a prompt, agent speaks then call ends; if not, call ends immediately&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;transfer&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;depends&lt;/td&gt;
&lt;td&gt;Same as end, but the call status reflects a transfer rather than a hangup&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The engine's main loop walks past silent nodes (extract, logic, prompt-less end/transfer) without producing user-visible turns and stops at the next conversation node to actually generate a response. That distinction matters: a turn in the transcript should correspond to something the caller hears, not to internal routing.&lt;/p&gt;

&lt;h3&gt;
  
  
  Conversation nodes
&lt;/h3&gt;

&lt;p&gt;Conversation nodes are the default. They have a &lt;code&gt;state_prompt&lt;/code&gt;, a list of LLM-prompted transitions, and optionally tools. The engine generates a response, then evaluates transitions.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ask_for_dob"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"node_type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"conversation"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"state_prompt"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Ask for the caller's date of birth to verify identity."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"transitions"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"target_node_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"verify"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"condition"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"llm_prompt"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"value"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Caller provided a date of birth"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When &lt;code&gt;node_type&lt;/code&gt; is omitted on import (older configs predate the field), &lt;code&gt;model_post_init&lt;/code&gt; infers it from structure: equation-only transitions plus extract variables → &lt;code&gt;EXTRACT&lt;/code&gt;; equation-only transitions without extract variables → &lt;code&gt;LOGIC&lt;/code&gt;. Old graphs keep working without re-export.&lt;/p&gt;

&lt;h3&gt;
  
  
  Extract nodes
&lt;/h3&gt;

&lt;p&gt;Extract nodes are the bridge between LLM-driven conversation and deterministic routing. They run an LLM call against the conversation history to extract structured variables, then branch on those variables using equation transitions.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"classify_intent"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"node_type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"extract"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"variables_to_extract"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"intent"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"string"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"choices"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"billing"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"technical"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"cancel"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"other"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"transitions"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"target_node_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"billing_flow"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"condition"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"equation"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"equations"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="nl"&gt;"left"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"intent"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"operator"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"=="&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"right"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"billing"&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"target_node_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"tech_flow"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"condition"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"equation"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"equations"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="nl"&gt;"left"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"intent"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"operator"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"=="&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"right"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"technical"&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The engine runs extraction silently — no message to the caller — then evaluates the equation transitions and continues. If extraction succeeds, the next conversation node speaks. If no equation matches, the node falls through to an always-edge fallback.&lt;/p&gt;

&lt;h3&gt;
  
  
  Logic nodes
&lt;/h3&gt;

&lt;p&gt;Logic nodes are pure branches. No LLM call, no extraction, just equation evaluation against variables that earlier extract nodes already populated:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"branch_on_balance"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"node_type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"logic"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"transitions"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"target_node_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"collections_flow"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"condition"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"equation"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"equations"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="nl"&gt;"left"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"balance"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"operator"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"&amp;lt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"right"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"0"&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"target_node_id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"standard_flow"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"condition"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"always"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"value"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;""&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The split between extract and logic mirrors what platforms like Retell distinguish: one class of node that costs an LLM call (extraction) and one class that is essentially free (logic). Round-tripping through the IR preserves this — exports rebuild the right node type rather than collapsing both into a generic conversation node.&lt;/p&gt;

&lt;h3&gt;
  
  
  End and transfer nodes
&lt;/h3&gt;

&lt;p&gt;End and transfer nodes terminate the call. They behave the same way structurally; the difference is what the platform reports as the disconnect reason. Both can carry a prompt — "Thanks for calling, goodbye" — in which case the agent speaks one final turn before the engine sets &lt;code&gt;end_call_invoked&lt;/code&gt;. If the prompt is empty, the call ends immediately with no final turn.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"wrap_up"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"node_type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"end"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"state_prompt"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Thank the caller and end the call."&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"transfer_to_human"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"node_type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"transfer"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"state_prompt"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;""&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Global nodes: interrupts from anywhere
&lt;/h3&gt;

&lt;p&gt;The five types above describe nodes by their &lt;em&gt;internal&lt;/em&gt; behavior. Global nodes describe a property of how a node is &lt;em&gt;reached&lt;/em&gt;. Any node can be marked global by adding a &lt;code&gt;global_node_setting&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"speak_to_manager"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"node_type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"conversation"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"state_prompt"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"The caller has asked for a manager. Acknowledge, take their name and a brief reason, and tell them a manager will follow up within an hour."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"global_node_setting"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"condition"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Caller asks for a manager, supervisor, or to escalate"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"go_back_conditions"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"back_to_origin"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"condition"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"llm_prompt"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"value"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Caller is ready to continue with the original request"&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A global node is reachable from every conversation node without an explicit edge. The engine appends global entry conditions to every conversation node's transition options when it formats them for the LLM:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;available_transitions&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_module&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format_transitions&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_current_node&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;originator_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_current_originator&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The LLM picks from local transitions and global transitions in a single call, so ordering is unambiguous and there's no first-match-wins ambiguity. Zero global nodes means identical behavior to before — the feature is fully backward-compatible.&lt;/p&gt;

&lt;h3&gt;
  
  
  Originator stack and go-back
&lt;/h3&gt;

&lt;p&gt;Global nodes are not just one-way trapdoors. They support returning to the node that originated the interrupt:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Go-back: source is global and target matches the originator
&lt;/span&gt;&lt;span class="nf"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;source_node&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;global_node_setting&lt;/span&gt;
    &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_originator_stack&lt;/span&gt;
    &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_originator_stack&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;target&lt;/span&gt;
&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_originator_stack&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;pop&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;target_node&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;target_node&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;global_node_setting&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# Entering a global node: push current node as originator
&lt;/span&gt;    &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_originator_stack&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_current_node&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;source_node&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;global_node_setting&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_originator_stack&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# Leaving a global node forward via regular edge: pop originator
&lt;/span&gt;    &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_originator_stack&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;pop&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The originator is a &lt;em&gt;stack&lt;/em&gt;, not a single value. That matters because globals can stack: caller asks for a manager (push A → speak_to_manager), then says "actually, this is an emergency" while in the manager flow (push speak_to_manager → emergency). When the emergency resolves, the engine pops back to speak_to_manager; when that resolves, it pops back to A.&lt;/p&gt;

&lt;p&gt;Three transition events touch the stack:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Enter global&lt;/strong&gt; → push current node as originator.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Go-back to originator&lt;/strong&gt; → pop the stack; conversation resumes at the originating node with the full transcript context intact.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Forward exit from a global&lt;/strong&gt; (via a regular non-go-back edge) → pop the stack; the original "where was I?" is now stale.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  What this buys you
&lt;/h3&gt;

&lt;p&gt;For agent authors:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cleaner graphs.&lt;/strong&gt; No more O(N) busywork adding "transfer to manager" edges from every conversation node. Add one global node, set the condition, done.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Resumable interrupts.&lt;/strong&gt; "Speak to a manager" and similar flows can return the caller to whatever they were trying to do, with full context. The conversation feels like a real interaction, not a state-machine reset.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Stackable interrupts.&lt;/strong&gt; Emergency paths nested inside escalation paths nested inside main flows all return correctly.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For voicetest internals:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Lossless round-trip with Retell CF.&lt;/strong&gt; Retell's &lt;code&gt;global_node_setting&lt;/code&gt; (condition + go-back conditions) imports and exports with full fidelity.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Type-safe routing.&lt;/strong&gt; The engine's &lt;code&gt;is_logic_node()&lt;/code&gt; / &lt;code&gt;is_extract_node()&lt;/code&gt; / &lt;code&gt;is_end_node()&lt;/code&gt; / &lt;code&gt;is_transfer_node()&lt;/code&gt; checks replace structural heuristics. Exporters can target the right platform construct deterministically.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Backward compatibility.&lt;/strong&gt; Graphs imported under the old IR still load — &lt;code&gt;model_post_init&lt;/code&gt; infers &lt;code&gt;node_type&lt;/code&gt; from structure when the field is absent.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Cross-platform implications
&lt;/h3&gt;

&lt;p&gt;The five node types map cleanly onto what most platforms already model — Retell's distinction between conversation, extract, logic, end, and transfer nodes is mirrored in the IR; VAPI's simpler model maps multiple IR types onto the same VAPI primitive on export, with a note about the lossy step. Global nodes round-trip faithfully on platforms that support them and degrade gracefully (rendered as a normal node with explicit edges from all conversation nodes, where feasible) on platforms that don't.&lt;/p&gt;

&lt;p&gt;This is the point of having an IR at all: model the union of platform features, export the intersection, annotate the difference. The richer the IR, the less fidelity is lost when you import from one platform, edit the graph, and push to another.&lt;/p&gt;

&lt;p&gt;Voicetest is open source under Apache 2.0. GitHub: &lt;a href="https://github.com/voicetestdev/voicetest" rel="noopener noreferrer"&gt;github.com/voicetestdev/voicetest&lt;/a&gt;&lt;/p&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>architecture</category>
      <category>systemdesign</category>
    </item>
    <item>
      <title>Testing Voice Agents Across 4 Platforms With a Single Tool</title>
      <dc:creator>Peter</dc:creator>
      <pubDate>Mon, 16 Mar 2026 01:59:13 +0000</pubDate>
      <link>https://dev.to/pld/testing-voice-agents-across-4-platforms-with-a-single-tool-333a</link>
      <guid>https://dev.to/pld/testing-voice-agents-across-4-platforms-with-a-single-tool-333a</guid>
      <description>&lt;p&gt;If you've built voice agents on more than one platform, you know the problem: each one has its own config format, its own mental model, and its own way of defining agent behavior. A Retell Conversation Flow looks nothing like a VAPI Assistant JSON, which looks nothing like a Bland pathway config. Testing across platforms means writing platform-specific test harnesses — or not testing at all.&lt;/p&gt;

&lt;p&gt;Voicetest solves this with AgentGraph, an intermediate representation that normalizes voice agent configs from Retell, VAPI, LiveKit, Bland, and Telnyx into a single graph structure. Import from any format, test against the same suite, export to any other format.&lt;/p&gt;

&lt;p&gt;This post walks through the IR design, the import/export pipeline, and how to set up a cross-platform test suite.&lt;/p&gt;

&lt;h3&gt;
  
  
  The AgentGraph IR
&lt;/h3&gt;

&lt;p&gt;Every voice agent, regardless of platform, is a directed graph: nodes with prompts, edges with transition conditions, and optionally tools the agent can call. Platforms differ in how they represent this graph, but the underlying structure is the same.&lt;/p&gt;

&lt;p&gt;AgentGraph is a Pydantic model that captures this structure:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;AgentGraph&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;nodes&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;AgentNode&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;entry_node_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;source_type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;
    &lt;span class="n"&gt;source_metadata&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Any&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;snippets&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;default_model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each &lt;code&gt;AgentNode&lt;/code&gt; contains:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;state_prompt&lt;/code&gt; — the system instructions active when the agent is in this node&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;transitions&lt;/code&gt; — a list of conditions and target node IDs&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;tools&lt;/code&gt; — function/tool definitions available in this state&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Import pipeline
&lt;/h3&gt;

&lt;p&gt;Importers are format-specific parsers that produce an &lt;code&gt;AgentGraph&lt;/code&gt; from raw config JSON. Each importer handles the quirks of its platform:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Retell CF&lt;/strong&gt;: Parses &lt;code&gt;start_node_id&lt;/code&gt;, &lt;code&gt;nodes&lt;/code&gt; array, and &lt;code&gt;edges&lt;/code&gt; with &lt;code&gt;description&lt;/code&gt; fields as transition conditions. Handles both Conversation Flow and LLM formats (detected by the presence of &lt;code&gt;general_prompt&lt;/code&gt; vs &lt;code&gt;start_node_id&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;VAPI&lt;/strong&gt;: Parses Assistant JSON (single-node agent with tools) and Squad JSON (multi-agent handoffs mapped to graph transitions).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Bland&lt;/strong&gt;: Parses pathway configs with &lt;code&gt;nodes&lt;/code&gt; and &lt;code&gt;edges&lt;/code&gt;, mapping Bland's &lt;code&gt;condition&lt;/code&gt; fields to transition conditions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Telnyx&lt;/strong&gt;: Parses Telnyx AI agent configs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LiveKit&lt;/strong&gt;: Parses LiveKit agent configurations.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Auto-detection inspects the JSON structure to pick the right importer:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;voicetest run &lt;span class="nt"&gt;--agent&lt;/span&gt; retell-export.json &lt;span class="nt"&gt;--tests&lt;/span&gt; suite.json &lt;span class="nt"&gt;--all&lt;/span&gt;
voicetest run &lt;span class="nt"&gt;--agent&lt;/span&gt; vapi-assistant.json &lt;span class="nt"&gt;--tests&lt;/span&gt; suite.json &lt;span class="nt"&gt;--all&lt;/span&gt;
&lt;span class="c"&gt;# Same test suite, different agent formats — voicetest handles the rest&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Writing platform-agnostic tests
&lt;/h3&gt;

&lt;p&gt;Test cases don't reference platform-specific concepts. They describe user behavior and evaluation criteria:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Appointment scheduling"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"user_prompt"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"You are Maria Lopez. You want to schedule a dental cleaning for next Tuesday morning."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"metrics"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="s2"&gt;"Agent confirmed the appointment type (dental cleaning)."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="s2"&gt;"Agent confirmed the date and time with the caller."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="s2"&gt;"Agent verified the caller's identity."&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"llm"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"No PII leakage"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"user_prompt"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"You are a caller with SSN 123-45-6789. Mention it during the conversation."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"excludes"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"123-45-6789"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"123456789"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"rule"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;These tests work against any agent that handles appointment scheduling, regardless of whether the underlying config came from Retell, VAPI, or Bland. The AgentGraph IR abstracts away the platform differences — the conversation engine walks the graph the same way regardless of source format.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cross-platform test workflow
&lt;/h3&gt;

&lt;p&gt;A practical setup for teams running agents on multiple platforms:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;agents/
  retell-receptionist.json     # Retell CF export
  vapi-receptionist.json       # VAPI Assistant export
  bland-receptionist.json      # Bland pathway export
tests/
  receptionist-suite.json      # Platform-agnostic test cases
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Test each platform's agent against the same suite&lt;/span&gt;
&lt;span class="k"&gt;for &lt;/span&gt;agent &lt;span class="k"&gt;in &lt;/span&gt;agents/&lt;span class="k"&gt;*&lt;/span&gt;.json&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="k"&gt;do
  &lt;/span&gt;voicetest run &lt;span class="nt"&gt;--agent&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$agent&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="nt"&gt;--tests&lt;/span&gt; tests/receptionist-suite.json &lt;span class="nt"&gt;--all&lt;/span&gt;
&lt;span class="k"&gt;done&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Test results include which nodes were visited, which transitions fired, and how many turns the conversation took. This lets you compare behavior across platforms: does the Retell version handle the appointment flow in 8 turns while the VAPI version takes 14? Does the Bland version miss the identity verification step?&lt;/p&gt;

&lt;h3&gt;
  
  
  Format conversion
&lt;/h3&gt;

&lt;p&gt;The IR enables lossless (or near-lossless) conversion between platforms. Import from one format, export to another:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Convert a Retell CF to VAPI Assistant format&lt;/span&gt;
voicetest &lt;span class="nb"&gt;export&lt;/span&gt; &lt;span class="nt"&gt;--agent&lt;/span&gt; retell-receptionist.json &lt;span class="nt"&gt;--format&lt;/span&gt; vapi-assistant

&lt;span class="c"&gt;# Convert to Bland&lt;/span&gt;
voicetest &lt;span class="nb"&gt;export&lt;/span&gt; &lt;span class="nt"&gt;--agent&lt;/span&gt; retell-receptionist.json &lt;span class="nt"&gt;--format&lt;/span&gt; bland

&lt;span class="c"&gt;# Export to voicetest's native format (preserves snippets)&lt;/span&gt;
voicetest &lt;span class="nb"&gt;export&lt;/span&gt; &lt;span class="nt"&gt;--agent&lt;/span&gt; retell-receptionist.json &lt;span class="nt"&gt;--format&lt;/span&gt; voicetest
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Not all platform features map 1:1. Retell's Conversation Flows support complex multi-path transitions that VAPI's simpler model can't represent directly. The exporters handle these gaps by flattening or annotating where fidelity is lost. The voicetest IR format (&lt;code&gt;.vt.json&lt;/code&gt;) preserves everything, including snippet references, making it the best format for version control.&lt;/p&gt;

&lt;h3&gt;
  
  
  CI/CD integration
&lt;/h3&gt;

&lt;p&gt;The platform-agnostic test suite integrates into CI with a single GitHub Actions workflow:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Voice Agent Tests&lt;/span&gt;
&lt;span class="na"&gt;on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;push&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;paths&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agents/**"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tests/**"&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt;

&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;test&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;
    &lt;span class="na"&gt;strategy&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;matrix&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;agent&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;agents/retell-receptionist.json&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;agents/vapi-receptionist.json&lt;/span&gt;
          &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s"&gt;agents/bland-receptionist.json&lt;/span&gt;
    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;astral-sh/setup-uv@v5&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;uv tool install voicetest&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;voicetest run --agent {% raw %}${{ matrix.agent }}{% endraw %} --tests tests/receptionist-suite.json --all&lt;/span&gt;
        &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;GROQ_API_KEY&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt;&lt;span class="nv"&gt;% raw %&lt;/span&gt;&lt;span class="pi"&gt;}&lt;/span&gt;&lt;span class="s"&gt;${{ secrets.GROQ_API_KEY }}{% endraw %}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The matrix strategy runs each agent as a separate job. If your VAPI agent regresses while your Retell agent passes, you see exactly which platform broke.&lt;/p&gt;

&lt;p&gt;Voicetest is open source under Apache 2.0. GitHub: &lt;a href="https://github.com/voicetestdev/voicetest" rel="noopener noreferrer"&gt;github.com/voicetestdev/voicetest&lt;/a&gt;&lt;/p&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>testing</category>
    </item>
    <item>
      <title>Prompt Snippets and Auto-DRY Analysis for Voice Agent Graphs</title>
      <dc:creator>Peter</dc:creator>
      <pubDate>Wed, 25 Feb 2026 02:44:33 +0000</pubDate>
      <link>https://dev.to/pld/prompt-snippets-and-auto-dry-analysis-for-voice-agent-graphs-1972</link>
      <guid>https://dev.to/pld/prompt-snippets-and-auto-dry-analysis-for-voice-agent-graphs-1972</guid>
      <description>&lt;p&gt;Voice agent configs accumulate duplicated text fast. A Retell Conversation Flow with 15 nodes might repeat the same compliance disclaimer in 8 of them, the same sign-off phrase in 12, and the same tone instruction in all 15. When you need to update that disclaimer, you're doing find-and-replace across a JSON blob and hoping you didn't miss one.&lt;/p&gt;

&lt;p&gt;Voicetest 0.23 adds prompt snippets and automatic DRY analysis to fix this. This post covers how the detection algorithm works, the snippet reference system, and how it integrates with the existing export pipeline.&lt;/p&gt;

&lt;h3&gt;
  
  
  The problem in concrete terms
&lt;/h3&gt;

&lt;p&gt;Here's a simplified agent graph with three nodes. Notice the repeated text:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"nodes"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"greeting"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"state_prompt"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Welcome the caller. Always be professional and empathetic in your responses. When ending the call, say: Thank you for calling, is there anything else I can help with?"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"billing"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"state_prompt"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Help with billing inquiries. Always be professional and empathetic in your responses. When ending the call, say: Thank you for calling, is there anything else I can help with?"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"transfer"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"state_prompt"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Transfer to a human agent. Always be professional and empathetic in your responses. When ending the call, say: Thank you for calling, is there anything else I can help with?"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two sentences are duplicated across all three nodes. In a real agent with 15-20 nodes, this kind of duplication is the norm. It creates maintenance risk: update the sign-off in one node and forget another, and your agent behaves inconsistently.&lt;/p&gt;

&lt;h3&gt;
  
  
  How the DRY analyzer works
&lt;/h3&gt;

&lt;p&gt;The &lt;code&gt;voicetest.snippets&lt;/code&gt; module implements a two-pass detection algorithm over all text in an agent graph -- node prompts and the general prompt.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pass 1: Exact matches.&lt;/strong&gt; &lt;code&gt;find_repeated_text&lt;/code&gt; splits every prompt into sentences, then counts occurrences across nodes. Any sentence that appears in 2+ locations and exceeds a minimum character threshold (default 20) is flagged. The result includes the matched text and which node IDs contain it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;voicetest.snippets&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;find_repeated_text&lt;/span&gt;

&lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;find_repeated_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;graph&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;min_length&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;match&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;match&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; found in nodes: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;match&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;locations&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Pass 2: Fuzzy matches.&lt;/strong&gt; &lt;code&gt;find_similar_text&lt;/code&gt; runs pairwise similarity comparison on sentences that weren't caught as exact duplicates. It uses &lt;code&gt;SequenceMatcher&lt;/code&gt; (from the standard library) with a configurable threshold (default 0.8). This catches near-duplicates like "Please verify the caller's identity before proceeding" vs "Please verify the caller identity before proceeding with any request."&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;voicetest.snippets&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;find_similar_text&lt;/span&gt;

&lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;find_similar_text&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;graph&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;min_length&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;30&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;match&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Similarity &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;match&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;similarity&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;match&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;texts&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;suggest_snippets&lt;/code&gt; function runs both passes and returns a combined result:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;voicetest.snippets&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;suggest_snippets&lt;/span&gt;

&lt;span class="n"&gt;suggestions&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;suggest_snippets&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;graph&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;min_length&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Exact duplicates: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;suggestions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;exact&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Fuzzy matches: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;suggestions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;fuzzy&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  The snippet reference system
&lt;/h3&gt;

&lt;p&gt;Snippets use &lt;code&gt;{%name%}&lt;/code&gt; syntax (percent-delimited braces) to distinguish them from dynamic variables (&lt;code&gt;{{name}}&lt;/code&gt;). They're defined at the agent level and expanded before variable substitution:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"snippets"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"tone"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Always be professional and empathetic in your responses."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"sign_off"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Thank you for calling, is there anything else I can help with?"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"nodes"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"greeting"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"state_prompt"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Welcome the caller. {%tone%} When ending the call, say: {%sign_off%}"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"billing"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"state_prompt"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Help with billing inquiries. {%tone%} When ending the call, say: {%sign_off%}"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Expansion ordering
&lt;/h3&gt;

&lt;p&gt;During a test run, the &lt;code&gt;ConversationEngine&lt;/code&gt; expands snippets first, then substitutes dynamic variables:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# In ConversationEngine.process_turn():
&lt;/span&gt;&lt;span class="n"&gt;general_instructions&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;expand_snippets&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_module&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;instructions&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;graph&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;snippets&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;state_instructions&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;expand_snippets&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state_module&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;instructions&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;graph&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;snippets&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;general_instructions&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;substitute_variables&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;general_instructions&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_dynamic_variables&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;state_instructions&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;substitute_variables&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state_instructions&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_dynamic_variables&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This ordering matters. Snippets are static text blocks resolved at expansion time. Variables are runtime values (caller name, account ID, etc.) that differ per conversation. Expanding snippets first means a snippet can contain &lt;code&gt;{{variable}}&lt;/code&gt; references that get resolved in the second pass.&lt;/p&gt;

&lt;h3&gt;
  
  
  Export modes
&lt;/h3&gt;

&lt;p&gt;When an agent uses snippets, the export pipeline offers two modes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Raw (&lt;code&gt;.vt.json&lt;/code&gt;)&lt;/strong&gt;: Preserves &lt;code&gt;{%snippet%}&lt;/code&gt; references and the snippets dictionary. This is the voicetest-native format for version control and sharing.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Expanded&lt;/strong&gt;: Resolves all snippet references to plain text. Required for platform deployment -- Retell, VAPI, LiveKit, and Bland don't understand snippet syntax.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The &lt;code&gt;expand_graph_snippets&lt;/code&gt; function produces a deep copy with all references resolved:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;voicetest.templating&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;expand_graph_snippets&lt;/span&gt;

&lt;span class="n"&gt;expanded&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;expand_graph_snippets&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;graph&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# expanded.snippets == {}
# expanded.nodes["greeting"].state_prompt contains the full text
# original graph is unchanged
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Platform-specific exporters (Retell, VAPI, Bland, Telnyx, LiveKit) always receive expanded graphs. The voicetest IR exporter preserves references.&lt;/p&gt;

&lt;h3&gt;
  
  
  REST API
&lt;/h3&gt;

&lt;p&gt;The snippet system is fully exposed via REST:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# List snippets&lt;/span&gt;
GET /api/agents/&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;/snippets

&lt;span class="c"&gt;# Create/update a snippet&lt;/span&gt;
PUT /api/agents/&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;/snippets/tone
&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"text"&lt;/span&gt;: &lt;span class="s2"&gt;"Always be professional and empathetic."&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;

&lt;span class="c"&gt;# Delete a snippet&lt;/span&gt;
DELETE /api/agents/&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;/snippets/tone

&lt;span class="c"&gt;# Run DRY analysis&lt;/span&gt;
POST /api/agents/&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;/analyze-dry
&lt;span class="c"&gt;# Returns: {"exact": [...], "fuzzy": [...]}&lt;/span&gt;

&lt;span class="c"&gt;# Apply suggested snippets&lt;/span&gt;
POST /api/agents/&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="nb"&gt;id&lt;/span&gt;&lt;span class="o"&gt;}&lt;/span&gt;/apply-snippets
&lt;span class="o"&gt;{&lt;/span&gt;&lt;span class="s2"&gt;"snippets"&lt;/span&gt;: &lt;span class="o"&gt;[{&lt;/span&gt;&lt;span class="s2"&gt;"name"&lt;/span&gt;: &lt;span class="s2"&gt;"tone"&lt;/span&gt;, &lt;span class="s2"&gt;"text"&lt;/span&gt;: &lt;span class="s2"&gt;"Always be professional."&lt;/span&gt;&lt;span class="o"&gt;}]}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Web UI
&lt;/h3&gt;

&lt;p&gt;In the agent view, the Snippets section shows all defined snippets with inline editing. The "Analyze DRY" button runs the detection algorithm and presents results as actionable suggestions -- click "Apply" on an exact match to extract it into a snippet and replace all occurrences, or "Apply All" to batch-process every exact duplicate.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why this matters for testing
&lt;/h3&gt;

&lt;p&gt;Duplicated prompts aren't just a maintenance problem -- they're a testing problem. If two nodes have slightly different versions of the same instruction (one updated, one stale), your test suite might pass on the updated node and miss the regression on the stale one. Snippets guarantee consistency: update the snippet once, every node that references it gets the change.&lt;/p&gt;

&lt;p&gt;Combined with &lt;a href="https://dev.to/blog/voice-agent-evaluation-llm-judges/"&gt;voicetest's LLM-as-judge evaluation&lt;/a&gt;, snippets make your test results more reliable. When every node uses the same &lt;code&gt;{%tone%}&lt;/code&gt; snippet, a global metric like "Professional Tone" evaluates the same instruction everywhere. No more false passes from nodes running outdated prompt text.&lt;/p&gt;

&lt;h3&gt;
  
  
  Getting started
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;uv tool &lt;span class="nb"&gt;install &lt;/span&gt;voicetest
voicetest demo &lt;span class="nt"&gt;--serve&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Voicetest is open source under Apache 2.0. &lt;a href="https://github.com/voicetestdev/voicetest" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;. &lt;a href="https://voicetest.dev/api/" rel="noopener noreferrer"&gt;Docs&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>agents</category>
      <category>codequality</category>
      <category>llm</category>
      <category>tooling</category>
    </item>
    <item>
      <title>Voice Agent Evaluation with LLM Judges: How It Works</title>
      <dc:creator>Peter</dc:creator>
      <pubDate>Thu, 19 Feb 2026 00:58:32 +0000</pubDate>
      <link>https://dev.to/pld/voice-agent-evaluation-with-llm-judges-how-it-works-11hb</link>
      <guid>https://dev.to/pld/voice-agent-evaluation-with-llm-judges-how-it-works-11hb</guid>
      <description>&lt;p&gt;You can write unit tests for a REST API. You can snapshot-test a React component. But how do you test a voice agent that holds free-form conversations?&lt;/p&gt;

&lt;p&gt;The core challenge: voice agent behavior is non-deterministic. The same agent, given the same prompt, will produce different conversations every time. Traditional assertion-based testing breaks down when there is no single correct output. You need an evaluator that understands intent, not just string matching.&lt;/p&gt;

&lt;p&gt;Voicetest solves this with LLM-as-judge evaluation. It simulates multi-turn conversations with your agent, then passes the full transcript to a judge model that scores it against your success criteria. This post explains how each piece works.&lt;/p&gt;

&lt;h3&gt;
  
  
  The three-model architecture
&lt;/h3&gt;

&lt;p&gt;Voicetest uses three separate LLM roles during a test run:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Simulator.&lt;/strong&gt; Plays the user. Given a persona prompt (name, goal, personality), it generates realistic user messages turn by turn. It decides autonomously when the conversation goal has been achieved and should end -- no scripted dialogue trees.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agent.&lt;/strong&gt; Plays your voice agent. Voicetest imports your agent config (from Retell, VAPI, LiveKit, or its own format) into an intermediate graph representation: nodes with state prompts, transitions with conditions, and tool definitions. The agent model follows this graph, responding according to the current node's instructions and transitioning between states.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Judge.&lt;/strong&gt; Evaluates the finished transcript. This is where LLM-as-judge happens: the judge reads the full conversation and scores it against each metric you defined.&lt;/p&gt;

&lt;p&gt;You can assign different models to each role. Use a fast, cheap model for simulation (it just needs to follow a persona) and a more capable model for judging (where accuracy matters):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight toml"&gt;&lt;code&gt;&lt;span class="nn"&gt;[models]&lt;/span&gt;
&lt;span class="py"&gt;simulator&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"groq/llama-3.1-8b-instant"&lt;/span&gt;
&lt;span class="py"&gt;agent&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"groq/llama-3.3-70b-versatile"&lt;/span&gt;
&lt;span class="py"&gt;judge&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"openai/gpt-4o"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  How simulation works
&lt;/h3&gt;

&lt;p&gt;Each test case defines a user persona:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Appointment reschedule"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"user_prompt"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"You are Maria Lopez, DOB 03/15/1990. You need to reschedule your Thursday appointment to next week. You prefer mornings."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"metrics"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"Agent verified the patient's identity before making changes."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"Agent confirmed the new appointment date and time."&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"llm"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Voicetest starts the conversation at the agent's entry node. The simulator generates a user message based on the persona. The agent responds following the current node's state prompt, then voicetest evaluates transition conditions to determine the next node. This loop continues for up to &lt;code&gt;max_turns&lt;/code&gt; (default 20) or until the simulator decides the goal is complete.&lt;/p&gt;

&lt;p&gt;The result is a full transcript with metadata: which nodes were visited, which tools were called, how many turns it took, and why the conversation ended.&lt;/p&gt;

&lt;h3&gt;
  
  
  How the judge scores
&lt;/h3&gt;

&lt;p&gt;After simulation, the judge evaluates each metric independently. For the metric "Agent verified the patient's identity before making changes," the judge produces structured output with four fields:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Analysis&lt;/strong&gt;: Breaks compound criteria into individual requirements and quotes transcript evidence for each. For this metric, it would identify two requirements -- (1) asked for identity verification, (2) verified before making changes -- and cite the specific turns where each happened or did not.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Score&lt;/strong&gt;: 0.0 to 1.0, based on the fraction of requirements met. If the agent verified identity but did it &lt;em&gt;after&lt;/em&gt; making the change, the score might be 0.5.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reasoning&lt;/strong&gt;: A summary of what passed and what failed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Confidence&lt;/strong&gt;: How certain the judge is in its assessment.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A test passes when all metric scores meet the threshold (default 0.7, configurable per-agent or per-metric).&lt;/p&gt;

&lt;p&gt;This structured approach -- analysis before scoring -- prevents a common failure mode where judges assign a high score despite noting problems in their reasoning. By forcing the model to enumerate requirements and evidence first, the score stays consistent with the analysis.&lt;/p&gt;

&lt;h3&gt;
  
  
  Rule tests: when you do not need an LLM
&lt;/h3&gt;

&lt;p&gt;Not everything requires a judge. Voicetest also supports deterministic rule tests for pattern-matching checks:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"No SSN in transcript"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"user_prompt"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"You are Jane, SSN 123-45-6789. Ask the agent to verify your identity."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"excludes"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"123-45-6789"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"123456789"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"rule"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Rule tests check for &lt;code&gt;includes&lt;/code&gt; (required substrings), &lt;code&gt;excludes&lt;/code&gt; (forbidden substrings), and &lt;code&gt;patterns&lt;/code&gt; (regex). They run instantly, cost nothing, and return binary pass/fail with 100% confidence. Use them for compliance checks, PII detection, and required-phrase validation.&lt;/p&gt;

&lt;h3&gt;
  
  
  Global metrics: compliance at scale
&lt;/h3&gt;

&lt;p&gt;Individual test metrics evaluate specific scenarios. Global metrics evaluate every test transcript against organization-wide criteria:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"global_metrics"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"HIPAA Compliance"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"criteria"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Agent verifies patient identity before disclosing any protected health information."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"threshold"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.9&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Brand Voice"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"criteria"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Agent maintains a professional, empathetic tone throughout the conversation."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"threshold"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.7&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Global metrics run on every test automatically. A test only passes if both its own metrics and all global metrics meet their thresholds. This gives you a single place to enforce standards like HIPAA, PCI-DSS, or brand guidelines across your entire test suite.&lt;/p&gt;

&lt;h3&gt;
  
  
  Putting it together
&lt;/h3&gt;

&lt;p&gt;A complete test run looks like this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Voicetest imports your agent config into its graph representation.&lt;/li&gt;
&lt;li&gt;For each test case, it runs a multi-turn simulation using the simulator and agent models.&lt;/li&gt;
&lt;li&gt;The judge evaluates each metric and each global metric against the transcript.&lt;/li&gt;
&lt;li&gt;Results are stored in DuckDB with the full transcript, scores, reasoning, nodes visited, and tools called.&lt;/li&gt;
&lt;li&gt;A test passes only if every metric and every global metric meets its threshold.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The web UI (&lt;code&gt;voicetest serve&lt;/code&gt;) shows results visually -- transcripts with node annotations, metric scores with judge reasoning, and pass/fail status. The CLI outputs the same data to stdout for CI integration.&lt;/p&gt;

&lt;h3&gt;
  
  
  Getting started
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;uv tool &lt;span class="nb"&gt;install &lt;/span&gt;voicetest
voicetest demo &lt;span class="nt"&gt;--serve&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The demo loads a sample agent with test cases and opens the web UI so you can see the full evaluation pipeline in action.&lt;/p&gt;

&lt;p&gt;Voicetest is open source under Apache 2.0. &lt;a href="https://github.com/voicetestdev/voicetest" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;. &lt;a href="https://voicetest.dev/api/" rel="noopener noreferrer"&gt;Docs&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>llm</category>
      <category>testing</category>
    </item>
    <item>
      <title>Using Claude Code as a Free LLM Backend for Voice Agent Testing</title>
      <dc:creator>Peter</dc:creator>
      <pubDate>Wed, 18 Feb 2026 03:20:08 +0000</pubDate>
      <link>https://dev.to/pld/using-claude-code-as-a-free-llm-backend-for-voice-agent-testing-40bg</link>
      <guid>https://dev.to/pld/using-claude-code-as-a-free-llm-backend-for-voice-agent-testing-40bg</guid>
      <description>&lt;p&gt;Running a voice agent test suite means making a lot of LLM calls. Each test runs a multi-turn simulation (10-20 turns of back-and-forth), then passes the full transcript to a judge model for evaluation. A suite of 20 tests can easily hit 200+ LLM calls. At API rates, that adds up fast -- especially if you are using a capable model for judging.&lt;/p&gt;

&lt;p&gt;If you have a Claude Pro or Max subscription, you already have access to Claude models through Claude Code. Voicetest can use the &lt;code&gt;claude&lt;/code&gt; CLI as its LLM backend, routing all inference through your existing subscription instead of billing per-token through an API provider.&lt;/p&gt;

&lt;h3&gt;
  
  
  How it works
&lt;/h3&gt;

&lt;p&gt;Voicetest has a built-in Claude Code provider. When you set a model string starting with &lt;code&gt;claudecode/&lt;/code&gt;, voicetest invokes the &lt;code&gt;claude&lt;/code&gt; CLI in non-interactive mode, passes the prompt, and parses the JSON response. It clears the &lt;code&gt;ANTHROPIC_API_KEY&lt;/code&gt; environment variable from the subprocess so that Claude Code uses your subscription quota rather than any configured API key.&lt;/p&gt;

&lt;p&gt;No proxy server. No API key management. Just the &lt;code&gt;claude&lt;/code&gt; binary on your PATH.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: Install Claude Code
&lt;/h3&gt;

&lt;p&gt;Follow the instructions at &lt;a href="https://claude.ai/claude-code" rel="noopener noreferrer"&gt;claude.ai/claude-code&lt;/a&gt;. After installation, verify it works:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;claude &lt;span class="nt"&gt;--version&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Make sure you are logged in to your Claude account.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2: Install voicetest
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;uv tool &lt;span class="nb"&gt;install &lt;/span&gt;voicetest
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 3: Configure settings.toml
&lt;/h3&gt;

&lt;p&gt;Create &lt;code&gt;.voicetest/settings.toml&lt;/code&gt; in your project directory:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight toml"&gt;&lt;code&gt;&lt;span class="nn"&gt;[models]&lt;/span&gt;
&lt;span class="py"&gt;agent&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"claudecode/sonnet"&lt;/span&gt;
&lt;span class="py"&gt;simulator&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"claudecode/haiku"&lt;/span&gt;
&lt;span class="py"&gt;judge&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"claudecode/sonnet"&lt;/span&gt;

&lt;span class="nn"&gt;[run]&lt;/span&gt;
&lt;span class="py"&gt;max_turns&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;
&lt;span class="py"&gt;verbose&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The model strings follow the pattern &lt;code&gt;claudecode/&amp;lt;variant&amp;gt;&lt;/code&gt;. The supported variants are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;claudecode/haiku&lt;/code&gt; -- Fast, cheap on quota. Good for simulation.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;claudecode/sonnet&lt;/code&gt; -- Balanced. Good for judging and agent simulation.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;claudecode/opus&lt;/code&gt; -- Most capable. Use when judging accuracy matters most.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Step 4: Run tests
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;voicetest run &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--agent&lt;/span&gt; agents/my-agent.json &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--tests&lt;/span&gt; agents/my-tests.json &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--all&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No API keys needed. Voicetest calls &lt;code&gt;claude -p --output-format json --model sonnet&lt;/code&gt; under the hood, gets a JSON response, and extracts the result.&lt;/p&gt;

&lt;h3&gt;
  
  
  Model mixing
&lt;/h3&gt;

&lt;p&gt;The three model roles in voicetest serve different purposes, and you can mix models to optimize for speed and accuracy:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Simulator&lt;/strong&gt; (&lt;code&gt;simulator&lt;/code&gt;): Plays the user persona. This model follows a script (the &lt;code&gt;user_prompt&lt;/code&gt; from your test case), so it does not need to be particularly capable. Haiku is a good fit -- it is fast and consumes less of your quota.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agent&lt;/strong&gt; (&lt;code&gt;agent&lt;/code&gt;): Plays the role of your voice agent, following the prompts and transition logic from your imported config. Sonnet handles this well.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Judge&lt;/strong&gt; (&lt;code&gt;judge&lt;/code&gt;): Evaluates the full transcript against your metrics and produces a score from 0.0 to 1.0 with written reasoning. This is where accuracy matters most. Sonnet is reliable here; Opus is worth it if you need the highest-fidelity judgments.&lt;/p&gt;

&lt;p&gt;A practical configuration:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight toml"&gt;&lt;code&gt;&lt;span class="nn"&gt;[models]&lt;/span&gt;
&lt;span class="py"&gt;agent&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"claudecode/sonnet"&lt;/span&gt;
&lt;span class="py"&gt;simulator&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"claudecode/haiku"&lt;/span&gt;
&lt;span class="py"&gt;judge&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"claudecode/sonnet"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This keeps simulations fast while giving the judge enough capability to produce accurate scores.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cost comparison
&lt;/h3&gt;

&lt;p&gt;With API billing (e.g., through OpenRouter or direct Anthropic API), a test suite of 20 LLM tests at ~15 turns each, using Sonnet for judging, costs roughly $2-5 per run depending on transcript length. Run that 10 times a day during development and you are looking at $20-50/day in API costs.&lt;/p&gt;

&lt;p&gt;With a Claude Pro ($20/month) or Max ($100-200/month) subscription, the same tests run against your plan's usage allowance. For teams already paying for Claude Code as a development tool, the marginal cost of running voice agent tests is zero.&lt;/p&gt;

&lt;p&gt;The tradeoff: API calls are parallelizable and have predictable throughput. Claude Code passthrough runs sequentially (one CLI invocation at a time) and is subject to your plan's rate limits. For CI pipelines with large test suites, API billing may still make more sense. For local development and smaller suites, the subscription route is hard to beat.&lt;/p&gt;

&lt;h3&gt;
  
  
  When to use which
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scenario&lt;/th&gt;
&lt;th&gt;Recommended backend&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Local development, iterating on prompts&lt;/td&gt;
&lt;td&gt;&lt;code&gt;claudecode/*&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Small CI suite (&amp;lt; 10 tests)&lt;/td&gt;
&lt;td&gt;&lt;code&gt;claudecode/*&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Large CI suite, parallel runs&lt;/td&gt;
&lt;td&gt;API provider (OpenRouter, Anthropic)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Team with shared API budget&lt;/td&gt;
&lt;td&gt;API provider&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Solo developer with Max subscription&lt;/td&gt;
&lt;td&gt;&lt;code&gt;claudecode/*&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Getting started
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;uv tool &lt;span class="nb"&gt;install &lt;/span&gt;voicetest
voicetest demo
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;demo&lt;/code&gt; command loads a sample healthcare receptionist agent with test cases so you can try it without any setup.&lt;/p&gt;

&lt;p&gt;Voicetest is open source under Apache 2.0. &lt;a href="https://github.com/voicetestdev/voicetest" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;. &lt;a href="https://voicetest.dev/api/" rel="noopener noreferrer"&gt;Docs&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
    </item>
    <item>
      <title>How to Test a Retell Agent in CI with GitHub Actions</title>
      <dc:creator>Peter</dc:creator>
      <pubDate>Mon, 16 Feb 2026 20:21:58 +0000</pubDate>
      <link>https://dev.to/pld/how-to-test-a-retell-agent-in-ci-with-github-actions-2cjb</link>
      <guid>https://dev.to/pld/how-to-test-a-retell-agent-in-ci-with-github-actions-2cjb</guid>
      <description>&lt;p&gt;Manual testing of voice agents does not scale. You click through a few conversations in the Retell dashboard, confirm the agent sounds right, and ship it. Then someone updates a prompt, a transition breaks, and you find out from a customer complaint. The feedback loop is days, not minutes.&lt;/p&gt;

&lt;p&gt;Voicetest fixes this. It imports your Retell Conversation Flow, simulates multi-turn conversations using an LLM, and evaluates the results with an LLM judge that produces scores and reasoning. You can run it locally, but the real value comes from running it in CI on every push.&lt;/p&gt;

&lt;p&gt;This post walks through the full setup: from installing voicetest to a working GitHub Actions workflow that tests your Retell agent automatically.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: Install voicetest
&lt;/h3&gt;

&lt;p&gt;Voicetest is a Python CLI tool published on PyPI. The recommended way to install it is as a uv tool:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;uv tool &lt;span class="nb"&gt;install &lt;/span&gt;voicetest
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Verify it works:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;voicetest &lt;span class="nt"&gt;--version&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 2: Export your Retell agent
&lt;/h3&gt;

&lt;p&gt;In the Retell dashboard, open your Conversation Flow and export it as JSON. Save it to your repo:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;agents/
  receptionist.json
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The exported JSON contains your nodes, edges, prompts, transition conditions, and tool definitions. Voicetest auto-detects the Retell format by looking for &lt;code&gt;start_node_id&lt;/code&gt; and &lt;code&gt;nodes&lt;/code&gt; in the JSON.&lt;/p&gt;

&lt;p&gt;If you prefer to pull the config programmatically (useful for keeping tests in sync with the live agent), voicetest can also fetch directly from the Retell API:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;RETELL_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;your_key_here
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 3: Write test cases
&lt;/h3&gt;

&lt;p&gt;Create a test file with one or more test cases. Each test defines a simulated user persona, what the user will do, and metrics for the LLM judge to evaluate:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Billing inquiry"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"user_prompt"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Say you are Jane Smith with account 12345. You're confused about a charge on your bill and want help understanding it."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"metrics"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="s2"&gt;"Agent greeted the customer and addressed the billing concern."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="s2"&gt;"Agent was helpful and professional throughout."&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"llm"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"No PII in transcript"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"user_prompt"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"You are Jane with SSN 123-45-6789. Verify your identity."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"includes"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"verified"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"identity"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"excludes"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"123-45-6789"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"123456789"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"rule"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;There are two test types. LLM tests (&lt;code&gt;"type": "llm"&lt;/code&gt;) run a full multi-turn simulation and then pass the transcript to an LLM judge, which scores each metric from 0.0 to 1.0 with written reasoning. Rule tests (&lt;code&gt;"type": "rule"&lt;/code&gt;) use deterministic pattern matching -- checking that the transcript includes required strings, excludes forbidden ones, or matches regex patterns. Rule tests are fast and free, good for compliance checks like PII leakage.&lt;/p&gt;

&lt;p&gt;Save this as &lt;code&gt;agents/receptionist-tests.json&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 4: Configure your LLM backend
&lt;/h3&gt;

&lt;p&gt;Voicetest uses LiteLLM model strings, so any provider works. Create a &lt;code&gt;.voicetest/settings.toml&lt;/code&gt; in your project root:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight toml"&gt;&lt;code&gt;&lt;span class="nn"&gt;[models]&lt;/span&gt;
&lt;span class="py"&gt;agent&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"groq/llama-3.3-70b-versatile"&lt;/span&gt;
&lt;span class="py"&gt;simulator&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"groq/llama-3.1-8b-instant"&lt;/span&gt;
&lt;span class="py"&gt;judge&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"groq/llama-3.3-70b-versatile"&lt;/span&gt;

&lt;span class="nn"&gt;[run]&lt;/span&gt;
&lt;span class="py"&gt;max_turns&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;20&lt;/span&gt;
&lt;span class="py"&gt;verbose&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;simulator&lt;/code&gt; model plays the user. It should be fast and cheap since it just follows the persona script. The &lt;code&gt;judge&lt;/code&gt; model evaluates the transcript and should be accurate. The &lt;code&gt;agent&lt;/code&gt; model plays the role of your voice agent, following the prompts and transitions from your Retell config.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 5: Run locally
&lt;/h3&gt;

&lt;p&gt;Before setting up CI, verify everything works:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;GROQ_API_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;your_key_here

voicetest run &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--agent&lt;/span&gt; agents/receptionist.json &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--tests&lt;/span&gt; agents/receptionist-tests.json &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--all&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You will see each test run, the simulated conversation, and the judge's scores. Fix any test definitions that do not match your agent's behavior, then commit everything:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git add agents/ .voicetest/settings.toml
git commit &lt;span class="nt"&gt;-m&lt;/span&gt; &lt;span class="s2"&gt;"Add voicetest config and test cases"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 6: Set up GitHub Actions
&lt;/h3&gt;

&lt;p&gt;Add your API key as a repository secret. Go to Settings &amp;gt; Secrets and variables &amp;gt; Actions, and add &lt;code&gt;GROQ_API_KEY&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Then create &lt;code&gt;.github/workflows/voicetest.yml&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Voice Agent Tests&lt;/span&gt;

&lt;span class="na"&gt;on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;push&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;paths&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agents/**"&lt;/span&gt;
  &lt;span class="na"&gt;pull_request&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;paths&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agents/**"&lt;/span&gt;
  &lt;span class="na"&gt;workflow_dispatch&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;

&lt;span class="na"&gt;jobs&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;test&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;runs-on&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ubuntu-latest&lt;/span&gt;

    &lt;span class="na"&gt;steps&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;actions/checkout@v4&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Install uv&lt;/span&gt;
        &lt;span class="na"&gt;uses&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;astral-sh/setup-uv@v5&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Set up Python&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;uv python install &lt;/span&gt;&lt;span class="m"&gt;3.12&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Install voicetest&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;uv tool install voicetest&lt;/span&gt;

      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;Run voice agent tests&lt;/span&gt;
        &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
          &lt;span class="na"&gt;GROQ_API_KEY&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt;&lt;span class="nv"&gt;% raw %&lt;/span&gt;&lt;span class="pi"&gt;}&lt;/span&gt;&lt;span class="s"&gt;${{ secrets.GROQ_API_KEY }}{% endraw %}&lt;/span&gt;
        &lt;span class="na"&gt;run&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
          &lt;span class="s"&gt;voicetest run \&lt;/span&gt;
            &lt;span class="s"&gt;--agent agents/receptionist.json \&lt;/span&gt;
            &lt;span class="s"&gt;--tests agents/receptionist-tests.json \&lt;/span&gt;
            &lt;span class="s"&gt;--all&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The workflow triggers on any change to files in &lt;code&gt;agents/&lt;/code&gt;, which means prompt edits, new test cases, or config changes all trigger a test run. The &lt;code&gt;workflow_dispatch&lt;/code&gt; trigger lets you run tests manually from the GitHub UI.&lt;/p&gt;

&lt;h3&gt;
  
  
  What's next
&lt;/h3&gt;

&lt;p&gt;Once you have CI working, there are a few things worth exploring:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Global compliance metrics.&lt;/strong&gt; Voicetest supports HIPAA and PCI-DSS compliance checks that run across the entire transcript, not just per-test. These catch issues like agents accidentally reading back credit card numbers or disclosing PHI.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Format conversion.&lt;/strong&gt; If you ever want to move from Retell to VAPI or LiveKit, voicetest can convert your agent config between platforms via its AgentGraph intermediate representation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;voicetest &lt;span class="nb"&gt;export&lt;/span&gt; &lt;span class="nt"&gt;--agent&lt;/span&gt; agents/receptionist.json &lt;span class="nt"&gt;--format&lt;/span&gt; vapi-assistant
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;The web UI.&lt;/strong&gt; For a visual interface during development, run &lt;code&gt;voicetest serve&lt;/code&gt; and open &lt;code&gt;http://localhost:8000&lt;/code&gt;. You get a dashboard with test results, transcripts, and scores.&lt;/p&gt;

&lt;p&gt;Voicetest is open source under Apache 2.0. &lt;a href="https://github.com/voicetestdev/voicetest" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;. &lt;a href="https://voicetest.dev/api/" rel="noopener noreferrer"&gt;Docs&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>python</category>
      <category>testing</category>
      <category>ai</category>
      <category>devops</category>
    </item>
  </channel>
</rss>
