<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Dima Statz</title>
    <description>The latest articles on DEV Community by Dima Statz (@dimastatz).</description>
    <link>https://dev.to/dimastatz</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F201303%2Fc5c36ca4-1d87-47aa-88ff-d2f32dfa6918.jpeg</url>
      <title>DEV Community: Dima Statz</title>
      <link>https://dev.to/dimastatz</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/dimastatz"/>
    <language>en</language>
    <item>
      <title>Your AI Voice Agent Is a Black Box. Here's How to Open It.</title>
      <dc:creator>Dima Statz</dc:creator>
      <pubDate>Sat, 27 Jun 2026 04:39:16 +0000</pubDate>
      <link>https://dev.to/dimastatz/your-ai-voice-agent-is-a-black-box-heres-how-to-open-it-41kc</link>
      <guid>https://dev.to/dimastatz/your-ai-voice-agent-is-a-black-box-heres-how-to-open-it-41kc</guid>
      <description>&lt;p&gt;When your AI agent types, you can see everything it does. LangChain traces every&lt;br&gt;
step, LangSmith replays every run, OpenTelemetry hangs spans off each call. You&lt;br&gt;
know what the model saw, what it said, how long it took, and what it cost.&lt;/p&gt;

&lt;p&gt;The moment that same agent picks up a phone, the lights go out.&lt;/p&gt;

&lt;p&gt;A voice agent's entire interaction lives inside an &lt;code&gt;.mp3&lt;/code&gt;. The transcript, the&lt;br&gt;
customer's mood, the awkward four-second silence, the moment it talked over the&lt;br&gt;
caller, the point where the conversation went sideways — all of it is in there.&lt;br&gt;
But to your existing observability stack, that file is opaque. LangSmith sees the&lt;br&gt;
tokens you fed the LLM; it does not see the audio that reached a human ear.&lt;/p&gt;

&lt;p&gt;So most teams do the only thing they can: they listen to a handful of calls by&lt;br&gt;
hand and hope the sample is representative. That doesn't scale, and it misses the&lt;br&gt;
thing that makes voice agents hard — &lt;strong&gt;their behavior drifts.&lt;/strong&gt; You tweak a&lt;br&gt;
prompt, swap a model, change a TTS voice, and the agent gets subtly slower,&lt;br&gt;
colder, or starts missing intents. No unit test catches it, because the&lt;br&gt;
regression lives in the audio.&lt;/p&gt;

&lt;p&gt;This series is about closing that gap. In this first post I'll lay out the mental&lt;br&gt;
model; the next two get hands-on with a tricky signal-extraction problem and with&lt;br&gt;
wiring voice signals into CI.&lt;/p&gt;
&lt;h2&gt;
  
  
  The artifact is richer than you think
&lt;/h2&gt;

&lt;p&gt;Here's what's actually recoverable from a single call recording:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Transcript&lt;/strong&gt; — what was said, by whom, with timestamps.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Quality&lt;/strong&gt; — silence gaps, interruptions, speaking pace, pitch variance.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sentiment&lt;/strong&gt; — the caller's mood, and where it shifted.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Latency&lt;/strong&gt; — how long each stage (STT, LLM, TTS) took to respond.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost&lt;/strong&gt; — what the call cost, attributed per stage.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Events&lt;/strong&gt; — the detected intent, whether the caller dropped off, compliance flags.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That's a lot of signal locked inside one file. The reason teams rebuild this from&lt;br&gt;
scratch at every company is that prying it loose means bolting together speech&lt;br&gt;
recognition, speaker separation, audio analysis, a sentiment model, and a pricing&lt;br&gt;
sheet — and then maintaining all of it.&lt;/p&gt;
&lt;h2&gt;
  
  
  Two ways to pull meaning out of audio
&lt;/h2&gt;

&lt;p&gt;The key insight that makes this tractable: there are really &lt;strong&gt;two different&lt;br&gt;
kinds of question&lt;/strong&gt; you can ask of audio, and they want two different tools.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Measure it — classical signal processing.&lt;/strong&gt; Deterministic math run straight&lt;br&gt;
on the waveform: energy, pitch, the length of a silence. Cheap, exact, no&lt;br&gt;
training data. It shines for physical questions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How long was the pause?&lt;/li&gt;
&lt;li&gt;How fast did someone speak?&lt;/li&gt;
&lt;li&gt;Is this voice high-pitched or low?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You &lt;em&gt;measure&lt;/em&gt; the answer instead of guessing at it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Estimate it — learned models.&lt;/strong&gt; Statistical systems like Whisper or a&lt;br&gt;
sentiment classifier that have ingested enormous amounts of data and &lt;em&gt;estimate&lt;/em&gt;&lt;br&gt;
an answer. They own everything that turns on meaning rather than physics:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What words were said?&lt;/li&gt;
&lt;li&gt;Who is speaking?&lt;/li&gt;
&lt;li&gt;Is the caller upset?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;No hand-written rule survives real speech here — you need a model.&lt;/p&gt;

&lt;p&gt;Most of the craft is knowing which question belongs to which bucket: reach for a&lt;br&gt;
model to &lt;strong&gt;estimate meaning&lt;/strong&gt;, for signal processing to &lt;strong&gt;measure physics&lt;/strong&gt;. (In&lt;br&gt;
the next post you'll see that when a model isn't available, a measurement can&lt;br&gt;
sometimes stand in for it — that turns out to be a surprisingly useful trick.)&lt;/p&gt;
&lt;h2&gt;
  
  
  One report, split along that line
&lt;/h2&gt;

&lt;p&gt;I packaged this into a small open-source library called&lt;br&gt;
&lt;a href="https://github.com/dimastatz/audiotrace" rel="noopener noreferrer"&gt;AudioTrace&lt;/a&gt;. You hand it a recording;&lt;br&gt;
it hands back one structured, typed report — split along exactly that&lt;br&gt;
measure-vs-estimate line. The acoustic layer (silence, pace, pitch) is signal&lt;br&gt;
processing; the semantic layer (transcript, sentiment, intent) is models.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;audiotrace
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;audiotrace&lt;/span&gt;

&lt;span class="n"&gt;report&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;audiotrace&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;analyze&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;audio&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;call_recording.wav&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agent_version&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;v2.1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;provider&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;vapi&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;report&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;quality&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;overall_score&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;        &lt;span class="c1"&gt;# 0.87
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;report&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;quality&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;speaking_pace_wpm&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;     &lt;span class="c1"&gt;# 168.0
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;report&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sentiment&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;caller_frustration&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# False
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;report&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;latency&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;total_ms&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;              &lt;span class="c1"&gt;# 4200
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;report&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;drop_off&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;               &lt;span class="c1"&gt;# False
&lt;/span&gt;&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;report&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cost&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;total_usd&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;                &lt;span class="c1"&gt;# 0.063
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The return value is a Pydantic &lt;code&gt;CallReport&lt;/code&gt;, so it's typed, validated, and trivial&lt;br&gt;
to serialize. You can emit it as OpenTelemetry spans, hang it off your LangChain&lt;br&gt;
and LangSmith traces, or assert on it in a CI check — which is exactly where this&lt;br&gt;
series is headed.&lt;/p&gt;
&lt;h2&gt;
  
  
  One decision shaped everything: it runs locally
&lt;/h2&gt;

&lt;p&gt;Call recordings are about as sensitive as data gets. So AudioTrace runs entirely&lt;br&gt;
on your machine — no audio leaves the box, and the open models download once.&lt;br&gt;
Privacy here shouldn't be an upgrade you pay for; it should be the default.&lt;/p&gt;
&lt;h2&gt;
  
  
  What's next
&lt;/h2&gt;

&lt;p&gt;The two-layer model sounds tidy, but the interesting part is what happens when&lt;br&gt;
the "right" tool isn't available. In the next post I'll walk through a concrete&lt;br&gt;
example: labeling &lt;strong&gt;who is speaking&lt;/strong&gt; without the gated model everyone reaches&lt;br&gt;
for — and why a few dozen lines of pitch measurement beat it for the common case.&lt;/p&gt;

&lt;p&gt;If you want to poke at it now:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;audiotrace
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;⭐ The repo is at &lt;a href="https://github.com/dimastatz/audiotrace" rel="noopener noreferrer"&gt;github.com/dimastatz/audiotrace&lt;/a&gt;.&lt;br&gt;
Issues and PRs welcome — it's early, and provider integrations are exactly the&lt;br&gt;
kind of contribution that helps most.&lt;/p&gt;

&lt;p&gt;Keep building!&lt;/p&gt;

</description>
      <category>ai</category>
      <category>python</category>
      <category>opensource</category>
      <category>machinelearning</category>
    </item>
  </channel>
</rss>
