DEV Community: Dima Statz

A GitHub Action That Fails the Build When Your Voice Agent Gets Worse

Dima Statz — Sat, 11 Jul 2026 03:39:10 +0000

In the last post we put call audio in the test suite: treat a few golden
recordings as fixtures, analyze them on every change, and assert on the report so
a regression turns the build red. That works — but it's a pile of pytest you
have to write and maintain.

This post skips the boilerplate. There's now a GitHub Action that does the whole
thing: AudioTrace regression gate.
Point it at a folder of recordings and a committed baseline, and it fails the
build when call quality regresses — latency, sentiment, drop-off, cost, or
compliance. It runs entirely on open models, so there are no secrets and no
network calls in CI.

Here's the whole thing, start to finish.

The idea in one sentence

Voice agents drift in ways no unit test catches — a prompt tweak makes the agent
slower, a new TTS voice makes it colder, a refactor drops a required disclosure.
So we commit a baseline of what "good" sounds like, and gate every PR
against it.

Step 1 — Commit a baseline

Grab a handful of representative call recordings — a happy path, a frustrated
caller, a compliance-heavy call — and drop them in your repo (say tests/calls/).
Then generate a baseline locally:

pip install audiotrace          # FFmpeg must be on your system
audiotrace baseline tests/calls -o baseline.json

This analyzes every recording and writes baseline.json — the committed snapshot
of "good." Commit it alongside your code:

git add tests/calls baseline.json
git commit -m "Add voice-quality baseline"

That's your golden master. When you intentionally improve the agent, you
regenerate and re-commit it — same workflow as snapshot testing.

Step 2 — Add the Action

Drop this into .github/workflows/voice-quality.yml:

name: Voice quality
on: [pull_request]

jobs:
  audiotrace:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: dimastatz/audiotrace@v1
        with:
          calls: tests/calls
          baseline: baseline.json

Two required inputs — where the recordings live and where the baseline lives.
That's it. The Action installs AudioTrace + FFmpeg, re-analyzes the calls,
compares each against the baseline, and exits non-zero on any out-of-tolerance
regression.

Step 3 — Watch it catch a regression

Open a PR that changes a prompt or swaps the model. On the next run the gate
re-scores your golden calls and, if the agent got measurably worse, the check
fails with a summary like:

FAIL  frustrated_customer.wav
  Response latency (p95)  1,820ms → 2,540ms   (+39%, allowed +15%)
  Sentiment               0.12 → -0.20        (Δ0.32, allowed 0.10)

1 of 3 calls regressed.

A green build means the change was safe to ship. A red build means you made the
agent slower or colder before a customer felt it.

Every run also uploads a per-call HTML + JSON report as a build artifact —
even on failure — so you can open the report and see exactly what moved.

Inputs

Input	Required	Default	What it does
`calls`	yes	—	Directory of golden call recordings to gate.
`baseline`	yes	`baseline.json`	The committed baseline to compare against.
`report-dir`	no	`audiotrace-report`	Where per-call HTML + JSON reports are written.
`version`	no	`audiotrace`	pip spec to pin AudioTrace (e.g. `audiotrace==1.2.2`).
`python-version`	no	`3.12`	Python version the gate runs on.

Tuning the tolerances

Conversational signals wiggle run to run, so the gate ships with sane per-metric
tolerances — a band the number can move within before it counts as a regression:

Metric	Tolerance
Quality score	±0.05
Sentiment	±0.10
Response p95	+15%
Cost	+20%
Interruptions	+1
Frustration / drop-off / compliance	zero — any regression fails

New recordings that aren't in the baseline yet are skipped, not failed, so
adding fixtures never breaks the build.

Why this runs in CI at all

The trick that makes this practical: audiotrace analyze() runs locally on open
models (Whisper, pyannote, Librosa) — no hosted API, no key, no per-call bill. So
the gate needs nothing but your recordings and a runner. That's what lets it live
in pull_request CI instead of a nightly job behind a secret.

Try it

Action: AudioTrace regression gate on the Marketplace
Library: pip install audiotrace · github.com/dimastatz/audiotrace

Point it at three recordings and a baseline, open a PR, and watch a bad prompt
change go red. That's regression testing for the part of your product that used
to be a black box.

Fail the Build When Your Voice Agent Gets Worse

Dima Statz — Sat, 11 Jul 2026 03:38:05 +0000

In this series we've turned a raw call recording into a structured CallReport
(post 1) and looked at how to extract signals cheaply enough to run on every
call (post 2). Now the payoff: using those signals to stop regressions
before they ship.

A voice agent's behavior drifts. You change a prompt, swap a model, pick a new TTS
voice — and the agent gets subtly slower to respond, colder in tone, or starts
skipping a required disclosure. None of that shows up in a normal test suite,
because the regression lives in the audio. So let's put the audio in the test
suite.

The idea: golden recordings as test fixtures

Treat a small set of representative call recordings as fixtures. On every change,
analyze them and assert on the report. If a prompt change pushes a number past a
threshold, the build goes red — same as any other failing test.

import audiotrace
import pytest

# A few representative calls checked into the repo (or pulled from storage).
GOLDEN_CALLS = [
    "tests/calls/happy_path.wav",
    "tests/calls/frustrated_customer.wav",
    "tests/calls/compliance_heavy.wav",
]


@pytest.mark.parametrize("path", GOLDEN_CALLS)
def test_call_quality_does_not_regress(path):
    report = audiotrace.analyze(path, num_speakers=2)

    # Latency: the agent must stay responsive.
    assert report.latency.total_ms < 6000, "agent got too slow"

    # Quality: overall score must stay healthy.
    assert report.quality.overall_score >= 0.80

    # The agent shouldn't be talking over the caller.
    assert report.quality.interruptions <= 2


def test_required_disclosure_present():
    report = audiotrace.analyze("tests/calls/compliance_heavy.wav")
    # Compliance flags surface missing/again-required disclosures.
    assert "missing_disclosure" not in report.events.compliance_flags


def test_agent_does_not_frustrate_callers():
    report = audiotrace.analyze("tests/calls/happy_path.wav")
    assert report.sentiment.caller_frustration is False
    assert report.sentiment.overall >= 0.0  # net-neutral-or-better tone

Because analyze() runs locally with no API calls, this works in CI with no
secrets and no network — the recordings and the open models are all you need.

Catching drift, not just hard failures

Absolute thresholds catch cliffs. To catch drift, compare against a baseline you
commit alongside the code:

import json
import audiotrace

def snapshot(path):
    r = audiotrace.analyze(path, num_speakers=2)
    return {
        "pace_wpm": r.quality.speaking_pace_wpm,
        "overall": r.quality.overall_score,
        "latency_ms": r.latency.total_ms,
        "sentiment": r.sentiment.overall,
    }

def test_no_drift_from_baseline():
    baseline = json.load(open("tests/baseline.json"))
    current = snapshot("tests/calls/happy_path.wav")

    # Latency may not grow more than 15% vs. the committed baseline.
    assert current["latency_ms"] <= baseline["latency_ms"] * 1.15
    # Tone may not drop more than 0.1 absolute.
    assert current["sentiment"] >= baseline["sentiment"] - 0.1

When you intentionally improve the agent, you regenerate baseline.json and
commit it — the same workflow as snapshot testing.

Emit it as OpenTelemetry spans

CI catches regressions before they ship; observability catches what happens in
production. The CallReport maps cleanly onto OpenTelemetry, so voice-call
signals sit right next to the rest of your traces:

from opentelemetry import trace
import audiotrace

tracer = trace.get_tracer("audiotrace")

def trace_call(path: str):
    report = audiotrace.analyze(path)
    with tracer.start_as_current_span("voice_call") as span:
        span.set_attribute("call.duration_ms", report.media.duration_ms)
        span.set_attribute("call.quality_score", report.quality.overall_score)
        span.set_attribute("call.caller_frustrated", report.sentiment.caller_frustration)
        span.set_attribute("call.cost_usd", report.cost.total_usd)
        span.set_attribute("call.outcome", report.events.outcome)

        # The latency waterfall becomes child spans (STT, LLM, TTS, ...).
        for stage in report.latency.waterfall:
            child = tracer.start_span(stage.name, start_time=stage.start_ms)
            child.end()
    return report

Hang it off your LangChain / LangSmith traces

If you already trace your agent's reasoning in LangSmith, AudioTrace fills in the
half it can't see — what actually reached the caller's ear. Attach the report to
the run as metadata so the audio signals live next to the token-level trace:

from langsmith import Client
import audiotrace

client = Client()

def attach_audio_signals(run_id: str, recording: str):
    report = audiotrace.analyze(recording)
    client.update_run(
        run_id,
        extra={
            "audio": {
                "quality_score": report.quality.overall_score,
                "caller_frustration": report.sentiment.caller_frustration,
                "speaking_pace_wpm": report.quality.speaking_pace_wpm,
                "drop_off": report.events.drop_off,
                "total_cost_usd": report.cost.total_usd,
            }
        },
    )

Now a single LangSmith run shows both what the model thought and how the call
sounded — and the same signals that flag a bad call in production are the
examples you feed back in to fine-tune the next, better agent.

Wrapping the series

Three ideas, one thread:

A voice call is a rich artifact your token-level tooling can't read — so turn it into a typed CallReport.
Split the work by measure vs. estimate, and don't reach for a big model when a cheap measurement will do.
Put those signals where they pay off: red builds on regressions and spans/traces in production.

A lot of progress in AI isn't a new model — it's packaging hard-won engineering
into something others can pip install. That's what AudioTrace is trying to be
for voice agents.

pip install audiotrace

⭐ Repo: github.com/dimastatz/audiotrace —
it's early, and provider integrations + richer compliance checks are exactly where
contributions help most.

Keep building!

Measure, Don't Estimate: Labeling Speakers Without a Gated Model

Dima Statz — Sat, 11 Jul 2026 03:06:02 +0000

In the first post I argued there are two ways to pull meaning out of audio:
measure it with signal processing, or estimate it with a model. This post
is the story of a problem where the obvious move was to estimate — and where
measuring turned out to be better.

The problem: labeling who is speaking. A transcript that says "Agent: …" and
"Customer: …" is far more useful than an undifferentiated wall of text. Splitting
a conversation by speaker is called diarization.

The obvious tool, and why it hurt

The strong, well-known tool for diarization is
pyannote. It's genuinely good. It
is also gated: to run it you need a Hugging Face account, an access token, and
to accept a license agreement before the weights will download.

That's fine for a production deployment. It's a terrible first impression for
someone who just pip install-ed your library and wants to see it work. Without a
token, every single turn comes back labeled "unknown". The newcomer's first run
is a wall of unknown: … and they bounce.

So I wanted a default path that works with zero setup, and lets you opt into
pyannote when you have a token and want the best quality.

Shortcut #1: just alternate speakers (this fails)

My first instinct was the dumbest possible heuristic: in a two-party call, the
speakers take turns, so just alternate Agent, Customer, Agent, Customer…

It fell apart immediately. Speech recognizers like Whisper segment on
sentences, not speakers. So the agent's multi-sentence greeting —

"Hi there! Thanks for calling. How can I help you today?"

— gets split into three segments, and the naive alternator flip-flops the label
mid-utterance:

Agent:    Hi there!
Customer: Thanks for calling.
Agent:    How can I help you today?

Garbage. The structure I assumed (one segment per speaker turn) simply isn't there.

Shortcut #2: ask what signal is actually present

Instead of forcing a model-shaped solution, I asked: what's physically in the
audio that distinguishes these two speakers?

In a typical support call, the agent and the customer have noticeably
different voice pitch. That's a physical property of the waveform — exactly the
kind of thing signal processing measures cheaply and exactly.

So the approach becomes:

For each transcribed segment, measure its average pitch (fundamental frequency) using an audio library I already had as a dependency.
Cluster the segments into two groups by pitch.
The low-pitch cluster is one speaker, the high-pitch cluster is the other.

The core of it is just a measurement plus a 2-way split:

import librosa
import numpy as np

def segment_pitch(y: np.ndarray, sr: int) -> float:
    """Mean fundamental frequency (Hz) of one transcript segment."""
    f0, voiced_flag, _ = librosa.pyin(
        y,
        fmin=float(librosa.note_to_hz("C2")),
        fmax=float(librosa.note_to_hz("C7")),
        sr=sr,
    )
    voiced = f0[voiced_flag]
    return float(np.nanmean(voiced)) if voiced.size else 0.0


def assign_speakers(pitches: list[float], labels=("AI Agent", "Customer")):
    """Split segments into two speakers by a pitch threshold."""
    valid = [p for p in pitches if p > 0]
    if not valid:
        return ["unknown"] * len(pitches)
    threshold = float(np.median(valid))
    # Lower-pitched cluster -> first label, higher -> second.
    return [
        labels[0] if (p > 0 and p <= threshold) else
        labels[1] if p > 0 else "unknown"
        for p in pitches
    ]

A few dozen lines. No new dependency. No token. And the labels come out right for
the common case — a plain measurement standing in for a model I couldn't
assume the user had.

It isn't magic — and that's the point

Two similar voices (two men, two women, a deep-voiced customer) can fool the pitch
split. With a token, pyannote still does better, and it handles three-plus
speakers, overlapping speech, and edge cases this never will. So AudioTrace keeps
both paths:

import audiotrace

# Default: zero-setup, infer speakers by pitch.
report = audiotrace.analyze("call.wav", diarize=False, num_speakers=2)

# Best quality: opt into pyannote with a token.
report = audiotrace.analyze("call.wav", hf_token="hf_...")

The lesson I keep relearning: we grab the biggest model out of habit. A
careful look at the data often points to something lighter, cheaper, and easier
to reason about. "What signal is actually there?" is a more useful question than
"which model should I download?"

That's also a practical observability principle. The cheap, deterministic
measurement runs in milliseconds with no GPU, which means you can run it on
every call — and the things you can afford to run on every call are the things
that actually catch regressions.

What's next

We now have a structured CallReport with speakers, quality, sentiment, latency,
and cost. In the final post I'll wire it into CI: fail the build when a prompt
change makes the agent slower, colder, or less compliant, and emit the signals
as OpenTelemetry spans alongside your LangChain / LangSmith traces.

pip install audiotrace

⭐ Repo: github.com/dimastatz/audiotrace

Keep building!

Your AI Voice Agent Is a Black Box. Here's How to Open It.

Dima Statz — Sat, 27 Jun 2026 04:39:16 +0000

When your AI agent types, you can see everything it does. LangChain traces every
step, LangSmith replays every run, OpenTelemetry hangs spans off each call. You
know what the model saw, what it said, how long it took, and what it cost.

The moment that same agent picks up a phone, the lights go out.

A voice agent's entire interaction lives inside an .mp3. The transcript, the
customer's mood, the awkward four-second silence, the moment it talked over the
caller, the point where the conversation went sideways — all of it is in there.
But to your existing observability stack, that file is opaque. LangSmith sees the
tokens you fed the LLM; it does not see the audio that reached a human ear.

So most teams do the only thing they can: they listen to a handful of calls by
hand and hope the sample is representative. That doesn't scale, and it misses the
thing that makes voice agents hard — their behavior drifts. You tweak a
prompt, swap a model, change a TTS voice, and the agent gets subtly slower,
colder, or starts missing intents. No unit test catches it, because the
regression lives in the audio.

This series is about closing that gap. In this first post I'll lay out the mental
model; the next two get hands-on with a tricky signal-extraction problem and with
wiring voice signals into CI.

The artifact is richer than you think

Here's what's actually recoverable from a single call recording:

Transcript — what was said, by whom, with timestamps.
Quality — silence gaps, interruptions, speaking pace, pitch variance.
Sentiment — the caller's mood, and where it shifted.
Latency — how long each stage (STT, LLM, TTS) took to respond.
Cost — what the call cost, attributed per stage.
Events — the detected intent, whether the caller dropped off, compliance flags.

That's a lot of signal locked inside one file. The reason teams rebuild this from
scratch at every company is that prying it loose means bolting together speech
recognition, speaker separation, audio analysis, a sentiment model, and a pricing
sheet — and then maintaining all of it.

Two ways to pull meaning out of audio

The key insight that makes this tractable: there are really two different
kinds of question you can ask of audio, and they want two different tools.

1. Measure it — classical signal processing. Deterministic math run straight
on the waveform: energy, pitch, the length of a silence. Cheap, exact, no
training data. It shines for physical questions:

How long was the pause?
How fast did someone speak?
Is this voice high-pitched or low?

You measure the answer instead of guessing at it.

2. Estimate it — learned models. Statistical systems like Whisper or a
sentiment classifier that have ingested enormous amounts of data and estimate
an answer. They own everything that turns on meaning rather than physics:

What words were said?
Who is speaking?
Is the caller upset?

No hand-written rule survives real speech here — you need a model.

Most of the craft is knowing which question belongs to which bucket: reach for a
model to estimate meaning, for signal processing to measure physics. (In
the next post you'll see that when a model isn't available, a measurement can
sometimes stand in for it — that turns out to be a surprisingly useful trick.)

One report, split along that line

I packaged this into a small open-source library called
AudioTrace. You hand it a recording;
it hands back one structured, typed report — split along exactly that
measure-vs-estimate line. The acoustic layer (silence, pace, pitch) is signal
processing; the semantic layer (transcript, sentiment, intent) is models.

pip install audiotrace

import audiotrace

report = audiotrace.analyze(
    audio="call_recording.wav",
    metadata={"agent_version": "v2.1", "provider": "vapi"},
)

print(report.quality.overall_score)        # 0.87
print(report.quality.speaking_pace_wpm)     # 168.0
print(report.sentiment.caller_frustration)  # False
print(report.latency.total_ms)              # 4200
print(report.events.drop_off)               # False
print(report.cost.total_usd)                # 0.063

The return value is a Pydantic CallReport, so it's typed, validated, and trivial
to serialize. You can emit it as OpenTelemetry spans, hang it off your LangChain
and LangSmith traces, or assert on it in a CI check — which is exactly where this
series is headed.

One decision shaped everything: it runs locally

Call recordings are about as sensitive as data gets. So AudioTrace runs entirely
on your machine — no audio leaves the box, and the open models download once.
Privacy here shouldn't be an upgrade you pay for; it should be the default.

What's next

The two-layer model sounds tidy, but the interesting part is what happens when
the "right" tool isn't available. In the next post I'll walk through a concrete
example: labeling who is speaking without the gated model everyone reaches
for — and why a few dozen lines of pitch measurement beat it for the common case.

If you want to poke at it now:

pip install audiotrace

⭐ The repo is at github.com/dimastatz/audiotrace.
Issues and PRs welcome — it's early, and provider integrations are exactly the
kind of contribution that helps most.

Keep building!