DEV Community

ashczar
ashczar

Posted on

How I Built an Automated Testing Harness for Voice AI Agents

Voice AI is having a moment. GPT-powered phone bots, IVR systems, and voice assistants are being deployed faster than ever. But there's a problem nobody talks about (or maybe they just haven't reached my ears yet): how do you test them?

You can unit test your backend. You can integration test your API. But how do you know your voice AI agent actually says the right thing, in the right order, with acceptable latency — before your users find out it doesn't?

That's the problem I set out to solve. The result is MockingJay — an open source testing harness for voice AI agents.


The Problem

At 10,000 calls a day, you can't listen to them all. And manual testing doesn't scale. Every time you update your agent's model, prompt, or logic, you're flying blind.

What I wanted was something like Jest or pytest but for voice AI. Write your test scenarios once, run them on every deploy, get a pass/fail result with metrics.


What MockingJay Does

MockingJay is not a voice AI agent. It's the tool you use to test one. Your agent lives separately, it could be a GPT-powered phone bot, a Twilio-based IVR, or any HTTP endpoint. MockingJay plugs into it and tests it:

mockingjay run

   🐦 MockingJay - Starting tests...

     [1/3] basic-greeting ✓ PASS (latency: 103ms)
     [2/3] appointment-booking ✓ PASS (latency: 98ms)
     [3/3] business-hours ✓ PASS (latency: 112ms)

   📊 Results:
     Pass rate: 100%  Avg latency: 104ms

   💬 Conversation Intelligence:
     Intent accuracy: 100%  Context retention: 100%
Enter fullscreen mode Exit fullscreen mode

You write scenarios in a simple YAML file:

   version: 1

   agent:
     endpoint: "http://localhost:9000/call"

   scenarios:
     - name: "basic-greeting"
       steps:
         - say: "Hello"
           expect: "greeting"

     - name: "appointment-booking"
       steps:
         - say: "I want to book an appointment"
           expect: "booking_intent"
         - say: "Tomorrow at 7pm"
           expect: "confirmation"
Enter fullscreen mode Exit fullscreen mode

MockingJay sends each say to your agent's HTTP endpoint and checks that the returned intent matches what you defined in expect. Multi-turn conversations, latency tracking, drop-off detection. All built in.


The Full Testing Loop

HTTP testing covers your agent's logic. But voice AI has another layer: the actual phone call. Does it connect? Does the TTS sound right? Does the recording capture what was said?

MockingJay handles that too:

Make a real call via Twilio and record it:
mockingjay call --to +12345678 --webhook https://your-agent.com/voice --record

Transcribe the recording with Deepgram:
mockingjay transcribe --file recording.wav --api-url http://localhost:8080

Or do all three in one command:
mockingjay calltest --to +12345678 --webhook https://your-agent.com/voice --expect "hello"

The calltest command chains call → transcribe → validate → report automatically. It checks that the transcript contains your expected phrase and saves the result to the dashboard.


The Architecture

The stack is deliberately simple:

  • CLI — Go, Cobra. Parallel test execution with goroutines.
  • Backend — Go, SQLite. REST API for storing results.
  • Frontend — Next.js. Dashboard with 5 tabs: Metrics, Results, A/B Tests, Transcriptions, Health.
  • Integrations — Twilio for real calls, Deepgram for transcription.

The CLI is the core. The backend and dashboard are optional, useful for tracking trends over time, but you can run mockingjay run standalone with no infrastructure.


A/B Testing Agent Versions

One feature I'm particularly happy with is the A/B tester. Before promoting a new model or prompt, you can run both versions against the same scenarios:

mockingjay ab -c ab-test.yaml

   🔬 A/B Test: v1-baseline vs v2-new-model

     Metric              v1-baseline  v2-new-model   Delta
     Avg Latency (ms)        105.0         89.0  ✓  -16.0
     Pass Rate (%)           100.0        100.0       +0.0

     🏆 Winner: v2-new-model
Enter fullscreen mode Exit fullscreen mode

Production Monitoring

Once your agent is live, mockingjay monitor runs your scenarios on a schedule and fires a Slack alert if pass rate drops below a threshold:

mockingjay monitor --interval 300 --threshold 90 --alert-webhook https://hooks.slack.com/xxx


What I Learned

Test your contracts, not your implementation. MockingJay tests the HTTP contract between your test harness and your agent. That's the right boundary, it doesn't care how your agent works internally, only what it returns.

Real calls are a different test. HTTP testing and phone call testing answer different questions. HTTP tells you if your logic is right. A real call tells you if your TTS, telephony, and recording pipeline work end-to-end. You need both.

Interfaces matter more than you think. The refactor that made MockingJay more testable was adding voice.Caller, reporter.Reporter, and repository.Repository interfaces. Suddenly every component could be swapped or mocked independently.


Try It

MockingJay is open source:

👉 https://github.com/ashczar77/mockingjay (https://github.com/ashczar77/mockingjay)

   git clone https://github.com/ashczar77/mockingjay.git
   cd mockingjay/cli && go build -o mockingjay
   cd ../examples/voice-server && go run main.go
   ../../cli/mockingjay run
Enter fullscreen mode Exit fullscreen mode

If you're building voice AI and this solves a problem you have, I'd love to hear from you. Open an issue, submit a PR, or just leave a comment below.

Thank you for reading my pitiful first attempt at writing an article. Happy building...

Top comments (0)