What 500 Cold Call Recordings Taught Us About Building AI Dialers

#voip #asterisk #sysadmin #devops

Most AI voice products start in a conference room. Someone writes what they think a good cold call sounds like, feeds it into a language model, and ships it. The result sounds like a script being read — because it is.

We started differently. We pulled 500+ real cold call recordings from a live roofing appointment-setting campaign running through VICIdial. Five human agents, same lists, same dialer settings. We transcribed every call, tagged every disposition, and went line by line through the conversations that actually booked appointments to figure out why they worked.

The insight was simple: don't invent a playbook. Clone the one that's already winning.

Phone Audio Breaks Standard Transcription

Whisper was trained on podcasts, audiobooks, and YouTube — clean audio in controlled environments. VICIdial recordings are narrow-band 8kHz G.711 with background noise, crosstalk, and cell phone compression. Feed that raw into Whisper and you get hallucinated bible verses and confident-sounding garbage.

We built the pipeline locally with faster-whisper. large-v3 was the only model that consistently got it right. Here's the config that made it work:

from faster_whisper import WhisperModel

model = WhisperModel("large-v3", device="cpu", compute_type="int8")
segments, info = model.transcribe(
    audio_path,
    beam_size=10,
    initial_prompt="This is a phone call about roofing. "
        "Terms: siding, fascia, eaves, soffit, flashing, ridge cap.",
    vad_filter=True,
    vad_parameters=dict(min_silence_duration_ms=500),
    condition_on_previous_text=False,
)

Every parameter exists for a reason. beam_size=10 gives the decoder room to explore candidates instead of committing greedily. The initial_prompt seeds domain vocabulary — without it, Whisper hears "soffit" and writes "saw fit" every time. vad_filter engages Silero VAD to skip silence and prevent hallucination. condition_on_previous_text=False stops one bad segment from poisoning everything downstream.

Before Whisper sees anything, every recording passes through an ffmpeg preprocessing pipeline:

ffmpeg -i /var/spool/asterisk/monitor/recording.wav \
  -af "highpass=f=300,lowpass=f=3400,volume=2.0" \
  -ar 16000 -ac 1 processed.wav

That upsampling from 8kHz to 16kHz — giving the model something closer to what it was trained on — measurably improved word error rates.

Context Turns Words Into Intelligence

A raw transcript is just words. Every VICIdial recording filename contains the phone number, which is a direct key into the vicidial_list table. We automated the matching:

SELECT vl.first_name, vl.last_name, vl.address1, vl.city,
       vl.state, vl.phone_number, vl.status,
       rl.filename, rl.length_in_sec, rl.start_time
FROM recording_log rl
JOIN vicidial_list vl ON rl.lead_id = vl.lead_id
WHERE rl.start_time >= '2026-01-01'
ORDER BY rl.start_time;

That gave us the lead's name, address, insurance status, and full call history for every recording. When Whisper wrote "123 Birch Wood Lane" and the lead record showed "123 Birchwood Drive," the correction was automatic.

The most valuable context was call history. A first-contact cold call and a fifth redial to the same prospect are different conversations with different dynamics. We tagged every call accordingly, because analyzing a redial with first-contact expectations produces garbage insights.

What Makes a Call Convert

Every call got tagged: WIN (appointment booked) or LOSS with a sub-reason. Then we extracted the exact moment a prospect commits — not the general vibe, the precise words and timestamp. Commitment sounds like quiet surrender: "Yeah, that'd be fine" at 1:47. "Sure, go ahead and put me down" at 2:12.

We mapped what the agent said immediately before each turning point. The difference between a booked appointment and a burned lead is often a single sentence. Prospect says "I'm not really interested." Agent A responds with a redirect to the free estimate and an either/or scheduling question. Prospect books. Agent B says "Oh, okay, well if you change your mind..." Call over.

Top performers absorbed two to three rejections per booked call. They didn't hear "no" as a stop sign. They heard it as a request for more information.

The Top Closer Followed the Script Least

We built per-agent performance profiles from timing data across every completed call.

The top agent converted at 23.7% with 78% script adherence. The worst agent converted at 11.2% with 93% script adherence. That's not a coincidence — the top performer deviated because they were reading the prospect and adjusting. The worst performer clung to the script because they weren't.

Talk-to-listen ratio told the same story. The worst performer talked 71% of the time. The top performer talked 57%. The top agent's average pause before responding was 1.8 seconds — long enough to feel considered. The worst performer's 0.7-second gap wasn't efficient; it was interrupting. Call review confirmed they were stepping on prospect sentences 34% of the time.

Silence after the ask was another separator. The top performer waited 3.1 seconds after requesting the booking before speaking again. The worst performer waited 1.1 seconds and immediately started soft-selling again, undercutting their own close.

Building the Agent From Data, Not Theory

Every design decision came from the dataset. The system prompt was modeled on the top performer's pacing, sentence length, and tone calibration. The conversation flow logic follows the golden path — the most common conversational flow that ends in a booked appointment, with four or five critical branch points where word choice determines the outcome.

The objection handling module was built from the real objection map. The written playbook had 15 objection-response pairs. The calls surfaced 43 distinct objections. That gap — between what you think prospects say and what they actually say — is where most AI voice agents fall apart.

"Not interested" appeared in 22% of all calls. It's recoverable about 15% of the time, but only with a specific pattern: acknowledge, pivot to the free inspection, offer a specific day. Any variation including the word "just" killed the call.

The AI doesn't try to replace every interaction. High-value prospect signals, emotional escalation, compliance-sensitive topics, or explicit requests for a human trigger warm transfers with full context summaries.

The Math That Matters

A fully loaded human agent runs $15-$25/hour. After AMD filtering, no-answers, and voicemails, that agent has 15-25 actual conversations per hour. Cost per live conversation: $0.60 to $1.67, before turnover, sick days, and training ramp.

An AI voice agent costs a fraction of a cent per second. It runs 24/7, handles unlimited concurrent calls, and performs at the level of your best agent on every call. For most operations we've tested, the AI delivers appointments at 40-60% lower cost — and that's before the 24/7 availability capturing leads that would otherwise hit voicemail at 8pm.

It doesn't need to be better than your best agent. It needs to be better than your average agent, available around the clock, at a fraction of the cost.

Originally published at ViciStack