We Cut AMD False Positives from 18% to 4% -- Here's Exactly How

#voip #asterisk #sysadmin #devops

A live prospect picks up, says "Hello?" and your AMD decides it heard a voicemail greeting. Call dropped. Lead burned. At 50 agents doing 200+ dials per hour, an 18% false positive rate means you're losing 300+ live connections per day. That's real money disappearing into bad parameter tuning.

Most VICIdial admins set AMD parameters once and never touch them again. They run the defaults that were calibrated for Asterisk 1.4-era landlines and wonder why accuracy is terrible on modern cell phone traffic through VoIP carriers. Let me walk through the methodology that consistently brings false positive rates from 15-25% down to 3-5% across over 100 VICIdial deployments.

How AMD Actually Decides

VICIdial's AMD runs through Asterisk's detection engine. It listens to initial audio after the call connects, tracks sound events (what it calls "words"), measures silence gaps, and makes a binary call: human or machine. The logic is simple -- humans typically say one or two short words ("Hello?" "Yeah?"). Answering machines play longer greetings with more words. The engine counts and measures to distinguish between them.

The problem is that the default parameters were built for a world of landlines and clean audio. Today's calls traverse VoIP networks with variable latency, hit cell phones with background noise, and reach voicemail systems that have gotten shorter and more conversational. The defaults produce 15-25% false positive rates on most modern campaigns.

The Four Parameters That Control Everything

maximum_word_length (start at 2000ms) -- How long a single sound event can last before it's flagged as machine speech. Too tight and you catch slow-speaking humans. Too loose and blended machine greetings slip through. The default of 5000ms is way too permissive for modern traffic.

minimum_word_length (start at 120ms) -- The minimum duration to count as a "word." Below this, sounds are treated as noise. On noisy cell phone connections, push this to 140-160ms to filter out background sounds that inflate word counts. On clean SIP trunks to mostly landlines, 100ms works fine.

between_words_silence (start at 70ms) -- This is arguably the most impactful parameter. It defines the silence gap that separates one "word" from the next. If it's too short, the system merges multiple words into one long "word," making a human greeting look like a machine greeting. If it's too long, it splits single words into multiple, making machine greetings look like a human saying several short words. This parameter is highly carrier-dependent. We've seen cases where switching to a different SIP carrier required a 30ms adjustment to this single value.

maximum_number_of_words (start at 3) -- How many words are allowed before a machine classification. Most humans answer with 1-2 words. Most voicemail greetings contain 8-20. A threshold of 3 catches the majority of machines while allowing for humans who say "Hello, this is Mike" (3 words). For B2B campaigns where people often answer with their name and company ("Mike Johnson, Acme Corp"), consider raising this to 4. For residential campaigns where people tend to just say "Hello," you could tighten to 2 -- but test carefully.

Per-Carrier Tuning: Where the Real Gains Are

Here's where most admins leave massive performance on the table. They set AMD parameters globally and call it done. But different SIP carriers have radically different audio characteristics that directly affect detection accuracy.

Each carrier has its own codec negotiation preferences (G.711 ulaw vs G.729 vs Opus), jitter buffer implementation, silence suppression behavior (VAD), post-dial delay, and comfort noise generation. A carrier using aggressive VAD strips silence from the audio stream, destroying between_words_silence detection. A carrier with comfort noise generation injects artificial background sound that gets counted as speech.

The approach:

Step 1: List every SIP trunk you route outbound calls through. Check Admin > Carriers in VICIdial.

Step 2: Tag calls by carrier. Use the carrier_id field or parse the channel column in your logs to identify which trunk handled each call.

Step 3: Pull AMD outcomes per carrier. Query vicidial_log joined with vicidial_carrier_log to see machine classification rates per trunk.

Step 4: Compare against baselines. US B2C campaigns typically see 40-60% actual answering machines. If a carrier shows 70%+, AMD is over-classifying on that trunk. Under 30%, it's under-classifying.

Step 5: Tune per carrier. VICIdial lets you set AMD parameters at the campaign level. If specific campaigns route through specific carriers, you effectively get per-carrier tuning. For more granular control, modify the Asterisk dialplan to set AMD parameters dynamically based on the outbound trunk before the AMD() application runs.

Adaptive Thresholds for Sub-5% False Positives

Static tuning gets you from 20% down to maybe 8-10%. To push below 5%, you need adaptive thresholds that respond to changing conditions throughout the day.

The feedback loop concept: agents disposition calls, indicating whether AMD's classification matched reality. A monitoring script polls the database every few minutes and calculates the current false positive rate. When the rate drifts above a threshold (say 8%), the script automatically loosens parameters (increases between_words_silence by 5ms, capped at 120ms). When it drops below 3%, it tightens slightly (decreases by 3ms, floored at 40ms).

A production system should also rate-limit adjustments (no more than one change per 15 minutes) and include safeguards against oscillation.

Time-of-Day Patterns

AMD accuracy varies throughout the day in predictable ways. Morning hours (8-10 AM) hit more voicemail because people are commuting or in meetings -- tighter detection works better. Afternoon (11 AM - 4 PM) is balanced. Evening hours (4-8 PM) get more live answers but with background noise (TV, kids, cooking) -- more permissive parameters reduce false positives from noisy environments.

Maintain separate parameter profiles for different time blocks. A cron job that swaps profiles at the boundaries handles this automatically.

Testing with Real Recordings

You cannot tune AMD blind. You need ground truth.

Build a test corpus: pull 200-300 recordings from /var/spool/asterisk/monitorDONE/. Get 100 that AMD classified as machine (status AA), 100 classified as human (status AL), and 50-100 where the agent dispositioned differently than AMD predicted. Manually listen to the first 3 seconds of each and label it HUMAN or MACHINE.

Then batch-test different parameter combinations against the labeled corpus. For each combination, build a confusion matrix:

                    Predicted Human    Predicted Machine
Actual Human        True Positive      FALSE POSITIVE (bad!)
Actual Machine      False Negative     True Negative

False positives are the priority -- those are lost live connections. A 5% false positive rate on 1000 daily live answers means 50 lost conversations. False negatives (machines reaching agents) waste 3-5 seconds per call. Annoying, but not nearly as expensive.

Find the parameter sweet spot where false positives sit under 5% without letting too many machines through. Never tune based on fewer than 500 calls per parameter set -- small samples produce misleading results.

Real-World Before and After

Typical results on a 50-agent floor:

Before tuning (default parameters): 12,000 daily dials, 4,800 answered, 18.5% false positive rate, 533 live connections lost per day, 62% agent talk time utilization.

After per-carrier tuning + adaptive thresholds: Same dial volume, 4.2% false positive rate, 111 live connections lost per day, 78% utilization.

The difference: 422 additional live connections per day reaching agents. At a conservative $3 per live connection value, that's $1,266 per day or roughly $28K per month in recovered revenue -- from parameter tuning alone. No new hardware, no new software, just better configuration.

Common Pitfalls

Tuning during peak hours. Always test parameter changes during low-volume periods first. A bad change during peak dialing can spike false positives across thousands of calls before you catch it.

Ignoring codec changes. If you switch carriers or change codec preferences, re-tune AMD. G.729 compression changes audio characteristics enough to shift word boundary detection by 20-30ms.

Not accounting for carrier pre-greetings. Some cell carriers play a brief tone or message ("The subscriber you are calling...") before the actual person answers. This gets counted as machine speech. Work with your carriers to understand their pre-connect audio behavior.

Over-tuning on small samples. Never tune based on fewer than 500 calls per parameter set. Statistical noise in small samples leads to parameters that perform well on the test set but poorly in production.

Forgetting compliance. If AMD drops a call classified as machine but it was actually a human, you just made a "dead air" call that counts against your abandoned call rate. FTC limits abandoned calls to 3% of answered calls. False positives can push you over this limit without you realizing it.

ViciStack deploys per-carrier AMD tuning, adaptive monitoring, and recording-level analysis for every client. Our median improvement: 18% false positives down to 4.1%. Get a free AMD analysis -- we'll measure your current false positive rate and show you what's recoverable.

Originally published at https://vicistack.com/blog/vicidial-amd-false-positive-reduction/