How to Stop Losing $1,250/Hour When Your SIP Trunk Goes Down

#voip #asterisk #sysadmin #devops

Your primary SIP carrier goes down during peak dialing hours. Fifty agents are sitting idle. At a fully loaded cost of $25/hour per agent, you're burning $1,250 per hour in payroll alone — not counting the revenue those agents would have generated.

This happens more often than you'd expect. Carriers have outages. They run out of capacity. They return 503s because their upstream route is congested. And VICIdial, out of the box, does nothing about it. The call fails, gets logged as an error, and the lead goes back in the hopper. No automatic retry through an alternate carrier. No health monitoring. No dynamic routing based on carrier performance.

Here's how to build real failover at the Asterisk layer.

Why VICIdial's Default Carrier Setup Fails You

VICIdial's admin interface lets you define multiple carriers and assign them to campaigns. But the default behavior has three critical gaps:

No automatic failover — if Carrier A returns a 503, VICIdial doesn't retry through Carrier B. The call simply fails.
No health monitoring — you discover a carrier is down when agents complain about no calls, or when you notice the dial rate has cratered. By then you've lost 30-60 minutes.
Static routing — calls go to whichever carrier is assigned to the campaign, with no dynamic adjustment based on performance, latency, or capacity.

To fix these, you need to work in the Asterisk dialplan and sip.conf, not just the VICIdial admin panel.

Step 1: Configure Trunks with Health Probing

Define each carrier in sip.conf with health checking enabled:

; /etc/asterisk/sip.conf

[carrier-primary]
type=peer
host=sip.primarycarrier.com
port=5060
username=your_account_id
secret=your_password
fromuser=your_account_id
fromdomain=sip.primarycarrier.com
insecure=invite,port
qualify=yes
qualifyfreq=30
dtmfmode=rfc2833
disallow=all
allow=ulaw
allow=g729
context=from-carrier-primary
nat=force_rport,comedia

[carrier-secondary]
type=peer
host=sip.secondarycarrier.com
; ... same structure, different credentials
qualify=yes
qualifyfreq=30
nat=force_rport,comedia

[carrier-tertiary]
type=peer
host=sip.tertiarycarrier.com
; ... emergency fallback
qualify=yes
qualifyfreq=30
nat=force_rport,comedia

The key settings: qualify=yes with qualifyfreq=30 sends SIP OPTIONS requests every 30 seconds. If the carrier stops responding, Asterisk marks it UNREACHABLE. The nat=force_rport,comedia is essential for most VoIP deployments behind NAT — without it, one-way audio and registration failures are common.

For more aggressive monitoring, reduce qualifyfreq to 15 seconds. But some carriers rate-limit OPTIONS requests and may temporarily block your IP if you probe too frequently.

Step 2: Build the Failover Dialplan

The dialplan is where the actual failover logic lives. You build a cascading context that tries carriers in sequence when one fails:

; /etc/asterisk/extensions_custom.conf

[outbound-failover]
; Try primary carrier first
exten => _1NXXNXXXXXX,1,NoOp(Outbound call to ${EXTEN} - trying primary)
exten => _1NXXNXXXXXX,n,Set(TRUNK_TRIED=primary)
exten => _1NXXNXXXXXX,n,Dial(SIP/${EXTEN}@carrier-primary,60,tT)
exten => _1NXXNXXXXXX,n,NoOp(Primary result: ${DIALSTATUS})

; Failover on trunk-level failures only
exten => _1NXXNXXXXXX,n,GotoIf($["${DIALSTATUS}" = "CHANUNAVAIL"]?try_secondary)
exten => _1NXXNXXXXXX,n,GotoIf($["${DIALSTATUS}" = "CONGESTION"]?try_secondary)
exten => _1NXXNXXXXXX,n,Goto(done)

; Try secondary
exten => _1NXXNXXXXXX,n(try_secondary),NoOp(Trying secondary carrier)
exten => _1NXXNXXXXXX,n,Set(TRUNK_TRIED=${TRUNK_TRIED},secondary)
exten => _1NXXNXXXXXX,n,Dial(SIP/${EXTEN}@carrier-secondary,60,tT)
exten => _1NXXNXXXXXX,n,GotoIf($["${DIALSTATUS}" = "CHANUNAVAIL"]?try_tertiary)
exten => _1NXXNXXXXXX,n,GotoIf($["${DIALSTATUS}" = "CONGESTION"]?try_tertiary)
exten => _1NXXNXXXXXX,n,Goto(done)

; Try tertiary
exten => _1NXXNXXXXXX,n(try_tertiary),NoOp(Trying tertiary carrier)
exten => _1NXXNXXXXXX,n,Set(TRUNK_TRIED=${TRUNK_TRIED},tertiary)
exten => _1NXXNXXXXXX,n,Dial(SIP/${EXTEN}@carrier-tertiary,60,tT)

exten => _1NXXNXXXXXX,n(done),NoOp(Final: ${DIALSTATUS} via ${TRUNK_TRIED})
exten => _1NXXNXXXXXX,n,Hangup()

Critical detail: Only fail over on CHANUNAVAIL and CONGESTION. These mean the trunk itself is down or overloaded. Never fail over on BUSY or NOANSWER — those mean the called party is unavailable, not the carrier. Failing over on BUSY would cause the same number to be called through multiple carriers simultaneously.

DIALSTATUS	Meaning	Should Failover?
CHANUNAVAIL	Trunk unreachable	Yes
CONGESTION	Network congestion / SIP 503	Yes
BUSY	Called party busy	No
NOANSWER	Called party didn't answer	No
ANSWER	Call connected	No
CANCEL	Caller hung up	No

Step 3: Integrate with VICIdial

To route VICIdial's outbound calls through your failover dialplan without breaking call tracking, configure your carrier's dialplan entry in Admin > Carriers to point to your custom context:

exten => _1NXXNXXXXXX,1,Goto(outbound-failover,${EXTEN},1)

This tells VICIdial to route all outbound calls through your failover context, which handles the cascading carrier logic transparently.

Building an External Health Monitor

Asterisk's qualify only checks if the carrier responds to OPTIONS requests — it doesn't verify the carrier can actually route calls. Build an external monitor that checks three things:

SIP OPTIONS response (is the carrier alive?)
Response latency (is it degraded — over 500ms?)
Recent call success rate from your vicidial_log (are calls completing?)

Run it via cron every 2 minutes. Alert when the primary is down and escalate when all carriers have issues:

#!/bin/bash
# carrier_health_monitor.sh

check_carrier() {
    local NAME=$1
    OPTIONS_RESULT=$(asterisk -rx "sip show peer $NAME" | grep "Status")

    if echo "$OPTIONS_RESULT" | grep -q "UNREACHABLE"; then
        echo "$(date) CRITICAL: $NAME is UNREACHABLE" >> /var/log/carrier_health.log
        return 2
    fi

    LATENCY=$(echo "$OPTIONS_RESULT" | grep -oP '\d+(?= ms)')
    if [ -n "$LATENCY" ] && [ "$LATENCY" -gt 500 ]; then
        echo "$(date) WARNING: $NAME latency ${LATENCY}ms" >> /var/log/carrier_health.log
        return 1
    fi
    return 0
}

check_carrier "carrier-primary"
PRIMARY=$?

check_carrier "carrier-secondary"
SECONDARY=$?

if [ $PRIMARY -eq 2 ]; then
    echo "Primary carrier UNREACHABLE - failover active" | \
        mail -s "CRITICAL: Primary SIP carrier down" ops@yourdomain.com
fi

Proactive Rate-Based Failover

The dialplan-level failover retries each individual call. But if your primary carrier's failure rate jumps from 5% to 40% within a 5-minute window, you don't want to waste the first 60 seconds of every call attempting the failing primary before falling back.

A cron script running every minute can check the failure rate over the last 5 minutes from vicidial_log. If it exceeds your threshold (we use 30%), it deactivates the primary carrier in VICIdial's carrier configuration and activates the secondary. This shifts routing at the VICIdial level so the dialplan never even tries the failing carrier.

When the primary recovers (detected by your health monitor), reverse the change and resume normal routing.

Capacity Planning

Failover only works if your backup carriers can absorb the traffic. Rules of thumb:

Secondary carrier should have contracts and capacity for at least 80% of primary's concurrent channels
Tertiary should handle at least 50%
For a 100-agent center doing 200 dials/agent/hour, you need roughly 150-200 concurrent channels at peak

Make sure maxcalls in asterisk.conf is set high enough to handle total capacity across all carriers:

[options]
maxcalls = 500
maxload = 5.0

Testing Without Disrupting Production

Never test failover by killing a live carrier connection. Three safe approaches:

1. iptables block: iptables -A OUTPUT -d sip.primarycarrier.com -p udp --dport 5060 -j DROP simulates a network-level outage. Watch Asterisk detect the carrier as unreachable, verify calls route to secondary, then remove the rule with -D instead of -A.

2. Point to a dead IP: Temporarily change the primary carrier's host in sip.conf to 192.0.2.1 (TEST-NET, guaranteed unreachable), reload SIP, test failover, then restore from backup and reload.

3. Test campaign: Create a dedicated campaign with a small list of your own numbers. Route it through the failover dialplan and simulate failures while monitoring the flow.

After testing, query vicidial_carrier_log to confirm calls actually failed over:

SELECT
    DATE_FORMAT(cl.call_date, '%H:%i') as minute,
    cl.channel,
    COUNT(*) as calls,
    SUM(CASE WHEN cl.dialstatus = 'ANSWER' THEN 1 ELSE 0 END) as successful
FROM vicidial_carrier_log cl
WHERE cl.call_date BETWEEN '2026-03-19 02:00:00' AND '2026-03-19 02:30:00'
GROUP BY minute, cl.channel
ORDER BY minute;

When to Add Kamailio

For 100+ agent centers, Asterisk-level failover may not be fast enough. Kamailio sitting in front of Asterisk as a SIP proxy provides sub-second failover (reroutes within milliseconds), weighted traffic distribution (70% to cheapest carrier, 20% to second, 10% to third), and transaction-level failover where the retry happens within the same SIP transaction, transparent to Asterisk.

The full Kamailio configuration, weighted distribution setup, and VICIdial integration details are in the complete failover guide at ViciStack.

Originally published at https://vicistack.com/blog/vicidial-sip-trunk-failover/