Jason Shouldice

Posted on Mar 20 • Edited on Mar 25 • Originally published at vicistack.com

The VICIdial DR Plan I Wish I'd Had at 2 AM

#voip #asterisk #sysadmin #devops

It's 2:14 AM. Your phone is buzzing. The database server just died — failed RAID controller, bad firmware update, doesn't matter. 200 agents across three time zones are logging in at 8 AM, your dialer manager is already texting you, and every second of downtime is burning revenue.

I've been in that scenario more times than I'd like to admit — both on our own infrastructure and rescuing other operators who called us in a panic. I've watched a $40 RAID battery destroy an entire day's production. I've seen a DROP TABLE command typed into the wrong terminal wipe out a campaign mid-shift. And I've seen operators with proper DR in place recover from total server loss in under 15 minutes while agents never noticed.

The difference is preparation, not luck.

Why VICIdial DR Is Uniquely Hard

VICIdial isn't a stateless web app you redeploy from a container image. It's a real-time telephony platform with specific characteristics that break standard IT disaster recovery playbooks:

The database is a single point of failure by design. Every server in a VICIdial cluster connects to one MySQL instance. There's no built-in clustering, no automatic failover, no Galera-style multi-master. If the database goes down, every agent screen freezes, every dial stops, every call in progress gets orphaned.

Live call state is ephemeral. When Asterisk crashes or a telephony server reboots, every active call on that server drops instantly. There's no way to "resume" a call — the RTP streams, channel state, and conference bridges are gone. Your DR plan needs to account for this reality.

Recordings have legal retention requirements. Call recordings often need 3-7 years of retention depending on industry and jurisdiction. Losing recordings isn't just operational — it's a compliance liability that can result in fines.

Time sensitivity is extreme. A 200-agent call center losing 30 minutes of production at peak hours loses thousands in revenue and potentially hundreds of leads that will never be contactable again.

Set Two Numbers Before You Configure Anything

Everything flows from these two decisions:

Recovery Time Objective (RTO): How long can your call center be completely down before the business impact becomes unacceptable? This is a business question, not a technical one. Ask your operations manager, not your sysadmin.

Recovery Point Objective (RPO): How much data can you afford to lose? If the database dies right now, what's the oldest acceptable backup? The last 5 minutes of dispositions? The last hour? The last 24 hours?

Scenario	Typical RTO	Typical RPO	DR Strategy
Small (< 50 agents), single shift	2-4 hours	1 hour	Nightly backups + manual restore
Mid-size (50-200 agents), multi-shift	15-30 minutes	5 minutes	MySQL replication + keepalived failover
Large (200-500 agents), 16+ hours/day	< 5 minutes	Near-zero	Master-master replication + automated failover + hot standby
Enterprise / 24x7 (500+ agents)	< 1 minute	Zero	Multi-cluster with geographic redundancy

The money math: A 200-agent outbound operation at $15/hour/agent loses $3,000 for every hour of downtime in direct labor alone — not counting lost leads, missed SLAs, or attrition. A proper DR setup costs $100-300/month in additional infrastructure. The payback period is literally one incident.

MySQL Replication: The Foundation

Your database is the single most important component to protect. If you lose a telephony server, 20 agents have a bad day. If you lose the database without a replica, your entire operation stops.

Master-Slave Replication

This is the minimum viable DR strategy for any deployment above 30 agents. The master processes all writes and streams its binary log to one or more slaves that replay writes in near-real-time.

On the master (/etc/my.cnf):

[mysqld]
server-id = 1
log-bin = /var/log/mysql/mysql-bin
binlog-format = MIXED
expire_logs_days = 7
max_binlog_size = 256M
sync_binlog = 1
binlog-do-db = asterisk

On the slave:

[mysqld]
server-id = 2
relay-log = /var/log/mysql/mysql-relay
read_only = 1
log-slave-updates = 1
replicate-do-db = asterisk

Set up the replication user on the master, dump the database with --master-data=2, transfer to the slave, import, configure CHANGE MASTER TO with the correct log file and position, and start the slave. Verify with SHOW SLAVE STATUS\G — you want Slave_IO_Running: Yes, Slave_SQL_Running: Yes, and Seconds_Behind_Master: 0.

What master-slave gives you: A warm standby database seconds behind production. If the master dies, promote the slave, repoint your cluster's $VARDB_server in every server's astguiclient.conf, restart services. Recovery: 5-15 minutes with a practiced procedure, 30-60 minutes first time.

What master-slave doesn't give you: Automatic failover. Someone has to detect the failure, promote the slave, and reconfigure every cluster node.

Master-Master Replication

Master-master means both MySQL instances accept writes simultaneously, enabling automated failover because either server can serve as primary at any time.

This is significantly more dangerous than master-slave. Write conflicts, auto-increment collisions, and split-brain scenarios are real risks. The key is ensuring only one master receives writes at any given time — the second master is a hot standby.

The critical configuration: set auto_increment_increment = 2 and auto_increment_offset to different values (1 and 2) on each server. Master 1 generates odd IDs, Master 2 generates even IDs. This eliminates the most common source of replication conflicts in VICIdial's auto-increment tables.

Automated Failover with Keepalived

Manual failover means waking someone up at 2 AM, SSH-ing into servers, running commands, and hoping they don't make a mistake under pressure. Keepalived uses VRRP to manage a floating IP — a virtual IP (VIP) that automatically moves from primary to backup if the primary becomes unreachable.

Your cluster nodes don't connect to the database server's real IP. They connect to the VIP. If the primary dies, keepalived moves the VIP to the standby within 1-3 seconds. Every cluster node's MySQL connections fail and reconnect — to the same VIP, now resolving to the standby. Total interruption: under 10 seconds.

Keepalived Configuration

On the primary database server (/etc/keepalived/keepalived.conf):

vrrp_script chk_mysql {
    script "/usr/local/bin/check_mysql.sh"
    interval 2
    weight -20
    fall 3
    rise 2
}

vrrp_instance VI_MYSQL {
    state MASTER
    interface eth0
    virtual_router_id 51
    priority 100
    advert_int 1

    virtual_ipaddress {
        192.168.1.100/24
    }

    track_script {
        chk_mysql
    }

    notify_master "/usr/local/bin/failover_promote.sh"
}

On the standby: same config but state BACKUP and priority 90.

Health Check and Promotion Scripts

The health check runs every 2 seconds: is MySQL running? Accepting connections? Replication healthy?

#!/bin/bash
# /usr/local/bin/check_mysql.sh
if ! systemctl is-active --quiet mariadb; then exit 1; fi
if ! mysqladmin ping --connect-timeout=3 &>/dev/null; then exit 1; fi

SLAVE_STATUS=$(mysql -N -e "SHOW SLAVE STATUS\G" 2>/dev/null)
if [ -n "$SLAVE_STATUS" ]; then
    SQL_RUNNING=$(echo "$SLAVE_STATUS" | grep "Slave_SQL_Running:" | awk '{print $2}')
    if [ "$SQL_RUNNING" != "Yes" ]; then exit 1; fi
fi
exit 0

The fall 3 parameter means three consecutive failures (6 seconds) before failover triggers — preventing false positives from momentary stalls during report generation or log archiving.

The promotion script stops replication, removes read-only, logs the binary log position, and sends an alert:

#!/bin/bash
# /usr/local/bin/failover_promote.sh
mysql -e "STOP SLAVE; RESET SLAVE ALL; SET GLOBAL read_only = 0;"
echo "VICIdial DB failover at $(date). This server is now primary." | \
    mail -s "CRITICAL: VICIdial Database Failover" ops@yourdomain.com

Every astguiclient.conf across the cluster should point $VARDB_server at the VIP (192.168.1.100), not any server's real IP. Same for MySQL connection strings in custom scripts, API integrations, and cron jobs. Miss one and that component breaks during failover while everything else works — maddening to debug under pressure.

Common Pitfalls in Master-Master Replication

Master-master is the standard for production VICIdial HA, but it's significantly more dangerous than master-slave. Write conflicts, auto-increment collisions, and split-brain scenarios are real risks. VICIdial's MyISAM tables don't support transactions, meaning there's no rollback safety net if replication breaks.

The critical configuration that prevents the most common problems: set auto_increment_increment = 2 and auto_increment_offset to different values on each server (1 and 2). Master 1 generates odd-numbered IDs (1, 3, 5...) and Master 2 generates even-numbered IDs (2, 4, 6...). This eliminates auto-increment collisions in VICIdial's high-write tables like vicidial_log, call_log, and vicidial_agent_log.

Both servers must be configured as slaves of each other — run CHANGE MASTER TO on each server pointing to the other. The key operating principle: only one master should receive writes at any given time. The second master is a hot standby that can accept writes instantly if the primary fails, but under normal operation it only replays the primary's binary log.

Replication Monitoring

Check replication health regularly. Seconds_Behind_Master should be close to zero on the standby. If it's climbing, the standby hardware can't keep up with write volume — usually a disk I/O bottleneck that needs SSDs or increased innodb_io_capacity.

On the replica, run SHOW SLAVE STATUS\G and check that Slave_IO_Running: Yes and Slave_SQL_Running: Yes. If either shows No, replication is broken and needs manual intervention before a failover can succeed. Automate this check with a monitoring script that alerts you immediately when replication breaks.

Telephony Server Failover

Database failover gets the attention, but telephony server failures are more common. Hard drives die, kernel panics happen, Asterisk has its own collection of segfaults and memory leaks.

VICIdial's cluster architecture handles this better than most people realize. When a telephony server drops, AST_VDadapt.pl notices within 30-60 seconds and stops assigning calls. The adaptive algorithm redistributes dial capacity across remaining servers. If you have 10 telephony servers and one dies, you lose 10% of agent capacity until it comes back. The other 90% continues without interruption.

The critical exception: the primary dialer — the server running keepalive flags 5 and 7 (adaptive predictive algorithm and fill dialer). If that server dies, the entire cluster stops auto-dialing. Keep a scripted promotion procedure for the backup dialer. But do not fully automate it — a network partition could trick automation into running two primary dialers simultaneously, causing double-dialed leads and erratic dial levels. This requires a human check.

For operations where every agent-minute counts, keep one telephony server running with VICIdial installed, configured, and registered but with no agents assigned (keepalive flags 12368). If any server fails, reassign its agents to the spare. Recovery: under 2 minutes.

Recording Backup Strategy

A 200-agent operation generates 1-2 TB of recordings per month. Losing them means regulatory fines, lost evidence in disputes, and compliance audit failures.

Three tiers:

Tier 1: Local disk (hours) — Asterisk writes to the telephony server. Local storage is a buffer, not a backup.

Tier 2: Archive server (weeks to months) — FTP pipeline moves processed recordings to a dedicated archive server. Mount via NFS to web servers for playback.

Tier 3: Cloud storage (years) — S3 or S3-compatible storage for long-term retention. Use STANDARD_IA for recordings older than a day (~$19/month for 1.5 TB). Glacier Deep Archive for recordings older than a year (~$1.50/month for 1.5 TB).

Agent Count	Monthly Recordings	S3 Standard IA	Glacier Deep Archive
50 agents	~250 GB	~$3/month	~$0.25/month
200 agents	~1.5 TB	~$19/month	~$1.50/month
500 agents	~4 TB	~$50/month	~$4/month

Point-in-Time Recovery: The Difference Between Losing a Day and Losing 30 Seconds

This is what separates "we lost a day of data" from "we lost 30 seconds of data." Point-in-time recovery (PITR) combines a full backup with binary log replay to restore the database to any specific moment.

The prerequisite is binary logging — which you already have if you configured replication. The recovery procedure:

# 1. Someone dropped vicidial_list at 14:32:15

# 2. Restore the most recent full backup
gunzip < /backup/mysql/vicidial_20260318_020000.sql.gz | mysql asterisk

# 3. Replay binary logs from backup to one second before disaster
mysqlbinlog --start-datetime="2026-03-18 02:00:00" \
            --stop-datetime="2026-03-18 14:32:14" \
            /var/log/mysql/mysql-bin.000042 \
            /var/log/mysql/mysql-bin.000043 | mysql asterisk

The --stop-datetime is set to one second before the disaster. This replays every write between backup and bad event, without replaying the bad event itself. Store binary logs on a separate physical disk from your data directory — if the data disk fails, binary logs survive and you can do PITR from backup plus surviving logs.

Database Backups vs. Replication

Replication protects you from hardware failure. Backups protect you from everything else — accidental deletes, corrupted tables, bad upgrades, ransomware. Replication is not a backup. If someone runs DELETE FROM vicidial_list WHERE 1=1 on the master, that delete replicates to the slave instantly. Both copies are empty.

Run automated daily backups with mysqldump --single-transaction --master-data=2. The --master-data=2 flag embeds the binary log position, enabling point-in-time recovery — you can restore the backup and then replay binary logs up to one second before the disaster.

Store binary logs on a separate physical disk from your data directory. If the data disk fails, binary logs survive and you can do PITR from backup plus surviving logs.

Verify backup size isn't suspiciously small (corrupt or truncated). Sync to offsite storage. Keep 14 days of retention.

The Config Files Everyone Forgets

Beyond MySQL, these files need backup:

/etc/astguiclient.conf — per-server VICIdial config
/etc/asterisk/ — Asterisk configuration
/etc/my.cnf — MySQL configuration
/etc/keepalived/keepalived.conf — failover config
/usr/share/astguiclient/ — VICIdial Perl scripts and custom modifications
/var/www/html/agc/ — agent interface customizations
SSL certificates and carrier trunk credentials

Keep a Git repo with all config files. After any change, commit and push. If you need to rebuild from scratch, the entire history is versioned.

The DR Runbook: Scenarios That Actually Happen

A runbook isn't documentation you read — it's a procedure you execute at 2 AM when your brain is at 40% capacity.

Scenario 1: Database Server Total Failure (Automated Failover Active)

Expected RTO: < 10 seconds (automated) + 5 minutes (verification)

Keepalived detects failure, VIP migrates to standby automatically
Promotion script executes — standby stops replication, accepts writes
Verify: ip addr show | grep 192.168.1.100 (VIP is on new primary), SHOW SLAVE STATUS\G (should be empty), check vicidial_live_agents count
Monitor agent reconnection — brief "time synchronization" errors, recovery within 30-60 seconds
Notify operations: "DB failover occurred. Service restored. Investigating root cause."
Root cause analysis after service is stable

Scenario 2: Database Server Failure (No Automated Failover)

Expected RTO: 5-15 minutes (practiced), 30-60 minutes (first time)

Confirm master is actually down (not just network partition)
On the slave: stop replication, remove read-only, note binary log position
On every cluster node: update $VARDB_server in astguiclient.conf to the slave's IP
Restart keepalive processes on every cluster node
Verify agents are reconnecting
Plan master rebuild during next maintenance window

Scenario 3: Telephony Server Crash

Expected RTO: automatic (30-60 seconds for remaining cluster to absorb)

VICIdial auto-detects missing server within 30-60 seconds
Affected agents see errors, need to re-login
If primary dialer (flags 5/7): manually promote backup using the promotion script after confirming the primary is truly down
If spare server available: reassign affected agents through admin interface (2 minutes)

What to Back Up Beyond MySQL

Path	What It Is	Frequency
`/etc/astguiclient.conf`	Per-server VICIdial config	On change
`/etc/asterisk/`	Asterisk configuration	On change
`/etc/my.cnf`	MySQL configuration	On change
`/etc/keepalived/keepalived.conf`	Failover config	On change
`/usr/share/astguiclient/`	VICIdial Perl scripts + mods	Weekly
`/var/www/html/agc/`	Agent interface customizations	Weekly
SSL certificates	WebRTC/HTTPS	On renewal

Keep all config files in a Git repo. After any change, commit and push. If you need to rebuild a server from scratch, the entire configuration history is versioned and available.

Automated Daily Backups

Set up a cron job that runs at 2 AM:

#!/bin/bash
# /usr/local/bin/vicidial_backup.sh
BACKUP_DIR="/backup/mysql"
DATE=$(date +%Y%m%d_%H%M%S)
RETENTION_DAYS=14

mkdir -p $BACKUP_DIR

mysqldump -u root \
    --single-transaction \
    --routines --triggers --events \
    --master-data=2 --flush-logs \
    asterisk | gzip > $BACKUP_DIR/vicidial_${DATE}.sql.gz

# Verify backup isn't zero-size or corrupt
BACKUP_SIZE=$(stat -c%s "$BACKUP_DIR/vicidial_${DATE}.sql.gz")
if [ "$BACKUP_SIZE" -lt 1000000 ]; then
    echo "WARNING: Backup suspiciously small ($BACKUP_SIZE bytes)" | \
        mail -s "VICIdial Backup Warning" ops@yourdomain.com
    exit 1
fi

# Clean up old backups
find $BACKUP_DIR -name "vicidial_*.sql.gz" -mtime +$RETENTION_DAYS -delete

# Sync to offsite
rsync -az $BACKUP_DIR/vicidial_${DATE}.sql.gz backup_user@remote_ip:/offsite_backup/

The --master-data=2 flag embeds the binary log position as a comment in the dump file. This is essential for point-in-time recovery — you need to know exactly where the backup's snapshot ends and the binary logs should begin.

Test Your DR Plan

A DR plan you've never tested is a hypothesis, not a plan. Schedule quarterly failover tests during low-traffic hours. Actually trigger the failover — not just a tabletop exercise. Time the recovery. Document what went wrong. Update procedures based on what you learn. The 2 AM scenario is not the time to discover your promotion script has a typo or that someone changed the VIP address six months ago without updating the keepalived config.

Monitoring Replication Health

Set up a cron job that checks replication status every 5 minutes and alerts on failure:

#!/bin/bash
SLAVE_STATUS=$(mysql -N -e "SHOW SLAVE STATUS\G" 2>/dev/null)
IO_RUNNING=$(echo "$SLAVE_STATUS" | grep "Slave_IO_Running:" | awk '{print $2}')
SQL_RUNNING=$(echo "$SLAVE_STATUS" | grep "Slave_SQL_Running:" | awk '{print $2}')
LAG=$(echo "$SLAVE_STATUS" | grep "Seconds_Behind_Master:" | awk '{print $2}')

if [ "$IO_RUNNING" != "Yes" ] || [ "$SQL_RUNNING" != "Yes" ]; then
    echo "REPLICATION BROKEN: IO=$IO_RUNNING SQL=$SQL_RUNNING" | \
        mail -s "CRITICAL: VICIdial Replication Failure" ops@yourdomain.com
fi

if [ "$LAG" -gt 30 ]; then
    echo "REPLICATION LAG: ${LAG} seconds behind master" | \
        mail -s "WARNING: VICIdial Replication Lag" ops@yourdomain.com
fi

Broken replication that goes undetected for hours means your "standby" database is no longer current — if you fail over to it, you lose every write since replication broke. This monitoring script is the cheapest insurance in your entire DR stack.

The difference between a minor hiccup and a catastrophic outage is whether you built and tested the plan before you needed it. ViciStack ships every cluster with keepalived, automated promotion, recording backup, and alerting pre-configured — the database fails over before your phone finishes buzzing.

Originally published at https://vicistack.com/blog/vicidial-disaster-recovery/

DEV Community