It's 2:14 AM. Your phone is buzzing. The database server just died — failed RAID controller, bad firmware update, doesn't matter. 200 agents across three time zones are logging in at 8 AM, your dialer manager is already texting you, and every second of downtime is burning revenue.
I've been in that scenario more times than I'd like to admit — both on our own infrastructure and rescuing other operators who called us in a panic. I've watched a $40 RAID battery destroy an entire day's production. I've seen a DROP TABLE command typed into the wrong terminal wipe out a campaign mid-shift. And I've seen operators with proper DR in place recover from total server loss in under 15 minutes while agents never noticed.
The difference is preparation, not luck.
Why VICIdial DR Is Uniquely Hard
VICIdial isn't a stateless web app you redeploy from a container image. It's a real-time telephony platform with specific characteristics that break standard IT disaster recovery playbooks:
The database is a single point of failure by design. Every server in a VICIdial cluster connects to one MySQL instance. There's no built-in clustering, no automatic failover, no Galera-style multi-master. If the database goes down, every agent screen freezes, every dial stops, every call in progress gets orphaned.
Live call state is ephemeral. When Asterisk crashes or a telephony server reboots, every active call on that server drops instantly. There's no way to "resume" a call — the RTP streams, channel state, and conference bridges are gone. Your DR plan needs to account for this reality.
Recordings have legal retention requirements. Call recordings often need 3-7 years of retention depending on industry and jurisdiction. Losing recordings isn't just operational — it's a compliance liability that can result in fines.
Time sensitivity is extreme. A 200-agent call center losing 30 minutes of production at peak hours loses thousands in revenue and potentially hundreds of leads that will never be contactable again.
Set Two Numbers Before You Configure Anything
Everything flows from these two decisions:
Recovery Time Objective (RTO): How long can your call center be completely down before the business impact becomes unacceptable? This is a business question, not a technical one. Ask your operations manager, not your sysadmin.
Recovery Point Objective (RPO): How much data can you afford to lose? If the database dies right now, what's the oldest acceptable backup? The last 5 minutes of dispositions? The last hour? The last 24 hours?
| Scenario | Typical RTO | Typical RPO | DR Strategy |
|---|---|---|---|
| Small (< 50 agents), single shift | 2-4 hours | 1 hour | Nightly backups + manual restore |
| Mid-size (50-200 agents), multi-shift | 15-30 minutes | 5 minutes | MySQL replication + keepalived failover |
| Large (200-500 agents), 16+ hours/day | < 5 minutes | Near-zero | Master-master replication + automated failover + hot standby |
| Enterprise / 24x7 (500+ agents) | < 1 minute | Zero | Multi-cluster with geographic redundancy |
The money math: A 200-agent outbound operation at $15/hour/agent loses $3,000 for every hour of downtime in direct labor alone — not counting lost leads, missed SLAs, or attrition. A proper DR setup costs $100-300/month in additional infrastructure. The payback period is literally one incident.
MySQL Replication: The Foundation
Your database is the single most important component to protect. If you lose a telephony server, 20 agents have a bad day. If you lose the database without a replica, your entire operation stops.
Master-Slave Replication
This is the minimum viable DR strategy for any deployment above 30 agents. The master processes all writes and streams its binary log to one or more slaves that replay writes in near-real-time.
On the master (/etc/my.cnf):
[mysqld]
server-id = 1
log-bin = /var/log/mysql/mysql-bin
binlog-format = MIXED
expire_logs_days = 7
max_binlog_size = 256M
sync_binlog = 1
binlog-do-db = asterisk
On the slave:
[mysqld]
server-id = 2
relay-log = /var/log/mysql/mysql-relay
read_only = 1
log-slave-updates = 1
replicate-do-db = asterisk
Set up the replication user on the master, dump the database with --master-data=2, transfer to the slave, import, configure CHANGE MASTER TO with the correct log file and position, and start the slave. Verify with SHOW SLAVE STATUS\G — you want Slave_IO_Running: Yes, Slave_SQL_Running: Yes, and Seconds_Behind_Master: 0.
What master-slave gives you: A warm standby database seconds behind production. If the master dies, promote the slave, repoint your cluster's $VARDB_server in every server's astguiclient.conf, restart services. Recovery: 5-15 minutes with a practiced procedure, 30-60 minutes first time.
What master-slave doesn't give you: Automatic failover. Someone has to detect the failure, promote the slave, and reconfigure every cluster node.
Master-Master Replication
Master-master means both MySQL instances accept writes simultaneously, enabling automated failover because either server can serve as primary at any time.
This is significantly more dangerous than master-slave. Write conflicts, auto-increment collisions, and split-brain scenarios are real risks. The key is ensuring only one master receives writes at any given time — the second master is a hot standby.
The critical configuration: set auto_increment_increment = 2 and auto_increment_offset to different values (1 and 2) on each server. Master 1 generates odd IDs, Master 2 generates even IDs. This eliminates the most common source of replication conflicts in VICIdial's auto-increment tables.
Automated Failover with Keepalived
Manual failover means waking someone up at 2 AM, SSH-ing into servers, running commands, and hoping they don't make a mistake under pressure. Keepalived uses VRRP to manage a floating IP — a virtual IP (VIP) that automatically moves from primary to backup if the primary becomes unreachable.
Your cluster nodes don't connect to the database server's real IP. They connect to the VIP. If the primary dies, keepalived moves the VIP to the standby within 1-3 seconds. Every cluster node's MySQL connections fail and reconnect — to the same VIP, now resolving to the standby. Total interruption: under 10 seconds.
Keepalived Configuration
On the primary database server (/etc/keepalived/keepalived.conf):
vrrp_script chk_mysql {
script "/usr/local/bin/check_mysql.sh"
interval 2
weight -20
fall 3
rise 2
}
vrrp_instance VI_MYSQL {
state MASTER
interface eth0
virtual_router_id 51
priority 100
advert_int 1
virtual_ipaddress {
192.168.1.100/24
}
track_script {
chk_mysql
}
notify_master "/usr/local/bin/failover_promote.sh"
}
On the standby: same config but state BACKUP and priority 90.
Health Check and Promotion Scripts
The health check runs every 2 seconds: is MySQL running? Accepting connections? Replication healthy?
#!/bin/bash
# /usr/local/bin/check_mysql.sh
if ! systemctl is-active --quiet mariadb; then exit 1; fi
if ! mysqladmin ping --connect-timeout=3 &>/dev/null; then exit 1; fi
SLAVE_STATUS=$(mysql -N -e "SHOW SLAVE STATUS\G" 2>/dev/null)
if [ -n "$SLAVE_STATUS" ]; then
SQL_RUNNING=$(echo "$SLAVE_STATUS" | grep "Slave_SQL_Running:" | awk '{print $2}')
if [ "$SQL_RUNNING" != "Yes" ]; then exit 1; fi
fi
exit 0
The fall 3 parameter means three consecutive failures (6 seconds) before failover triggers — preventing false positives from momentary stalls during report generation or log archiving.
The promotion script stops replication, removes read-only, logs the binary log position, and sends an alert:
#!/bin/bash
# /usr/local/bin/failover_promote.sh
mysql -e "STOP SLAVE; RESET SLAVE ALL; SET GLOBAL read_only = 0;"
echo "VICIdial DB failover at $(date). This server is now primary." | \
mail -s "CRITICAL: VICIdial Database Failover" ops@yourdomain.com
Every astguiclient.conf across the cluster should point $VARDB_server at the VIP (192.168.1.100), not any server's real IP. Same for MySQL connection strings in custom scripts, API integrations, and cron jobs. Miss one and that component breaks during failover while everything else works — maddening to debug under pressure.
Common Pitfalls in Master-Master Replication
Master-master is the standard for production VICIdial HA, but it's significantly more dangerous than master-slave. Write conflicts, auto-increment collisions, and split-brain scenarios are real risks. VICIdial's MyISAM tables don't support transactions, meaning there's no rollback safety net if replication breaks.
The critical configuration that prevents the most common problems: set auto_increment_increment = 2 and auto_increment_offset to different values on each server (1 and 2). Master 1 generates odd-numbered IDs (1, 3, 5...) and Master 2 generates even-numbered IDs (2, 4, 6...). This eliminates auto-increment collisions in VICIdial's high-write tables like vicidial_log, call_log, and vicidial_agent_log.
Both servers must be configured as slaves of each other — run CHANGE MASTER TO on each server pointing to the other. The key operating principle: only one master should receive writes at any given time. The second master is a hot standby that can accept writes instantly if the primary fails, but under normal operation it only replays the primary's binary log.
Replication Monitoring
Check replication health regularly. Seconds_Behind_Master should be close to zero on the standby. If it's climbing, the standby hardware can't keep up with write volume — usually a disk I/O bottleneck that needs SSDs or increased innodb_io_capacity.
On the replica, run SHOW SLAVE STATUS\G and check that Slave_IO_Running: Yes and Slave_SQL_Running: Yes. If either shows No, replication is broken and needs manual intervention before a failover can succeed. Automate this check with a monitoring script that alerts you immediately when replication breaks.
Telephony Server Failover
Database failover gets the attention, but telephony server failures are more common. Hard drives die, kernel panics happen, Asterisk has its own collection of segfaults and memory leaks.
VICIdial's cluster architecture handles this better than most people realize. When a telephony server drops, AST_VDadapt.pl notices within 30-60 seconds and stops assigning calls. The adaptive algorithm redistributes dial capacity across remaining servers. If you have 10 telephony servers and one dies, you lose 10% of agent capacity until it comes back. The other 90% continues without interruption.
The critical exception: the primary dialer — the server running keepalive flags 5 and 7 (adaptive predictive algorithm and fill dialer). If that server dies, the entire cluster stops auto-dialing. Keep a scripted promotion procedure for the backup dialer. But do not fully automate it — a network partition could trick automation into running two primary dialers simultaneously, causing double-dialed leads and erratic dial levels. This requires a human check.
For operations where every agent-minute counts, keep one telephony server running with VICIdial installed, configured, and registered but with no agents assigned (keepalive flags 12368). If any server fails, reassign its agents to the spare. Recovery: under 2 minutes.
Recording Backup Strategy
A 200-agent operation generates 1-2 TB of recordings per month. Losing them means regulatory fines, lost evidence in disputes, and compliance audit failures.
Three tiers:
Tier 1: Local disk (hours) — Asterisk writes to the telephony server. Local storage is a buffer, not a backup.
Tier 2: Archive server (weeks to months) — FTP pipeline moves processed recordings to a dedicated archive server. Mount via NFS to web servers for playback.
Tier 3: Cloud storage (years) — S3 or S3-compatible storage for long-term retention. Use STANDARD_IA for recordings older than a day (~$19/month for 1.5 TB). Glacier Deep Archive for recordings older than a year (~$1.50/month for 1.5 TB).
| Agent Count | Monthly Recordings | S3 Standard IA | Glacier Deep Archive |
|---|---|---|---|
| 50 agents | ~250 GB | ~$3/month | ~$0.25/month |
| 200 agents | ~1.5 TB | ~$19/month | ~$1.50/month |
| 500 agents | ~4 TB | ~$50/month | ~$4/month |
Point-in-Time Recovery: The Difference Between Losing a Day and Losing 30 Seconds
This is what separates "we lost a day of data" from "we lost 30 seconds of data." Point-in-time recovery (PITR) combines a full backup with binary log replay to restore the database to any specific moment.
The prerequisite is binary logging — which you already have if you configured replication. The recovery procedure:
# 1. Someone dropped vicidial_list at 14:32:15
# 2. Restore the most recent full backup
gunzip < /backup/mysql/vicidial_20260318_020000.sql.gz | mysql asterisk
# 3. Replay binary logs from backup to one second before disaster
mysqlbinlog --start-datetime="2026-03-18 02:00:00" \
--stop-datetime="2026-03-18 14:32:14" \
/var/log/mysql/mysql-bin.000042 \
/var/log/mysql/mysql-bin.000043 | mysql asterisk
The --stop-datetime is set to one second before the disaster. This replays every write between backup and bad event, without replaying the bad event itself. Store binary logs on a separate physical disk from your data directory — if the data disk fails, binary logs survive and you can do PITR from backup plus surviving logs.
Database Backups vs. Replication
Replication protects you from hardware failure. Backups protect you from everything else — accidental deletes, corrupted tables, bad upgrades, ransomware. Replication is not a backup. If someone runs DELETE FROM vicidial_list WHERE 1=1 on the master, that delete replicates to the slave instantly. Both copies are empty.
Run automated daily backups with mysqldump --single-transaction --master-data=2. The --master-data=2 flag embeds the binary log position, enabling point-in-time recovery — you can restore the backup and then replay binary logs up to one second before the disaster.
Store binary logs on a separate physical disk from your data directory. If the data disk fails, binary logs survive and you can do PITR from backup plus surviving logs.
Verify backup size isn't suspiciously small (corrupt or truncated). Sync to offsite storage. Keep 14 days of retention.
The Config Files Everyone Forgets
Beyond MySQL, these files need backup:
-
/etc/astguiclient.conf— per-server VICIdial config -
/etc/asterisk/— Asterisk configuration -
/etc/my.cnf— MySQL configuration -
/etc/keepalived/keepalived.conf— failover config -
/usr/share/astguiclient/— VICIdial Perl scripts and custom modifications -
/var/www/html/agc/— agent interface customizations - SSL certificates and carrier trunk credentials
Keep a Git repo with all config files. After any change, commit and push. If you need to rebuild from scratch, the entire history is versioned.
The DR Runbook: Scenarios That Actually Happen
A runbook isn't documentation you read — it's a procedure you execute at 2 AM when your brain is at 40% capacity.
Scenario 1: Database Server Total Failure (Automated Failover Active)
Expected RTO: < 10 seconds (automated) + 5 minutes (verification)
- Keepalived detects failure, VIP migrates to standby automatically
- Promotion script executes — standby stops replication, accepts writes
- Verify:
ip addr show | grep 192.168.1.100(VIP is on new primary),SHOW SLAVE STATUS\G(should be empty), checkvicidial_live_agentscount - Monitor agent reconnection — brief "time synchronization" errors, recovery within 30-60 seconds
- Notify operations: "DB failover occurred. Service restored. Investigating root cause."
- Root cause analysis after service is stable
Scenario 2: Database Server Failure (No Automated Failover)
Expected RTO: 5-15 minutes (practiced), 30-60 minutes (first time)
- Confirm master is actually down (not just network partition)
- On the slave: stop replication, remove read-only, note binary log position
- On every cluster node: update
$VARDB_serverinastguiclient.confto the slave's IP - Restart keepalive processes on every cluster node
- Verify agents are reconnecting
- Plan master rebuild during next maintenance window
Scenario 3: Telephony Server Crash
Expected RTO: automatic (30-60 seconds for remaining cluster to absorb)
- VICIdial auto-detects missing server within 30-60 seconds
- Affected agents see errors, need to re-login
- If primary dialer (flags 5/7): manually promote backup using the promotion script after confirming the primary is truly down
- If spare server available: reassign affected agents through admin interface (2 minutes)
What to Back Up Beyond MySQL
| Path | What It Is | Frequency |
|---|---|---|
/etc/astguiclient.conf |
Per-server VICIdial config | On change |
/etc/asterisk/ |
Asterisk configuration | On change |
/etc/my.cnf |
MySQL configuration | On change |
/etc/keepalived/keepalived.conf |
Failover config | On change |
/usr/share/astguiclient/ |
VICIdial Perl scripts + mods | Weekly |
/var/www/html/agc/ |
Agent interface customizations | Weekly |
| SSL certificates | WebRTC/HTTPS | On renewal |
Keep all config files in a Git repo. After any change, commit and push. If you need to rebuild a server from scratch, the entire configuration history is versioned and available.
Automated Daily Backups
Set up a cron job that runs at 2 AM:
#!/bin/bash
# /usr/local/bin/vicidial_backup.sh
BACKUP_DIR="/backup/mysql"
DATE=$(date +%Y%m%d_%H%M%S)
RETENTION_DAYS=14
mkdir -p $BACKUP_DIR
mysqldump -u root \
--single-transaction \
--routines --triggers --events \
--master-data=2 --flush-logs \
asterisk | gzip > $BACKUP_DIR/vicidial_${DATE}.sql.gz
# Verify backup isn't zero-size or corrupt
BACKUP_SIZE=$(stat -c%s "$BACKUP_DIR/vicidial_${DATE}.sql.gz")
if [ "$BACKUP_SIZE" -lt 1000000 ]; then
echo "WARNING: Backup suspiciously small ($BACKUP_SIZE bytes)" | \
mail -s "VICIdial Backup Warning" ops@yourdomain.com
exit 1
fi
# Clean up old backups
find $BACKUP_DIR -name "vicidial_*.sql.gz" -mtime +$RETENTION_DAYS -delete
# Sync to offsite
rsync -az $BACKUP_DIR/vicidial_${DATE}.sql.gz backup_user@remote_ip:/offsite_backup/
The --master-data=2 flag embeds the binary log position as a comment in the dump file. This is essential for point-in-time recovery — you need to know exactly where the backup's snapshot ends and the binary logs should begin.
Test Your DR Plan
A DR plan you've never tested is a hypothesis, not a plan. Schedule quarterly failover tests during low-traffic hours. Actually trigger the failover — not just a tabletop exercise. Time the recovery. Document what went wrong. Update procedures based on what you learn. The 2 AM scenario is not the time to discover your promotion script has a typo or that someone changed the VIP address six months ago without updating the keepalived config.
Monitoring Replication Health
Set up a cron job that checks replication status every 5 minutes and alerts on failure:
#!/bin/bash
SLAVE_STATUS=$(mysql -N -e "SHOW SLAVE STATUS\G" 2>/dev/null)
IO_RUNNING=$(echo "$SLAVE_STATUS" | grep "Slave_IO_Running:" | awk '{print $2}')
SQL_RUNNING=$(echo "$SLAVE_STATUS" | grep "Slave_SQL_Running:" | awk '{print $2}')
LAG=$(echo "$SLAVE_STATUS" | grep "Seconds_Behind_Master:" | awk '{print $2}')
if [ "$IO_RUNNING" != "Yes" ] || [ "$SQL_RUNNING" != "Yes" ]; then
echo "REPLICATION BROKEN: IO=$IO_RUNNING SQL=$SQL_RUNNING" | \
mail -s "CRITICAL: VICIdial Replication Failure" ops@yourdomain.com
fi
if [ "$LAG" -gt 30 ]; then
echo "REPLICATION LAG: ${LAG} seconds behind master" | \
mail -s "WARNING: VICIdial Replication Lag" ops@yourdomain.com
fi
Broken replication that goes undetected for hours means your "standby" database is no longer current — if you fail over to it, you lose every write since replication broke. This monitoring script is the cheapest insurance in your entire DR stack.
The difference between a minor hiccup and a catastrophic outage is whether you built and tested the plan before you needed it. ViciStack ships every cluster with keepalived, automated promotion, recording backup, and alerting pre-configured — the database fails over before your phone finishes buzzing.
Originally published at https://vicistack.com/blog/vicidial-disaster-recovery/
Top comments (0)