VICIdial Clustering: What Breaks at Scale and How to Fix It

#voip #asterisk #sysadmin #devops

Your single-server VICIdial box just hit a wall. Agents get "time synchronization" errors, the hopper empties faster than it fills, and your real-time report loads like it's 2003. You've Googled "VICIdial cluster setup" and found a 15-year-old PDF, some AI-generated marketing content, and fragmented forum threads.

The core problem: Asterisk doesn't scale vertically. A 64-core server chokes at the same agent count as a quad-core. A single all-in-one VICIdial server maxes out around 20-25 agents for predictive outbound. That's not hardware — it's architecture. The solution is more servers, each doing one job well.

The Four Server Roles

Database server. The brain. Every agent login, disposition, hopper query, and real-time report flows through one MySQL instance. VICIdial uses MyISAM exclusively — not InnoDB. The VICIdial creator has been clear about this: "We do not recommend using InnoDB under any circumstances." Use SSDs at minimum, NVMe for 200+ agents, and an LSI Logic MegaRAID controller specifically. Not 3ware, not Adaptec.

Telephony servers. Each one runs Asterisk and handles calls for a slice of your agents. Golden rule: 20 agents per telephony server for predictive outbound. You can push to 25 on a high-clock quad-core. If you're running answering machine detection, cut that nearly in half — AMD's loopback trunk architecture doubles the channel count.

The critical spec is single-thread CPU performance, not core count. A 4-core at 4.5 GHz beats an 8-core at 2.5 GHz.

Web servers. Stateless HTTP handlers — Apache serving the agent interface and admin panels. One web server handles about 150 agents, dropping to 75 with SSL/TLS. This is the one role where virtualization is genuinely fine.

Archive server. Each telephony server records calls locally. Without a central archive server, recordings scatter across every dialer and playback links randomly return "Object not found."

The Keepalive Flags That Break Every New Cluster

VICIdial uses numbered flags in /etc/astguiclient.conf to control which background processes run on each server. Get these wrong and you'll see duplicate calls, erratic dial levels, or a system that silently stops dialing.

The critical rule: flags 5 and 7 must run on exactly ONE telephony server. Flag 5 is the adaptive predictive algorithm — it calculates dial levels for the entire cluster. Flag 7 is the fill/balance dialer. Run either on two servers simultaneously and you get conflicting dial-level calculations, double-dialed leads, and behavior that'll make you question reality.

Database server gets VARactive_keepalives = X. Web servers get X. Regular telephony servers get 1238. Exactly one telephony server — your "primary" dialer — gets 123456789.

MySQL Tuning That Prevents 3 AM Meltdowns

Add skip-name-resolve to your my.cnf. Without it, MySQL does a reverse DNS lookup on every connection. In a cluster where everything is hammering the database, those lookups create a connection backlog that looks like "too many connections" — except raising max_connections doesn't fix it. This single line has saved more VICIdial clusters than any other config change.

Set max_connections = 4000, key_buffer_size = 640M (4096M for enterprise), table_open_cache = 8192, and query_cache_size = 0 (write invalidation overhead exceeds the benefit for VICIdial's access pattern). Enable concurrent_insert = 2 for MyISAM.

Above 200 agents, convert vicidial_live_agents to the MEMORY engine. This table tracks every agent's real-time state and gets hammered by every dialer, web server, and real-time report simultaneously. On MyISAM, table-level locks create contention. On MEMORY, it runs from RAM. Forum users consistently describe the difference as "night and day."

Archive your logs or die. vicidial_log, call_log, vicidial_carrier_log grow indefinitely. Run ADMIN_archive_log_tables.pl --daily at 1 AM. Un-archived tables are the #1 cause of cascading failures: slow queries cause table locks cause "too many connections" cause cluster-wide outage.

The "No Audio Between Servers" Problem

This is the #1 reported cluster issue. Calls ring, agent picks up, silence. But only on calls crossing servers — same-server calls work fine.

Fix checklist in order of likelihood:

Add externip and localnet to sip.conf on every dialer. Without these, Asterisk advertises its internal IP in SDP packets and the remote party sends RTP to an unreachable address.
Open UDP 10000-20000 bidirectionally between ALL nodes. SIP on 5060 handles signaling. Actual audio travels on random high UDP ports.
Set canreinvite=no and directmedia=no so Asterisk keeps media flowing through itself.
Use a private switch between servers with zero firewall. Public NICs face carriers and agents. Private NICs face each other on an unfiltered gigabit switch. This eliminates the entire class of inter-server audio problems.

Capacity Planning From Real Deployments

Scale	DB	Web	Telephony	Archive	Total
50 agents	1	1 (shared w/ DB)	2-3	1	4-5
100 agents	1 dedicated	1-2	4-5	1	7-9
200 agents	1 beefy	2-3	8-10	1	12-15
500 agents	1 maxed + 1 slave	4-5	20-25	2	28-33

Around 125 agents, MyISAM table-lock contention on vicidial_live_agents starts causing cascading issues. MEMORY table conversion becomes essential. Above 300 agents with aggressive dial ratios, consider splitting into multiple independent clusters rather than scaling one further — one DB outage taking out 300+ agents is scary math.

Virtualization: Settled

The VICIdial creator: "You will actually need more hardware and spend more money when you virtualize. Best case scenario is 50% of normal bare metal capacity."

The nuance: dedicated cloud instances work. Over 100 agents have been confirmed on AWS EC2 dedicated hosts. The distinction is actual hardware with no noisy neighbors. Shared/burstable instances fail because DAHDI needs precise 1ms kernel timer ticks.

Telephony servers: bare metal or dedicated cloud. Database: bare metal preferred. Web servers: virtualization fine. Archive: anything with enough storage.

A full 500-agent cluster on Hetzner bare metal costs roughly $860/month for 14 servers — $1.72 per agent per month for compute. Compare that to hosted dialers at $120-225/agent/month. ViciStack ships pre-configured bare metal clusters with every optimization described here already baked in.

Originally published at https://vicistack.com/blog/vicidial-cluster-guide/