Jason Shouldice

Posted on Mar 22 • Edited on Mar 25 • Originally published at vicistack.com

Your VICIdial Server Is Throttling Itself: OS and Database Tuning for High-Agent Deployments

#voip #asterisk #sysadmin #devops

I've debugged this exact scenario a dozen times: four telephony servers, dedicated database box, clean VICIdial install. Everything runs great at 50 agents. At 150, real-time reports start lagging. At 300, calls drop during peak hours. By 400, the telephony servers pin at 100% CPU every afternoon and screen sessions silently die.

The hardware is fine. VICIdial's code handles scale. The problem is that nobody tuned anything underneath it — the kernel is still set for a general-purpose web server, MySQL is running defaults, and Apache is spawning processes until it OOM-kills itself.

The Swap Death Spiral

This is the most common VICIdial outage I see, and it has nothing to do with VICIdial. Here's the sequence:

Database server is using 28 GB of 32 GB RAM, running fine
Linux decides to swap application memory to make room for filesystem cache
MyISAM reads now compete with swap I/O
MySQL slows down
Perl daemons queue up waiting for database responses
Memory usage spikes from queued processes
More swapping
Within minutes: swap death spiral, frozen agents, dropped calls

The fix is one sysctl setting:

vm.swappiness = 10

This tells the kernel to strongly prefer dropping filesystem cache over swapping application memory. On a dedicated VICIdial box, you can go as low as 1. Put it in /etc/sysctl.d/99-vicidial.conf and run sysctl --system.

OS Kernel Tuning

VICIdial is not a general-purpose workload. It opens thousands of file descriptors, maintains thousands of concurrent network sockets, and moves massive volumes of small UDP packets. Default kernel parameters throttle all of this.

Create /etc/sysctl.d/99-vicidial.conf:

# File descriptor limits — default 1024 per process is insufficient
fs.file-max = 524288
fs.inotify.max_user_watches = 131072

# Connection tracking for SIP/RTP
net.netfilter.nf_conntrack_max = 262144

# Socket backlog — prevents SIP connection drops during spikes
net.core.somaxconn = 8192
net.core.netdev_max_backlog = 16384

# UDP buffer sizes for SIP/RTP traffic
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.core.rmem_default = 1048576

# TCP tuning for agent web sessions and database connections
net.ipv4.tcp_rmem = 4096 1048576 16777216
net.ipv4.tcp_wmem = 4096 1048576 16777216
net.ipv4.tcp_max_tw_buckets = 65536
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_fin_timeout = 30
net.ipv4.tcp_keepalive_time = 300
net.ipv4.tcp_max_syn_backlog = 8192

# Memory management
vm.swappiness = 10
vm.dirty_ratio = 20
vm.dirty_background_ratio = 5

# Shared memory for Asterisk and MySQL
kernel.shmmax = 4294967296
kernel.shmall = 4294967296

The connection tracking table (nf_conntrack_max) is the one that bites you first. Default 65536 gets exhausted with high call volume, and once it's full, new SIP connections get silently dropped. The only symptom is calls that randomly fail to connect.

The tcp_tw_reuse setting solves MySQL connection exhaustion. VICIdial's Perl daemons create and destroy TCP connections to MySQL constantly. Without TIME_WAIT socket reuse, connection failures spike at scale.

ulimits and systemd Overrides

The system-wide fs.file-max sets the ceiling, but each process has its own limits too:

# /etc/security/limits.conf
asterisk  soft  nofile  65536
asterisk  hard  nofile  131072
mysql     soft  nofile  65536
mysql     hard  nofile  131072

For systemd-managed services, limits.conf values may be ignored. Create overrides:

# For Asterisk
mkdir -p /etc/systemd/system/asterisk.service.d/
cat > /etc/systemd/system/asterisk.service.d/limits.conf << EOF
[Service]
LimitNOFILE=131072
LimitNPROC=65536
LimitMEMLOCK=infinity
EOF

# For MariaDB/MySQL
mkdir -p /etc/systemd/system/mariadb.service.d/
cat > /etc/systemd/system/mariadb.service.d/limits.conf << EOF
[Service]
LimitNOFILE=131072
LimitNPROC=65536
EOF

systemctl daemon-reload

Verify after restart:

cat /proc/$(pidof asterisk)/limits | grep "Max open files"
# Should show 131072, not 1024

CPU Governor

Modern servers ship with power-saving CPU governors. On a telephony server, the CPU scales down during quiet periods and takes milliseconds to ramp back up. Those milliseconds cause jitter on active calls.

for cpu in /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor; do
    echo performance > $cpu
done

# Make persistent via tuned (RHEL/CentOS)
tuned-adm profile throughput-performance

On bare-metal, disable C-states and P-states in the BIOS too. SpeedStep is great for power bills, terrible for real-time telephony.

Asterisk Optimization

Module Cleanup

VICIdial only uses SIP channels and Local channels. Every other loaded module is memory waste:

; /etc/asterisk/modules.conf
[modules]
autoload=yes
noload => chan_iax2.so
noload => chan_skinny.so
noload => chan_unistim.so
noload => chan_mgcp.so
noload => chan_oss.so
noload => chan_alsa.so
noload => app_voicemail.so
noload => app_queue.so
noload => app_followme.so
noload => codec_speex.so
noload => codec_ilbc.so
noload => codec_g726.so
noload => codec_lpc10.so

On a 500-agent telephony server, unloading unused modules frees enough memory for 50-100 more concurrent channels.

In asterisk.conf:

[options]
verbose = 1
debug = 0
cache_record_files = yes
record_cache_dir = /tmp
maxcalls = 2048
maxfiles = 131072

SIP Channel Tuning

For high-volume outbound dialing, tune sip.conf:

[general]
timer1 = 500
timerb = 16000        ; faster call teardown (default 32s is too slow)
qualifyfreq = 60      ; reduce qualify frequency
compactheaders = yes   ; reduces packet size at 1000+ concurrent calls
srvlookup = no        ; disable DNS SRV for trunks with known IPs
rtptimeout = 60
rtpholdtimeout = 300

RTP Port Range

Each active call uses one RTP port. At 1,500 concurrent channels, you need at least 1,500 ports:

; /etc/asterisk/rtp.conf
[general]
rtpstart = 10000
rtpend = 30000
strictrtp = yes

Whatever range you set here must be open in your firewall for UDP. Mismatched port ranges between rtp.conf and iptables cause one-way audio that only appears under load.

Codec Choice

G.711 is the only correct choice for VICIdial outbound dialing. G.729 halves bandwidth but requires software transcoding per channel.

Codec	Channels per CPU Core
G.711 (passthrough)	150-200
G.729 (transcoding)	15-25

A 16-core telephony server handles ~1,500 G.711 channels or ~200 G.729 channels. Unless bandwidth is genuinely constrained (it isn't on any modern data center connection), use G.711.

MySQL/MariaDB Tuning

Turn Off the Query Cache

This is counterintuitive but critical. VICIdial writes to its core tables multiple times per second. Every write invalidates every cached query for that table. The cache becomes high-overhead bookkeeping that caches results only to immediately throw them away. The cache lock becomes a contention point.

[mysqld]
query_cache_size = 0
query_cache_type = 0

Every experienced VICIdial admin arrives at the same conclusion. Turn it off.

Buffer Pool Sizing

The relationship between agent count and database load is non-linear. At 500 agents, the database does 25-30x the work of 50 agents because concurrent queries create lock contention that multiplies.

Agents	key_buffer_size	max_heap_table_size	sort_buffer_size	max_connections
50	512M	64M	4M	2000
100	1024M	128M	8M	3000
200	2048M	128M	8M	4000
300	3072M	256M	16M	5000
500+	4096M	256M	16M	6000

The max_connections number surprises people. A single telephony server holds 40-60 active database connections during peak. A web server handling 100 agent sessions holds 100-200 connections. At 500 agents across 4 telephony servers and 2 web servers, you're at 600-1,000 active connections routinely, with spikes to 2,000+ during shift changes.

MEMORY Tables

VICIdial's hottest tables — vicidial_live_agents, vicidial_auto_calls, vicidial_hopper — should run on the MEMORY storage engine. MEMORY tables live entirely in RAM and reduce lock duration from milliseconds to microseconds.

Set max_heap_table_size high enough that you never hit the ceiling. When a MEMORY table hits the limit, INSERTs fail. In VICIdial, that means agents can't log in or calls won't dial. The error log shows "table is full" messages. 256M is safe for 500-agent deployments.

Concurrent Insert Optimization

MyISAM supports concurrent inserts — new rows inserted at the end of a table while SELECTs read from it, without table-level locking:

[mysqld]
concurrent_insert = 2

Value 2 means always insert at end of table regardless of gaps from deleted rows. VICIdial tables frequently have gaps from archival and hopper drains, so concurrent_insert = 2 is necessary.

Web Server: Apache vs Nginx

The Apache Problem

Default Apache MaxClients is 256. Each Apache+mod_php process with VICIdial uses 30-60 MB of RAM. At 500 concurrent agent connections, you need 500 children. That's 15-30 GB of RAM just for the web server.

Tuned Apache for 500 agents:

<IfModule prefork.c>
    StartServers         50
    MinSpareServers      25
    MaxSpareServers      75
    ServerLimit         600
    MaxClients          600
    MaxRequestsPerChild  10000
</IfModule>

KeepAlive On
MaxKeepAliveRequests 100
KeepAliveTimeout 5

Memory budget: 600 children x 50 MB = 30 GB. In a cluster, splitting across two web servers cuts this in half.

Nginx + PHP-FPM: The Better Architecture

Nginx handles HTTP connections at 2-10 KB each (vs. 30-60 MB per Apache child). PHP-FPM manages a fixed pool of workers.

# /etc/nginx/conf.d/vicidial.conf
server {
    listen 80;
    server_name dialer.example.com;
    root /var/www/html;

    location ~* \.(js|css|png|jpg|gif|ico)$ {
        expires 7d;
        add_header Cache-Control "public, immutable";
        access_log off;
    }

    location ~ \.php$ {
        fastcgi_pass unix:/run/php-fpm/www.sock;
        fastcgi_param SCRIPT_FILENAME $document_root$fastcgi_script_name;
        include fastcgi_params;
        fastcgi_keep_conn on;
        fastcgi_read_timeout 120;
    }

    gzip on;
    gzip_types text/html text/css application/javascript text/xml;
}

PHP-FPM pool configuration:

; /etc/php-fpm.d/www.conf
pm = static
pm.max_children = 128
pm.max_requests = 10000

php_admin_value[memory_limit] = 256M
php_admin_value[opcache.enable] = 1
php_admin_value[opcache.memory_consumption] = 128
php_admin_value[opcache.validate_timestamps] = 0

At 500 agents polling every second, 128 FPM workers provide ~8,500 req/sec throughput (each poll takes ~15ms). You need 500 req/sec minimum, so there's massive headroom for report queries and burst traffic.

For new deployments at 200+ agents, Nginx + PHP-FPM uses 60-70% less memory. The caveat: VICIdial's .htaccess files need to be converted to Nginx location/rewrite directives. Test thoroughly in staging.

Process Management

VICIdial runs core processes in GNU screen sessions. When a process dies, keepalive scripts are supposed to restart it. In practice, screen sessions die silently at scale and nobody notices until calls stop routing.

Monitor screen session health as part of your operations routine. Check that all expected processes are running. If you're above 200 agents, consider wrapping critical processes in systemd services with automatic restart instead of relying on screen + cron keepalives.

The Bottom Line

These changes — kernel tuning, file descriptor limits, connection tracking, Asterisk module cleanup, G.711 codecs, MySQL query cache off, buffer pool sizing, MEMORY tables, web server optimization, CPU governor — are what separate a VICIdial install that runs from one that holds under load.

None of them are VICIdial bugs. They're all infrastructure configuration that the install guide skips because it's designed for general-purpose environments, not 500-agent telephony platforms.

For the complete guide including capacity planning formulas, MEMORY table conversion procedures, and monitoring setup, see the full performance tuning guide at ViciStack.

Originally published at https://vicistack.com/blog/vicidial-performance-tuning/

DEV Community