Jason Shouldice

Posted on Mar 26 • Originally published at vicistack.com

Stop SSH-ing Into Your Asterisk Box: Build a Real Observability Stack

#voip #asterisk #sysadmin #devops

For most of my career running Asterisk in production, "monitoring" meant SSH into the box, run asterisk -rx "core show channels", squint at the output, and hope the number of active channels looked right. Maybe check /var/log/asterisk/full when something broke. Maybe not. That approach stopped being acceptable around the time we crossed 50,000 daily calls across a 4-server cluster. When a SIP trunk goes down at 2 PM on a Tuesday and 300 agents go idle, you need to know in seconds, not whenever someone notices the real-time report looks weird.

This is how to build actual observability for Asterisk: metrics collection with OpenTelemetry, storage in Prometheus, visualization in Grafana, and distributed tracing for individual call flows.

Why OpenTelemetry Over Custom Scripts

You could skip OTel entirely. Install prometheus-node-exporter, write a bash script that scrapes asterisk -rx output into Prometheus-formatted metrics, and call it done. I've done exactly that. It works. It's also fragile, custom, and doesn't scale.

OpenTelemetry gives you three things that roll-your-own monitoring doesn't. First, vendor-neutral collection — the OTel Collector speaks StatsD, Prometheus, OTLP, syslog, and dozens of other formats. Asterisk's built-in res_statsd module pushes metrics via StatsD. AMI events can be forwarded as structured logs. You configure receivers, not parsers.

Second, processing pipelines. OTel lets you filter, transform, aggregate, and route telemetry before it hits your backend. Drop debug-level events but keep warnings. Add a cluster_name attribute to every metric. Sample 10% of traces for non-error calls. All configurable in the collector YAML.

Third, multi-backend export. Send metrics to Prometheus, traces to Jaeger or Tempo, logs to Loki — from one collector instance. If you ever swap backends, you change one exporter config. Nothing on the Asterisk side changes.

For a single Asterisk box doing 5,000 calls a day, a custom Prometheus scraper is probably fine. OTel shines when you have multiple servers, multiple signal types, or when you're tired of maintaining shell scripts that break when the CLI output format changes between Asterisk versions.

The Architecture

Asterisk Server:
  res_statsd ------> OTel Collector ------> Prometheus
  AMI Events ------> ami-otel-bridge -----> OTel Collector
  CDR/CEL ----------> MySQL ------------> mysqld_exporter --> Prometheus

All three feed into Grafana for dashboards and alerting.
Traces go to Jaeger or Tempo for call flow analysis.

The Two Metric Sources

Asterisk's res_statsd module. Built into Asterisk since version 13. Enable it in /etc/asterisk/statsd.conf by pointing it at the OTel Collector's StatsD receiver on 127.0.0.1:8125. It emits gauges for active channel count (total and by type), registered endpoints (online and offline), active bridges, and bridge channel counts. Updated every 10 seconds.

AMI event bridge. A Python daemon that connects to the Asterisk Manager Interface on port 5038, listens for events — Newchannel, Hangup, PeerStatus, Join, Leave, DialBegin, DialEnd — and pushes them to the OTel Collector as both metrics and trace spans. This gives you the call-level telemetry that res_statsd doesn't provide: calls per minute, active call gauge, call duration histograms, SIP registration events, queue caller counts, and disposition breakdowns.

The OTel Collector runs as a sidecar systemd service on each Asterisk server. It receives from both sources via the StatsD and OTLP receivers, adds resource attributes (hostname, cluster name, server role), batches the data, and exports metrics to Prometheus on port 8889.

Prometheus Recording Rules

Raw metrics get you started. Derived metrics are where the real value lives. Set up recording rules for:

Calls per minute (cluster-wide): sum(rate(asterisk_calls_total[5m])) * 60
Average and P95 call duration from the histogram buckets
Channel utilization per server: active channels divided by registered endpoints times 2
SIP registration churn rate: abs(rate(asterisk_sip_registrations[5m])) — a spike means phones are flapping

The Dashboards That Matter

Cluster Overview (put this on a wall TV): active channels as a stat panel with sparkline, calls per minute as a time series split by server, channel utilization per server as a gauge (green under 70%, yellow 70-85%, red above 85%), SIP registrations as a stat panel, and queue callers waiting as a time series colored by queue.

SIP Health: registration status by peer in a table (online = green cell, offline = red cell), registration event rate as a time series (spikes = phones flapping, usually a network problem), and active channels by type in a pie chart. In a healthy VICIdial system, you see mostly PJSIP channels for agent phones and trunk calls, some Local channels for internal routing, and IAX2 for cluster inter-server traffic.

Per-Server Deep Dive: CPU, memory, active channels, load average, network I/O, disk I/O — all from the same server, correlated on the same time axis. The magic correlation: CPU spikes with channel increase = normal load. CPU spikes without channel increase = runaway AGI script, a MySQL query from hell, or a cron job that shouldn't be running during peak hours. This dashboard saves more troubleshooting time than anything else you'll build.

Distributed Tracing for Call Flows

This is what most Asterisk monitoring setups miss entirely. Metrics tell you that something happened. Traces tell you why.

The AMI bridge creates a parent span for each call (keyed on Asterisk's Uniqueid), then child spans for key events: dial attempts, queue waits, agent delivery. In Jaeger or Tempo, a single inbound call trace shows every phase with timing:

[asterisk.call] --- 145.2s total
 |-- [asterisk.dial] --- 0.8s (to queue)
 |-- [asterisk.queue.wait] --- 12.4s (INBOUND_SALES queue)
 |-- [asterisk.dial] --- 1.2s (to agent SIP/agent42)
 +-- [asterisk.call] ends --- hangup cause: Normal Clearing

When a caller reports "I waited 3 minutes and then got disconnected," you pull up that exact call's trace and see every hop. Was it 12 seconds in queue or 120? Did the agent's phone ring? Did the dial timeout? The trace answers all of it without grepping through 20 GB of log files.

The Five Alerts That Cover 90% of Overnight Incidents

TrunkDown — zero active channels on a server for 2+ minutes (critical)
ChannelExhaustion — utilization above 85% for 5+ minutes (warning)
RegistrationStorm — SIP registration churn above 5/sec for 3+ minutes (warning — phones flapping, usually network instability)
QueueBackup — more than 15 callers waiting in any queue for 2+ minutes (warning)
NoCalls — zero calls per minute during business hours for 10+ minutes (critical)

Wire these into Alertmanager, route to Slack or PagerDuty. The TrunkDown alert alone will save you from the next "nobody noticed the carrier went down at 2 AM" incident.

The 3 AM Scenario: Why This Pays for Itself

It's 3:17 AM. PagerDuty wakes you up: AsteriskTrunkDown on dialer02.

Before the observability stack: SSH into dialer02, run sip show peers, stare at output, grep through logs trying to figure out when it broke, call the carrier, wait on hold for 20 minutes.

With the stack: open Grafana on your phone. Cluster Overview shows dialer02 at zero channels, dialer01 and dialer03 healthy. SIP Health dashboard shows the trunk to Carrier A went offline at 3:04 AM, with a registration flapping pattern starting at 3:01 AM — three minutes of register/unregister cycles before it gave up. Host metrics show dialer02's network I/O dropped to zero on eth1 (the trunk interface) at 3:04 AM. CPU and memory are fine. It's a network link failure, not a server or carrier issue.

You call the NOC, not the carrier. They find the switch port went down. Fix applied. Trunk comes back. Total resolution: 12 minutes instead of 45.

That single incident saves more time than the entire observability stack took to build.

CDR-Based Metrics

The AMI bridge captures real-time events, but CDR (Call Detail Records) give you the complete picture after calls end. For a Prometheus-native approach, write a small Python exporter that queries CDR data every 60 seconds and exposes it as Prometheus metrics: total calls by disposition, average duration histogram, and answer-seizure ratio (ASR). Scrape it on a dedicated port. This gives you historical trend data that the real-time AMI metrics don't capture — particularly useful for weekly and monthly reporting dashboards.

Performance Impact and Gotchas

On a production VICIdial cluster processing 200,000+ daily calls: res_statsd adds roughly 0.1% CPU overhead. The AMI bridge uses about 15MB RAM and negligible CPU. The OTel Collector uses 50-100MB RAM depending on pipeline complexity. Total overhead is less than a single poorly-written AGI script adds per call. If you're worried about performance, profile your AGI scripts first — that's where the real CPU waste lives.

One critical caveat: avoid high-cardinality metric labels. Don't put full phone numbers, channel IDs, or caller names as metric labels. Prometheus stores a separate time series for every unique label combination, and your memory usage will grow proportionally to call volume. Stick to low-cardinality values: server name, channel type, queue name, disposition status. Not phone numbers, not channel names, not caller IDs.

Another gotcha: make sure your AMI user has the right event filters. If you subscribe to all events, the AMI bridge will drown in DTMF and VarSet events that you don't need. Filter to call,agent,cdr in the AMI login action.

Getting Started: The Minimum Viable Setup

If you want to get something useful running in a day, here's the minimal path:

Install the OTel Collector and configure the StatsD receiver on port 8125
Enable res_statsd in Asterisk pointing at 127.0.0.1:8125
Configure the Prometheus exporter in the OTel Collector on port 8889
Add the scrape target to your Prometheus config
Import or build the Cluster Overview dashboard in Grafana with three panels: active channels (stat), channels over time (timeseries), and endpoints online (stat)
Add the TrunkDown alert to Prometheus alerting rules

That gives you real-time channel visibility, registration monitoring, and the single most important alert. Total setup time: 2-4 hours if you already have Prometheus and Grafana running. Add the AMI bridge for call-level metrics in week two. Add tracing when you need to debug a specific call quality problem.

The monitoring stack takes a day to deploy. Building the habit of looking at dashboards before they page you — that takes longer, and matters more. But the first time the TrunkDown alert fires at 3 AM and you diagnose the issue from your phone in 5 minutes instead of SSH-ing around for 30 minutes, you'll wonder why you didn't set this up years ago.

For the complete implementation with the full OTel Collector YAML configuration, the AMI bridge Python script with trace span instrumentation, Prometheus recording rules and alerting rules, Grafana dashboard panel specifications, and the CDR exporter script, see the full guide at ViciStack.

DEV Community