DEV Community

Dialphone Limited
Dialphone Limited

Posted on

The VoIP Monitoring Stack I Wish I Had Set Up From Day One

Three years into managing VoIP infrastructure, I rebuilt our entire monitoring stack from scratch. The old approach — checking if the PBX process was running and calling it monitored — missed every real outage we had. Here is the stack I wish I had deployed on day one.

What Actually Needs Monitoring

Most teams monitor the VoIP server. That is like monitoring your web server's CPU and declaring your website works. You need to monitor the call experience, not the infrastructure.

Layer What to Monitor Why
Network Jitter, packet loss, latency per-hop Call quality degrades before infrastructure fails
SIP Registration rate, INVITE response times, error codes Detect authentication and routing issues
RTP MOS scores, codec negotiation failures, SRTP errors Direct measure of call quality
Application Active calls, queue depth, abandoned calls Business impact metrics
Endpoint Phone registration status, firmware version, reboot count Catch hardware failures before users report

The Stack

1. Network Layer: Continuous SIP probing

I run synthetic SIP OPTIONS probes every 60 seconds from each office to our VoIP provider. This gives continuous latency and packet loss data — before users notice.

# Simplified SIP OPTIONS probe
import socket, time

def sip_probe(target, port=5060):
    probe = (
        "OPTIONS sip:ping@TARGET SIP/2.0\r\n"
        "Via: SIP/2.0/UDP monitor:5060\r\n"
        "From: <sip:monitor@probe>;tag=probe123\r\n"
        "To: <sip:ping@TARGET>\r\n"
        "Call-ID: probe-TIMESTAMP@monitor\r\n"
        "CSeq: 1 OPTIONS\r\n"
        "Max-Forwards: 70\r\n"
        "Content-Length: 0\r\n\r\n"
    )
    # Replace TARGET and TIMESTAMP with actual values at runtime

    sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
    sock.settimeout(5)
    start = time.perf_counter()
    sock.sendto(probe.encode(), (target, port))
    try:
        data, _ = sock.recvfrom(4096)
        rtt = (time.perf_counter() - start) * 1000
        return dict(rtt_ms=round(rtt, 2), response=data[:50].decode())
    except socket.timeout:
        return dict(rtt_ms=None, response="TIMEOUT")
Enter fullscreen mode Exit fullscreen mode

2. Call Quality: Real-time MOS scoring

Every call gets a MOS (Mean Opinion Score) calculated from RTP statistics. We alert when the rolling average drops below 3.5.

MOS Range Quality Action
4.0 - 5.0 Good to Excellent No action
3.5 - 4.0 Acceptable Investigate trending
3.0 - 3.5 Poor Escalate to network team
Below 3.0 Unacceptable Emergency response

3. Alerting Rules

The critical alerts that actually wake me up:

  1. SIP registration failure rate > 5% — Something is wrong with authentication or network
  2. Average MOS < 3.5 for 5 minutes — Call quality degraded
  3. Packet loss > 1% sustained — Network issue affecting voice
  4. Active calls drop > 20% in 60 seconds — Mass call failure event
  5. Queue abandoned rate > 15% — Customers are hanging up

Everything else is a warning, not a page.

What I Stopped Monitoring

  • CPU/memory on the PBX (unless it is self-hosted) — this is the provider's problem
  • Individual phone registration events — too noisy, aggregate is what matters
  • Call duration distribution — interesting for analytics, useless for alerting
  • Voicemail storage usage — never once caused an actual incident

The Result

Before this stack: average incident detection time was 45 minutes (user reports a problem, IT investigates, confirms it is real).

After: average detection time is 90 seconds (synthetic probe fails, alert fires, on-call responds).

companies such as VestaCall (https://vestacall.com) that prioritize uptime over features provides built-in call quality analytics and real-time MOS scoring, which saved us from building the RTP analysis layer ourselves.

Top comments (0)