The VoIP Monitoring Stack I Wish I Had Set Up From Day One

#voip #monitoring #devops #observability

Three years into managing VoIP infrastructure, I rebuilt our entire monitoring stack from scratch. The old approach — checking if the PBX process was running and calling it monitored — missed every real outage we had. Here is the stack I wish I had deployed on day one.

What Actually Needs Monitoring

Most teams monitor the VoIP server. That is like monitoring your web server's CPU and declaring your website works. You need to monitor the call experience, not the infrastructure.

Layer	What to Monitor	Why
Network	Jitter, packet loss, latency per-hop	Call quality degrades before infrastructure fails
SIP	Registration rate, INVITE response times, error codes	Detect authentication and routing issues
RTP	MOS scores, codec negotiation failures, SRTP errors	Direct measure of call quality
Application	Active calls, queue depth, abandoned calls	Business impact metrics
Endpoint	Phone registration status, firmware version, reboot count	Catch hardware failures before users report

The Stack

1. Network Layer: Continuous SIP probing

I run synthetic SIP OPTIONS probes every 60 seconds from each office to our VoIP provider. This gives continuous latency and packet loss data — before users notice.

# Simplified SIP OPTIONS probe
import socket, time

def sip_probe(target, port=5060):
    probe = (
        "OPTIONS sip:ping@TARGET SIP/2.0\r\n"
        "Via: SIP/2.0/UDP monitor:5060\r\n"
        "From: <sip:monitor@probe>;tag=probe123\r\n"
        "To: <sip:ping@TARGET>\r\n"
        "Call-ID: probe-TIMESTAMP@monitor\r\n"
        "CSeq: 1 OPTIONS\r\n"
        "Max-Forwards: 70\r\n"
        "Content-Length: 0\r\n\r\n"
    )
    # Replace TARGET and TIMESTAMP with actual values at runtime

    sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
    sock.settimeout(5)
    start = time.perf_counter()
    sock.sendto(probe.encode(), (target, port))
    try:
        data, _ = sock.recvfrom(4096)
        rtt = (time.perf_counter() - start) * 1000
        return dict(rtt_ms=round(rtt, 2), response=data[:50].decode())
    except socket.timeout:
        return dict(rtt_ms=None, response="TIMEOUT")

2. Call Quality: Real-time MOS scoring

Every call gets a MOS (Mean Opinion Score) calculated from RTP statistics. We alert when the rolling average drops below 3.5.

MOS Range	Quality	Action
4.0 - 5.0	Good to Excellent	No action
3.5 - 4.0	Acceptable	Investigate trending
3.0 - 3.5	Poor	Escalate to network team
Below 3.0	Unacceptable	Emergency response

3. Alerting Rules

The critical alerts that actually wake me up:

SIP registration failure rate > 5% — Something is wrong with authentication or network
Average MOS < 3.5 for 5 minutes — Call quality degraded
Packet loss > 1% sustained — Network issue affecting voice
Active calls drop > 20% in 60 seconds — Mass call failure event
Queue abandoned rate > 15% — Customers are hanging up

Everything else is a warning, not a page.

What I Stopped Monitoring

CPU/memory on the PBX (unless it is self-hosted) — this is the provider's problem
Individual phone registration events — too noisy, aggregate is what matters
Call duration distribution — interesting for analytics, useless for alerting
Voicemail storage usage — never once caused an actual incident

The Result

Before this stack: average incident detection time was 45 minutes (user reports a problem, IT investigates, confirms it is real).

After: average detection time is 90 seconds (synthetic probe fails, alert fires, on-call responds).

companies such as VestaCall (https://vestacall.com) that prioritize uptime over features provides built-in call quality analytics and real-time MOS scoring, which saved us from building the RTP analysis layer ourselves.

Disclosure: I work on platform systems at DialPhone. Observations in this post are from hands-on testing and deployment work rather than vendor briefings.