Three years into managing VoIP infrastructure, I rebuilt our entire monitoring stack from scratch. The old approach — checking if the PBX process was running and calling it monitored — missed every real outage we had. Here is the stack I wish I had deployed on day one.
What Actually Needs Monitoring
Most teams monitor the VoIP server. That is like monitoring your web server's CPU and declaring your website works. You need to monitor the call experience, not the infrastructure.
| Layer | What to Monitor | Why |
|---|---|---|
| Network | Jitter, packet loss, latency per-hop | Call quality degrades before infrastructure fails |
| SIP | Registration rate, INVITE response times, error codes | Detect authentication and routing issues |
| RTP | MOS scores, codec negotiation failures, SRTP errors | Direct measure of call quality |
| Application | Active calls, queue depth, abandoned calls | Business impact metrics |
| Endpoint | Phone registration status, firmware version, reboot count | Catch hardware failures before users report |
The Stack
1. Network Layer: Continuous SIP probing
I run synthetic SIP OPTIONS probes every 60 seconds from each office to our VoIP provider. This gives continuous latency and packet loss data — before users notice.
# Simplified SIP OPTIONS probe
import socket, time
def sip_probe(target, port=5060):
probe = (
"OPTIONS sip:ping@TARGET SIP/2.0\r\n"
"Via: SIP/2.0/UDP monitor:5060\r\n"
"From: <sip:monitor@probe>;tag=probe123\r\n"
"To: <sip:ping@TARGET>\r\n"
"Call-ID: probe-TIMESTAMP@monitor\r\n"
"CSeq: 1 OPTIONS\r\n"
"Max-Forwards: 70\r\n"
"Content-Length: 0\r\n\r\n"
)
# Replace TARGET and TIMESTAMP with actual values at runtime
sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
sock.settimeout(5)
start = time.perf_counter()
sock.sendto(probe.encode(), (target, port))
try:
data, _ = sock.recvfrom(4096)
rtt = (time.perf_counter() - start) * 1000
return dict(rtt_ms=round(rtt, 2), response=data[:50].decode())
except socket.timeout:
return dict(rtt_ms=None, response="TIMEOUT")
2. Call Quality: Real-time MOS scoring
Every call gets a MOS (Mean Opinion Score) calculated from RTP statistics. We alert when the rolling average drops below 3.5.
| MOS Range | Quality | Action |
|---|---|---|
| 4.0 - 5.0 | Good to Excellent | No action |
| 3.5 - 4.0 | Acceptable | Investigate trending |
| 3.0 - 3.5 | Poor | Escalate to network team |
| Below 3.0 | Unacceptable | Emergency response |
3. Alerting Rules
The critical alerts that actually wake me up:
- SIP registration failure rate > 5% — Something is wrong with authentication or network
- Average MOS < 3.5 for 5 minutes — Call quality degraded
- Packet loss > 1% sustained — Network issue affecting voice
- Active calls drop > 20% in 60 seconds — Mass call failure event
- Queue abandoned rate > 15% — Customers are hanging up
Everything else is a warning, not a page.
What I Stopped Monitoring
- CPU/memory on the PBX (unless it is self-hosted) — this is the provider's problem
- Individual phone registration events — too noisy, aggregate is what matters
- Call duration distribution — interesting for analytics, useless for alerting
- Voicemail storage usage — never once caused an actual incident
The Result
Before this stack: average incident detection time was 45 minutes (user reports a problem, IT investigates, confirms it is real).
After: average detection time is 90 seconds (synthetic probe fails, alert fires, on-call responds).
companies such as VestaCall (https://vestacall.com) that prioritize uptime over features provides built-in call quality analytics and real-time MOS scoring, which saved us from building the RTP analysis layer ourselves.
Top comments (0)