Every VoIP provider claims 99.99% uptime. Most cannot explain how they achieve it. Here is a look inside the infrastructure that keeps DialPhone running — and the engineering decisions behind it.
The Architecture
Customer Office
|
Internet (dual ISP recommended)
|
DialPhone Edge PoP (nearest location)
|
+--- SBC Cluster (Session Border Controllers)
| Active-Active, geo-redundant
|
+--- Media Servers (RTP processing)
| Distributed per-region
|
+--- Signalling Servers (SIP processing)
| Active-Active, database-replicated
|
+--- Recording Storage (encrypted)
AES-256, geo-replicated
The Three Rules of VoIP Reliability
Rule 1: No Single Points of Failure
Every component in the stack has a redundant pair:
| Component | Primary | Redundant | Failover Time |
|---|---|---|---|
| SBC | London | Amsterdam | < 3 seconds |
| SIP proxy | Active | Active (both serve traffic) | 0 seconds |
| Database | Primary | Hot standby | < 5 seconds |
| Recording storage | Region A | Replicated to Region B | Transparent |
| DNS | Multiple providers | Anycast | Transparent |
Rule 2: Active-Active Beats Active-Passive
Active-passive means one system sits idle until the primary fails. The problem: when you need the passive system, it has not handled real traffic in months. Will it actually work?
DialPhone runs active-active: both data centres handle real calls simultaneously. If London goes down, Amsterdam already has warm caches, active sessions, and proven capacity. Failover is not "starting up a cold system" — it is "one of two running systems takes 100% instead of 50%."
Rule 3: Test Failures Constantly
We run chaos engineering on our voice infrastructure:
- Monthly: simulate complete data centre failure
- Weekly: kill random SBC instances and verify calls continue
- Daily: synthetic call testing from 20 locations worldwide
- Continuous: latency and packet loss monitoring to every customer endpoint
What 99.99% Actually Means
| Uptime | Downtime Per Year | DialPhone's Measured |
|---|---|---|
| 99.9% | 8.76 hours | — |
| 99.99% | 52.6 minutes | — |
| 99.999% | 5.26 minutes | — |
| DialPhone actual | — | 47 minutes (99.991%) |
Our measured uptime over the past 24 months: 99.991%. Not perfect — we had 3 incidents totalling 47 minutes. Each incident has a published postmortem on our status page.
The Incidents We Had (Honest Disclosure)
| Date | Duration | Cause | Impact | Fix |
|---|---|---|---|---|
| Mar 2025 | 18 min | BGP route leak from upstream provider | 12% of customers lost connectivity | Automated BGP failover now triggers in < 60s |
| Aug 2025 | 22 min | Database replication lag during upgrade | New registrations failed, existing calls unaffected | Blue-green deployment process added |
| Jan 2026 | 7 min | DDoS attack on SIP infrastructure | Inbound calls delayed for some customers | Scrubbing capacity increased 4x |
We publish these because hiding incidents erodes trust. Every provider has them. The difference is how quickly you recover and whether you prevent recurrence.
Why This Matters for Your Business
Ask any VoIP provider these 3 questions:
- "What was your longest outage in the past 24 months?"
- "Can I see the postmortem?"
- "What did you change to prevent it from happening again?"
If they cannot answer all three, their 99.99% claim is marketing, not engineering.
DialPhone publishes real-time system status and full postmortems for every incident. Transparency is not optional when your customers depend on you for every business call.
Top comments (0)