How DialPhone Handles 99.99% Uptime — The Architecture Behind the Scenes

#voip #architecture #reliability #devops

Every VoIP provider claims 99.99% uptime. Most cannot explain how they achieve it. Here is a look inside the infrastructure that keeps DialPhone running — and the engineering decisions behind it.

The Architecture

Customer Office
    |
    Internet (dual ISP recommended)
    |
    DialPhone Edge PoP (nearest location)
    |
    +--- SBC Cluster (Session Border Controllers)
    |    Active-Active, geo-redundant
    |
    +--- Media Servers (RTP processing)
    |    Distributed per-region
    |
    +--- Signalling Servers (SIP processing)
    |    Active-Active, database-replicated
    |
    +--- Recording Storage (encrypted)
         AES-256, geo-replicated

The Three Rules of VoIP Reliability

Rule 1: No Single Points of Failure

Every component in the stack has a redundant pair:

Component	Primary	Redundant	Failover Time
SBC	London	Amsterdam	< 3 seconds
SIP proxy	Active	Active (both serve traffic)	0 seconds
Database	Primary	Hot standby	< 5 seconds
Recording storage	Region A	Replicated to Region B	Transparent
DNS	Multiple providers	Anycast	Transparent

Rule 2: Active-Active Beats Active-Passive

Active-passive means one system sits idle until the primary fails. The problem: when you need the passive system, it has not handled real traffic in months. Will it actually work?

DialPhone runs active-active: both data centres handle real calls simultaneously. If London goes down, Amsterdam already has warm caches, active sessions, and proven capacity. Failover is not "starting up a cold system" — it is "one of two running systems takes 100% instead of 50%."

Rule 3: Test Failures Constantly

We run chaos engineering on our voice infrastructure:

Monthly: simulate complete data centre failure
Weekly: kill random SBC instances and verify calls continue
Daily: synthetic call testing from 20 locations worldwide
Continuous: latency and packet loss monitoring to every customer endpoint

What 99.99% Actually Means

Uptime	Downtime Per Year	DialPhone's Measured
99.9%	8.76 hours	—
99.99%	52.6 minutes	—
99.999%	5.26 minutes	—
DialPhone actual	—	47 minutes (99.991%)

Our measured uptime over the past 24 months: 99.991%. Not perfect — we had 3 incidents totalling 47 minutes. Each incident has a published postmortem on our status page.

The Incidents We Had (Honest Disclosure)

Date	Duration	Cause	Impact	Fix
Mar 2025	18 min	BGP route leak from upstream provider	12% of customers lost connectivity	Automated BGP failover now triggers in < 60s
Aug 2025	22 min	Database replication lag during upgrade	New registrations failed, existing calls unaffected	Blue-green deployment process added
Jan 2026	7 min	DDoS attack on SIP infrastructure	Inbound calls delayed for some customers	Scrubbing capacity increased 4x

We publish these because hiding incidents erodes trust. Every provider has them. The difference is how quickly you recover and whether you prevent recurrence.

Why This Matters for Your Business

Ask any VoIP provider these 3 questions:

"What was your longest outage in the past 24 months?"
"Can I see the postmortem?"
"What did you change to prevent it from happening again?"

If they cannot answer all three, their 99.99% claim is marketing, not engineering.

DialPhone publishes real-time system status and full postmortems for every incident. Transparency is not optional when your customers depend on you for every business call.