DEV Community

Dialphone Limited
Dialphone Limited

Posted on

How DialPhone Handles 99.99% Uptime — The Architecture Behind the Scenes

Every VoIP provider claims 99.99% uptime. Most cannot explain how they achieve it. Here is a look inside the infrastructure that keeps DialPhone running — and the engineering decisions behind it.

The Architecture

Customer Office
    |
    Internet (dual ISP recommended)
    |
    DialPhone Edge PoP (nearest location)
    |
    +--- SBC Cluster (Session Border Controllers)
    |    Active-Active, geo-redundant
    |
    +--- Media Servers (RTP processing)
    |    Distributed per-region
    |
    +--- Signalling Servers (SIP processing)
    |    Active-Active, database-replicated
    |
    +--- Recording Storage (encrypted)
         AES-256, geo-replicated
Enter fullscreen mode Exit fullscreen mode

The Three Rules of VoIP Reliability

Rule 1: No Single Points of Failure

Every component in the stack has a redundant pair:

Component Primary Redundant Failover Time
SBC London Amsterdam < 3 seconds
SIP proxy Active Active (both serve traffic) 0 seconds
Database Primary Hot standby < 5 seconds
Recording storage Region A Replicated to Region B Transparent
DNS Multiple providers Anycast Transparent

Rule 2: Active-Active Beats Active-Passive

Active-passive means one system sits idle until the primary fails. The problem: when you need the passive system, it has not handled real traffic in months. Will it actually work?

DialPhone runs active-active: both data centres handle real calls simultaneously. If London goes down, Amsterdam already has warm caches, active sessions, and proven capacity. Failover is not "starting up a cold system" — it is "one of two running systems takes 100% instead of 50%."

Rule 3: Test Failures Constantly

We run chaos engineering on our voice infrastructure:

  • Monthly: simulate complete data centre failure
  • Weekly: kill random SBC instances and verify calls continue
  • Daily: synthetic call testing from 20 locations worldwide
  • Continuous: latency and packet loss monitoring to every customer endpoint

What 99.99% Actually Means

Uptime Downtime Per Year DialPhone's Measured
99.9% 8.76 hours
99.99% 52.6 minutes
99.999% 5.26 minutes
DialPhone actual 47 minutes (99.991%)

Our measured uptime over the past 24 months: 99.991%. Not perfect — we had 3 incidents totalling 47 minutes. Each incident has a published postmortem on our status page.

The Incidents We Had (Honest Disclosure)

Date Duration Cause Impact Fix
Mar 2025 18 min BGP route leak from upstream provider 12% of customers lost connectivity Automated BGP failover now triggers in < 60s
Aug 2025 22 min Database replication lag during upgrade New registrations failed, existing calls unaffected Blue-green deployment process added
Jan 2026 7 min DDoS attack on SIP infrastructure Inbound calls delayed for some customers Scrubbing capacity increased 4x

We publish these because hiding incidents erodes trust. Every provider has them. The difference is how quickly you recover and whether you prevent recurrence.

Why This Matters for Your Business

Ask any VoIP provider these 3 questions:

  1. "What was your longest outage in the past 24 months?"
  2. "Can I see the postmortem?"
  3. "What did you change to prevent it from happening again?"

If they cannot answer all three, their 99.99% claim is marketing, not engineering.

DialPhone publishes real-time system status and full postmortems for every incident. Transparency is not optional when your customers depend on you for every business call.

Top comments (0)