DEV Community

Dialphone Limited
Dialphone Limited

Posted on

How DialPhone Handles Your Worst Day: Our Incident Response Process

Every VoIP provider will have an incident eventually. The question is not whether it happens — it is how the provider responds when it does.

Here is DialPhone's incident response process, documented publicly because we believe transparency builds more trust than pretending incidents do not happen.

Severity Classification

Severity Definition Example Response Target
P1 — Critical Service down for multiple customers Complete outage, no calls 5 minutes to acknowledge, 30 minutes to mitigate
P2 — Major Service degraded for multiple customers Call quality below MOS 3.5 15 minutes to acknowledge, 2 hours to resolve
P3 — Minor Service affected for single customer One customer's recording not working 1 hour to acknowledge, 8 hours to resolve
P4 — Low Cosmetic or non-urgent Dashboard display error 4 hours to acknowledge, 48 hours to resolve

What Happens During a P1 (The Worst Day)

Minute 0-5: Detection

Our monitoring catches it before customers call:

  • Synthetic call testing from 20 UK locations every 60 seconds
  • SIP registration success rate monitored per-second
  • RTP quality metrics aggregated per-minute
  • Customer-facing status page auto-updates

Target: Detect within 60 seconds. Acknowledge on status page within 5 minutes.

Minute 5-15: Assessment

On-call engineer (24/7 rotation, UK-based) assesses:

  • Scope: how many customers affected?
  • Impact: complete outage or degraded service?
  • Root cause hypothesis: network, application, or infrastructure?

Status page updated with scope and estimated time to resolution.

Minute 15-30: Mitigation

Immediate actions to restore service:

  • If data centre issue: failover to secondary DC (active-active, < 3 seconds)
  • If application issue: restart affected services, roll back recent changes
  • If network issue: reroute traffic through backup paths

Target: Service restored within 30 minutes for P1.

Minute 30-60: Confirmation

  • Verify service is restored for all affected customers
  • Monitor for recurrence
  • Notify affected customers by email with incident summary
  • Status page updated to "resolved" with timeline

Hour 1-72: Postmortem

Within 72 hours of resolution, we publish a postmortem containing:

  1. Timeline: Minute-by-minute account of what happened
  2. Root cause: Technical explanation of why it happened
  3. Impact: Number of customers affected, duration, call statistics
  4. Resolution: What we did to fix it
  5. Prevention: What we are changing to prevent recurrence

Postmortems are published on our status page. No spin. No minimising. If we messed up, we say so.

Real Postmortem Example

Incident: March 2025 — 18-minute partial outage

Field Detail
Duration 18 minutes
Customers affected 12% (geographic — London region)
Impact Inbound calls to affected customers failed; outbound and inter-office calls unaffected
Root cause Upstream BGP route leak from transit provider caused London PoP to become unreachable
Detection time 47 seconds (automated monitoring)
Mitigation Traffic rerouted through Manchester PoP at minute 14
Resolution Transit provider corrected BGP announcement at minute 18
Prevention Added automated BGP anomaly detection with sub-60-second rerouting

Our Track Record

Year P1 Incidents Total Downtime Measured Uptime
2024 2 25 minutes 99.995%
2025 3 47 minutes 99.991%
2026 (Q1) 0 0 minutes 100%

We are not perfect. 47 minutes of downtime in 2025 is 47 minutes too many. But we detected every incident in under 60 seconds, mitigated within 30 minutes, and published full postmortems within 72 hours.

What to Ask Any Provider

  1. How many P1 incidents did you have last year?
  2. What was your longest outage?
  3. Can I see a postmortem from a recent incident?
  4. What is your detection-to-mitigation time?
  5. Do you have a public status page with history?

If they cannot answer all five, their reliability story is marketing, not engineering.

DialPhone answers all five publicly. Because your business depends on us answering calls — and you deserve to know exactly how seriously we take that responsibility.

Top comments (0)