TechLogStack

Posted on Jun 23 • Originally published at techlogstack.com on Oct 1, 2025

600 Calls. 13 Hours. 4 Dead. Inside the Optus Emergency Routing Collapse.

#reliability #devops #cloud #webdev

13 hours of total network outage window for emergency calling
600 failed Triple Zero emergency calls that never reached dispatchers
4 states completely affected (NT, SA, WA, and NSW)
4 confirmed deaths directly correlated with the outage window
2-second automatic failover SLA implemented in the postmortem fix

On September 18, 2025, Optus engineering teams initiated what should have been a standard firewall upgrade during a scheduled maintenance window. Within minutes, a configuration error silently severed the routing path for Triple Zero—Australia's native emergency services line. As the primary path collapsed, the lack of an automated failover secondary route left hundreds of callers with nothing but dead silence. It would take thirteen hours of manual diagnostics and cascading failures before emergency communications were restored, triggering a nationwide regulatory reckoning.

The Story

At 12:30 AM AEST on September 18, 2025, Optus began a routine firewall upgrade. Firewall updates occur hundreds of times a year across any large telecommunications network, but this specific push carried an undetected configuration error. Within minutes, customers across Northern Territory, South Australia, Western Australia, and New South Wales lost the ability to reach emergency dispatchers.

The engineering failure was compounded by a severe architectural blindspot: the absence of an automated backup mechanism. For 13 hours, emergency calls dropped silently into the void without triggering secondary routing infrastructure. By the time engineers isolated the misconfiguration and pushed a manual recovery fix, approximately 600 emergency calls had failed, and four people had tragically lost their lives during the communication blackout.

Problem

Firewall configuration error severs emergency routing path

A routine infrastructure upgrade introduced a broken routing rule inside the firewall layer, completely dropping traffic directed toward the emergency services dispatch network.

Cause

Absence of an automated secondary failover path

The call routing layer (the infrastructure passing mobile calls to emergency dispatchers) lacked an automated secondary route, preventing the system from automatically bypassing the broken firewall.

Solution

Manual rule isolation and corrective configuration deployment

Network engineers manually traced the blocked call paths to the newly upgraded firewall, isolated the invalid rule, and deployed a targeted corrective update to restore traffic.

Result

Full restoration after 13 hours and incoming federal investigations

Emergency services re-established connectivity after a 13-hour blackout, prompting Australian telecommunications regulators to open strict audits into redundancy across all national carriers.

The Fix

Automated PSAP Failover and Pre-Deployment Simulation Gate

The remediation required shifting from manual incident response to an automated, resilient topology that protects the emergency path from upstream firewall failures.

Automated PSAP Routing — Multi-path routing rules that bypass primary infrastructure if a handshake drops.
Pre-flight Live Simulation — Mandatory canary testing utilizing simulated emergency call traffic on isolated staging networks.
Regulated Testing SLA — Tightened infrastructure boundaries satisfying newly mandated federal telecommunications limits.

# Verify secondary PSAP path health and force failover on heartbeat loss
#!/bin/bash
PRIMARY_PSAP_IP="10.0.10.5"
SECONDARY_PSAP_ROUTE="10.0.20.5"

check_health() {
    curl --silent --fail --max-time 2 http://${PRIMARY_PSAP_IP}/health || return 1
}

if ! check_health; then
    echo "Primary routing path failed. Diverting emergency traffic to backup..."
    ip route replace default via ${SECONDARY_PSAP_ROUTE}
fi

Following the hotfix, engineers verified routing paths by simulating synthetic traffic across all four affected regional gateways before clearing the maintenance block.

Architecture

The emergency infrastructure relies on direct peering with the Public Safety Answering Point (PSAP) infrastructure. A single firewall misconfiguration broke the primary transit link.

Before: Single-Path Firewall Topology

View interactive diagram on TechLogStack →

Interactive diagrams with full source links available on TechLogStack.

After: Dual-Path Auto-Failover Topology

View interactive diagram on TechLogStack →

Interactive diagrams with full source links available on TechLogStack.

Metric	Before Fix	After Fix	Improvement
Failover Trigger	Manual Intervention	Automated	Instantaneous
Failover Convergence Time	13 Hours	< 2 Seconds	> 99.99% Drop
Pre-flight Call Validation	None (Prose Check)	Simulated Live Call	100% Gate

Lessons

Emergency call routing is not configurable the way application routing is. It demands completely isolated test environments and a mandatory live-call simulation gate before any firewall rule changes hit production systems.
Automatic failover for critical safety infrastructure cannot be optional. A manual recovery runbook that spans 13 hours is not a viable fallback strategy—it represents a systemic architecture failure with fatal human costs.
Regulatory compliance defines the baseline floor, not the architectural ceiling. Legal obligations to carry emergency traffic must be designed with the same high-availability, zero-trust redundancy as your most profitable commercial engine.
A silent failure should trigger an alert within seconds, not hours. Telemetry must actively monitor for the suspicious absence of expected baseline call volumes (dead silence monitoring), rather than relying purely on the presence of explicit system error logs.

Engineering Glossary

Call routing layer — The network infrastructure component responsible for evaluating telecommunications traffic destinations and binding calls to active carrier or emergency service endpoints.

Dead silence monitoring — An alerting paradigm that monitors for the complete absence of expected traffic or signals within a specific time window, used when broken components drop traffic cleanly without generating error metrics.

Public Safety Answering Point (PSAP) — The central operational agency or dispatch center responsible for receiving national emergency calls and routing them directly to police, fire, or ambulance services.

This case is a plain-English retelling of publicly available engineering material.

Read the full case on TechLogStack →
(interactive diagrams, source links, and the full reader experience)

TechLogStack — built at scale, broken in public, rebuilt by engineers.

DEV Community