Azure Front Door didn't fail over during a real multi-region DR drill. Here's what went wrong, how we fixed it, and how to design reliable failover.
Table of Contents
- The Story / Background
- Core Concepts: How Azure Front Door Failover Really Works
- Step-by-Step Guide: Designing Azure Front Door for Real Multi-Region DR
- Architecture Diagram
- Best Practices for Azure Front Door Multi-Region DR
- Common Pitfalls (and How to Avoid Them)
- FAQ
- Conclusion
- References
A few quarters ago we ran what we thought would be a routine multi-region DR game day on Azure. The plan was simple: simulate a primary region failure, watch Azure Front Door detect the issue, fail over to the secondary region, and go for coffee feeling smug.
Instead, Front Door stared at our "dead" region and kept happily sending it traffic. Users got timeouts. Dashboards lit up. Our DR runbooks suddenly looked very theoretical. I'll walk through what actually happened, how we debugged it, and the patterns I use now whenever I put Azure Front Door in front of multi-region workloads.
The Story / Background
The architecture we thought we had
This was a fairly typical enterprise setup:
- Front door / CDN: Azure Front Door Standard/Premium with WAF
-
Two Azure regions:
- Region A (primary) – AKS + internal Application Gateway, Azure SQL with geo-replica
- Region B (secondary) – warm standby AKS + App Gateway, Azure SQL geo-replica
- Routing mode: Active-passive (priority routing) in Front Door
-
Health probes: Configured at the origin group level to hit
/healthon each region's App Gateway - Infra-as-Code: Terraform for Front Door, AKS, App Gateway, SQL, and plumbing
- Observability: Azure Monitor, Log Analytics, Application Insights, plus synthetic checks from multiple locations
On paper, this ticked all the boxes: multi-region, DR runbooks, IaC, WAF in front, and tests.
The drill
The DR playbook was:
- Simulate a partial outage in Region A.
- Observe Front Door marking the primary origin unhealthy.
- Confirm automatic failover to Region B.
- Run smoke tests and declare the drill successful.
Simulation method: we applied a network ACL on the primary App Gateway subnet to effectively blackhole traffic from Front Door, mimicking a critical failure in the app tier.
What actually happened
- Front Door did not immediately fail over.
- Users got intermittent timeouts and 5xxs, but traffic kept trying Region A for long enough to trigger a production-level incident if this had been real.
- Our synthetic checks (which hit the Front Door endpoint) kept reporting "green" for several minutes.
- Logs seemed contradictory: App Gateway showed traffic drops; Front Door metrics looked almost normal.
It took a painful hour-plus of log diving and config reviews to realize:
- Our health probe path
/healthwas still responding200 OKfrom a separate "status" service that hadn't been affected by the simulated failure. - The probe interval and sample size made failover slower than our target RTO.
- Some internal services were bypassing Front Door and talking directly to Region A's private endpoints, so even if Front Door had failed over, we still had partial breakage.
The short version: the app died, but the health probes didn't. And Front Door did exactly what we told it to do, not what we thought we configured.
Core Concepts: How Azure Front Door Failover Really Works
Let's unpack what matters for Azure Front Door in a multi-region DR setup.
Origin groups, priorities, and routing
In Azure Front Door Standard/Premium:
- You define origin groups (backend pools).
- Within a group, each origin (Region A, Region B) can have:
- A priority (for active-passive)
- A weight (for active-active / traffic split)
- Front Door sends traffic to the lowest-priority healthy origin.
- If that origin becomes unhealthy, it will fail over to the next priority.
The word "healthy" hides a lot of detail.
Health probes and what "healthy" really means
Health probes are where most DR drills go to die:
- Probes are configured per origin group with:
- Protocol & port (HTTP/HTTPS, 80/443, etc.)
-
Path (e.g.,
/healthz,/live,/ready) - Interval & sample size
- Front Door considers an origin healthy if it gets enough 2xx/3xx responses from the probe within the configured sample window.
- It considers an origin unhealthy after enough failures/timeouts in that window.
Key gotchas:
- If your probe hits a different component than your critical path (e.g., a static health page, a separate sidecar), you'll see green while users are screaming.
- If the probe is too forgiving (long intervals, large sample size), failover is slower than your RTO.
- If the probe path is behind aggressive caching or a CDN rule, Front Door might be probing a cached thing, not your real app.
Active-active vs active-passive in DR context
-
Active-passive (priority routing)
- Simpler mental model: Region A is primary, Region B is standby.
- Good when your data tier or regulatory constraints make multi-master tricky.
-
Active-active (latency / weighted)
- Better utilization and resilience, but more complex for stateful workloads.
- Requires careful handling for session affinity, data consistency, and rollouts.
Front Door supports both via routing rules and origin group configuration, but DR behavior and testing strategy differ.
Data tier is not Front Door's job
Front Door only handles HTTP(S) routing. Your data layer is your responsibility:
- Azure SQL with active geo-replication or auto-failover groups
- Cosmos DB with multi-region writes
- Redis with geo-replication or region-local caches
- Storage accounts with RA-GRS or dual-write patterns
If your data tier can't fail over fast enough, Front Door can swap regions all day and users will still see errors or stale data.
Observability for failover
For real DR:
- Azure Monitor & Log Analytics for Front Door metrics and logs
- Application Insights for dependency failures, response times, distributed tracing
- Synthetic tests (multi-region) that hit the Front Door endpoint with app-level expectations
-
End-to-end dashboards showing:
- Front Door health vs backend health
- Per-region error rates
- Failover events and timings
Step-by-Step Guide: Designing Azure Front Door for Real Multi-Region DR
1. Define RTO/RPO and failure modes
Before YAML and Terraform, write down:
- RTO – how fast must failover complete?
- RPO – how much data loss can you tolerate?
-
Failure modes you care about:
- Region outage
- App tier outage
- Partial dependency outage (e.g., DB or cache)
- Front Door misconfig / WAF block
Agree this with product, business, and security. DR that only works for "region disappeared" but not "DB is slow" is half a solution.
2. Design origin groups and health probe strategy
For an active-passive setup:
- Single origin group with two origins:
app-region-a,app-region-b. - Use priority: Region A = 1, Region B = 2.
- Configure probes to hit a realistic but cheap path, e.g.
/readyzthat:- Checks app's critical dependencies (DB, cache, queue) at lightweight level.
- Returns non-2xx when something essential is broken.
3. Implement with Terraform (example)
Here's a simplified Terraform snippet for Azure Front Door Standard/Premium with two origins and a health probe tuned for DR:
# Resource Group
resource "azurerm_resource_group" "network" {
name = "rg-network-prod"
location = "East US"
}
# Azure Front Door Profile
resource "azurerm_cdn_frontdoor_profile" "prod" {
name = "fd-prod-profile"
resource_group_name = azurerm_resource_group.network.name
sku_name = "Standard_AzureFrontDoor"
tags = {
environment = "production"
purpose = "multi-region-dr"
}
}
# Front Door Endpoint
resource "azurerm_cdn_frontdoor_endpoint" "prod" {
name = "fd-prod-endpoint"
cdn_frontdoor_profile_id = azurerm_cdn_frontdoor_profile.prod.id
tags = {
environment = "production"
}
}
# Origin Group with Health Probes
resource "azurerm_cdn_frontdoor_origin_group" "app" {
name = "og-app-multiregion"
cdn_frontdoor_profile_id = azurerm_cdn_frontdoor_profile.prod.id
session_affinity_enabled = false
health_probe {
interval_in_seconds = 15
path = "/readyz"
protocol = "Https"
request_type = "GET"
}
load_balancing {
additional_latency_in_milliseconds = 0
successful_samples_required = 3
sample_size = 4
}
}
# Primary Origin (Region A)
resource "azurerm_cdn_frontdoor_origin" "app_region_a" {
name = "app-region-a"
cdn_frontdoor_origin_group_id = azurerm_cdn_frontdoor_origin_group.app.id
host_name = "app-gw-eastus.contoso.internal"
http_port = 80
https_port = 443
origin_host_header = "app.contoso.com"
priority = 1
weight = 1000
enabled = true
certificate_name_check_enabled = true
}
# Secondary Origin (Region B)
resource "azurerm_cdn_frontdoor_origin" "app_region_b" {
name = "app-region-b"
cdn_frontdoor_origin_group_id = azurerm_cdn_frontdoor_origin_group.app.id
host_name = "app-gw-westus.contoso.internal"
http_port = 80
https_port = 443
origin_host_header = "app.contoso.com"
priority = 2
weight = 1000
enabled = true
certificate_name_check_enabled = true
}
# Route to map requests to origin group
resource "azurerm_cdn_frontdoor_route" "app_route" {
name = "app-route"
cdn_frontdoor_endpoint_id = azurerm_cdn_frontdoor_endpoint.prod.id
cdn_frontdoor_origin_group_id = azurerm_cdn_frontdoor_origin_group.app.id
patterns_to_match = ["/*"]
supported_protocols = ["Http", "Https"]
https_redirect_enabled = true
forwarding_protocol = "HttpsOnly"
link_to_default_domain = true
}
4. Build DR-aware pipelines and configuration management
- Treat Front Door config as code (Terraform/Bicep).
- Protect it with:
- Pull requests and mandatory reviews.
- Policy checks (e.g., checks that every origin has a probe).
- Automated validation in a non-prod "chaos" environment.
- Build pipelines that can:
- Temporarily disable an origin (simulated outage).
- Flip priorities if you need a manual failover.
Example Azure CLI snippet to temporarily disable Region A origin:
#!/bin/bash
# Disable primary origin for DR testing
az afd origin update \
--resource-group rg-network-prod \
--profile-name fd-prod-profile \
--origin-group-name og-app-multiregion \
--origin-name app-region-a \
--enabled-state Disabled
echo "Origin app-region-a has been disabled. Traffic should failover to app-region-b."
# Monitor failover progress
echo "Monitoring Front Door metrics for 5 minutes..."
sleep 300
# Re-enable origin after test
read -p "Re-enable primary origin? (y/n): " -n 1 -r
echo
if [[ $REPLY =~ ^[Yy]$ ]]; then
az afd origin update \
--resource-group rg-network-prod \
--profile-name fd-prod-profile \
--origin-group-name og-app-multiregion \
--origin-name app-region-a \
--enabled-state Enabled
echo "Origin app-region-a has been re-enabled."
fi
Use this in non-prod to safely observe Front Door's behavior.
5. Implement synthetic tests and dashboards
- Create synthetic tests that:
- Hit
https://app.contoso.com/healthcheck-end-to-end - Validate response code, body, and latency
- Run from multiple Azure regions (or external providers)
- Hit
- Build dashboards that show, per region:
- Front Door origin health state
- App response times
- Error rates and timeouts
Ensure your on-call runbook includes how to read these graphs during a DR event.
6. Run regular DR drills and chaos tests
Treat DR like CI:
- Schedule recurring game days (quarterly is a good start).
- Test different failure modes: origin disabled, DB unavailable, cache down, WAF rule gone wild.
- Time how long:
- Front Door takes to mark the origin unhealthy.
- Users experience degraded performance.
- The team takes to declare failover complete.
Capture and track those as SLOs for DR.
Architecture Diagram
The diagram below illustrates the multi-region Azure Front Door DR architecture discussed in this post:
Key Components:
- Azure Front Door acts as the global load balancer with WAF protection
- Priority-based routing with Region A as primary (Priority 1) and Region B as secondary (Priority 2)
-
Health probes monitor
/readyzendpoints to determine origin health - Geo-replicated Azure SQL ensures data availability across regions
- Azure Monitor provides comprehensive observability across all components
Traffic Flow:
- Normal Operation: User requests → Front Door → Region A (Primary) → Application Gateway → AKS → Azure SQL Primary
- During Failover: Health probe fails on Region A → Front Door redirects traffic → Region B (Secondary) → Application Gateway → AKS → Azure SQL Geo-Replica
- Monitoring: All components send telemetry to Azure Monitor and Application Insights for real-time observability
Best Practices for Azure Front Door Multi-Region DR
-
Health checks must reflect real risk
- Probe something that depends on your critical services (DB, cache, queue) but is cheap to execute.
-
Use explicit priorities for active-passive
- Don't rely on latency routing if your DR strategy is "primary then fail over".
-
Align probe configuration with RTO
- Shorter intervals and smaller sample sizes mean faster failover, at the cost of more sensitivity to transient blips.
-
Decouple internal vs external paths
- Ensure internal clients also route via Front Door (or a consistent DR mechanism), otherwise they'll keep hitting a dead region.
-
Keep origin host headers consistent
- Use a single app host name to simplify config, TLS, and debugging.
-
Tag everything
- Use tags for
env,region,dr-role,owner,criticality. Helps a lot in DR reviews and cost tracking.
- Use tags for
-
Secure by default
- Use WAF, private origins (Private Link / internal App Gateway), and managed identities.
-
Centralize observability
- One place where SRE/DevOps can see Front Door + app + DB health across regions.
-
Automate DR verification
- After every significant infrastructure or Front Door change, run automated DR checks in lower environments.
Common Pitfalls (and How to Avoid Them)
1. Health probes hitting the wrong thing
Problem: Probes target a static /health that doesn't reflect real dependencies.
Impact: Front Door sees green while the app is actually broken, delaying failover or preventing it entirely.
Fix:
- Implement
/readyzor/healthz-deepthat checks key dependencies. - Make sure it returns non-2xx when critical components are broken.
2. Probes behind caching or CDN rules
Problem: Health probe requests get cached or served by a rule path that hides backend errors.
Impact: Probes never see failures; Front Door won't fail over.
Fix:
- Exclude health probe paths from caching and rewrites.
- Validate with logs that probes hit the actual app.
3. Overly large sample sizes and long intervals
Problem: Probe interval = 60s, sample size = 16, successful samples required = 15.
Impact: It can take many minutes of continuous failures before Front Door marks an origin unhealthy.
Fix:
- Tune probe interval and samples to align with your RTO.
- In many enterprise setups, something like 15–30s intervals and small sample windows (e.g., 3 out of 4) is a better starting point.
4. Internal traffic bypassing Front Door
Problem: Internal services talk directly to App Gateway or App Service in Region A.
Impact: External users may fail over via Front Door, but internal APIs and jobs still rely on the failed region.
Fix:
- Use Front Door (or an equivalent internal traffic manager) as the standard entry point for inter-service communication where DR matters.
- Or implement separate internal traffic management with the same multi-region logic.
5. No DR for the data tier
Problem: App tier is multi-region, but SQL or Redis is single-region.
Impact: Failover appears successful at the HTTP layer, but the secondary region has no usable data.
Fix:
- Plan data DR first: geo-replication, multi-region writes, failover groups.
- Wire app config (connection strings, secrets) to automatically use the correct endpoint after failover.
6. DR tests only in staging
Problem: DR game days happen in lower environments that don't mirror prod topology, traffic patterns, or data sensitivity.
Impact: False confidence. Things that worked in staging break in production.
Fix:
- Run carefully scoped DR drills in production: limited time windows, pre-announced, with a rollback plan.
- Start small (e.g., partial traffic) and grow once you've built muscle.
7. No clear runbook for Front Door changes
Problem: During an incident, engineers manually poke around in the Azure Portal, toggling origins and routing rules.
Impact: Slow response, new mistakes, hard to audit.
Fix:
- Document and automate incident playbooks:
- "Disable primary origin"
- "Force traffic to Region B"
- "Roll back to normal state"
- Implement them as scripts or pipeline tasks, not "click here, then here".
FAQ
1. Azure Front Door vs Traffic Manager vs DNS for DR?
- Front Door: Layer 7 routing, WAF, caching, modern Standard/Premium features; ideal for web/API DR.
- Traffic Manager: DNS-based routing, good for non-HTTP workloads or hybrid scenarios.
- DNS only: Very coarse and slow control. You generally layer Front Door or Traffic Manager on top of DNS, not instead of them.
For most modern web workloads, use Front Door as the primary DR switch and DNS as a coarse backup.
2. How do I test failover safely in production?
- Start by failing a small percentage of traffic (e.g., use weighted routing in a subset environment).
- Use short, well-announced windows.
- Have an automated rollback (re-enable origin, revert routing).
- Observe impact in real time on error budgets and SLO dashboards.
3. How should I choose health probe paths?
- Use a dedicated endpoint like
/readyzor/health-deep. - It should check critical dependencies in a lightweight way.
- Return non-2xx when the app is not fit to serve traffic.
- Exclude it from caching and WAF rules that could mask problems.
4. What's a reasonable failover time with Front Door?
It depends on your probe configuration, but many teams target:
- Detection: 30–90 seconds
- Failover complete: Under 2–3 minutes
If your RTO is stricter, tune probes more aggressively and mitigate false positives with solid observability and retry logic at the client layer.
5. How do I handle stateful sessions with multi-region Front Door?
Options:
- Go stateless at the app layer (recommended where possible).
- Use distributed caches (e.g., Redis) or centralized session stores that replicate between regions.
- For active-passive, consider shorter session lifetimes + re-auth on failover.
- Be careful with "sticky sessions" and ensure they don't lock users to a dead region.
6. How do I bring this pattern into a legacy environment?
- Start by putting Front Door in front of your existing primary region.
- Add a secondary region with a subset of services.
- Use DR drills in lower environments first to refine runbooks.
- Gradually move more legacy components behind consistent Front Door routing.
You don't have to go all-in on day one; even a partial DR capability is better than none.
7. How do I measure DR success?
Track:
- RTO achieved vs target during drills.
- RPO (data loss or replay needs).
- User impact during failover (error rates, latency).
- Time for engineers to execute runbooks.
- Number of incidents where DR actually saved you.
Turn those into SLOs that leadership can understand.
8. How does this compare to AWS and GCP?
Rough mapping:
- AWS: CloudFront + ALB/NLB + Route 53 health checks and routing policies.
- GCP: External HTTP(S) Load Balancer + Cloud CDN + Cloud Armor.
Concepts are similar: health checks, multi-region backends, DR drills. The main differences are in configuration models, naming, and surrounding ecosystem.
Conclusion
In our DR drill, Azure Front Door didn't "fail over" because:
- Our health probes were lying to it.
- Our expectations didn't match our configuration.
- Our DR practice was theoretical rather than muscle memory.
The good news: once you understand how Front Door evaluates backend health and how to align probes with real-world failure modes, it becomes a powerful tool for multi-region resilience.
If you take one thing from this story, let it be this:
Don't wait for a real outage to find out whether your DR works.
Start with a lower environment, codify Front Door and DR behavior in Terraform/Bicep, set up observability, and schedule regular game days. Every drill you run now is one less panic later.
If this resonated with you, follow along, drop your own DR stories in the comments, and share this with the person in your org who will be on call when Azure Front Door is your first line of defense.
References
- Azure Front Door health probes overview (Microsoft Learn)
- Designing multi-region web applications (Microsoft Azure Architecture Center)
- Azure Front Door Standard/Premium documentation
- Azure SQL Database geo-replication
- Azure Kubernetes Service multi-region best practices
Connect With Me
If you enjoyed this walkthrough, feel free to connect with me here:

Top comments (0)