Raghavendra R for CareerByteCode

Posted on Dec 7, 2025

# When Azure Front Door Won't Fail Over: Lessons from a Real Multi-Region DR Drill

Azure Front Door didn't fail over during a real multi-region DR drill. Here's what went wrong, how we fixed it, and how to design reliable failover.

The Story / Background
Core Concepts: How Azure Front Door Failover Really Works
Step-by-Step Guide: Designing Azure Front Door for Real Multi-Region DR
Architecture Diagram
Best Practices for Azure Front Door Multi-Region DR
Common Pitfalls (and How to Avoid Them)
FAQ
Conclusion
References

A few quarters ago we ran what we thought would be a routine multi-region DR game day on Azure. The plan was simple: simulate a primary region failure, watch Azure Front Door detect the issue, fail over to the secondary region, and go for coffee feeling smug.

Instead, Front Door stared at our "dead" region and kept happily sending it traffic. Users got timeouts. Dashboards lit up. Our DR runbooks suddenly looked very theoretical. I'll walk through what actually happened, how we debugged it, and the patterns I use now whenever I put Azure Front Door in front of multi-region workloads.

The Story / Background

The architecture we thought we had

This was a fairly typical enterprise setup:

Front door / CDN: Azure Front Door Standard/Premium with WAF
Two Azure regions:
- Region A (primary) – AKS + internal Application Gateway, Azure SQL with geo-replica
- Region B (secondary) – warm standby AKS + App Gateway, Azure SQL geo-replica
Routing mode: Active-passive (priority routing) in Front Door
Health probes: Configured at the origin group level to hit /health on each region's App Gateway
Infra-as-Code: Terraform for Front Door, AKS, App Gateway, SQL, and plumbing
Observability: Azure Monitor, Log Analytics, Application Insights, plus synthetic checks from multiple locations

On paper, this ticked all the boxes: multi-region, DR runbooks, IaC, WAF in front, and tests.

The drill

The DR playbook was:

Simulate a partial outage in Region A.
Observe Front Door marking the primary origin unhealthy.
Confirm automatic failover to Region B.
Run smoke tests and declare the drill successful.

Simulation method: we applied a network ACL on the primary App Gateway subnet to effectively blackhole traffic from Front Door, mimicking a critical failure in the app tier.

What actually happened

Front Door did not immediately fail over.
Users got intermittent timeouts and 5xxs, but traffic kept trying Region A for long enough to trigger a production-level incident if this had been real.
Our synthetic checks (which hit the Front Door endpoint) kept reporting "green" for several minutes.
Logs seemed contradictory: App Gateway showed traffic drops; Front Door metrics looked almost normal.

It took a painful hour-plus of log diving and config reviews to realize:

Our health probe path /health was still responding 200 OK from a separate "status" service that hadn't been affected by the simulated failure.
The probe interval and sample size made failover slower than our target RTO.
Some internal services were bypassing Front Door and talking directly to Region A's private endpoints, so even if Front Door had failed over, we still had partial breakage.

The short version: the app died, but the health probes didn't. And Front Door did exactly what we told it to do, not what we thought we configured.

Core Concepts: How Azure Front Door Failover Really Works

Let's unpack what matters for Azure Front Door in a multi-region DR setup.

Origin groups, priorities, and routing

In Azure Front Door Standard/Premium:

You define origin groups (backend pools).
Within a group, each origin (Region A, Region B) can have:
- A priority (for active-passive)
- A weight (for active-active / traffic split)
Front Door sends traffic to the lowest-priority healthy origin.
If that origin becomes unhealthy, it will fail over to the next priority.

The word "healthy" hides a lot of detail.

Health probes and what "healthy" really means

Health probes are where most DR drills go to die:

Probes are configured per origin group with:
- Protocol & port (HTTP/HTTPS, 80/443, etc.)
- Path (e.g., /healthz, /live, /ready)
- Interval & sample size
Front Door considers an origin healthy if it gets enough 2xx/3xx responses from the probe within the configured sample window.
It considers an origin unhealthy after enough failures/timeouts in that window.

Key gotchas:

If your probe hits a different component than your critical path (e.g., a static health page, a separate sidecar), you'll see green while users are screaming.
If the probe is too forgiving (long intervals, large sample size), failover is slower than your RTO.
If the probe path is behind aggressive caching or a CDN rule, Front Door might be probing a cached thing, not your real app.

Active-active vs active-passive in DR context

Active-passive (priority routing)
- Simpler mental model: Region A is primary, Region B is standby.
- Good when your data tier or regulatory constraints make multi-master tricky.
Active-active (latency / weighted)
- Better utilization and resilience, but more complex for stateful workloads.
- Requires careful handling for session affinity, data consistency, and rollouts.

Front Door supports both via routing rules and origin group configuration, but DR behavior and testing strategy differ.

Data tier is not Front Door's job

Front Door only handles HTTP(S) routing. Your data layer is your responsibility:

Azure SQL with active geo-replication or auto-failover groups
Cosmos DB with multi-region writes
Redis with geo-replication or region-local caches
Storage accounts with RA-GRS or dual-write patterns

If your data tier can't fail over fast enough, Front Door can swap regions all day and users will still see errors or stale data.

Observability for failover

For real DR:

Azure Monitor & Log Analytics for Front Door metrics and logs
Application Insights for dependency failures, response times, distributed tracing
Synthetic tests (multi-region) that hit the Front Door endpoint with app-level expectations
End-to-end dashboards showing:
- Front Door health vs backend health
- Per-region error rates
- Failover events and timings

Step-by-Step Guide: Designing Azure Front Door for Real Multi-Region DR

1. Define RTO/RPO and failure modes

Before YAML and Terraform, write down:

RTO – how fast must failover complete?
RPO – how much data loss can you tolerate?
Failure modes you care about:
- Region outage
- App tier outage
- Partial dependency outage (e.g., DB or cache)
- Front Door misconfig / WAF block

Agree this with product, business, and security. DR that only works for "region disappeared" but not "DB is slow" is half a solution.

2. Design origin groups and health probe strategy

For an active-passive setup:

Single origin group with two origins: app-region-a, app-region-b.
Use priority: Region A = 1, Region B = 2.
Configure probes to hit a realistic but cheap path, e.g. /readyz that:
- Checks app's critical dependencies (DB, cache, queue) at lightweight level.
- Returns non-2xx when something essential is broken.

3. Implement with Terraform (example)

Here's a simplified Terraform snippet for Azure Front Door Standard/Premium with two origins and a health probe tuned for DR:

# Resource Group
resource "azurerm_resource_group" "network" {
  name     = "rg-network-prod"
  location = "East US"
}

# Azure Front Door Profile
resource "azurerm_cdn_frontdoor_profile" "prod" {
  name                = "fd-prod-profile"
  resource_group_name = azurerm_resource_group.network.name
  sku_name            = "Standard_AzureFrontDoor"

  tags = {
    environment = "production"
    purpose     = "multi-region-dr"
  }
}

# Front Door Endpoint
resource "azurerm_cdn_frontdoor_endpoint" "prod" {
  name                     = "fd-prod-endpoint"
  cdn_frontdoor_profile_id = azurerm_cdn_frontdoor_profile.prod.id

  tags = {
    environment = "production"
  }
}

# Origin Group with Health Probes
resource "azurerm_cdn_frontdoor_origin_group" "app" {
  name                     = "og-app-multiregion"
  cdn_frontdoor_profile_id = azurerm_cdn_frontdoor_profile.prod.id

  session_affinity_enabled = false

  health_probe {
    interval_in_seconds = 15
    path                = "/readyz"
    protocol            = "Https"
    request_type        = "GET"
  }

  load_balancing {
    additional_latency_in_milliseconds = 0
    successful_samples_required        = 3
    sample_size                        = 4
  }
}

# Primary Origin (Region A)
resource "azurerm_cdn_frontdoor_origin" "app_region_a" {
  name                           = "app-region-a"
  cdn_frontdoor_origin_group_id  = azurerm_cdn_frontdoor_origin_group.app.id
  host_name                      = "app-gw-eastus.contoso.internal"
  http_port                      = 80
  https_port                     = 443
  origin_host_header             = "app.contoso.com"
  priority                       = 1
  weight                         = 1000
  enabled                        = true

  certificate_name_check_enabled = true
}

# Secondary Origin (Region B)
resource "azurerm_cdn_frontdoor_origin" "app_region_b" {
  name                           = "app-region-b"
  cdn_frontdoor_origin_group_id  = azurerm_cdn_frontdoor_origin_group.app.id
  host_name                      = "app-gw-westus.contoso.internal"
  http_port                      = 80
  https_port                     = 443
  origin_host_header             = "app.contoso.com"
  priority                       = 2
  weight                         = 1000
  enabled                        = true

  certificate_name_check_enabled = true
}

# Route to map requests to origin group
resource "azurerm_cdn_frontdoor_route" "app_route" {
  name                          = "app-route"
  cdn_frontdoor_endpoint_id     = azurerm_cdn_frontdoor_endpoint.prod.id
  cdn_frontdoor_origin_group_id = azurerm_cdn_frontdoor_origin_group.app.id
  patterns_to_match             = ["/*"]
  supported_protocols           = ["Http", "Https"]
  https_redirect_enabled        = true

  forwarding_protocol    = "HttpsOnly"
  link_to_default_domain = true
}

4. Build DR-aware pipelines and configuration management

Treat Front Door config as code (Terraform/Bicep).
Protect it with:
- Pull requests and mandatory reviews.
- Policy checks (e.g., checks that every origin has a probe).
- Automated validation in a non-prod "chaos" environment.
Build pipelines that can:
- Temporarily disable an origin (simulated outage).
- Flip priorities if you need a manual failover.

Example Azure CLI snippet to temporarily disable Region A origin:

#!/bin/bash

# Disable primary origin for DR testing
az afd origin update \
  --resource-group rg-network-prod \
  --profile-name fd-prod-profile \
  --origin-group-name og-app-multiregion \
  --origin-name app-region-a \
  --enabled-state Disabled

echo "Origin app-region-a has been disabled. Traffic should failover to app-region-b."

# Monitor failover progress
echo "Monitoring Front Door metrics for 5 minutes..."
sleep 300

# Re-enable origin after test
read -p "Re-enable primary origin? (y/n): " -n 1 -r
echo
if [[ $REPLY =~ ^[Yy]$ ]]; then
    az afd origin update \
      --resource-group rg-network-prod \
      --profile-name fd-prod-profile \
      --origin-group-name og-app-multiregion \
      --origin-name app-region-a \
      --enabled-state Enabled
    echo "Origin app-region-a has been re-enabled."
fi

Use this in non-prod to safely observe Front Door's behavior.

5. Implement synthetic tests and dashboards

Create synthetic tests that:
- Hit https://app.contoso.com/healthcheck-end-to-end
- Validate response code, body, and latency
- Run from multiple Azure regions (or external providers)
Build dashboards that show, per region:
- Front Door origin health state
- App response times
- Error rates and timeouts

Ensure your on-call runbook includes how to read these graphs during a DR event.

6. Run regular DR drills and chaos tests

Treat DR like CI:

Schedule recurring game days (quarterly is a good start).
Test different failure modes: origin disabled, DB unavailable, cache down, WAF rule gone wild.
Time how long:
- Front Door takes to mark the origin unhealthy.
- Users experience degraded performance.
- The team takes to declare failover complete.

Capture and track those as SLOs for DR.

Architecture Diagram

The diagram below illustrates the multi-region Azure Front Door DR architecture discussed in this post:

Key Components:

Azure Front Door acts as the global load balancer with WAF protection
Priority-based routing with Region A as primary (Priority 1) and Region B as secondary (Priority 2)
Health probes monitor /readyz endpoints to determine origin health
Geo-replicated Azure SQL ensures data availability across regions
Azure Monitor provides comprehensive observability across all components

Traffic Flow:

Normal Operation: User requests → Front Door → Region A (Primary) → Application Gateway → AKS → Azure SQL Primary
During Failover: Health probe fails on Region A → Front Door redirects traffic → Region B (Secondary) → Application Gateway → AKS → Azure SQL Geo-Replica
Monitoring: All components send telemetry to Azure Monitor and Application Insights for real-time observability

Best Practices for Azure Front Door Multi-Region DR

Health checks must reflect real risk
- Probe something that depends on your critical services (DB, cache, queue) but is cheap to execute.
Use explicit priorities for active-passive
- Don't rely on latency routing if your DR strategy is "primary then fail over".
Align probe configuration with RTO
- Shorter intervals and smaller sample sizes mean faster failover, at the cost of more sensitivity to transient blips.
Decouple internal vs external paths
- Ensure internal clients also route via Front Door (or a consistent DR mechanism), otherwise they'll keep hitting a dead region.
Keep origin host headers consistent
- Use a single app host name to simplify config, TLS, and debugging.
Tag everything
- Use tags for env, region, dr-role, owner, criticality. Helps a lot in DR reviews and cost tracking.
Secure by default
- Use WAF, private origins (Private Link / internal App Gateway), and managed identities.
Centralize observability
- One place where SRE/DevOps can see Front Door + app + DB health across regions.
Automate DR verification
- After every significant infrastructure or Front Door change, run automated DR checks in lower environments.

Common Pitfalls (and How to Avoid Them)

1. Health probes hitting the wrong thing

Problem: Probes target a static /health that doesn't reflect real dependencies.

Impact: Front Door sees green while the app is actually broken, delaying failover or preventing it entirely.

Fix:

Implement /readyz or /healthz-deep that checks key dependencies.
Make sure it returns non-2xx when critical components are broken.

2. Probes behind caching or CDN rules

Problem: Health probe requests get cached or served by a rule path that hides backend errors.

Impact: Probes never see failures; Front Door won't fail over.

Fix:

Exclude health probe paths from caching and rewrites.
Validate with logs that probes hit the actual app.

3. Overly large sample sizes and long intervals

Problem: Probe interval = 60s, sample size = 16, successful samples required = 15.

Impact: It can take many minutes of continuous failures before Front Door marks an origin unhealthy.

Fix:

Tune probe interval and samples to align with your RTO.
In many enterprise setups, something like 15–30s intervals and small sample windows (e.g., 3 out of 4) is a better starting point.

4. Internal traffic bypassing Front Door

Problem: Internal services talk directly to App Gateway or App Service in Region A.

Impact: External users may fail over via Front Door, but internal APIs and jobs still rely on the failed region.

Fix:

Use Front Door (or an equivalent internal traffic manager) as the standard entry point for inter-service communication where DR matters.
Or implement separate internal traffic management with the same multi-region logic.

5. No DR for the data tier

Problem: App tier is multi-region, but SQL or Redis is single-region.

Impact: Failover appears successful at the HTTP layer, but the secondary region has no usable data.

Fix:

Plan data DR first: geo-replication, multi-region writes, failover groups.
Wire app config (connection strings, secrets) to automatically use the correct endpoint after failover.

6. DR tests only in staging

Problem: DR game days happen in lower environments that don't mirror prod topology, traffic patterns, or data sensitivity.

Impact: False confidence. Things that worked in staging break in production.

Fix:

Run carefully scoped DR drills in production: limited time windows, pre-announced, with a rollback plan.
Start small (e.g., partial traffic) and grow once you've built muscle.

7. No clear runbook for Front Door changes

Problem: During an incident, engineers manually poke around in the Azure Portal, toggling origins and routing rules.

Impact: Slow response, new mistakes, hard to audit.

Fix:

Document and automate incident playbooks:
- "Disable primary origin"
- "Force traffic to Region B"
- "Roll back to normal state"
Implement them as scripts or pipeline tasks, not "click here, then here".

FAQ

1. Azure Front Door vs Traffic Manager vs DNS for DR?

Front Door: Layer 7 routing, WAF, caching, modern Standard/Premium features; ideal for web/API DR.
Traffic Manager: DNS-based routing, good for non-HTTP workloads or hybrid scenarios.
DNS only: Very coarse and slow control. You generally layer Front Door or Traffic Manager on top of DNS, not instead of them.

For most modern web workloads, use Front Door as the primary DR switch and DNS as a coarse backup.

2. How do I test failover safely in production?

Start by failing a small percentage of traffic (e.g., use weighted routing in a subset environment).
Use short, well-announced windows.
Have an automated rollback (re-enable origin, revert routing).
Observe impact in real time on error budgets and SLO dashboards.

3. How should I choose health probe paths?

Use a dedicated endpoint like /readyz or /health-deep.
It should check critical dependencies in a lightweight way.
Return non-2xx when the app is not fit to serve traffic.
Exclude it from caching and WAF rules that could mask problems.

4. What's a reasonable failover time with Front Door?

It depends on your probe configuration, but many teams target:

Detection: 30–90 seconds
Failover complete: Under 2–3 minutes

If your RTO is stricter, tune probes more aggressively and mitigate false positives with solid observability and retry logic at the client layer.

5. How do I handle stateful sessions with multi-region Front Door?

Options:

Go stateless at the app layer (recommended where possible).
Use distributed caches (e.g., Redis) or centralized session stores that replicate between regions.
For active-passive, consider shorter session lifetimes + re-auth on failover.
Be careful with "sticky sessions" and ensure they don't lock users to a dead region.

6. How do I bring this pattern into a legacy environment?

Start by putting Front Door in front of your existing primary region.
Add a secondary region with a subset of services.
Use DR drills in lower environments first to refine runbooks.
Gradually move more legacy components behind consistent Front Door routing.

You don't have to go all-in on day one; even a partial DR capability is better than none.

7. How do I measure DR success?

Track:

RTO achieved vs target during drills.
RPO (data loss or replay needs).
User impact during failover (error rates, latency).
Time for engineers to execute runbooks.
Number of incidents where DR actually saved you.

Turn those into SLOs that leadership can understand.

8. How does this compare to AWS and GCP?

Rough mapping:

AWS: CloudFront + ALB/NLB + Route 53 health checks and routing policies.
GCP: External HTTP(S) Load Balancer + Cloud CDN + Cloud Armor.

Concepts are similar: health checks, multi-region backends, DR drills. The main differences are in configuration models, naming, and surrounding ecosystem.

Conclusion

In our DR drill, Azure Front Door didn't "fail over" because:

Our health probes were lying to it.
Our expectations didn't match our configuration.
Our DR practice was theoretical rather than muscle memory.

The good news: once you understand how Front Door evaluates backend health and how to align probes with real-world failure modes, it becomes a powerful tool for multi-region resilience.

If you take one thing from this story, let it be this:

Don't wait for a real outage to find out whether your DR works.

Start with a lower environment, codify Front Door and DR behavior in Terraform/Bicep, set up observability, and schedule regular game days. Every drill you run now is one less panic later.

If this resonated with you, follow along, drop your own DR stories in the comments, and share this with the person in your org who will be on call when Azure Front Door is your first line of defense.

References

Connect With Me

If you enjoyed this walkthrough, feel free to connect with me here:

Table of Contents

The Story / Background

The architecture we thought we had

The drill

What actually happened

Core Concepts: How Azure Front Door Failover Really Works

Origin groups, priorities, and routing

Health probes and what "healthy" really means

Active-active vs active-passive in DR context

Data tier is not Front Door's job

Observability for failover

Step-by-Step Guide: Designing Azure Front Door for Real Multi-Region DR

1. Define RTO/RPO and failure modes

2. Design origin groups and health probe strategy

3. Implement with Terraform (example)

4. Build DR-aware pipelines and configuration management

5. Implement synthetic tests and dashboards

6. Run regular DR drills and chaos tests

Architecture Diagram

Key Components:

Traffic Flow:

Best Practices for Azure Front Door Multi-Region DR

Common Pitfalls (and How to Avoid Them)

1. Health probes hitting the wrong thing

2. Probes behind caching or CDN rules

3. Overly large sample sizes and long intervals

4. Internal traffic bypassing Front Door

5. No DR for the data tier

6. DR tests only in staging

7. No clear runbook for Front Door changes

FAQ

1. Azure Front Door vs Traffic Manager vs DNS for DR?

2. How do I test failover safely in production?

3. How should I choose health probe paths?

4. What's a reasonable failover time with Front Door?

5. How do I handle stateful sessions with multi-region Front Door?

6. How do I bring this pattern into a legacy environment?

7. How do I measure DR success?

8. How does this compare to AWS and GCP?

Conclusion

References

Connect With Me