vaibhav bedi

Posted on Nov 15

Troubleshooting Real-World Network Outages in Microsoft Azure

#networking #azure #monitoring #devops

Network outages in Azure can be stressful. One minute everything's running smoothly, the next you're getting alerts that your application is unreachable. I've been through enough of these incidents to know that having a systematic approach makes all the difference between panic and resolution.

The 3 AM Wake-Up Call

Picture this: your monitoring alerts are going off, your application isn't responding, and you need to figure out what's wrong. Fast. Azure's network stack is powerful but complex, with virtual networks, subnets, network security groups, route tables, and service endpoints all playing together. When something breaks, knowing where to look is half the battle.

Start With the Basics

Before diving into Azure-specific tools, verify the obvious stuff. I know it sounds basic, but I've seen too many incidents where we skipped this and wasted time:

Can you reach the Azure portal? If you can't, the issue might be on your end or a broader Azure service disruption. Check the Azure Status page first.

Is your service actually down? Sometimes monitoring gets it wrong. Try accessing your application from different locations or networks. Your office network might have issues while everything else works fine.

Check the Azure Service Health dashboard. Microsoft might already know about the outage. Navigate to Service Health in the portal and check for any incidents in your region.

Network Security Groups: The Silent Killers

NSGs are probably responsible for more outages than people want to admit. They're easy to misconfigure and the results are immediate.

Open your NSG in the portal and check the inbound and outbound rules. Look for recently modified rules - someone might have made a change that broke connectivity. Azure keeps an activity log, so you can see who changed what and when.

Use the IP Flow Verify tool in Network Watcher. This tool tells you whether traffic is allowed or denied between two points, and which NSG rule is responsible. It's saved me hours of manual rule checking.

Route Tables and Unexpected Paths

Routes control where traffic goes, and a misconfigured route table can send your traffic into a black hole. This happens more often than you'd think, especially after someone adds a new subnet or makes changes to a firewall appliance.

Check your route tables through the portal or use Azure CLI. Look for routes with a next hop type of "None" - these explicitly drop traffic. Also watch for routes that point to network virtual appliances that might be down.

The Next Hop tool in Network Watcher shows you exactly where traffic will go for a given source and destination. Use it to trace your traffic path and find where things go wrong.

DNS Issues Are Network Issues Too

DNS problems look like network outages but they're actually resolution failures. Your application can't reach the database because it can't resolve the hostname to an IP address.

If you're using Azure Private DNS zones, verify that your VNet is actually linked to the zone. A missing link means your VMs can't resolve private DNS names.

Check your DNS servers in the VNet configuration. Custom DNS servers are a common source of problems. If you've configured custom DNS, make sure those servers are reachable and functioning.

Service Endpoints and Private Endpoints

These features are great for security but they add complexity. If your application suddenly can't reach Azure Storage or SQL Database, check whether service endpoints or private endpoints are involved.

Service endpoints route traffic to Azure services through the Microsoft backbone network instead of the internet. They require specific configuration on both the VNet subnet and the Azure service. Missing either side breaks connectivity.

Private endpoints create a private IP address for an Azure service inside your VNet. If someone deleted or misconfigured a private endpoint, your application loses access. Check the Private Link Center to see all your private endpoints and their status.

Network Watcher Connection Monitor

This tool continuously monitors connectivity between resources. If you haven't set it up before an outage, you can still use the Connection Troubleshoot feature to test connectivity right now.

Connection Troubleshoot checks whether a VM can reach another VM, an external endpoint, or an Azure service. It shows you the exact path traffic takes and where it fails. It checks NSGs, routes, and even effective routes to give you a complete picture.

Application Gateway and Load Balancer Health

Your networking might be fine but your load balancer's backend pool could be unhealthy. Check the backend health in Application Gateway or Load Balancer. If all backends are showing as unhealthy, the issue might be with your health probes, not the actual backends.

Common health probe failures include incorrect probe paths, wrong ports, or overly aggressive timeout settings. Review your probe configuration and test manually using curl or a browser.

Effective Security Rules

Azure applies NSG rules at both the subnet and NIC level. The combination of these rules determines what traffic is actually allowed. Use the Effective Security Rules view in the portal to see the final set of rules that apply to a specific NIC. This resolves the confusion when you have NSGs at multiple levels.

Real-World Scenario: The Disappearing Database

Here's a recent example. Application suddenly couldn't connect to Azure SQL Database. Portal showed the database was running, no Azure service issues, application logs showed connection timeouts.

Started with IP Flow Verify - traffic was allowed by NSGs. Checked routes - everything looked normal. Then checked the SQL Database firewall rules. Someone had removed the subnet's service endpoint access during a cleanup task. Added the subnet back to the firewall rules and connectivity restored immediately.

The lesson? Always check service-specific firewall rules, not just network-level security.

Diagnostic Logs Are Your Friend

Enable diagnostic logging for your network resources. NSG flow logs show you exactly what traffic is being allowed or denied in real-time. This data goes to a storage account or Log Analytics workspace where you can query it.

You can use Log Analytics queries to find patterns. Looking for a spike in denied connections? Query your NSG flow logs. Want to see if traffic is actually reaching your subnet? Flow logs have the answer.

The Recovery Checklist

When you're in the middle of an outage, having a checklist helps you stay systematic:

Verify the outage is real and affects users
Check Azure Service Health for known issues
Review recent changes in the Activity Log
Verify NSG rules using IP Flow Verify
Check route tables using Next Hop
Test DNS resolution
Verify service endpoint and private endpoint configurations
Check load balancer backend health
Review service-specific firewall rules
Check diagnostic logs for patterns

Prevention Is Better Than 3 AM Fixes

Set up Connection Monitor to continuously test critical paths. Enable NSG flow logs. Use Azure Policy to prevent certain risky configurations. Tag your resources properly so you can track relationships.

Document your network architecture. When things break, you need to quickly understand what connects to what. A simple diagram saves time when you're troubleshooting under pressure.

Wrapping Up

Azure network troubleshooting gets easier with experience, but having the right tools and approach makes a huge difference. Network Watcher is your best friend. The Activity Log shows you what changed. And sometimes the issue is as simple as a checkbox someone unchecked.

The next time you face a network outage, take a breath, work through the checklist, and use the tools Azure gives you. You'll find the problem faster than you think.

What's your worst Azure network outage story? Drop it in the comments - we've all been there.

DEV Community