When building infrastructure in Azure, high availability is non-negotiable. One of the fundamental tools for achieving this within a single datacenter is the Availability Set.
In this post, we’ll break down how Availability Sets work, the math behind Fault and Update domains, and the critical constraints you need to know.
What is an Availability Set?
An Availability Set is a logical grouping capability that ensures the Virtual Machines (VMs) you place within it are distributed across multiple isolated hardware nodes in a datacenter.
This distribution is crucial because it protects your applications from two specific types of disruptions:
Planned Maintenance Events: Handled by Update Domains.
Unplanned Hardware Failures: Handled by Fault Domains.
The Core Components: Fault vs. Update Domains
To understand how Azure protects your VMs, you need to understand the two dimensions of an Availability Set.
- Fault Domains (FD) What they are: A Fault Domain represents a group of VMs that share common physical hardware, specifically a power source and a network switch (think of it as a physical server rack).
The Goal: To prevent simultaneous downtime caused by physical hardware failures (e.g., a power outage or a rack switch failure).
Configuration: By default, Azure assigns 3 Fault Domains per Availability Set (in most regions).
Distribution: Azure spreads your VMs across these domains. If you have 3 VMs and 3 FDs, each VM sits on a different rack.
- Update Domains (UD) What they are: A logical group of hardware that can be rebooted at the same time.
The Goal: To protect against downtime during planned maintenance. When Azure needs to patch the underlying host OS, it will never restart more than one Update Domain at a time.
Configuration: The default is 5 Update Domains, but you can increase this up to 20.
Distribution: When maintenance occurs, UD1 reboots, then UD2, and so on, ensuring the other domains remain online to handle traffic.
The Logic: How VMs are Distributed
It helps to visualize how Azure places your VMs into these buckets.
Example Scenario: Imagine you have an Availability Set configured with 2 Update Domains and 3 Fault Domains. You deploy 3 VMs.
Update Domains: VM 1 and VM 2 might reboot first (UD0), followed by VM 3 (UD1).
Fault Domains: Each of the 3 VMs is placed on separate physical hardware (Rack 1, Rack 2, Rack 3) to maximize survival during a power outage.
The "Bucket" Calculation
What happens if you have more VMs than domains? They wrap around.
Let's say you have 14 VMs and you configured 10 Update Domains.
The Math:
The first 10 VMs fill UD0 through UD9.
The remaining 4 VMs wrap around and are placed in UD0 through UD3.
The Result:
4 Domains (UD0–UD3) contain 2 VMs each.
6 Domains (UD4–UD9) contain 1 VM each.
Risk Assessment: During a patch cycle for UD0, only 2 VMs will go offline simultaneously.
Critical Constraints & "Gotchas"
Availability Sets are powerful, but they come with strict rules.
Creation Only: You cannot add an existing, running VM to an Availability Set. You must define the Availability Set at the time of VM creation. If you need to add a VM later, you will have to recreate it.
Fixed Counts: Once the Availability Set is created, you cannot modify the number of Fault Domains or Update Domains. Plan ahead!
Managed Disks: Always use Managed Disks with your Availability Sets. This ensures that the disks are also placed in different storage clusters aligned with the Fault Domains, preventing a single storage failure from taking down the whole set.
The Resizing Rule
This is a common headache for administrators. If you need to resize a VM within an Availability Set (specifically to a size that requires different physical hardware), you often cannot just resize that one VM.
The Rule: You must stop (deallocate) ALL VMs in the Availability Set first.
The Reason: All running VMs in an Availability Set must reside on the same physical hardware cluster type. To move to a new size that the current cluster doesn't support, the entire set must be moved to a new cluster that supports the new size.
Summary
Fault Domains = Protection against Hardware/Power failures (Racks).
Update Domains = Protection against Microsoft Patching/Reboots.
Strategy: Combine Availability Sets with a Load Balancer to ensure your application remains accessible even when specific domains are down.
Top comments (0)