Amazon FSx for NetApp ONTAP - Multi-AZ - Networking explained

#aws #fsx #cloudstorage #networking

Setting up Amazon FSx for NetApp ONTAP (FSxN) feels like magic right up until you have to design the network architecture. Migrating enterprise file workloads to the cloud means dealing with strict high-availability requirements, and AWS handles this elegantly but the underlying networking mechanics can easily trip you up if you aren't paying attention.

Today, we are going completely under the hood on FSx for ONTAP networking. We'll look at how failovers actually work across Availability Zones, how to wrangle IP allocations, and a few edge cases that have definitely caused some gray hairs in production.

Skip to Cheat Sheet

The Foundation: Single-AZ vs. Multi-AZ

Before we get into the weeds, let's establish how FSxN sits in your VPC.

Single-AZ Deployments are straightforward. AWS drops both your active and standby nodes into the exact same subnet. Because they share a subnet, failover is a simple IP reassignment. The active node dies, the standby node takes over the IP address, and your clients barely notice.

Multi-AZ Deployments are where things get interesting. You are putting the active node in AZ-A and the standby node in AZ-B to survive a massive data center impairment. Here is the problem: AWS subnets cannot span Availability Zones. A node in AZ-B cannot simply adopt an IP address from a subnet in AZ-A.

So, how does AWS route traffic to a stable endpoint if the underlying node keeps changing subnets? Enter the Floating IP.

The Magic of the Multi-AZ Floating IP

To bridge the gap across subnets, FSx for ONTAP uses "Floating IPs" (Endpoint IPs). When you spin up a Multi-AZ cluster, AWS asks you for an IP range. It takes IPs from this range and assigns them to your storage endpoints (NFS, SMB, iSCSI, Management).

Instead of attaching these IPs directly to the Elastic Network Interfaces (ENIs), Amazon FSx injects routes directly into your VPC Route Tables.

Here is what it looks like in your AWS console when you check a route table associated with your file system:

Destination        Target                     Status    Propagated
10.0.0.0/16        local                      active    No
198.19.0.12/32     eni-0abc123def4567890      active    No  <-- Floating IP (NFS/SMB)
198.19.0.13/32     eni-0abc123def4567890      active    No  <-- Floating IP (Management)

Notice how those /32 routes point to a specific ENI? If the active node fails, FSx reaches into your VPC route tables and dynamically updates those targets to point to the standby node's ENI in the other AZ. It is a brilliant, DNS-free failover mechanism.

Real-World Scenarios & IP ranging

If you just click through the AWS console, it defaults to grabbing the last 64 IPs of your VPC's primary CIDR for these floating IPs. But enterprise networks are rarely that clean. Let's look at some real-world scenarios.

Scenario 1: The Last 64 IPs are Taken

What if your VPC is heavily utilized and those last 64 IPs are already assigned to EC2 instances, but you have a /28 block completely free sitting in the middle of your CIDR?

You don't need to resize your VPC. You can just override the default behavior using the AWS CLI or the console's "Standard Create" flow.

AWS CLI Example:

aws fsx create-file-system \
  --file-system-type ONTAP \
  --storage-capacity 1024 \
  --subnet-ids subnet-0123456789abcdef0 subnet-0fedcba9876543210 \
  --ontap-configuration DeploymentType=MULTI_AZ_1,EndpointIpAddressRange=10.0.5.64/28,ThroughputCapacity=128

By specifying EndpointIpAddressRange=10.0.5.64/28, AWS happily carves your floating IPs out of that specific middle block.

Scenario 2: Total VPC CIDR Exhaustion

You've completely run out of IP space in your primary VPC CIDR. Now what?

You can add a Secondary IPv4 CIDR block to your VPC. You can use this secondary CIDR just to host the floating IP address range. Internal VPC routing will handle the translation natively.

The Gotcha: If you are accessing this via a Transit Gateway or Direct Connect from on-premises, you must update your on-premises routers to know about this new secondary CIDR, or your packets will drop at the edge.

Scenario 3: Using the Default 198.19.x.x Range

If you provision via infrastructure-as-code (Terraform, CloudFormation, CLI) and don't specify a range, AWS defaults to the 198.19.* block. This is a non-routable block reserved for benchmarking. It's great because it saves your precious VPC IPs.

How many subnet IPs does it consume? Zero. The floating IPs exist purely as route table entries. The only IPs actually consumed from your subnet are for the ENIs themselves (which is exactly 1 + the number of Storage Virtual Machines per subnet).

The Gotcha: Standard VPC peering cannot route 198.19.* traffic natively. If you need to access these endpoints from another VPC or on-premises, you must use an AWS Transit Gateway.

Furthermore, you have to manually add a static route to your TGW Route Table to push 198.19.0.0/16 traffic to your VPC attachment.

Edge Cases

Let's wrap up with two scenarios that are notorious for causing headaches during deployment and failover testing.

1. The Transit Gateway "Blackhole"
When you route through a Transit Gateway, the TGW uses ENIs sitting in designated "attachment subnets." When you build your FSxN file system, you must select the route tables associated with those TGW attachment subnets. If you don't, FSx won't inject the floating IP routes into them. Your on-prem traffic will hit the attachment subnet, look for the floating IP, find no route, and immediately drop into a blackhole.

2. Route Table Quotas
AWS sets a default hard limit of 50 routes per VPC route table. Remember how FSx injects 1 + number of SVMs routes? If you have a massive multi-tenant cluster with 20 SVMs, FSx will try to inject 21 routes. If your route table already has 35 routes from various VPC peerings and TGWs, your deployment will fail because it breaches the quota limit. Keep an eye on those quotas!

Networking with FSx for ONTAP is incredibly robust once you understand the routing mechanics. Plan your CIDRs, respect the Transit Gateway attachment subnets, and let the floating IPs do the heavy lifting.

Summary: IP Consumption Cheat Sheet

Example 1: Using the Primary VPC CIDR Range

Floating IPs Needed: By default, the AWS console reserves 64 unallocated IPs from the end of your primary VPC CIDR block to use as the endpoint IP address range.
Subnet IPs Consumed: Each Elastic Network Interface (ENI) requires 1 IP address, plus 1 IP address per Storage Virtual Machine (SVM).
Calculation (2 SVMs): For a Multi-AZ deployment with 2 SVMs, you need 3 IPs per ENI. Because you must select 2 subnets (one per AZ), this setup consumes exactly 3 IPs from Subnet A and 3 IPs from Subnet B.

Example 2: Using a Secondary VPC CIDR Range

Floating IPs Needed: You allocate an unallocated IP range (e.g., a /28 or /24 block) exclusively from your newly attached secondary CIDR for the endpoints, preserving space in the primary CIDR.
Subnet IPs Consumed: Assuming your core ENIs are still provisioned in primary CIDR subnets, the calculation remains identical to Example 1.
Calculation (2 SVMs): The ENIs consume exactly 3 IPs from Subnet A and 3 IPs from Subnet B in the primary subnets, while the floating IPs live safely in the secondary CIDR space. (Note: You must ensure route propagation is updated on your on-premises routers for the secondary CIDR).

Example 3: Using a Completely Different Floating IP Range (Default 198.19.*)

Floating IPs Needed: The AWS API/CLI automatically selects an unused /24 block from the non-routable 198.19.* IETF benchmarking range. This consumes 0 IPs from your VPC CIDR space.
Subnet IPs Consumed: Because the 198.19.* addresses are virtual and injected directly via VPC Route Tables, they don't deplete your subnet IP pools.
Calculation (2 SVMs): You only pay the baseline ENI requirement. You consume exactly 3 IPs from Subnet A and 3 IPs from Subnet B. (Note: Accessing `198.19.` from outside the VPC requires an AWS Transit Gateway and manual route table updates).*