Kazuya

Posted on Dec 4

AWS re:Invent 2025 - New York Times: Best practices for migration to Amazon FSx for ONTAP (STG212)

🦄 Making great presentations more accessible.
This project aims to enhances multilingual accessibility and discoverability while maintaining the integrity of original content. Detailed transcriptions and keyframes preserve the nuances and technical insights that make each session compelling.

Overview

📖 AWS re:Invent 2025 - New York Times: Best practices for migration to Amazon FSx for ONTAP (STG212)

In this video, Sarah Kinansky from The New York Times shares their experience migrating Windows and NFS shares to Amazon FSX for NetApp ONTAP. She explains the critical design considerations including understanding hot data (modified within 30 days) versus cold data, calculating SSD performance tier capacity, and grouping shares into volumes based on RPO/RTO requirements. A key insight is her algorithm for SSD sizing: largest migration dataset plus sum of all hot data plus 50% buffer to trigger auto-tiering at 51% utilization. The migration achieved 70% cost reduction through deduplication, compression, and S3 auto-tiering, while reducing EC2 instances from 9 to 4. She emphasizes proper volume design, SSD capacity planning, stakeholder coordination, and monitoring utilization thresholds (70% warning, 90% alarm) as critical success factors.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Introduction and Data Assessment: Understanding Requirements for FSX Migration

Good afternoon. Thank you for being here. My name is Jim White. I work for Amazon Web Services, and today we have the privilege of hearing from Arkadiusz Chojnacki from The New York Times. I'm sure we're all familiar with The New York Times. He's going to share his implementation experience with the deployment of FSX for NetApp ONTAP. Without further ado, Arkadiusz, I'm going to turn this over to you.

Thank you. My name is Sarah Kinansky and we successfully migrated a lot of SMB shares and NFS shares to FSX. Let's start from the beginning. We have Windows shares and NFS shares, and we want to migrate these shares from different platforms. We have shares on premises, we have Windows file servers, and we want to migrate this data to FSX. In this session, I'm going to cover Amazon FSX for NetApp ONTAP design and deployment, which is the critical component of the migration and consolidation journey to FSX. I'll discuss the benefits of migration and challenges, and if we have time, I will have time for Q&A.

First of all, before we start, we have to know our data. We have to understand a couple of components. First, we have to know the utilization of each share in order to design the volumes. We also have to know about hot data and cold data. Hot data is the data that has been actively written to our shares within the last 30 days. This is important because it will define our performance requirements for FSX. We also have to know the cold data, which will be allocated later on to our capacity tier. We need to know the type of data so we can predict the data reduction, including deduplication and compression.

From the data protection perspective, we have to know our high availability requirements. Do we need HA? In this case, we should implement FSX accordingly. We also have to know our definition and SLA for Recovery Point Objective to define the snapshot policy and backups. If any of the volumes or the data require disaster recovery, that will be our candidate to replicate between regions. For the purpose of this session, I will review all migration steps for an imaginary project. Starting our project, we will collect the data from the current environment. Let's say we have nine shares, approximately from S1 to S9. We collect the data and identify the hot data, which is the data that has been modified within the last 30 days. We also collect information about how much cold data we have.

How do we get this information? If you ask anybody, they will say all their data is active. But if you have third-party software tools, PowerShell, or robocopy, you can identify that from 100 percent of the data, maybe only 10 to 20 percent is hot data. We have nine shares, and these shares belong to three departments: department one, department two, and department three. From an HA perspective, we need HA for each share. But we don't need DR for all of them. For example, S1 is our images.

We don't need DR for it. From the RPO and RTO hours, we see different requirements from each department and each share, ranging from 24 hours to 1 hour.

FSX for NetApp ONTAP Design and Migration Implementation Strategy

Let's group our shares by attributes and define the volume capacity because this is very important. Remember that the snapshot policy backup policy is based per volume, not per share. The volume size and how we design the volumes will contain the data size plus a reserve capacity for snapshots. In my example, I'd like you to reserve 20% of the capacity dedicated for snapshots. We're expecting growth of around 20% per year, and for the auto-tiering, because we can set up auto-tiering per volume, we estimated that we should not expect growth over 150%.

Based on the grouping of those shares, we came to the conclusion that we will create 5 volumes. Based on current utilization, we can combine everything. We have the volume size, snapshots at 20% growth, and then the volume size in gigabytes. As you can see, we will have volumes ranging from 140 gigabytes to 5 terabytes.

What is very important is understanding the data flow during migration. When we write data from the source, we're writing all the time in the first step, first wave to SSD. Then in SSD, we have deduplication and compression. After that, we can move the data to capacity. For example, if we have 1 terabyte of data and we want to set up auto-tiering so everything automatically goes to S3, then everything first has to land on the SSD and then move to capacity. So the critical component of FSX is the SSD, which is the performance tier. If we don't design this correctly, we can face the probability that we will not be able to write to FSX because all the SSD will be overwhelmed.

I came up with an algorithm for all those migrations which is very accurate, and we don't have to extend the tier for a very long time. First of all, we have the largest set of data which we want to migrate to FSX. We anticipate that this data will land in the first step in the SSD tier to do deduplication and compression. Then we have the sum of all the hard data from all our volumes in SSD. The SSD size should be all the hard data plus 50% because what happens with auto-tiering is that when we have the SSD, the first data is going to SSD and we should have 50% available after we write all that data. The algorithm of auto-tiering will trigger the movement of the data from the SSD to capacity based on our auto-tiering policy when the SSD has to exceed 51%.

When we do the backup or move the data, we need to consider storage efficiency. We have to have space to make sure that the SSD will handle the amount of data that we will be putting on the FSX. With the backup, we have exactly the same situation, though the backup is a little different because if we want to restart from the backup, it's on the back end, so the data flow is much faster and we can overload the SSD. This is critical. If we have the SSD set up correctly, then we're not going to see the spike over 70 to 80% utilization of SSD.

At this point, based on the scenario, we are configuring our FSX for NetApp ONTAP in two regions. Our production is in one region and our DR is in the other region with the SSD based on our calculation. We have five volumes, then we have the replication to region two. For region two, because that will be our target, we don't have to do the multi-AZ. We can do single, so we can save money.

First, we have to build the FSX based on the design. We define the SSD, we have our volumes, we do that with flex volumes, we enable storage efficiency. By default, during the migration, everything should go directly to S3. We want to use the tier only for the data efficiency. Then we create all those volumes. At this point, we are not creating the snapshots. We are doing the migration. After we finish the initial copy and initial sync, then we're going to start doing the snapshots. In the beginning, we don't need to.

With the migration steps, when we set up the storage efficiency all going to S3, we have the SSD performance set up. Then right now we are starting the initial sync. At this point, we can use any software we want. We can use rsync, we can use robocopy, we can use third-party software, and we can also use robocopy for Windows. We have to finish the initial copy.

When we finish that initial copy, we have to enable the auto-tuning. Why do we have to enable the auto-tuning? We have all data in S3 right now. We want to put all the hot data, the delta, to SSD, so our hot data will be in SSD. At this point, when we start doing this, the hot data will land in the SSD and we'll wait until 30 days, for example, and then they will be moved to the capacity tier. Also at this point, we will start doing the snapshots because we won't have the snapshots. A very nice thing with the FSX snapshots, for example with Windows, is that it's integrated with the shadow copy so you can see the shadow copy in your Windows.

After the initial copy, which we already did, we want to set up the backup. The backup is set up a little tricky because you set up the backup based on the FSX, not based per volume. So the backup is the global settings for your FSX. If you have different requirements, then you should use AWS Backup or you can set up more than one FSX. Then we have to also set up the replication between source and target. For this particular exercise, I'm really recommending using BlueXP because it's very easy to set up.

The cutover is the most painful task. First of all, if you have a global namespace like DFSN, then you have the ability to cut over individual shares, not the whole volume. Otherwise, if you are using CNAME, then it's a little bit tricky. If you want to schedule the cutover, you need to have the time of sync for the delta from the last 24 hours so you can automatically estimate how much downtime you're going to incur during the cutover.

Migration Benefits, Cost Optimization, and Key Takeaways from The New York Times Experience

Before I start syncing the last delta, I'm recommending you to set the source to read-only access. Sync the data and then you can cut over and do the rewrite. The benefits of migration are significant. What we did was eliminate the complexity of our current solution, managing NAS, Windows file servers, and so on. Right now we have everything under one pane of glass. We also optimized the cost of SMB and FSX shares in AWS after migration from all our solutions which we had on-premises and in AWS. Our cost has been reduced by approximately 70%.

Why? Because we introduced deduplication and compression, and with the auto-tuning costs of S3, it's much lower compared to if you have to put everything on SSD. In this small comparison, I want to compare Windows file servers. If you migrate the Windows file server with all those shares which we have to NetApp ONTAP, you can see that in our scenario, from 9 servers because we need high availability between AZ and AZ1 and AZ2, and then we need another set in the west region, from the EC2 instances point of view, we shrink from 9 to 4. The savings is 55 percent because of the 5 EC2 instances, as you can see, it's significant.

Then EBS volumes. Remember one thing with FSX: it's still provisioning on volumes. You have tech provisioning. So if you have any solution with tech provisioning, you are paying a lot of money, not for the utilization of the storage, but for the provisioned storage. With S3, we don't have S3 under another solution because they don't support S3. We have FSX because we support S3. We add the cost of 15 terabytes of S3, for example, but it's significantly lower compared to the EBS volumes.

Our challenges are significant. The design of the volume is tricky, and you cannot find a golden rule. What you have to do is know your data, and based on your data, you can actually group all those shares and then create the volumes based on the requirements for RPO, RTO, snapshots, and backup requirements. The capacity of the SSD and the performance tier is critical. The most important thing is coordinating with the stakeholders. It's very difficult many times, and everybody knows why. Everybody doesn't have the ability to introduce any downtime during production hours, so it's very difficult to find the golden time.

Set up appropriate monitoring, which is critical. Utilization of the performance tier is critical, and we should know if it's up to 70 percent, that will be a warning. At 90 percent, this is an alarm and you have to react right away. CPU utilization, yes, we should check the CPU utilization, especially if some people are moving transit data. Then perhaps we have to take action.

When moving transit data, we may also need to resize the CPU because of the volume of data we are pushing into FSX, and latency is very important for us too. What we like about FSX is the high availability in the regions across multiple Availability Zones. We don't have to worry about anything with auto-tuning, duplication, and compression. The integration with Windows and DFSN between regions is very good, and the ransomware protection is excellent. Snapshots integrated with Windows previous versions, Windows ACL, scalable performance, encrypted storage, and multi-protocol support are all valuable features.

To summarize this session, first of all, you must know your data. You have to take all variables into consideration to meet your SLA, especially from the point of view of RPO and RTO. Use BlueXP to manage your FSX environment, and make sure to check out the ransomware protection features, which are very important. Today was a very good session regarding ransomware protection this morning.

I recommend attending all FSX sessions to better understand FSX under the hood and to learn more about this great product. Thank you so much, and have a nice time at re:Invent.

; This article is entirely auto-generated using Amazon Bedrock.