DEV Community

Cover image for AWS re:Invent 2025 - New York Times: Best practices for migration to Amazon FSx for ONTAP (STG212)
Kazuya
Kazuya

Posted on • Edited on

AWS re:Invent 2025 - New York Times: Best practices for migration to Amazon FSx for ONTAP (STG212)

🦄 Making great presentations more accessible.
This project enhances multilingual accessibility and discoverability while preserving the original content. Detailed transcriptions and keyframes capture the nuances and technical insights that convey the full value of each session.

Note: A comprehensive list of re:Invent 2025 transcribed articles is available in this Spreadsheet!

Overview

📖 AWS re:Invent 2025 - New York Times: Best practices for migration to Amazon FSx for ONTAP (STG212)

In this video, Arek Chojnacki from The New York Times shares his experience migrating SMB and NFS shares to Amazon FSx for NetApp ONTAP. He emphasizes the importance of understanding data characteristics, particularly hot data (active within 30 days) versus cold data, to properly design the SSD performance tier. He presents a specific algorithm: SSD size should equal the largest data set plus all hot data plus 50% to prevent overwhelming the tier. His team consolidated nine shares into five volumes, achieving approximately 70% cost reduction through deduplication, compression, and auto-tiering to S3. Key migration steps include initial sync with all data going to S3, enabling auto-tiering post-migration, configuring snapshots and backups, and careful cut-over planning. He highlights benefits like multi-AZ high availability, thin provisioning versus thick provisioning in EBS, and integration with Windows DFSN and shadow copy.


; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Thumbnail 0

Introduction and FSx for NetApp ONTAP Design Fundamentals: Understanding Your Data

Good afternoon, and thank you for being here. My name is Jim White, and I work for Amazon Web Services. Today we have the privilege of hearing from Arek from The New York Times. I'm sure we're all familiar with The New York Times. He's going to share a little bit of his implementation experience with his deployment of FSx for NetApp ONTAP. So without further ado, Arek, I'm going to turn it over to you.

Thumbnail 40

Thank you. My name is Arek Chojnacki, and we successfully migrated a lot of SMB shares and NFS shares to FSx. So let's start from the beginning. We have Windows shares, we have NFS shares, and we have those shares on different platforms. We have them on Isilon, we have Windows File Servers, and we want to migrate this data to FSx.

Thumbnail 60

In this session, I'm going to cover the Amazon FSx for NetApp ONTAP design and deployment, which is the critical component of the migration, consolidation, and migration journey to FSx. I'll also cover the benefits of migration and the challenges. And if we have time, I will have time for questions and answers.

Thumbnail 90

First of all, before we start, we have to know our data. We have to know a couple of components. First of all, we have to know the utilization of each share, including how to actually design the volumes. We also have to know about the data, specifically the hot data and cold data. What is hot data? Hot data is the data which has been active in the last thirty days. This data has been written to our shares within the last thirty days. Why is this important? Because that will define our performance tier, the SSD tier on FSx.

We also have to know the cold data, which will be allocated later on to our capacity tier. We have to know the type of the data as well. Why do we have to know the type of the data? So we can predict the data reduction, including data deduplication and compression. From the data protection perspective, we have to know our high availability requirements. Do we need high availability? In this case, we should implement the multi-AZ in FSx.

Thumbnail 220

We also have to know our definition and SLA for the recovery point objective to define the snapshot policy and backups and disaster recovery. If any of the volumes or data require disaster recovery, that will be our candidate to replicate between regions. For the purpose of this session, I will review all migration steps for an imaginary project. Starting our project, we will collect the data from the current environment. Let's say we have nine shares, approximately from S1 to S9.

We collect the data, we identify the hot data, which is the data that is less than thirty days old. We also will collect how much cold data we have. How do you get this information? If you ask anybody in IT, everybody will say, "Yeah, all my data is active." But if you have third-party software tools, you have PowerShell, you have Robocopy, you can identify this. You can identify that from one hundred percent of the data, maybe ten percent to twenty percent is the hot data.

Thumbnail 280

So we have nine shares, and these nine shares belong to three departments: Department 1, Department 02, and Department 2. From a high availability perspective, we need high availability for each share, but we don't need disaster recovery for each share. For example, the S1 share contains ISO images and similar files that don't need to be replicated.

Thumbnail 320

When we look at the RPO in hours, we can see the difference in requirements from each department and each share, ranging from 24 hours to one hour.

Thumbnail 390

So let's group all shares by attributes and define the volume capacity, because this is very important. Remember that the snapshot policy, backup policy, and everything else is based per volume, not per share. The volume size and how we design the volumes will contain the data size plus the reserve capacity for the snapshots. In my example, I like to reserve 20% of the capacity dedicated for snapshots. Let's say we're expecting growth per year of around 20%. For the auto-grow feature, because we can set up auto-grow per volume, we estimated that we should not expect the growth to exceed 150%.

Thumbnail 430

So based on the grouping of those shares, we came to the conclusion that we will create five volumes based on the current utilization, so we can combine everything. We have the volume size, snapshots, 20% growth, and then we have the volume size in gigabytes. As you can see, we will have volumes ranging from 140 gigabytes to five terabytes.

Migration Strategy and Implementation: SSD Tier Sizing, Data Flow, and Cut-Over Process

But what is very important during the migration is the data flow when we write the data to FSx. From the source data, we're writing the data all the time on the first step, the first wave, to SSD. Then in SSD, we have the deduplication and compression, and then we can move the data to the capacity tier. For example, if we have one terabyte of data and we want to set up the auto-tiering to all, so everything automatically goes to S3, then everything first has to land on the SSD and then move to the capacity tier.

Thumbnail 490

So the critical component of FSx is the SSD, which is the performance tier. If we don't design this SSD tier correctly, we can face the probability that we will not be able to write to FSx because the entire SSD tier will be overwhelmed. So what is the algorithm? During all of this migration, I came up with an algorithm which is actually very accurate, and we don't have to extend the SSD tier for a very long time.

First of all, we have the largest set of data which we want to migrate to FSx, and we anticipate this data will land in the first step in the SSD tier to do the deduplication and compression. Then we have the sum of all the hot data from all our volumes, and the SSD size should be all the hot data plus 50%. Because what happens with the auto-tiering is when we have the SSD, the first data goes to SSD, and we should have 50% available after we write all of this hot data because the algorithm of the auto-tiering will trigger the movement of the data from the SSD to the capacity tier based on our auto-tiering policy.

The SSD has to exceed 51% utilization. When we do the backup or move the data with storage efficiency enabled, we have to have space to ensure that the SSD will handle the amount of data which we'll be putting on the FSx. With the backup, we have exactly the same situation. The backup is a little bit different because if we want to restore from the backup, it's on the backend, so the data flow is much faster and we can overload the SSD tier. This is critical. If we have the SSD and we have the setup correctly, then we are not going to see the spike over 70 to 80% utilization of SSD.

Thumbnail 670

At this point, based on all the situations and scenarios, we are configuring our Amazon FSx for NetApp ONTAP in two regions. Our production is in Region 1 with multi-AZ SSD based on our calculation. We have five volumes, and then we have the replication to Region 2. For Region 2, because that will be our target, we don't have to do the multi-AZ. We can do single AZ, so we can save money on the SSD tier.

Thumbnail 710

First, we have to build based on the design. We're building the FSx, we have the SSD, so we define the SSD. We have our volumes, we do the FlexVols, we enable storage efficiency and default volume. During the migration, everything should go directly to S3. Why do we want to use the SSD tier at this point? We want to use the SSD tier only for the data efficiency. Then we create all those volumes. At this point, we are not creating the snapshots because we are doing the migration. After we finish the initial copy, the initial sync, then we're going to start doing the snapshots. On the beginning, we don't need to.

Thumbnail 770

With the migration steps, when we set up the storage efficiency with all going to S3, we have the SSD capacity performance tier set up. Right now we are starting the sync, the initial sync. At this point, we can use any software which we want to. We can use rsync, we can use robocopy, we can use third-party software, and then we can also use robocopy for Windows, and we have to finish the initial copy. When we finish that initial copy, then what we have to do is enable the auto-tiering. Why do we have to enable the auto-tiering? We have all data in S3. Right now, we want to put all the hot data, the delta, to SSD, so our hot data will be in SSD. At this point, when we start doing this, the hot data will land in the SSD and we'll wait until, for example, 30 days, and then they will be moved to the capacity tier. Also, at this point, we will start doing the snapshots because we want to have the snapshots. A very nice thing with the snapshots, the FSx snapshots, for example with Windows, is that it's integrated with the shadow copy, so you can see the shadow copy in your Windows.

Thumbnail 860

After the initial copy, which we already did, we want to set up the backup. The backup is set up a little bit tricky because you set up the backup based on the FSx, not based per volume. The backup is a global setting for your FSx. If you have different requirements, then you should use AWS Backup or you can set up more than one FSx. Then we have to also set up the replication between source and target. For this particular exercise, I'm really recommending using BlueXP because it's very easy to set this up.

Thumbnail 910

The cut-over is the most painful task, I will say. First of all, if you have the global namespace like DFSN, then you have the ability to actually cut over individual shares, not the whole volume. Otherwise, if you are using CNAME, then it's a little bit tricky. If you want to schedule the cut-over, you need to have the time to sync the delta from the last 24 hours. So you can automatically estimate how much downtime you're going to incur during the cut-over. Before starting to sync the last delta, I recommend setting the source to read-only, syncing the data, and then you can cut over and enable read-write.

Thumbnail 970

Migration Benefits, Cost Optimization Results, and Key Challenges Learned

The benefits of migration that we achieved are significant. We eliminated the complexity of our current solution, managing NAS, Windows File Servers, and so on. Right now, we have everything under one pane of glass. We also optimized the cost of SMB and NFS shares in AWS. After migration from all our solutions, which we had on-premises and in AWS, our cost has been reduced by approximately 70%. Why? Because we introduced deduplication and compression, and with the auto-tiering to S3, the cost is much lower compared to putting everything on SSD.

In this comparison, I want to compare Windows File Servers. If you migrate the Windows File Server with all those servers that we had to NetApp, as you can see, in our scenario, from nine servers, because we need high availability between AZ1 and AZ2, and we also need another set in the West region, from the EC2 instance point of view, we shrunk from nine to four. The savings is five EC2 instances. As you can see, it's significant.

Then, regarding EBS volumes, remember one thing with FSx: it's thin provisioning. On EBS volumes, you have thick provisioning. So if you have any solution with thick provisioning, you are paying a lot of money, not for the utilization of the storage, but for the provisioned storage. For S3, we don't have S3 under the other solution because they don't support S3, whereas FSx supports S3. We added the cost of, for example, 15 terabytes of S3, but it's significantly lower compared to the EBS volumes.

Thumbnail 1110

Our challenges include the design of the volumes, which is tricky. You cannot find a golden rule, but what you have to do is know your data, and based on your data, you can actually group all those shares and then create the volumes based on the requirements for RPO, RTO, snapshots, and backup requirements. The capacity of the SSD performance tier is critical. The most important thing is coordinating with the stakeholders. It's very difficult many times, and everybody knows why. Everybody doesn't have the ability to introduce any downtime during production hours, so it's very difficult to find the golden time.

Setting up appropriate monitoring is critical. Utilization of the performance tier is critical, and we should know that if it reaches 70%, that should be a warning. At 90%, this is an alarm and you have to react right away. CPU utilization, yes, we should check the CPU utilization, especially if some people are moving large amounts of data.

Then perhaps we have to also resize the CPU because of the number of data which we are pushing into FSx, and latency, which is very important for us too.

Thumbnail 1220

And what we like about FSx for ONTAP: high availability in the regions across AZ1 and AZ2, and we don't have to worry about anything. Auto-tiering, deduplication and compression, integration with Windows DFSN, disaster recovery between the regions which is very good, ransomware protection, snapshots with Windows previous version, Windows ACL, scalability, performance, encrypted at rest, and multi-protocol support.

Thumbnail 1260

Thumbnail 1290

And just a summary of the session. First of all, you must know your data. You have to take all variables into consideration to meet SLA, especially from the point of view of RPO and RTO. Use BlueXP to manage your FSx environment. Also check out ransomware protection, which is very important. Today there was a very good session regarding ransomware protection in the morning. And attend all FSx sessions to better understand FSx under the hood.

Thumbnail 1300

My time is up, I'm sorry. Thank you so much and have a nice AWS re:Invent. And I recommend attending all those sessions about FSx because for us it's a great product. Thank you.


; This article is entirely auto-generated using Amazon Bedrock.

Top comments (0)