Kazuya

Posted on Dec 5, 2025 • Edited on Dec 8, 2025

AWS re:Invent 2025 - Improve self-managed database performance and agility with Amazon FSx (STG337)

🦄 Making great presentations more accessible.
This project enhances multilingual accessibility and discoverability while preserving the original content. Detailed transcriptions and keyframes capture the nuances and technical insights that convey the full value of each session.

Note: A comprehensive list of re:Invent 2025 transcribed articles is available in this Spreadsheet!

Overview

📖 AWS re:Invent 2025 - Improve self-managed database performance and agility with Amazon FSx (STG337)

In this video, Aaron Dailey and Jim discuss improving self-managed database performance using Amazon FSx file systems. They explain why customers migrate databases to the cloud, comparing fully managed versus self-managed deployment options. The session covers three FSx services suitable for databases: FSx for Windows File Server, FSx for OpenZFS, and FSx for ONTAP, highlighting their unique features like multi-protocol support, snapshots for backup in seconds, clones for creating thin copies, and cross-region replication. Real customer examples include Ava Arrow saving 60% on licensing costs using SQL Server FCI with FSx for Windows, and S&P Global running hundreds of SQL Server databases on FSx for ONTAP. Two live demonstrations show: first, testing disaster recovery for Microsoft SQL Server using FSx for ONTAP's SnapMirror replication and cloning without impacting RPO requirements; second, creating a development environment for Oracle database on FSx for OpenZFS using snapshots and clones, consuming only 40MB additional storage to duplicate an 18GB database.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Session Introduction: Improving Self-Managed Database Performance with Amazon FSx

Hello everyone. Welcome to our STG337 session. I hope everybody has been enjoying their Reinvent week thus far. In this session, we are going to cover improving self-managed database performance and agility with Amazon FSx file systems. My name is Aaron Dailey. I am a Senior Specialist Solutions Architect here at AWS, and I cover our file services. With that, I will hand it over to Jim.

Thank you, Aaron. Can you hear me? Great. Before we go on, how many of you have familiarity with Amazon FSx? Just raise your hand. Great, good to hear. So what we are going to talk about today is obviously the importance of choosing the right FSx file system for your databases, but we are going to go through a couple of other things first with respect to how exactly customers tend to look at these types of decisions. The things we are going to talk about today are, generically, why are people coming to the cloud and why are they bringing databases to the cloud first and foremost. Secondarily, what sort of important decisions do they need to make in terms of deployment options? We have fully managed options and we have self-managed options, which I will talk about a little bit.

Why storage matters for choosing databases is another key topic. Oftentimes, we see customers overlook storage features with respect to what storage can do and what it can add to the mix. When you think about deploying a database, you have storage, networking, and compute—obviously, that is the important infrastructure stack that we need to think about as we are deploying. So how does storage enter into that? Then I am going to turn over to Aaron for the last couple of segments, which is some practical advice. We will go through a couple of demonstration scenarios with respect to a SQL Server database scenario and an Oracle database scenario. So that is our agenda for today.

Why Customers Migrate Databases to the Cloud

I will not spend a lot of time on this one, but as we think about why customers choose to come to the cloud, and I am sure you have gone through some of these same scenarios in your own mind, oftentimes it is because there is a mandate to close down or reduce the number of data centers that might be under management. There is usually a cost element associated with that. Sometimes customers say, "I build cars, I build chips, I manufacture sneakers," whatever it is that their core business is.

Maybe IT is no longer something that they should focus on and no longer something that they should be as invested in in terms of procuring hardware and software. Obviously, they still need to deploy an infrastructure stack and still need to manage an infrastructure cloud stack in the cloud, but what if they did not have to manage the facilities? What if they did not have to worry about hardware upgrades and changing hardware in and out when it either fails or runs out of capacity?

Digital transformation seems to be on the mind of many executives these days. Really, behind that, I think the message is that customers want to move faster. They want to be more competitive. They want to be able to get a competitive edge by doing things differently with their application set, which will allow them to either make decisions faster or build things faster or, again, whatever their core business is, to allow them to do those things faster and get an edge against their competitors.

In addition to increasing agility, we often see customers who will say that their deployment in the cloud is actually more secure than it was on premises. Now, we could argue that that is not always the case, but I found it really interesting when customers say, "I feel more secure in the cloud than I felt on premises." There are a few reasons behind that, which I do not really have time to get into. The last thing I wanted to talk about today is putting data to work.

Organizations have a lot of data. What we do not often do well at is looking at the insights trapped inside that data. What does that data contain that will help us to be more competitive or to move faster or to learn about ourselves in terms of how we do business? This notion of putting data to work is far simpler in the cloud. Once your data is in the cloud, there is a variety of services. There are actually tens of services within AWS, for example, that will allow you to drive additional analytics against that data or to put it to work in useful ways for generative AI and machine learning, for example. These are just some of the drivers that we see. Obviously, databases are part and parcel of all of these decisions that are made when coming to the cloud.

Deployment Options: Fully Managed vs. Self-Managed Databases

There are a couple of deployment options as I alluded to earlier. You can choose fully managed databases. There is a whole bunch of them represented on this slide. My intent is not to walk through each and every one of them, but categorically, you have relational databases and more than likely you have some non-relational databases in your infrastructure. When you think about bringing those to AWS, starting in the lower left, these are standard relational databases that we have all known for decades in many cases. Some of them are newer, but what if I want to bring those to AWS and what if I do not want to manage them myself? What if I simply want to take over at the layer where I have already got the database, I create my tables, I manage how I use those databases, and so on. This is what we often see customers do who want to get away from the complexity of managing databases on-premises.

With respect to non-relational databases, you can see we've mentioned a few there. Oftentimes when customers come to AWS, they will look at things like DynamoDB or ElastiCache, for example, to replace those non-relational databases they may have run on premises. And then over on the right-hand side, lower right, I'm not going to spend a lot of time on that, but the idea is I might have a big data farm. Maybe I've got some Hadoop in my environment, and I want to bring that to AWS. I might choose to instead bring that to EMR and host that big data and the applications that use it in EMR and of course a variety of analytic solutions including Elasticsearch.

So again, lots of ways that you can come to data to AWS and bring your databases. Now I want to segue into this notion of self-managed databases and why would I choose to self-manage a database? Why would I bring Oracle to AWS or SQL Server to AWS, for example, and not choose RDS, which would be the fully managed implementation of those databases? There are some reasons we've listed here. It tends to be a combination of business drivers and technical drivers. Cost is almost always one of those drivers, optimization of TCO. However, if we look at some other things which tend to be pretty important in many cases, with RDS, for example, you get set ways to back up and protect your data.

Well, what if I want to do that on a more granular interval or what if I want to have complete control over the ability to drive how I back those databases up and how I recover them? That's not possible with RDS, but with the self-managed implementation of a database, it certainly is. Or, what if I want to scale my database over multiple regions, or multiple Availability Zones I should say, for the highest resilience? That's also possible with a self-managed database scenario. And of course there are some technical drivers. What if I want to control the entire infrastructure stack? I want to choose my Oracle version. I want to choose my Linux version. I want to choose other things running on that database server that maybe complement that database deployment. I can't do that with RDS.

So these are but some of the reasons why our customers are choosing self-managed databases. Now, if I didn't say this before, we almost always see customers choose some combination of self-managed and fully managed databases, so it's not exclusive one or the other. It tends to be a mixture of the two. As we look at the things that are important with respect to deploying self-managed databases, over on the left-hand side we have different database engines. You've seen those on previous slides. Over on the right-hand side we have some choices that you as customers would need to make when deploying those self-managed databases, namely at the top of the stack you've got compute, various varieties of compute, whether it's native EC2, whether it's ECS, EKS from a containerization perspective, or whether it's Elastic VMware Service, which is one of our newer compute services, EVS, that's VMware running in AWS.

Understanding Amazon FSx: Storage Choices for Database Deployments

And then, the heart of our discussion today really is around storage and what storage am I going to choose and why. So as we think about deploying storage under those databases, oftentimes customers choose block connectivity from their database servers to their storage, but not always. That's of course not a requirement. Sometimes they choose file protocols, namely NFS and SMB. So what we're showing here is that you've got a couple of broad options, one being EBS, which I think we're all familiar with. That's block storage and block storage only. It works very well. It's been around for a long time, and we have a tremendous number of databases deployed on EBS.

From an FSx perspective, what's cool about FSx is you get the choice between block or file depending upon which FSx family member you choose. You might choose block or file, and we'll get into that in a second. So lastly here at the bottom, you get to choose your database engine, you get to choose your compute, and you get to choose your storage with the self-managed database deployment. So let's talk now about FSx. FSx has been around since about 2018 when we launched our first two services, that was Windows and Lustre back in 2018. We launched them at re:Invent back in that time frame, and the way to think about FSx is that F and S stand for file system if that's not obvious. X is a variable just like it was in your high school algebra class, and in the case of X, there are four different possibilities that X can be equivalent to, one being Lustre, which is for HPC workloads. We're not going to talk about that today because that's not really for databases, but FSx services that are appropriate for databases are Windows, OpenZFS, and ONTAP.

We'll talk about those three. You get your choice here based on what your requirements are. The design goal of FSx is to build storage services based on the world's most popular file systems. We look at file systems that are used predominantly by many customers and we bring those to FSx. The design goal of FSx is to build a like-for-like experience in AWS. If I love Windows and I'm used to running Windows as my file server on-premises, I can run FSx for Windows and the way I administer that is exactly the way I would administer a Windows server. There are some nuances on top of that that are AWS specific, but the idea is we want to build a like-for-like experience.

The same is true with FSx for OpenZFS or FSx for ONTAP. If you're accustomed to on-premises systems, if you've built automation around that, if you have years of experience with one of those and you want to bring all of that experience with you along with all the automation and scripting to AWS, you can do that. They operate exactly the same way in AWS. These are the three services that are most interesting in a database environment. As you can see, Windows, FSx for OpenZFS, and FSx for ONTAP all have different sets of features and capabilities. There are reasons why you might choose one versus the other.

At a high level, I want to talk through starting on the far right. We believe it's important to think about familiarity. If you're accustomed to Windows or ZFS or ONTAP, more than likely you want to remain accustomed to that in AWS. There's no reason to change specifically if you want to bring that familiarity with you. As we work with customers, we always think about familiarity. What are you using on-premises and why? Would you like to preserve any or all of that when you come to AWS?

Each service has unique features and capabilities, but it's always important to remember that a list of features and capabilities don't really matter if they're not what you need. Which of those features and capabilities are important to you? Coming to the left-hand side of the slide, we're going to start at the bottom. From a resiliency perspective, each of these services supports either a single Availability Zone or a multi-AZ deployment, providing the highest resilience. As we go up the stack, we have two of these services that provide a richer data management experience: FSx for OpenZFS and FSx for ONTAP. Lastly, FSx for ONTAP is the only one that offers multiple protocols.

FSx for ONTAP offers block and file protocols. For Amazon FSx for Windows, that's SMB. FSx for OpenZFS is NFS. FSx for ONTAP actually supports four different protocols: both SMB and NFS on the file side, and on the block side, it's iSCSI and NVMe over TCP. If you're looking for a Swiss Army knife of storage in terms of protocol support, that would be FSx for ONTAP. This is a small decision tree, or maybe a decision bush if you will. There are a couple of things that it asks and questions that it poses. I want to start in the middle. The protocols are named here, so if you say I only connect my database servers over a block protocol to their storage, you have one choice here: FSx for ONTAP.

Key Data Management Capabilities: Snapshots, Replication, and Clones

If you say I'm good with a file protocol and I'm happy to run Oracle over NFS, for example, then you could choose FSx for OpenZFS or FSx for ONTAP. If you say something similar about Windows, which is I'm perfectly happy running my SQL Server databases over SMB, we have either Windows or FSx for ONTAP that could satisfy that requirement. Protocol is important. There are three other things here on the slide I want to talk about, and they're the words in orange starting right in the middle at the top: performance. These do have different performance profiles. Six gigabits per second is a relatively important delineator here. Windows and FSx for OpenZFS each have the capability to perform above that. In the case of FSx for ONTAP, that's essentially its ceiling today.

We have to think about performance and what kind of performance a single EC2 host can drive, what performance my database on that EC2 host will drive, and so on. Coming around clockwise from a data management perspective, this is where the services start to differentiate themselves a little bit beyond protocol.

Both FSx for ONTAP and FSx for OpenZFS have a rich set of data management capabilities, including onboard snapshots which allow you to run backups in seconds and recoveries of your databases in minutes, regardless of the database size. You might be thinking, if I have a 1 gigabyte database or a 100 terabyte database, you're telling me you can back either of those up using a snapshot in seconds and recover them in minutes? Yes, that's what I'm telling you. It's independent of the database size. Snapshots are a unique way to keep multiple recovery points that are database consistent. When I recover from a snapshot or roll back from that snapshot, when I start the database up, there's no roll forward recovery required. It just starts based on the time the snapshot was taken.

FSx for ONTAP and FSx for OpenZFS also have the ability to replicate from one instance to another. I might have either FSx for ONTAP or FSx for OpenZFS in region one and another one in region two, cross country or across town, between regions. At the storage level you can do that replication. There's always a discussion about whether the database should do the replication or the storage should do the replication. That's another choice that you get to make. When the storage does the replication, you can literally save licensing costs in some cases and you can also save costs around the EC2 instance size that you have to deploy. If I'm asking my EC2 engines to not only run the database engine but also drive replication to its partner somewhere else in another region, there are CPU cycles associated with that, and therefore I likely need a bigger EC2 instance. Those things are important in terms of considering whether I want storage level replication or database level replication.

Clones are another capability in both FSx for ONTAP and FSx for OpenZFS. Different names for it based on different services, but the idea is that I want to create a thin copy of my database environment. What a clone allows you to do is in a matter of seconds, using one of those snapshots I talked about earlier, create a clone. I can then on top of that clone, which is thin and doesn't occupy any space, actually start another instance of the database which looks through that clone at the same disks where my production copy of my database resides. This could be for a what-if scenario. This could be for when I want to upgrade my database engine. I'm going to create a clone and run through the upgrade scenario on my clone to prove to myself that I can do an upgrade without any problems. Once I get through that process and I'm satisfied that I can do it on this clone, then I'll be perfectly happy and confident that when I go to upgrade in my production environment, it's going to work because I've already proven it to myself in this cloned environment.

Customers also populate lower environments using clones and use those lower environments either for short term or long term for development and testing, training, and other enabling purposes. The idea that I might have to start a copy Friday night and then Saturday at noon after my kid's soccer game, I have to log in and check to see if that copy was done. You don't have to do anything of the sort anymore. You can use clones to literally create these database copies in a matter of seconds, and you can create hundreds of them if you so desire. Customers don't actually create hundreds of database copies in practice, but the capability exists to create many copies of your databases using clones. They're all thin and don't cost very much as a result to create. Those capabilities are unique to FSx for ONTAP and FSx for OpenZFS.

Resilience is another important capability. This is the notion that I can deploy in a multi-AZ configuration, which means I've got synchronous replication at the storage layer between two Availability Zones. If I build a cluster on top of that where my database engine runs, and I have a problem in AZ1, that cluster software or database software fails me over to AZ2. I've got a synchronous copy of my data there, so my Recovery Point Objective is exactly zero because I've got synchronous replication going on between the two. The time it takes me to affect that failover is simply a matter of how long it takes that cluster to accomplish that failover, which usually happens quite quickly.

FSx for Windows File Server: SQL Server Deployments and Cost Optimization

When I'm deploying SQL Server, I mentioned we've got two options using the SMB protocol with Windows and ONTAP. I can also use iSCSI, NVMe, or TCP for deployment on ONTAP. As we look at some of the capabilities of Windows, FSx for Windows File Server has VSS compatible snapshots. Volume Shadow Copy is completely built in. We're effectively running Windows here.

So you would expect that capability to be there. The SMB protocol is fully supported, and Active Directory integration is available for both authentication and authorization. From a cost reduction perspective, we have onboard compression and deduplication built into Windows File Server, which is helpful from a TCO perspective.

Let's talk about some minimum sizes here. This could come into play when you have a lot of small databases or a lot of large databases, and you want to figure out whether it makes sense to choose one or the other based on your database sizing. You can see the sizes there. The minimum file system size is 32 gigabytes, and the minimum throughput capacity, which is the IO performance you get from that file system, is 32 megabytes per second.

Let's talk about a customer here that's running SQL Server on SMB using FSx for Windows. This is Ava Arrow, as you can see. The thing to call out here is that I mentioned earlier there might be potential cost savings in licensing that are interesting. This is exactly one of those use cases where they're running SQL Server in an FCI deployment, which is a failover cluster deployment. In this case, they're not using enterprise licensing; they're using standard licensing. The difference between SQL Standard licensing and SQL Enterprise licensing can be about 2 to 4X. In this particular case, this customer is reporting that they saved about 60 percent on their Windows licensing because they went from an Always On deployment to an FCI deployment, meaning they don't need to license Windows and the SQL components on the other end.

These are those two options. On the right-hand side, we have the FCI deployment. On the left-hand side, we have the Always On Availability Group deployment. The difference here is that on the left-hand side, the database is driving the replication, which is why you need SQL license on both sides. On the right-hand side, the storage is driving the migration. So there's almost always cost savings associated with an FCI deployment relative to an Availability Group deployment, and that's a function of Windows licensing and also typically a function of those EC2 instance sizes that I spoke about earlier.

FSx for OpenZFS: Oracle Database Performance and Compression Benefits

Let's talk now about an Oracle deployment. I have two options here: ZFS and ONTAP. ZFS would be NFS, and ONTAP would be either NFS or one of the two block protocols I spoke about. With FSx for OpenZFS, I've talked through most of these capabilities, if not all of them. You have a rich set of capabilities. You can see the NFS versions that are supported there, and working up the stack, you have the ability to do the replication I spoke about, snapshots that drive really fast backups and recoveries, and so on.

From a sizing perspective, you can see slightly larger sizes with ZFS versus Windows. On the Windows side, they were 32 and 32. With FSx for OpenZFS, the smallest file system is 64 gigabytes in size, and the smallest throughput capacity is 64 megabytes per second. So you get a little bit more performance out of the box with even the smallest file system in the case of FSx for OpenZFS.

Here's a customer example: AMDdocs. You may be familiar with them. They do a lot of work in the healthcare industry in particular. What they found in their deployment was that through the use of compression with FSx for OpenZFS, onboard compression within FSx for OpenZFS, they were able to reduce their overall database storage. What's significant too is the performance that they saw. Sometimes people look at me cross-eyed when I say that when they bring applications from on-premises to AWS and they actually see better performance in AWS. They're like, that can't be possible. I don't know how they did that. This is one of those examples where this customer actually saw better performance, which obviously had to do with potentially a suboptimal deployment on-premises and a maybe more thoughtful deployment in AWS in terms of making sure we size everything properly. But we do see this from time to time.

FSx for ONTAP: Multi-Protocol Support and Advanced Features for Enterprise Databases

For FSx for ONTAP, this is actually a similar list of features and capabilities. It's actually maybe slightly larger than the one you saw for ZFS, namely multi-protocol support on ONTAP, which is not capable as we talked about in ZFS, but some of the same capabilities in terms of backups in seconds and replication across regions.

FSx for ONTAP offers many of the same capabilities as other solutions, including backups in seconds, cross-region replication, and TCO optimization through storage efficiencies. One key difference for customers running ONTAP on-premises who choose to migrate to AWS is that ONTAP has a feature called SnapMirror, which enables replication between two ONTAP instances. From a migration perspective, this makes it straightforward to take a database or any other workload hosted on ONTAP on-premises and bring it to AWS. You simply set up a SnapMirror relationship between your on-premises ONTAP system and the FSX for ONTAP system in AWS. Once that relationship is established, you press the go button and replication occurs automatically.

From a sizing perspective, FSX for ONTAP file systems tend to start slightly larger. The smallest file system size is 1 terabyte, and the smallest throughput capacity comes in at 128 megabytes per second. If you have a bunch of 32 megabyte databases, you might think this doesn't work for you. In that case, you have a couple of options: choose a different service like FSX for Windows, or stack several SQL Server databases onto the same file system, which allows you to easily consume 1 terabyte of capacity.

Let me walk through a couple of examples. The first is S&P Global, which has literally hundreds of SQL Server databases that are core to their business. They brought these from on-premises, and although these databases were not running on ONTAP on-premises, they were running on a different storage solution. They brought hundreds of them to AWS, and on-premises they were doing database-level replication. Now they have a multi-AZ deployment in every case, and they have storage-level replication in place. They are also using SnapMirror for disaster recovery to protect those databases within AWS. The primary databases are in AWS on FSX, and they are also SnapMirroring to another region for disaster recovery purposes.

What they have done is deploy multi-AZ instances where the primary databases are running, and they are replicating to single-AZ instances to reduce cost. Those single-AZ instances act as their disaster recovery failover option should they need to activate it. The next example is Pearson. This is both Oracle and SQL Server workloads, and this is their core customer system. When you register for a class, take a class, and access all the course content, it is all hosted on these two key systems that they brought to AWS. They have been on FSX for ONTAP now for three or more years and were one of our first customers. They used iSCSI, which is a block protocol.

I want to illustrate how snapshots work. Although this is simplistic, I want to call out a couple of key points on the slide. These are point-in-time snapshots, which is probably obvious, but you can create and maintain literally hundreds of them if you choose to do so. In practice, I do not see customers creating hundreds of snapshot copies and maintaining them, but you might keep five, ten, or fifteen copies of your database depending on how far back you want to go in your recovery cycle. You can replicate these snapshots to another region if you choose to do so.

In fact, one use case we sometimes see is customers creating snapshots of their databases in region one. They replicate those snapshots to region two using either OpenZFS's replication capability or ONTAP's replication capability, and then they clone those snapshots in region two. This way, they can maintain replication for disaster recovery purposes while using those clones for development and testing, including testing disaster recovery. If your boss comes around and says at two o'clock on a Tuesday that they want you to prove you could recover a database or set of databases in a disaster, you can say, "Hang on a second, boss, while I clone those databases and show you that I can bring them up, drop a table, and then look over at my DR copy and see that the table is still there." There are all kinds of scenarios you can go through.

Aaron is actually going to show you some of those in a minute, but snapshots are really useful in terms of keeping around copies. These are virtual copies of your database that you may want to recover from over the short term. That's not to say that you might not also want to use a cohesive or a rubric or a convolt to create longer term backups. You can certainly do that as well, but snapshots can be a secret weapon in terms of being able to do really quick disaster recovery or recovery from maybe something that's not a disaster per se, but that's not good for the business.

My last slide covers clones, which I talked about a little bit. What we're trying to do on the right hand side of this slide is illustrate what a clone looks like and how it works. I've got a primary copy represented by that blue disc, and I've got in this case a couple of clones, one I'm using for development and one I'm using for testing. These could be based on the same snapshot or they could be based on a different snapshot taken at different points in time. If I decide that I want to refresh my test environment, I simply create a new clone, mount that clone, and start my database.

If I'm 24 hours in arrears in terms of the difference between my production database and my secondary database or my test and development database, I can very quickly refresh those clones and be right up to speed. There are lots of capabilities that these clones can be used for. The last thing I want to point out is that they only use incremental capacity. I mentioned earlier that they're thin. When I create a clone, virtually no additional space is required to create that clone. There's a little bit of metadata, but it's measurable usually in megabytes.

Now, once I start the cloned database or the copy of the database, let's say that I now add a table to that clone of the database. Obviously I need some space to occupy that new table. In that case, my clone will be exactly the size of that metadata associated with that database copy, which again is small, and that new table that I just added to that database. You can bring those databases out of sync and do a what-if scenario, and you can tear that clone down, build a new clone, and start all over again with whatever test and development scenario you're interested in.

The last thing here is if you have a desire to split that clone, you can do that and then it becomes a 100% copy, a full copy, in which case you can then clone that split clone. You can do all kinds of things with these copies of data. We have customers doing really creative things with snapshots and clones which completely changed the game. If you're using EBS right now for your databases, EBS is a wonderful, very capable service. It doesn't do any of these things. If you have a desire to utilize your databases and do different things with how you copy your databases and how you protect your databases, then we would ask you to consider the FSX family because it really is a game changer.

This goes back to the statement I made at the very beginning, which is why does storage matter when I deploy a database. Some might say it doesn't matter at all. Actually, it can matter a lot if these features and capabilities are useful to you. With that, I'm going to turn it over to Aaron, and he's going to walk you through a couple of other reference diagrams and then we're going to get into a demo.

FSx Deployment Types: Single-AZ and Multi-AZ High Availability Configurations

Thanks, Jim. I'd like to quickly go over the different deployment types that we have available in FSX that we typically see customers utilizing when managing or deploying their file system for self-managed databases. We have several different deployment types within FSX, but specifically there are two deployment types that we typically recommend and see customers using. We'll start out here with our single availability zone, highly available deployment type.

Starting on the left hand side we have our Oracle production server. This could be any database engine. We're just using Oracle in this example. It could be PostgreSQL, SQL Server, MariaDB, and so on. When you deploy this file system deployment type, as part of the service, we will deploy a primary file server for you and a standby file server for you in the availability zone that you indicate you'd like the file server to reside within. All your reads will be serviced from that primary file server, and any writes will go to the primary file server and then be synchronously replicated down to the standby, and once committed on both will be acknowledged back to the clients.

Now if that primary file server fails for whatever reason, like a hardware failure, or if we're doing patching and maintenance on the file system during the maintenance window that you give us for this file system, the standby file server will be promoted to the primary. As long as your NFS mount options or your iSCSI or NVMe multipathing is set up correctly, your clients will seamlessly fail over to the standby file server. There will be a short period of increased latency while that failover occurs. But once the clients are all connected up to the standby, which has been promoted to primary, they will go on operating as they were previously.

If a hardware failure occurred, we will replace that primary file server. Alternatively, if this was just during maintenance and we rebooted the primary file server, as soon as it comes back online, your clients will fail back to the primary and the standby will return to standby status as it was when we initially deployed it.

The second deployment type that we typically recommend or see customers using for self-managed databases is the multi-availability zone highly available file system. Starting on the left side, we have an Oracle production server. In this example it is Oracle, but it could be PostgreSQL, MariaDB, or any other database engine of your choice. When you select this deployment type for the file system, we deploy a primary file server in the availability zone that you indicate as your primary production availability zone. We then deploy a standby file server in a second availability zone that you designate as your standby for your environment.

All reads are serviced from the primary, and all writes go to the primary, are synchronously replicated over to the standby, and once committed on both, are acknowledged back to your clients. If not only the primary file server fails in this scenario but the entire availability zone fails, we will promote that standby file server to primary. If you have an optional standby database server running in that availability zone, you could mount the file system or the LUNs in this case that we are showing here and start running your production from the second availability zone. Just be aware that when AZ1 comes back online, the file system will fail back over to the primary. We always try to run the file system from the availability zone that you indicated to us initially when you deployed the file system as your primary AZ.

Demo 1: Disaster Recovery Testing for SQL Server Using FSx for ONTAP Snapshots and Clones

When Jim and I were putting this presentation together, instead of just giving you PowerPoint after PowerPoint, we thought it would be a good idea to actually provide some real world examples of how we see our customers leveraging these advanced capabilities within FSx when they are running their self-managed databases on an FSx file system. One of the examples we come across very often with customers is that they are doing disaster recovery from one region or one AZ to a second region or second availability zone, and they want to test and validate that the disaster recovery environment would actually run their production environment if they needed to fail over in a true disaster. However, they also want to be able to do that testing without impacting the RPO requirements that they have agreed upon with their business stakeholders.

This is the environment that we will be working within during the demo. Starting on the left side, you can see I have a Microsoft SQL Server FCI cluster spread across two availability zones, AZ1 and AZ2. In this example, we have deployed an FSx for NetApp ONTAP multi-AZ file system with our primary file server in AZ1 and our standby in AZ2. That file system is presenting iSCSI LUNs, which are block-based, to our Microsoft SQL Server FCI cluster. We are taking advantage of the block-based replication within FSx for ONTAP and replicating those LUNs so that production database storage in our VPC on the left is replicated to a second VPC and second file system, which is a single availability zone FSx for ONTAP. We deployed a single AZ in the DR environment simply to save on costs and be more cost effective. We then have a standalone Microsoft SQL database server in the disaster recovery environment that we could bring up, again running as a standalone just to save on cost.

In this example, everything is running in a single region. I am showing this in us-east-1 and we are going from one VPC to another VPC, but with FSx for ONTAP's replication capabilities, this could easily be going from one region to another region, from one account to another account, or from one region and account to another region and account. We have vast flexibility within the block-based replication within the FSx for ONTAP file system to be able to replicate to basically wherever we would like to as long as we have the networking in place.

The high-level steps that we will go through in this demo are as follows: we will review the production SQL database configuration, create a snapshot on our FSx for ONTAP production file system, and replicate that over to our DR environment so that we have a point in time copy of our production environment.

Then we create a clone from that snapshot and present that clone to our DR SQL database, which allows us to bring our DR database up in a read-write state and perform our DR testing and validation against it.

You can see here this is our production SQL database. We have our STG337 database here. If we look at the properties of that database, we can see that the data files are residing on our S drive and our LDF, or log files, are on the L drive. If we go into Computer Management and look at the S drive and examine the properties of this disk, we can see that it is coming from a NetApp iSCSI-based LUN. This is coming from FSx for NetApp ONTAP as an iSCSI LUN. Our LDFs, or log drives, are sitting on Disk 2. If we look at the properties of that disk, it too is a NetApp iSCSI-based LUN coming from FSx for ONTAP.

Now if we go into SQL Server Management Studio and run a quick query selecting the top 10 rows from our Customers table, we can see that we have our database here with customer IDs 1 through 10 along with their first names, last names, and email addresses. This query is for us to compare when we bring up the disaster recovery database. It gives us a quick validation that our DR database looks similar to our production database, and we can feel confident that we indeed have a working copy of production in DR when we bring that DR environment online.

On our production FSx for ONTAP file system, we create a snapshot. This snapshot is what we will work from to create a point-in-time reference of production in our disaster recovery environment. We created that snapshot called DR_test. We go over to our disaster recovery FSx for ONTAP file system and using SnapMirror, the block-based replication technology, we pull that across. We can see here that the SnapMirror relationship is now idle and it is still healthy, showing healthy as true. This indicates that we have successfully pulled across that snapshot, that reference copy of production, to our disaster recovery environment.

I did this all within the CLI to show you step by step what happens. If you feel more comfortable working within a GUI or would like to make this easier to work with and automate, you can use NetApp's SnapCenter tool against FSx for ONTAP. You do not have to do this all from the CLI. I just wanted to show you what steps are actually occurring. If I did this in SnapCenter, you would not see a lot of the behind-the-scenes details of what is actually happening. It would just do these things for you.

Now that we have that snapshot and have pulled it across, we can create a clone from that snapshot. You can see here our parent snapshot name is DR_test, which is the DR_test snapshot I created over in production. Now that we have that snapshot pulled across, we have our clone, our read-write copy created. We can see now that we have these two new LUNs over in our disaster recovery environment called SQL Data and SQL logs. I am going to map those LUNs to our disaster recovery standalone SQL database server. Again, this is all done in the CLI to show you step by step what is happening. You could use SnapCenter to do this from a GUI if you feel more comfortable.

In our disaster recovery database, we currently have no databases. If we go into Disk Management, we have only the boot disk and no other disks. But if we rescan for new disks, those two iSCSI LUNs that we just presented from the cloned volume show up. If we online those disks, we now have our SQL data drive, and if we online the second disk, we have our LDFs, or SQL log files.

Now if we go back to SQL Server Management Studio, we can attach that database from the snapshot of production that we have pulled across to our DR. We point this to the MDFs, to our data files. Then, since this is a snapshot of production, the system still thinks we are on our production environment, so we have to point it to the right drive letter for our LDFs, or log files. We go ahead and do that, and as soon as we do, our production database is now mounted and available in our disaster recovery environment to work from. If we run that same quick query of our top 10 customers,

we will see that it is identical to production. We took that reference point in time snapshot of production and pulled it across to the DR, and we were able to open up our DR database. In a true disaster recovery scenario, you wouldn't need to do all those steps. This is just because we want to do testing and validation of DR and not impact our existing RPOs. In a true disaster scenario, you would simply break the SnapMirror relationship, which would bring the DR up in a read-write state, and you could then mount the production database.

To show that this is indeed a read-write database, a read-write copy that we could work from, I'm going to insert myself as another customer into this table, and you can see this was successfully executed. I can indeed do reads and writes from this DR database. At this point, we could bring up our application servers over in the DR environment and do all the testing and validation we need to make sure that our disaster recovery environment is working as we expect and that we could truly survive a disaster and fail over to this environment and run production from this environment.

If we did have to do that, if we had a true disaster, we could always run from the DR environments for a short period of time. Once production is back online, we can replicate those changes back to production and bring the application servers over to production when the time is best for us, and then bring everything back to the way it was previously. The reason we took that snapshot, the DR_test snapshot, is because our objective was to not impact our existing RPOs.

The reason we did that is because in the background, SnapMirror, which is the block-based replication of FSx for ONTAP, is continuing to run on the schedule it always has been. It is continuing to run in the background on that schedule based upon our RPO requirements that we agreed upon with our business stakeholders. If we happen to have a disaster while we are doing this disaster testing, we truly could just break that SnapMirror and we would still meet our RPO requirements that we had agreed upon previously with our stakeholders.

Demo 2: Creating Cost-Efficient Dev/Test Environments for Oracle Using FSx for OpenZFS Clones

Switching gears, that was scenario one. Another scenario we often see our customers taking advantage of FSx when they are running their self-managed databases is that they want to create a like-for-like environment for their development and test environment from production, but they don't want to duplicate storage capacity. They want to be cost-sensitive and cost-efficient in that process of creating dev and test, making it look just like production. We can do that through FSx file systems.

If we take a look here, in this example we are using an Oracle production server on the left. We have deployed an FSx for OpenZFS single availability zone, highly available file system, so we have our primary and our standby file servers. Following OFA best practices, Oracle Flexible Architecture, we have our Oracle data, Oracle logs, and Oracle binary volumes that we are going to present to our production Oracle server. Production Oracle is running via NFS version 4.2 to our FSx for OpenZFS file system.

What we are going to do in this example is we are going to stand up a development Oracle server and we are going to clone those production database volumes. I will be able to show you that we are going to consume essentially no additional capacity on the storage on the file system outside of the metadata required to create these clones. From a high-level perspective, the steps that we are going to take are we are going to review that Oracle production database config and the FSx OpenZFS file system. We will create an FSx OpenZFS snapshot of our production database, a reference point that we can take at that point in time of the production database, and then we are going to create clones from those snapshots that we can present to our development database environment. We will mount those clones on the development environments and open up the Oracle database so that we could start spinning up application servers and doing dev and test against that database.

There are two scripts that are going to be run in this demo. I want to call out what those scripts are actually doing behind the scenes. There are two primary calls that are being done in the first script. That first script is creating those snapshots for us of production to work from, and the first thing that it is doing is doing an AWS FSx describe volumes API call. We are issuing this API call against the file system to describe the volumes on the file system, and that allows us to iterate through each of those volumes. If you recall, those are Oracle data, Oracle binary, and Oracle logs. Then we run the AWS FSx create snapshot API call to create a snapshot of each of those volumes. That is what that first script is doing.

The second script uses the AWS FSX describe snapshots API call to pull the snapshots that we just created. It iterates through those snapshots, and for each one, we run the AWS FSX create volume API call. I want to call out one of the important flags for that create volume API call. We pass in the snapshot ARN—the ARN of the snapshots that we just created in production—and then we use a copy strategy of clone. This tells the file system that we want to duplicate the production environment but only using clones, not a full copy. A full copy is our other option, but we don't want to duplicate the entire database. We just want a clone of the database to work from so that we minimize our storage capacity requirements.

Now let's look at the Oracle production database server. We're going to connect into the database and examine where our data files currently reside. We can see here that the data files are sitting on U02. If we look at that mount point, we see that U02 is an FSX file system ending in C9DB. Now I'm going to use another AWS FSX API call for describe file systems, and we'll see that the C9DB file system is indeed an FSX for OpenZFS file system, as I showed earlier in the architecture.

Now we'll use another FSX API call, the describe volumes call, just to show you the volumes that reside on this FSX for OpenZFS file system. You can see here we have our ora_data, ora_logs, and ora_binaries. On the far left, you'll notice that ora_data and ora_binaries are using data compression. We're using the LZ4 compression algorithm that we have available in FSX for OpenZFS, and that simply shrinks the size of the database on disk so we can be more cost efficient with our storage capacity.

We'll now connect back into the database and take a quick look at the database ID and database name. Then I'm going to run a CloudWatch API call. The CloudWatch API call gets the current capacity of the file system. The reason for this is we're going to reference this back after we create the snapshots and clones so that we can see how much additional capacity we're actually using on the file system by creating these snapshots and clones of the production database.

Since we're going to go back to that, I'm opening up a new SSH session to our production database, and we're going to run that create snapshot script I referenced earlier. You can see here we put the database in backup mode. The database goes into backup mode, and then we iterate through each of the volumes, taking a snapshot of each one. For time purposes, this was obviously sped up, but you can see here in real time it took about 54 seconds for us to put the database in backup mode, take a snapshot of the database to give us that point-in-time reference to work from for the production database, and then take the database out of backup mode. This is what Jim was referring to when he said you can back up in seconds and restore in minutes.

Now we'll run the clone script. This clone script goes through each of those snapshots and iterates over each one. As it finds those snapshots, it runs the create clone API call, and we create a clone from those snapshots. The clone is what gives us read-write capability. Snapshots are read-only—they're just point-in-time references. The clones are what actually give us read-write capability. Again, this was sped up for time purposes, but you can see here it took us about 2 minutes and 16 seconds to create a clone of those snapshots. Down at the bottom, you can see we have our three clone volumes now: a clone for binaries, data, and logs of the database.

Now that we have our read-write clone database volumes, we can go over to the development database. If you take a look, you'll see we have no U zero mount points currently. If we try to connect into SQL Plus, it doesn't work because the binaries aren't there—they're not mounted. So we'll go ahead and mount those clone volumes that we just created after we took the snapshot from production. Following Oracle best practices, we'll have U01, U02, and U03 now. You can see now we've mounted each of those clone volumes. Now that we have the binaries mounted, SQL Plus will actually work, so we'll be able to run the SQL Plus command and connect into the database. You'll see that we connected to an idle instance, and we'll go ahead and start the database up. Now upon startup, the database is going to give us an error.

This is expected. We'll see here in just a second as the database is coming up that the database is complaining that it needs to either be taken out of backup mode or have media recovered. The reason this is expected is that if you recall, the script that created the snapshots put the database in backup mode, then took a snapshot, and then took the database out of backup mode once the snapshot was created. This clone that we're presenting to our development database hasn't had the database taken out of backup mode. It's at the point in time where the database was put in backup mode and then the snapshot was created. So as I mentioned, this is expected and not a problem.

All we need to do is run the recover database command, and Oracle will recover the database and take it out of backup mode. We then shut the database down. We'll see here the database will get dismounted and the database will get shut down. We can now cleanly start the database up now that it's been taken out of backup mode. We will see here that the database will successfully start up. It gets Oracle instant started. The database gets mounted and the database is opened. So we now have a working environment for an Oracle database on our development database server, and we could start running development applications against this database that's a like-for-like copy of production at the point in time that we took that production snapshot.

Now we'll go ahead and run the select name and DBID, and this will just give us a way to compare the production and validate that this is indeed the production database. We're also going to run that CloudWatch API call, and you can see here it ends in 475 megabytes. So 475 megabytes now. If we go back to that screen that I initially showed you earlier, we were at 434 megabytes in the last three digits. So with about 40 megabytes of storage capacity, we have just duplicated our 18 gigabyte database on storage. So we can now have two copies of our 18 gigabyte database while consuming only an additional 40 megabytes.

Those 40 megabytes are the metadata required essentially for those clones. So this shows you how efficient you can create duplicates of your production database from the storage perspective. We can continue and could keep creating more clones from that initial snapshot. We could create new snapshots and new clones, but we could begin working against this environment with very little additional storage capacity. If we go back to our development database just to show that this is indeed a read-write environment, we'll connect back into the database here. We'll just create a new table, and you can see the table's created successfully on our development environment, showing that those clones are indeed read-write.

In this development environment, we're building some new portion of the application that requires a new table. We could feel confident that after we've created this table, done our development work, and done all of our testing in this dev environment, it will work in production because we're working from an exact copy of our production database. So as Jim mentioned, some other scenarios you might use this for include testing database upgrades. I'm showing creating a table here, but you could do many different scenarios that we see customers using this snapshotting and cloning capability of your production environment.

Alright, so that ends the demos section. For next steps for folks to get started, in the upper left-hand corner we have a recent blog that was written about getting started with self-managed Oracle databases on FSx for OpenZFS. In the lower left, we have best practices around Microsoft SQL deployments on FSx for Windows. In the upper right, we have best practices for FSx for ONTAP when running Microsoft SQL databases. In the lower right, we have a hands-on workshop that you all have access to. You could go through many of the steps that I just showed hands-on yourselves to get some experience doing these operations in your environment. You could also talk to your AWS account representatives and have them potentially run this workshop for you as well in an AWS account.

So with that, I just want to say thank you everyone for your time and thanks for attending. If you get a chance, if you could please fill out the survey, we'd really appreciate it. Jim and I will be available after the session and happy to answer any questions or go through any scenarios you may have. Thank you.

; This article is entirely auto-generated using Amazon Bedrock.

DEV Community