🦄 Making great presentations more accessible.
This project enhances multilingual accessibility and discoverability while preserving the original content. Detailed transcriptions and keyframes capture the nuances and technical insights that convey the full value of each session.
Note: A comprehensive list of re:Invent 2025 transcribed articles is available in this Spreadsheet!
Overview
📖 AWS re:Invent 2025 - Data protection and resilience with AWS Storage (STG338)
In this video, AWS experts discuss data protection and resilience strategies for cyber event recovery. The session covers the Maersk ransomware attack case study, where 49,000 laptops and 3,500 servers were destroyed, emphasizing the importance of offline backups and knowing critical business services like Active Directory. The speakers introduce AWS's 3-2-1-1-0 backup framework, explain the difference between high availability and data protection, and detail AWS infrastructure concepts including regions and availability zones. They demonstrate practical implementations using AWS Backup, Elastic Disaster Recovery, and logical air gap vaults for protection against catastrophic breaches. The presentation highlights regulatory requirements like DORA and Sheltered Harbor certification, with AWS being the first cloud provider to achieve this standard. Key metrics like RTO and RPO are explained alongside multi-zonal and multi-regional recovery strategies.
; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.
Main Part
Introduction to Data Protection and Resilience with AWS Storage
Good morning everyone, and welcome to session STG338, Data Protection and Resilience with AWS Storage. Just by way of a quick introduction, my name is Danny Johnston, and for the last five years, I've been helping some of our largest customers within global financial services build out their cyber event recovery platforms and solutions at AWS, so I'm really excited about today's discussion. I'm absolutely delighted to be joined today by Brian and by Steve.
Hey, I'm Brian Benscoter. I'm a Principal Solutions Architect in the healthcare and life sciences space, but I also have a passion for cloud operations and data protection. Hello everyone, I'm Steve Devos. After 28 years of building on-premise backup solutions, I joined AWS eight years ago to basically build the AWS Backup service. I'm a Principal Engineer and I look forward to discussing data protection with you all today. Fantastic, thank you guys.
So we've got a packed agenda for you this morning. We're going to be kicking off talking about proactive defense and why it's so important to have proactive defense to meet today's threat landscape. We're going to talk about what we mean by the term data resilience. We'll look at some methodologies and strategies to bolster resilience. We'll look at AWS's approach to cyber event recovery and then wrap the session up with some firm next steps and actions.
The Growing Threat Landscape: Ransomware Statistics and Internal Risks
So kicking off then, proactive defense. I don't need to remind any of you just why it's so important we protect our most mission-critical data assets from external and internal threats. Now there's a huge amount of material out there pointing to this, but one report I refer back to time and time again is The State of Ransomware by Sophos. In their report titled The State of Ransomware Across Financial Services in 2024, Sophos estimates that 59% of organizations experienced some form of cyberattack. Nearly half of those attacks were successful.
But what really concerns me is that even when an organization paid its ransom, they were only able to retrieve about 70 to 80% of their data. Not only do our security agencies like the FBI, GCHQ, and the NCC recommend not paying a ransom, even if you do pay a ransom, that's not going to necessarily guarantee retrieval of all your data assets. And then you start building on top of that, the cost of recovery from each incident now sits at about the $1.5 to $1.6 million mark on average. But look, we all know that cost of recovery excludes the reputational damage inflicted by these attacks.
And once you start building in that reputational damage, the cost can be absolutely horrendous. In fact, Cybersecurity Ventures estimates that by the end of this year, the cost of cybercrime across all worldwide industry is going to be in the region of $10.5 trillion. Now that's even more profitable, they claim, than the entire illegal drug trade. In fact, they claim that if cybercrime was a country, it would be the third largest country in the world using GDP as a metric, so there really are some quite startling and eye-watering claims and stats out there.
But we also need to be very mindful and consider internal threats to data. You know, think about disgruntled employees, other internal malicious actors, or even well-intentioned employees with access to critical material who can really jeopardize an organization's stability through human error, such as accidental file deletion, or as we saw in the case of UniSuper in Australia about 18 months ago, accidental account deletion. And that human error put at risk $145 billion worth of pension funds, which in turn put at risk the future savings and futures of 640,000 senior citizens in Australia.
The Maersk Cyberattack: Lessons from a Digital Extinction Level Event
But what's it like to actually experience what some of the UK regulators call a digital extinction level event? What's the actual business impact on the ground and what steps can you take to recover your business in the event of a catastrophic breach? We work with an industry leader in this space, the former CIO of AP Moller Maersk, who experienced just such an attack in June 2017. The World Economic Forum classed this attack as the most costly and catastrophic cyberattack to date. So much so that they've now published an external case study so that other organizations can learn from Maersk's unfortunate experience, and I'll come on to talk about that experience shortly.
But before I do, just a bit of background on Maersk. They're a huge organization. Maersk is responsible for 20% of all global trade. The ship we see in front of us is one of their 750 ships in their fleet, and each of these ships is capable of carrying a staggering 19,000 containers. In fact, the newer generation of these carriers can carry up to 24,000 containers.
They own and operate 74 ports and terminals worldwide, and every single port can expect to receive about five of these ships on a daily basis, so that's around 100,000 containers per port per day. I mean, it's a colossal operation.
Now, five years before Russia invaded Ukraine, they were attacking Ukraine with cyber weapons, the largest of which was detonated on the 28th of June 2017. Russia targeted the Inland Revenue Platform of Ukraine, a system and platform called M.E.Doc in Kyiv. Any organization that happened to be on the platform at the time of the attack was impacted, about 3,380 organizations in total. But what sets Maersk apart from those other organizations is just how transparent they were in terms of their experience, so other organizations could learn from it, and I'll try and paint a picture of what happened that day.
Within the first nine minutes of the attack happening, most of the damage had already been done. The network was taken down. Their online backups were completely destroyed and taken out. 49,000 laptops were rendered completely useless, 3,500 servers were destroyed. A further 1,200 applications were destroyed and they couldn't access a further 1,000 applications because doing so would lead to reinfection.
In terms of communication, all they had in the early hours and days of the attack was using WhatsApp because their BlackBerries had been completely wiped. In fact, they couldn't even enter their offices because their office passes were linked to the IT system. I mean, a pretty disastrous and brutal situation to find yourself in, particularly if you're responsible for 20% of all global trade.
So what happened next was very much through luck rather than design. It just so happened that at the time of the attack in Ukraine, due to a major storm in Nigeria, there was a power outage. A system, an x86 box, due to the power outage was obviously offline at the time of the attack, but even more fortunate than that, this x86 box had a copy of Active Directory, that key infrastructure service that supported so much of Maersk's operations.
So as soon as the recovery team identified this system, they put it on the first available flight in a first-class cabin, I might add, from Lagos to Heathrow. They chauffeur drove at lightning speed that x86 box up the M4 motorway to their UK headquarters in Maidenhead and gradually started refiring up their network using Active Directory. Now it took days and days to get to any semblance of normal operational capacity and weeks and weeks to recover fully. But they did eventually recover their business.
So why is their experience, why is this story so important for this morning's discussion? Well, I think it really focuses us on a few core principles of data resilience. First of all, it teaches us the importance of having offline backups, completely segregated from your production environment, hopefully through design rather than luck in this case. I think it also teaches us the importance of really knowing your important business services. Again, in this case it was that key infrastructure service, Active Directory, that so many of us use.
And I think finally, it really underlines the fact that even if you're not the target of an attack, you can still be the victim of an attack. So it really is the case of not if, but when you will fall foul of a cyberattack. And given the systemic risk ransomware in particular poses to industry, we've seen a lot of guidance, prescriptive guidance coming out from governments and regulators worldwide.
Regulatory Guidance and Customer Priorities for Data Protection
Two years ago now, we saw the Department of Home Affairs in Australia issuing a really strong action plan against ransomware following the well-publicized Medibank and Optus data breaches. In Hong Kong, if you want to operate in that region, the Association of Banks and the Monetary Authorities stipulate you must have what they call a secure tertiary data backup. At the beginning of the year, a lot of eyes were on Europe and DORA, the Digital Operational Resilience Act, that came into force in Europe on the 17th of January. A lot of those articles and provisions pertain to network segregation and backup. If you're interested in learning more, take a look at Article 12.
And we worked really closely with the Bank of England and a body called CMORG, the Cross-Market Operational Resilience Group, in terms of what it means to design, deploy, and implement what they call CHDVs or Cloud Hosted Data Vaults. They see CHDVs as a key mechanism to level up resilience across the entire financial services sector. Now the guys are going to go into a lot more technical detail and depth around what a data vault and a CHDV is.
But in very simple terms, a CHDV is an independent recovery environment built in AWS where you maintain a golden copy of your most mission critical data assets in an air-gapped immutable vault, so you can recover your business in the event of a catastrophic breach. Exactly this time last year, the NYDFS in Wall Street in New York stipulated that if you want to continue doing business in New York, you must have an external backup that is, and I quote, free from alteration and destruction.
So I think we see a lot of guidance around data protection. I think it's fair to say that what we see less of are actual standards, and I mean specific surgical standards organizations should be meeting. And where there's perhaps pockets of standards, we see very little in the way of a pathway to achieve those standards and certainly pretty much nothing in the way of official certifications of standards. And that's why Sheltered Harbor was established. Sheltered Harbor set the gold standard for operational resilience across the financial services sector, and I'm absolutely delighted to say that AWS is the first and only cloud service provider to be an alliance partner with Sheltered Harbor and to have a Sheltered Harbor certified platform in the cloud. And when I talk to my customers about that, that really gives them confidence that we've got some fantastic solutions and credibility in helping them overcome the challenges that ransomware poses.
So what else do my customers say, what's important to them when it comes to data protection? Well, they tell me speed is of the essence. It's really important to minimize that threat window when it comes to cyberattacks, and just given the inherent flexibility you get with cloud solutions, you can achieve so much more, so much more quickly in the cloud than you can on premise. They also tell me that having an agile and modular data protection platform that can bend and flex to meet the ever-changing threat landscape out there is crucially important. Now Marketplace is a key component of supporting that modular platform.
And finally, they tell me they want the ability to start small but scale fast, to take advantage of the utility model that AWS provides them, and to completely avoid those heavy investments, upfront CapEx investments on expensive and complicated on-premise infrastructure that actually may or may not be fit for purpose in the future. And having that ability to optimize their data platform, which you can achieve in AWS, is super important to them. So it's around speed, it's around agility, and it's around scale. We really hone our messaging and go-to-market when it comes to cyber event recovery. But I think it's important as well to sometimes take a step back and kind of talk about what we mean by the term data resilience, and with that, really looking forward to Brian now sharing some key insights with us.
Understanding Resilience: High Availability versus Data Protection
Thank you, Steve. So I think it's important to start with some fundamentals. What exactly is resiliency? So this is your workload's ability to withstand partial or intermittent component failures across multiple services and eventually recover efficiently. But resilience isn't monolithic. It consists of multiple facets and components, and we need to understand those when we're architecting our own resilience strategies.
So that first pillar is high availability. So think of it this way. When you have network issues or component failures, high availability is your system's ability to fail over, keeping your applications running. As an example, imagine you're running a very popular e-commerce website and it's Black Friday. One of your web servers goes down because of a hardware failure, but your load balancer detects this and immediately starts directing traffic to the remaining available healthy servers. So that's high availability in action, automatic failover so you can maintain service continuity when individual components fail.
But in that example, right, your customers continue shopping, everybody gets what they're looking for, and there's no detection in the background. So even though that failure occurred, we're continuing our website. But failover is not necessarily the whole or the end game of everything, right? On the flip side, we have data protection, and this is when your application is completely down and you have to be able to fully recover that site. So while high availability helps prevent the outages, it's data protection that recovers from more catastrophic failures, and you can think of this in three layers.
We have first, backup and recovery. So these are regular snapshots that let you restore to a point in time before that data corruption event occurred.
Things like restoring from an accidental deletion using the previous night's backup fall into this category. Now disaster recovery is your ability to restore operations to an alternative site in the event that your primary location is unavailable. Closing it out, we have business continuity, and that encompasses the people, the processes, and the communications that are needed to run through the requirements of these events. So the key difference here is that high availability operates in seconds to keep your application running, while data protection typically operates more in hours, but that's going to get you back to running. In your resilience planning strategies, you're typically going to need both of these.
Now failure comes in many forms, and understanding their likelihood versus the impact of a failure is going to drive your resilience strategy. So the most common types of failure, anyone can guess, it's human error, right? We have somebody fat fingering, adding an extra zero, misconfiguring a parameter in the application deployment. Moving a bit further along, you have the load-induced type errors. If we're thinking again about that e-commerce website, perhaps there's heavier traffic than expected that's overloading our systems and things go down. Now these types of high likelihood but low impact events call for high availability type solutions. We have automatic failover, load balancing, and redundancies in place to help handle these events.
But as we move across the continuum, we need to shift our approach. Here we have, on the extreme end, natural disasters, regional disruptions, and cyber events like Danny kicked us off with today. These are lower likelihood but much higher impact types of scenarios. For these, we need data protection type solutions. Here's where our disaster recovery, our comprehensive backup solutions, and our business continuity planning all come into play. So it's high availability that manages the frequent but manageable disruptions. Data protection is there for those rare catastrophic ones.
Your investment in each one of these is going to ultimately depend on the business criticality of your applications. If you have, say, financial systems or ticketing systems, you're likely designing for both simultaneously because even a rare disruption can't be tolerated. Ultimately though, you want to make sure that you're aligning your metrics to help you decide where to make these investments. For those common but manageable errors, invest heavily in high availability, but don't neglect data protection for those rare catastrophic events that could ultimately end your business.
AWS Shared Responsibility Model and Infrastructure Foundations
We've all probably heard frequently that at AWS we talk a lot about shared responsibility models in the security context. I'm sure everybody has happened upon that in some other talks, but we also operate in a shared responsibility model for resiliency that directly impacts the strategies that you're going to implement that we're going to discuss. What that means is you're responsible for the resiliency in the cloud, while AWS is responsible for the resiliency of the cloud. Now what exactly does that mean?
AWS handles the core infrastructure and the resiliency of that. Things like regions, availability zones, and edge locations are all there to ensure that our compute, our storage, our network, and our database services are all highly available and redundant. That gives you the building blocks that you need for building your own resilience strategies. But here's a critical distinction, right? If you think of a service like Amazon S3 that we talk about having eleven nines of durability, I write that object to it, it's properly replicated, and we're giving you eleven nines of durability. If I corrupt that object in S3, we're going to make that corruption eleven nines durable as well, right?
The important takeaway there is it doesn't remove your responsibility of architecting your applications for that data protection. As we talk about the shared responsibility model again, we're providing enterprise-grade infrastructure that allows you to design highly available redundant systems. But at the application level, it's still ultimately your responsibility to put in place the resilient patterns that are going to work for your business application.
Let's quickly review what we mean by regions and availability zones. These are the infrastructure foundations for all of the resilience strategies that we'll talk about. Regions are geographically distributed locations, and they enable things like disaster recovery and meet data residency requirements.
So within each region we have multiple availability zones, and the availability zones are your isolated failure domains, and they enable different types of high availability designs. The key architectural principle here is that the availability zones are separated enough to avoid correlated failures, so things that are in that high impact area that we were talking about on the matrix, but they're also close enough to enable you to have synchronous replication and automatic failover, so you can incorporate those into your high availability designs. This gives you both building blocks. You have the availability zones for high availability, and you have the regions for data protection across different geographic boundaries.
All of our AWS services are architected to take advantage of these infrastructure boundaries as well, so understanding how they're designed helps you make the right choices when you're designing your own applications. So you can categorize the AWS services into three types based on their infrastructure patterns that they implement. And the first one is zonal services, so Amazon EBS and Amazon EC2 operate individually within zones or availability zones within that environment, and they fail independently as well. So this gives you some granular control. You can design how you want to distribute your application across multiple zones, but you're ultimately also responsible for defining how you're going to fail over or direct traffic between those zones as well.
Now regional services like Amazon S3 and Amazon DynamoDB abstract away a lot of this complexity for you, so AWS automatically distributes the data across multiple availability zones and handles the failover for you automatically. So if your application is using Amazon S3, for example, and there's a disruption, that's invisible to the application, and that's AWS handling the high availability for you. And finally we have global services. These have a distributed architecture where there's a control plane in a single region, but the data plane spans worldwide. So if you look at a service like Amazon Route 53, that operates in 200 plus points of presence around the globe. So this gives you high availability capabilities as well as data protection or disaster recovery capabilities within a single service.
So the important takeaway here is that you want to focus on trying to leverage things like regional services wherever possible so that high availability is provided at the service level for you automatically, and you can couple that with zonal services but ensure that you're aware of how they operate and how you distribute your data so that you can handle the resiliency of that application based on how the application should operate. Utilize the global services to add additional high availability and, more importantly, the disaster recovery capabilities should you need to move between regions.
One important caveat is that even regional services can have global dependencies. So if we look at Amazon S3, for example, the S3 naming service is global. So with Amazon S3 you have the naming service, which has a control plane in US East 1, but the data plane is global. So if US East 1 becomes unavailable, what that would mean is I'm no longer able to create new buckets because I need a globally unique name, but my remaining buckets are going to continue to be highly available and active for you. So consider these types of things when you're designing your application and your disaster recovery planning specifically.
Recovery Objectives and Multi-Dimensional Protection Strategies
So how you prepare for availability and corruption events is directly related to your objectives when such an event arises, but it also ties to your budget. And the questions that you need to answer when these events occur is how soon after the event happens do I need my system to become available again. So that's your Recovery Time Objective or RTO. The other question is how much of my data can I stand to lose after such an event, that's your Recovery Point Objective or RPO.
So if you're configuring things like synchronous replication, that's going to provide you with the shortest time to recovery, but it doesn't give you a lot of capability should you need to address data corruption or accidental deletions because I can't necessarily go back to a point in time before then. On the other hand, creating continuous or periodic backups allows me to go back in time to points before a data corruption event occurred, but it's going to take me typically longer than it would in a replication type solution. But I do have more flexibility in terms of where I can recover to. Now traditionally, enterprises maintained multiple data centers, and they replicated their critical data from one data center
to another and maintained backups that were stored off site. Today, AWS provides a sophisticated framework that's organized around two dimensions. One dimension is a zonal versus a regional scope, and the other is our recovery versus our availability approach. The matrix I'm going to talk through shows you four distinct strategies that you can tie to those two dimensions.
The first one is multi-zonal availability, and this means that your applications stay available despite an individual zonal failure. In these designs, you're configuring things like Amazon EBS, Amazon EFS, Amazon FSx, and Amazon RDS to replicate data between multiple zones so that data is active and available in the event of an individual zonal failure. This is a proactive availability approach, so systems are spread across multiple zones so that you can easily handle or fail over from an individual zonal failure. But it doesn't help much in the case of data corruption.
The next approach is multi-zonal recovery. In multi-zonal recovery, we are using backups. We're taking backups of our file, our block, our object, and our database storage systems, and we're copying those to another availability zone. So in the event of our primary zone outage, we're restoring that data into the alternate zone to get those systems back online before the corruption event occurred.
Multi-regional availability is very similar to multi-zonal, but in this case, we're replicating our critical business data from our primary region to a secondary region. Again, this is a proactive approach because now we have multiple systems spread across different geographic boundaries so that we can fail over to the secondary region in the event of a regional disruption and fail back when that region comes back online. Again, we're not in that type of solution protecting against data corruption type scenarios.
Rounding out our quadrant is multi-region recovery. In this case, we're taking backups in our primary region of our critical data, and we're copying those backups to a secondary region, again reactive in approach. But if we are to have that rare scenario where we have a data corruption event and there's some challenge in accessing the primary region, we can restore these systems back to that point in time in the secondary region. The key thing to think about is usually you're not picking just one of these quadrants and implementing your resilience strategies. You're going to use multiple of these to align against the different types of applications you're running and the business criticality of those applications.
Implementing Multi-AZ Availability and the 3-2-1-1-0 Framework
Let's take a look at that first quadrant, multi-zonal availability, and see how it looks in practice with a typical three-tier application. Here we have a very common type of straightforward multi-AZ architecture where we're leveraging a combination of zonal and regional services. We're going to use things like NAT gateways and load balancers to direct traffic between the different zones. As a best practice, you start off by placing your web servers behind a load balancer, but we're not doing this just for performance in this case, but also to enable the ability to direct to healthy servers across multiple zones.
Additionally, we want to set up Amazon EC2 Auto Scaling groups across multiple zones. So in the case of a single zone failure, the load balancer is going to start directing traffic to the healthy web servers, and the Auto Scaling groups are going to spin up additional resources in those zones so that we can maintain our level of service. Amazon Aurora and Amazon RDS can replicate data across multiple zones as well and then support automatic failover. Additionally, we can add in multiple read replicas if we want redundancy at the read level of our databases.
Rounding this out, using services like Amazon S3 and Amazon DynamoDB within the applications gives us that high availability at a regional level by default. We've just seen how multi-AZ availability can keep you running through application or availability zone failures, but availability alone isn't enough. I hope I'm stressing that significantly for you. You also need a strategy to recover from data corruption events, deletions, and those ransomware or cyber events like Danny kicked us off with today. While on premises we had a classic 3-2-1 backup model and that served us well.
The cloud enables even more flexible and cost-effective strategies on top of that. At AWS, we put forth a strategy or framework called the 3-2-1-1-0 framework, and this is our gold standard for data protection. What this entails is that you have three copies of your data that are separate from the primary resource. You have two copies that are in different locations, so that can be cross-account or cross-region. One copy is there for local recovery, so you have fast restore for operational issues. And one copy is immutable, isolated, and stored in a vault so that you're protecting against cyber events or ransomware-type scenarios. Probably often overlooked, but most importantly, you have some sort of process that's regularly testing your backups to ensure that you're able to recover with zero errors in the event that you actually need to use them.
But here's a critical point: not every application needs this level of protection. In fact, carrying multiple copies can be significant in cost, so you ultimately want to tie your protection plans to the business criticality and the requirements that drive the strategy for your applications. The framework is flexible, so one copy can solve multiple purposes within the solution. Your cross-region copy could also be immutable for your cyber recovery. The local operational copy can leverage native service capabilities like Amazon EBS snapshots or Amazon S3 replication. Ultimately, you want to match your protection strategy to the business impact.
Your financial trading system probably wants 3-2-1-1-0 because it's very mission-critical, but that monthly reporting system may just need 1-1-0 to be sufficient. Match the RPO and the RTO of the different applications to the strategies you ultimately select. Now, 3-2-1-1-0, use this as a North Star for your mission-critical applications, the types of applications that are highly dependent in your business. But ultimately, you want to scale the approach based on the business criticality and the tolerance of risk within a given application.
Now, the data protection strategies that we talked about map ultimately to three different or distinct types of copies, and each one of those serves different scenarios within your recovery processes. The first one is a local copy for fast recovery. This is a copy that's going to be in the same region as the original resources, so that way you have the shortest RTO when you need to recover. This is for those typical operational accidental deletions that occur, and they map more to those higher likelihood but lower impact type failure scenarios from our matrix earlier. Now, the trade-off here is that you get fast recovery, but they're not super useful in the case of regional disasters or ransomware-type events.
Our remote copies for disaster recovery are meant to handle these more geographic disaster-type solutions. So here we're typically using some sort of replication-type solution with a failover mechanism in place, and that's meant to handle those lower likelihood but higher impact type of regional disasters. Finally, we have our cyber recovery copies. If we think back to Danny's talk earlier, Maersk happened upon this accidentally by having a system that was offline, right? Similarly, we want to do that but in a more planned method. You want to have an immutable, isolated copy that's air-gapped from your existing operational systems because this is your break-glass-in-case-of-emergency copy in a ransomware scenario.
You want to make sure that these are regularly tested so you know that they're going to be able to recover, because this is the ultimate protection when all of your other defenses have been taken down by a sophisticated attack. You want to be able to get back to this particular point in time. The key takeaway from this is that ultimately we want each copy to map to a different failure scenario. The local copies are for that fast operational restore, these are the common things we've always dealt with in our IT world. Those cross-regional copies allow us to fail over in the event of regional-type disasters, and then that cyber recovery copy is well protected, isolated, and serves as that last line of defense against ransomware events. I'd like to hand it over next to Steve, who's going to go into a bit more depth on how you implement a lot of these things in practice. Thank you, Brian.
Replication Strategies: AWS Elastic Disaster Recovery and Native Services
So when considering the number, the type, and location of your recovery copies, you need to consider the criticality of each of your resources as well as the risks you are trying to mitigate. Do you need to be able to recover, or are you concerned about a geographic loss of an entire region that you need to be able to bring up and move your entire systems to another region in your area? Or are you likewise concerned about ransomware and needing to be able to recover in the case of losing access to your account, malicious corruption to your data, or maybe even loss of your AWS organization's master account?
Replication to other AWS regions allows you to maximize your availability in the case of a regional disruption. They allow you to quickly fail over in case of such an event. Many AWS services, including Amazon S3, Amazon EFS, Amazon FSx for ONTAP, Amazon DynamoDB, Amazon RDS, and Amazon Aurora, support native replication functionality. Typically, application owners build replication into their architecture since they're the ones who understand the relationship of the data that they're storing as well as the criticality of each individual resource they built their applications on top of.
Based on your budget and your RTO, application owners may choose to set up their replica target with minimal infrastructure for cost-conscious applications. But for more critical applications, they may set up a hot standby by having fully provisioned instances and storage services. Likewise, to simplify the fail back after a failure and the merging of the transactions, they may create a warm standby by making the target read-only. As with any resiliency solution, it is critical that your application owners periodically verify the recoverability of their application. If, for example, they've configured a warm standby in another region, you may want to have them periodically, perhaps twice a year, perform a fail over to another region and then fail back.
AWS Elastic Disaster Recovery is a replication service designed specifically for instance-based workloads. Regardless of where your application is hosted, DRS provides automated replication and recovery into an AWS region of your choice. To configure DRS, you first install AWS replication agents on your source servers. These agents are responsible for capturing all disk activity at the block level and transferring those, replicating those, over to the region of your choice.
Now these agents capture each change at the block level and send them after compressing and encrypting them. By only sending the modified change blocks, they're minimizing their impact on your network bandwidth. Now in the AWS region where the target is, DRS provisions lightweight instances that are designed to simply capture the changes and create a synchronized copy of your volumes from your primary site. DRS also configures snapshots at the frequency and with the retention of your choosing.
And in the case of an event,
DRS will provision production instances in the subnet of your choice when you choose to go ahead and initiate a recovery. It will restore volumes from your snapshots using a time of your choosing and attach those to the provisioned instances, allowing you to bring up your application within minutes. This architecture provides near zero RPO through continuous replication and minimal RTO through automated recovery orchestration. This is a good solution for your instance-based lift and shift applications as it provides the ability to fail over in the case of a regional issue, and it allows you to roll back to a previous point in time in case there was a data corruption.
Now let's dive in to some of the native AWS replication services that support replication. Amazon S3 offers cross-region and cross-account replication with a couple of options for delivery. You can choose best effort delivery for your cost sensitive applications, or you can choose time controlled replication that provides an SLA of delivery within 15 minutes. For FSx ONTAP, it offers one to one and one to many replication to the AWS region of your choice. DynamoDB Global Tables offers continuous active-active replication across AWS regions. It's a great solution for your globally distributed applications that require low latency access as well as automatic failover.
AWS Backup: Centralized Data Protection with Logical Air Gap Vaults
Where replication allows for quick failover to new infrastructure with little to no data loss, backups allow you to restore your application to a point prior to an unintended data mutation. Now these mutations could be caused from a bug, a user error, or malware. Many AWS services offer native snapshot and backup functionality. Application owners rely on these features for operational tasks. For example, an application owner may create a backup before deploying a schema update to their database, or they may use a backup to clone their resource for performing test and analytics tasks. But after that task is complete, they typically need to be able to delete those backups.
But to comply with data protection requirements from the organization and the industry, many organizations create data protection teams. These teams use AWS Backup to define and deploy backup policies. These policies create periodic or continuous backups of their critical applications across all production accounts. And then from an AWS delegated administrator account in the organization, these teams can monitor the job status, allowing them to address any issues that arise.
To protect against an unauthorized user deleting your backups, you can choose to store your backups in a locked vault. The lock will prevent any attempt to delete or otherwise modify the lifecycle of the backups within it. For further isolation, you can store your backups in a logical air gap vault or copy them to a locked vault in an isolated account within your organization.
A logical air gap vault can be shared with another account, whether inside your organization or outside. This allows your application owners to perform tests without impacting the primary production account, so you can share your vault with a test account.
In that test account, you can go ahead and perform restores, proving that you can rebuild your application on a regular basis. But to automate this process, you can build restore testing plans with AWS Backup. These plans, at the frequency of your choosing, will select the latest backup of your selected resources and automatically restore that for you.
Upon the completion of the restore, you'll be notified with an event. When you get notified, you can have a Lambda script go validate the content of the thing that was just restored to determine whether the data is as you expect. The result of that validation can be sent back to AWS Backup.
Now that centralized report that I talked to you about a little bit earlier will also include, with your restore job reports, the status of the restore as well as the result of your validation. One other thing I want to point out is the backups created by AWS Backup can only be deleted through AWS Backup APIs, CLIs, or console. This makes it simple for you to create a separation of control whereby the data protection team owns the lifecycle of the backups they create, while the application owners own the lifecycle of the primary application.
Likewise, not just with regards to the lifecycle, you have a separation of control of access. The backup administrators can maintain the access to their backups through policies placed on the vaults, and the application owners have control over the access to the application itself. The people who are managing the data, the data protection team folks themselves, they don't have direct access to the primary resources. They simply have access to pass a role to AWS Backup.
Now let's take a look at a simple example. Here we have an application, simple, built on top of Amazon RDS, Amazon EBS, which are zonal services, and Amazon EFS, which is a regional service. You can configure AWS Backup through backup plans to back up all three of those resources, and if you're concerned about a regional disaster, you may consider copying those backups to another AWS region.
Now let's say the availability zone is no longer there. You can use AWS Backup to restore your application into another availability zone by restoring the zonal services, RDS and EBS, then building a new compute instance and connecting the resources you restored as well as the regional non-impacted EFS file system. Similarly, if the whole region has been destroyed by a natural disaster, you can use AWS Backup in the target region to restore all three resources, once again building a new production instance and connecting all three resources.
The Well-Architected Framework provides six pillars for building robust cloud architectures. I'm going to bring together many of the things that the three of us talked about to help you build a strategy, define a strategy for your data protection. Since the criticality of your data varies from application to application as well as within an application, the first step is working with your application owners to classify the data that they manage.
Then for each class, define your strategy with regards to what replication requirements you have in order to mitigate the risks that you're concerned about and build backup policies at the organization level. These policies will provide the backups required for doing recovery after an unintended mutation of your data.
From a security standpoint, please ensure that you use the same rigor for managing access to your backups as you do for your primary resources. Because every copy that you are creating to protect yourself from availability issues or malware, every one of those copies creates another opportunity for data exfiltration. Your application owners probably have configured all sorts of controls to manage the access to their primary resources. As a data protection team, you need to make sure you're having the same sorts of controls to ensure that the data that you're holding in your backups doesn't get leaked.
Finally, it is critical for you to test your recovery. Your application owners need to be able to not just show that they can restore a given resource, but they need to be able to practice recovering their entire application in another account. Basically, if you are trying to mitigate a risk where you lose access to your primary account, your application owners need to not only make sure that things are backed up, but they need to be able to prove that they can recover their full application in another account.
Now, from a delegated administration account, please make sure your data protection teams are monitoring the status of your data protection jobs so that if there are failures that occur, they can go back and address those failures and any configuration issues that are causal to those errors. And you can also create compliance reports from that centralized admin account, allowing you to know that you have the backups for your critical resources within the time that you expect.
Now, let's take a look at another example. This one's still a simple EC2-based application, but this time we're going to talk more holistically about the various things that we've discussed. In this situation, in your workload account, you have your first backup copy using AWS Backup, and it's just stored in a simple backup vault. Your backup plan can be configured to also copy those backups to a secondary account, this one perhaps storing the data in a logical air gap vault that will allow you to assure you have access to your backups even if you lose access to your accounts.
Now when the backups are copied over, you can choose to capture that event and go ahead and build orchestrated testing or analytics in that account. Here, of course, like I said earlier, you can use automated restore testing to verify the recoverability of the backups you created and be able to report that back to your auditors.
Now for a third copy, consider using, for this basic instance-based application, DRS Elastic Disaster Recovery. DRS will go ahead and replicate your volumes for your application over to a region of your choice, thereby allowing you to fail over to another region if there's a regional disruption or a regional disaster. So with these three copies, you're protecting yourself from loss of your data due to a simple mistake by the user, a malware attack that maybe takes over your account and destroys it or destroys your organization account, and you also have protection against a regional outage.
Configuring Cyber Recovery and Final Recommendations
So at the top of this presentation, Danny talked about a massive cyberattack. I'm going to walk through how you can configure a logical air gap vault to protect you against such an attack. The first step is creating a recovery organization. In that recovery organization in the management account, you need to create identities for a handful of trusted individuals in your organization. Now you don't want to use the same identity provider for those identities because you do not want to have any shared dependencies between your recovery organization and your production organization.
Create an approval team with those identities. Specify how many approvals you need to successfully approve a request. Then you can share that approval team with your workload accounts and associate them with your logical air gap vaults. Then if a disaster occurs and you lose access to everything you own, you still have access to your backups. From a newly created account, you can use AWS Backup to request access to that logical air gap vault using the ARN for that vault. The request will be forwarded to the approval team. If they vote and say yes, you'll be given access to your vault where you'll be able to restore your applications.
With this system, you are guaranteed to be able to recover even if you lose access, as was described by Danny, to everything you own in your production environments. With that, let me go ahead and pass you back to Danny for him to close us out. Thank you.
Great, thank you. Fantastic, thank you Steve. So just rounding out the session then, I think it's fair to say we should be preparing for the worst. Assume the breach, assume that digital extinction level event and work back from that. And I think it's important as well to really understand those important business services, those critical services that make your business a business.
Now, as I mentioned at the top of the session, I work in financial services. I speak to a lot of banks. So typically in a bank, for example, you'd have a list of business services and a list of infrastructure services, probably a dozen aside, those mission critical ones. So you'd have payments. Payroll is a pretty important one as well, because without employees you haven't got a business. For a bank, you'd have mobile banking, you'd have online banking, ATM access to cash, access to balances, and if they've got a big trading division, then a Murex front office.
And then on the infrastructure side, we've talked about Active Directory already, using the Maersk example, but there'd be others. I'd include DNS in that. For a mainframe environment, you'd have LPAR. And once you've buttoned down those important business services, you need to test and test and test again and validate that end to end recovery. And make sure it's aligned with your business processes.
Please do take us up as well on the opportunity to run a cyber event maturity assessment workshop. We'll be hanging about after the session here. Now the customers I've run the workshop with have gleaned a significant amount of value and benefit from it, and it's really helped inform their business continuity plans and strategy moving ahead. Please do as well continue your AWS storage journey, your learning journey. There's a lot more information to be found at AWS.training/storage.
And finally, thank you ever so much for your time. Please, we'd really appreciate the session survey feedback. We'd love to come back next year and share some key insights with you. So please do complete the form. Most importantly, thank you so much for your time. We hope you've enjoyed this morning and we hope you have a fantastic rest of your day. Thank you so much.
; This article is entirely auto-generated using Amazon Bedrock.


































































Top comments (0)