DEV Community

Cover image for AWS re:Invent 2025 - Building resilience against ransomware using AWS Backup (STG412)
Kazuya
Kazuya

Posted on

AWS re:Invent 2025 - Building resilience against ransomware using AWS Backup (STG412)

🦄 Making great presentations more accessible.
This project enhances multilingual accessibility and discoverability while preserving the original content. Detailed transcriptions and keyframes capture the nuances and technical insights that convey the full value of each session.

Note: A comprehensive list of re:Invent 2025 transcribed articles is available in this Spreadsheet!

Overview

📖 AWS re:Invent 2025 - Building resilience against ransomware using AWS Backup (STG412)

In this video, AWS Backup team members discuss building ransomware resilience through three key pillars: immutability and isolation, integrity, and availability. They present detailed reference architectures using AWS Backup's logically air-gapped vaults, which provide compliant locking, service-owned encryption, and cross-account sharing. The session emphasizes threat modeling using the STRIDE framework, the 3-2-1 backup strategy, and critical concepts like Mean Time to Detect and Maximum Tolerable Data Loss. Key features covered include Amazon GuardDuty integration for malware scanning, restore testing capabilities, and multi-party approval for vault access during incidents. The speakers stress that backups are primary attack targets and that recovery planning should work backwards from business continuity requirements, distinguishing between operational, disaster, and cyber recovery scenarios. They introduce the minimum viable company concept for prioritizing critical system recovery.


; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Thumbnail 0

Introduction to Building Ransomware Resilience with AWS Backup

Welcome to session STG412, where we're discussing building resilience against ransomware using AWS Backup. My name is Sabith Venkitachalapathy. I'm a worldwide solutions architect, part of the AWS Backup team, and I've got the immense pleasure of working with customers worldwide, helping them build secure and cost-effective AWS Backup implementations.

Yes, hello, I am Ivan Velickovic, and I'm a senior engineer in AWS Backup. I started the service back in 2018 with a group of people, so whatever you're using with AWS Backup, partially I'm to blame for it.

Hi everybody, I'm Vivek Mishra. I'm a senior product manager with AWS Backup. I look over a lot of the platform areas, focusing specifically on ransomware protection and recovery, and here I am.

Thumbnail 60

Thumbnail 70

Okay, so let's get started. Before we begin, a couple of housekeeping items. This was planned as a level 400 session, which means that we will go deeper into AWS Backup reference architectures and touch upon some of the core concepts of a clean room recovery. Obviously, I'm pretty sure most of you would want to take pictures of the reference architectures because it's going to be built out in phases. We will let you know when it's all complete with this camera icon. Having said that, I'll let Vivek take us through the journey.

Thumbnail 100

So the theme that we're going to cover today is cyber resiliency, which is nothing but an ability of an organization to prepare for, withstand, and then recover from cyberattacks, all while keeping the lights on and pretending that everything is perfectly fine. It's essentially the art of keeping your business operational while the digital world is throwing its worst at you. So we will talk about cyber resiliency specifically from the perspective of how do you avoid the threat of ransomware, and I think most folks out here understand how the ransomware landscape has been evolving in the last couple of decades from just being a nuisance. It's now a full-blown industry.

Thumbnail 150

So what we will do in this session is we will go through some of the recovery strategies using AWS Backup specifically, talk about certain reference architectures and so forth. So the way we'll do this is we will cover the cyber threat landscape. We'll talk about how the role of backups is so important in this recovery strategy that you build, and then the heart of this discussion will be spent on understanding some of those specific strategies, reference architectures, and the best practices that are out there using AWS Backup. We'll just go from there and give you a summary of all of it. With that, let's begin.

Thumbnail 190

Thumbnail 200

Thumbnail 220

Understanding the Cyber Threat Landscape and Backup Targeting

Exactly, okay, thank you, Vivek. So before we get started on understanding how to build resilience, we have to obviously understand what is it that is triggering us from wanting to implement such a cyber resiliency strategy. So let's go deeper into the cyber threat landscape. We've actually discussed enough, and most probably you would have gone through multiple sessions, security sessions, to understand how a ransomware ever gets its foothold into your organization. So we're not going to go deeper into any of those tactics or any of those attack patterns, but rather once the payload or the ransomware scenario has actually formed in your organization.

Thumbnail 240

The first thing that would trigger the entire recovery process is what we call the ransomware event, where most of your production workloads get impacted by an encryption. What we have seen is that while most of the attackers would want to compromise your organizations, the first obvious thing that they will do is to ensure all of your recovery options are depleted, and the most solid recovery option that you have access to is your backups. So in almost all of the attacks or in almost all of our conversations with our customers, we've heard that deletions of backups is their primary concern.

Thumbnail 260

Obviously, once you've taken a proper backup strategy and have enough backups in place, the first thing that you'll have to do is to perform a post-event recovery. So most of the discussions that we have today is centered around how do we minimize this data loss. So from the actual ransomware event, which is impacting the availability of your services, to the actual post-event recovery, which ensures business continuity, we would want to keep this time period as minimal as possible.

Thumbnail 300

So now let's look at a hypothetical timeline. After so much of planning, the attacker, there's been a threat being introduced into your organization.

Thumbnail 330

Thumbnail 350

So just for clarity on the timelines, anything that is in green color indicates good data, and anything that is in yellow or orange talks about compromised data. Anything that is in yellow talks about infected data, and anything that is in red talks about compromised data which you cannot use. We have seen, based on conversations and talking to our customers, that most of the time there is always a period when the attacker takes access to your environment and then stays dormant for long enough before they start increasing the threat level. So when the threat penetration level increases, you can see that the compromise of data goes from infected to compromised.

Thumbnail 360

Thumbnail 380

Thumbnail 390

After some time, depending on all of your detection systems, there is a time when you have detected the compromise. We call this the Mean Time to Detect. For every organization's secure recovery resilience, the first and foremost thing is to ensure that you keep this Mean Time to Detect as minimal as possible. Obviously, after the threat has been detected and after so much planning and deciding how to recover and what runbooks to get through, we would have started the mitigation, and this is where we talk about the Mean Time to Respond.

Thumbnail 400

Thumbnail 410

Thumbnail 430

After so much planning and execution and then running through your runbooks, accessing your backups and all, you would have eventually mitigated the threat, which is the time when you would have got all of your business wanting to operate normally. This is what we call the Mean Time to Normal. But the most important aspect here is to ensure the business loss is kept as minimal as possible, and one important metric in this all planning is to come up with what we call the Maximum Tolerable Data Loss, which is the data that your organization can afford to lose before you can think about moving on to a different business domain altogether.

Thumbnail 440

Obviously, after all that, there's normal operations. And having said that, as I mentioned earlier, we have been discussing with a lot of our customers worldwide, and we have been looking into a lot of industry research. There are a few key components that we have seen as an outcome of these researches. One is that there is a persistent threat landscape that is happening in the space, which is ensuring that there is always a lot of compromise that is intended to happen in your organization. The other most important thing that we've heard and seen from our customers is that there's an organizational alignment challenge.

Imagine in your organization you've got a backup admin team which defines how the policies should look like, there is an application team which owns the actual application data, and there is a data protection team or a recovery team which decides how to combine all of these things to perform a recovery. We have seen that there's always a misalignment between all these teams, which is hindering most of your recoveries. Like most of the cyber events, there was always a financial impact to your organization, so you would have to ensure the recoveries as fast as possible to minimize the financial impact.

But the thing that we're going to concentrate on today in our discussions is the backups being targeted as part of the attacks. Based on the research and based on the conversations that we had with our customers, we have seen that backup systems are a target in all of these scenarios. There's always an evolving regulatory landscape. Because of the increasing number of attacks and because of the increasing number of the impact that it could have on the economy and on how these enterprises are impacted, regulators are stepping in and saying that we should be helping these customers perform well.

Thumbnail 560

Thumbnail 570

Thumbnail 580

The Evolving Regulatory Landscape for Cyber Resilience

So how does it happen? The first and foremost thing is that as customers, you have access to a lot of standards and frameworks available, which means that the standards and guidelines are the support mechanisms that you can tap into. But obviously, based on the foundations laid by the standards and best practices, there are regulations being built on top of it. So regulations like DORA and NYDFS focus very specifically on the FSI industry. But we do also have regulations like GXP and HIPAA which focus on the healthcare and life sciences industry. These are just examples of what a vertical-based regulation would look like.

Thumbnail 600

And then we also have regulations which are very much specific to a country. So examples of MAS TRM, like a Singapore-based Monetary Authority of Singapore,

Thumbnail 620

Thumbnail 630

Thumbnail 650

the TRM regulation talks about how to enforce business continuity management in a country like Singapore, which is followed by HKMA and APRA in Australia. This is what we call the industry and country based regulations. Obviously, on top of this, there are regulations that would apply to you at a global level. Things like NIST and GDPR are regulations that have a global focus, which means that depending on whether your enterprise is operating out of a particular region, it might have a global impact and you'll have to perform compliance for it. This is what we are actually having seen in all of our conversations, that regulators are now mandating cyber resilience and data protection as an essential business function, not just as an afterthought.

Thumbnail 670

Threat Modeling for Recovery Resilience Using the STRIDE Framework

With that, I would like to invite Ivan to take us through the role of backups and risk mitigation. What we'll start today with is threat modeling for recovery resilience, and the reason why I'm starting with this is because I personally think that the majority of you are already familiar with the process of threat modeling. You are already using it while you're designing your systems. The only difference will be that we will today focus on threat modeling for recovery resilience instead of just threat modeling against compromise.

There are multiple frameworks, and all of them are good. We will be using today what's called the STRIDE four-question model. That model starts with asking a very simple question: What are we working on? This is an opportunity for you to understand the system that you're going to protect. Try to understand what are the building blocks of your system, try to understand the mutual relationship between these components, and try to understand different dependencies. A very important part here is to try to understand consistency requirements of different components of your system because they will dictate what kind of recovery strategy you can actually implement.

You will be proposing a recovery strategy that fulfills your needs. Ask yourself now what can go wrong with the system that you are trying to protect and with the system that you devised as a mitigation strategy. That starts by asking who are your attackers. Who is the attacker who can harm your system? That vastly depends on what kind of permissions that persona can have in your system. Can they be, let's say, an administrator in your business account? Can they be an administrator in your management account of your organization? Can they be an insider that is intimately familiar with your internal operational procedures? All this should be taken into account when you are modeling the attack.

Finally, this is the most creative part of this exercise: what can we do to mitigate this risk that we identified in the previous step? This is very important because depending on the dependencies that you identified previously, you should devise a recovery strategy that fits your needs. Understand what is the RTO that you are targeting, understand whether that is possible, and work with your application owners to understand whether the recovery strategy that you devised makes sense and whether it is possible. It is very important to know that just having backups is not enough. Recovery assumes that your system can actually be recovered from backups. In order to achieve this, you will probably need to make some modifications to your existing system and build this recoverability into it.

Finally, the final question is, did we do a good job? Every time you're modifying your system, part of that modeling should be an exercise that discusses whether this new change affects the recoverability of your system. Also, perform regular evaluations of your recovery strategy and measure your performance. Exercising the recovery procedure is the key that can give you confidence that you can actually execute the strategy that you devised in case of a cyberattack.

Thumbnail 880

Strategies for Securing Backups: The 3-2-1 Approach and Recovery Planning

Threat factors depend vastly on the system that you will be designing, but on the other side, there are some common factors that we can talk about. What are the common attack vectors? What are the common targets that malicious actors will target in your system? They can target your backup strategy, they can target your backup plans that are currently protecting your resources, and they can target your notification mechanisms that you use to keep track of whether backups are taking place or not.

They can target your recovery infrastructure. They can target backups directly by trying to delete them, and they can target the additional infrastructure that you have put in place in order to facilitate recovery in case of an attack.

Thumbnail 930

Thumbnail 940

So when we talk about strategies for securing backups, this is usually a common question among people. How many backups should they create? How should I keep backups? Where should backups be kept? What we usually recommend, this is an industry standard, is a 3-2-1 approach. You should have three copies of data, two of which are backups, and one of which is across your trust boundaries. Depending on whether you are targeting disaster recovery as well as cyber resilience, you can decide whether you are going to put your backups in a separate region or you want to keep them in the same region.

Thumbnail 980

Finally, it's important to test your backups regularly to know that you can perform them and to evaluate the amount of time recovery takes. Keep in mind when you're working in the cloud, we're working in a multi-tenant system. Just asking us to give you new resources may take some time. That may actually affect the amount of time that it takes for your backups to become recovered.

Thumbnail 1000

So this is the second step that you should be thinking about. What data should I back up and how that data should be backed up? That is a question that solely depends on two factors. On one side, it is about understanding what is the data that you need in order to recover. On the other side, asking yourself how much money you are willing to spend on your protection strategy. Identify the criticality of your systems. Try to understand what data is needed for recovery. Is the data raw data or data that you can derive from the raw data so that it doesn't have to be backed up or at least doesn't have to be recovered as fast as the rest of your system?

Another very important part is to think about and take into account, which I really cannot emphasize enough, backups are copies of your data. As such, they are a threat on their own. They can become a threat if not protected properly. They can become an exfiltration risk. So protect yourself by guarding your backups if necessary, following the least privilege principle. Try to understand who needs to have access to your backups and when these backups need to be accessed.

Thumbnail 1120

Finally, when it comes to recovery, what I would actually recommend to you is early on as a consequence of threat modeling, try to develop healthy runbooks and try to keep them up to date so that you know what are the steps that you need to execute in order to react to a cyber event. Now that we are clear with the concepts, let me now start with some specific examples of how you can approach building a reliable recovery strategy.

Thumbnail 1130

Thumbnail 1140

Three Pillars of Data Protection and AWS Backup Service Overview

So to go from there, we'll start with laying the foundation of what we in data protection believe are important for you to consider. I'm going to talk about three very specific pillars. Now you can have more pillars as per your liking, but I think some of those should be considered as given. We are specifically talking about backups and then what will help you recover those backups in a situation where you're under attack and what pillars are really relevant from that perspective.

The first pillar that we always talk to our customers about is immutability and isolation. What it really means is your backups cannot be altered with and your backups are isolated. I'll go into more details of this pillar as we go along and then tie this up with specific examples of what service capabilities exist that you can leverage to achieve this aspect.

Thumbnail 1180

Thumbnail 1200

The second aspect that we really focus on is integrity, that it's not only that you have immutable and isolated backups. Those backups are clean. They exist. They are uncorrupted, and you can restore them. And then the last part, which is availability, you can have your backups locked away very nicely, but if they're not recoverable, then it's no use. Basically your backup is as good as your recoverability,

Thumbnail 1220

and to ensure that, you have to focus on the availability pillar so that once an event really happens, you can go and restore your backups and bring your business back online. So that's from our perspective are the three pillars that you should focus on. And I'll talk about specific examples from what AWS Backup service really offers.

Thumbnail 1230

But to get there, let's just, let me just give you a quick overview because some of you may be new to this service. So AWS Backup was launched in 2019 with a very specific purpose of offering a centralized data management and data protection capability for our customers. One of the guiding principles was ease of use, where you can schedule your backups, control lifecycle policies, perform restores as you wish, and then do audits of your activities that happen on your backups. You can actually scale across your organization by using management backup policies and do cross-account monitoring all in one place.

Our focus was more on ease of use and then giving you a capability where you can expand across many different AWS native services so that you don't have to think about individual services and how to add a lot of those value-added capabilities like search, like scanning for malware, and so forth. You can just come to one place and you can perform all of that. In the last six years, we have expanded from a couple of services to now supporting 22 AWS services, and we expect to continue to expand that scope as we go forward.

Thumbnail 1310

Immutability and Isolation: Logically Air-Gapped Vault Capabilities

Okay, so now going back to those three pillars: immutability and isolation. Now immutability is simply about how you ensure that your backups remain unchanged, and it's not about just your backup where somebody comes in and changes your backups. That was never even possible from day one. What it really means is that your backups are further protected from being compromised in the sense that they cannot be re-encrypted. Nobody can change their lifecycle policy, and essentially they cannot even delete that backup. From our perspective, that's what immutability really means.

And then isolation is slightly different. Yes, you have your backups which cannot be changed, but you also need to ensure your backups are not in the same sort of failure domain as your production systems so that if production systems are getting compromised, your backups are actually still available. They don't have the same way of failing so that you can recover when you need to. So it's very important that your backups are isolated separately from your production systems.

Thumbnail 1380

So what can help with that? I'll talk about one capability. There's a general Vault Lock capability that we also have, but this is one managed solution that we offer to our customers which actually operates across all those three pillars. But I'll focus a little bit on the first pillar. So logically air-gapped vault is a solution that we launched in 2024. It's basically, if I don't know how many of you understand, but it's a container, it's a logical container where you manage your backups as a single unit. You can basically apply access policies, encryption, and various of the capabilities that the service offers at a vault level and operate them as a single unit so that you don't have to work on individual backups, and it makes your management task easier.

So this solution offers some inbuilt capabilities across these three pillars. The backups inside logically air-gapped vault are always compliantly locked by default. There is no other option. It is always locked. So even if an attacker is able to get inside your system and take over your account, they really cannot do anything with those backups. They will continue to stay there as long as their lifecycle permits. So when you get access back to your account, you can be rest assured that those backups are clean. They have not been altered with. They have, of course, not been deleted, and you can basically go from there and do further actions on them, and you can feel confident that they're good.

It provides isolation, which I said was slightly different than immutability. Isolation in the sense that the backups are actually stored in separate service accounts and they're encrypted with service-owned encryption. And where that helps is if an attacker is able to get to your account or get to your KMS key account, they really cannot delete the encryption key. They cannot deny access to those backups because they're not really tied to the core account that has been compromised. So at the time of recovery, this isolation allows the service to perform and allow customers to perform a recovery by accessing those backups even though the core account is not under your control.

It offers cross-account sharing capability. This is where it gets interesting for the availability pillar, because you can now allow those backups in your vault to be shared across different accounts. You can do this for data loss recovery or for forensics testing, or you can do it if your account is completely compromised and then start your recovery process in a different account, even in a different organization, not just constrained to the same organization. And then there's a faster restore experience. So this is all about RTO, right, because if your account is compromised, eventually you will get access to that back, but it may take a while for you to get there in this period. What we really want to get to is that you're able to get access to your backup so that you can start your restore operation so that you can bring your business online. You don't want to wait for days and weeks to start that process. You can start it right away because you can get access to your backups even in an extreme situation.

Thumbnail 1560

Integrity and Availability: Restore Testing and Multi-Party Approval

Okay, integrity. So I mentioned it briefly initially, but what this really means is that you not only have to keep your backups secure, but you also need to ensure that these backups are restorable. They're free from all kinds of malware so that when you're trying to restore them, you know that they are good to go. There is no further action you need to do. You get a status of your backup. Okay, this backup looks clean. I can go restore it, and it's going to take me a certain number of hours to restore a specific resource. So integrity is really important from our perspective. I think just having backups is not sufficient without having that assurance that your backups are actually clean.

Thumbnail 1600

And the capability that we offer for this pillar is restore testing, which is basically a managed way of testing your backups on a regular basis. So some of you may be aware of how the service offers backup plans where you frequently take backups of your primary resources. It's very similar to that concept that you can frequently set up restore testing of your backups, and it offers it as a managed solution that you can select the kind of backups you want to pick up at what frequency. And then you can hook on your own test cases into this workflow where the service will restore those backups, allow you to test your restored instances, and then once you're done with your testing, you can report back to the service as to what happened with your testing and did it meet your requirements or not. And the service will record all that for auditing purposes so that you can share with your auditors that you have actually tested your backups and they're all good. And then at the end of that, you can get that report nicely, and then the service will clean up the restored resources so that you don't have to take an action. So it does this as a single solution so that ease of operation is what it is all about.

Thumbnail 1680

Okay, availability. Now I said immutability and isolation is important. Integrity of your backups is important, but I think from my perspective, availability is the most important one because your backups can be as secure as possible, as good as possible, but if you're not able to get to them, all of that effort is lost. And this is where availability from our perspective is extremely important, and we have built some very specific capabilities in the service to ensure that you can do that easily. And I'll talk about one of those capabilities.

Thumbnail 1700

So earlier this year, we launched support for multi-party approval. So what multi-party approval is, is like teams that you can form in your organization which are highly trusted individuals who come together to form quorums. And what these quorums or multi-party approval teams do is that at the time of an incident, they basically approve critical actions. In our scenario, the most critical action that we have onboarded to this is the ability to get access to your logically air-gapped vault. So if you have lost access to your entire organization, the backup account, and you can no longer recover, a multi-party approval setup can help you access your backup so that you can start your recovery. So the way you will set this up is you'll define your multi-party approval ideally in a separate organization. And it leverages IAM Identity Center identities, basically human-based identities, and you can have, let's say, five to ten or more members as part of that team, and then you define a minimum threshold that five out of ten have to approve a critical action. And you associate that team with the air-gapped vault at the beginning when the vault is created, and once that association is set, nobody can break that association except for the approval team. So even if an attacker has full access to your organization, that backup account, they cannot break this association because it is self-managed by the quorum team or this approval team itself.

So what will happen in a situation where your account goes down is that you can spin up a new recovery organization or a recovery account, and you can have the approval team approve access to your vault, which is still not under your control.

The account is not under your control, but the access approval is done by the team so that you can start restoring your backups or even copy those backups out to that new recovery account or other accounts within your organization. It just works seamlessly from there, but you have to think about this not at the time of an event, but when you're actually planning for a recovery strategy in the beginning. When you're creating your goals and setting up this infrastructure, you have to do it at the beginning.

Thumbnail 1860

Thumbnail 1870

Thumbnail 1910

AWS Backup Reference Architecture for Cyber Resilience

Let me tell you about how all these building blocks that we just discussed kind of come into play as one big picture. When we're working in AWS Cloud, of course, AWS Backup recommends usage of AWS Organizations to centralize your data management and your protection strategies. This is where you will be placing your Service Control Policies that will govern access to different components of your backup architecture. One important part about AWS Organizations is that within the organization there are multiple accounts. All accounts are equal, but the management account is more equal than the others. In that case, what we actually recommend instead of logging into your management account every time you need to see what's going on with backups, we recommend creating a delegated admin account and configuring backup personas in that account. This is where you will be using backup policies to specify what kind of backup strategy to implement. Also, this is where you can configure Backup Audit Manager to monitor your backups across your organization.

Thumbnail 1930

Thumbnail 1980

Another thing, depending on your decision regarding your backup strategy, we often recommend creating an isolated account that will store your keys. The reason for this is that, as I mentioned, attackers who are targeting your recovery infrastructure would consider getting hold of the keys that you use for recovery as their prized possession. If that happens, technically the recovery becomes impossible. Even with all the backups that you copy all over your organization, you will not be able to use them. These are workload accounts. This is where your systems live. This is where the business happens in every account, in every region wherever you are doing your business.

Thumbnail 2000

Thumbnail 2010

You will start by creating a vault, or actually a regular backup vault, and configuring the necessary access policies. This is where your first copy of the data that I mentioned with the 3-2-1-1 strategy is. Your primary resources live here, and the first backup that you will be creating will be created in that primary vault. This year we launched Amazon GuardDuty integration. With GuardDuty integration, you can actually scan your backups as they are being backed up. The reason why this is labeled with number zero is because this ensures that your backups are safe to recover.

Thumbnail 2030

Let's start talking about the secondary copy that I mentioned in our 3-2-1-1 strategy. Our reference architecture recommends using logically air-gapped vaults for storing your secondary copy, as Vivek mentioned. This is equivalent to copying backups into another account, so you don't have to do that manually anymore. Logically air-gapped vaults are essentially handling this problem for you. One important change that we launched relatively recently is the ability to use logically air-gapped vaults as your primary backup target. In that specific case, you may consider not using the primary vault that I started this discussion with.

Thumbnail 2080

Thumbnail 2100

Of course, the secondary copy, if your recovery strategy needs to take into account disaster recovery situations, our recommendation is to dedicate a separate account in a different region from your business and create copies into that account. If that is not your goal, you don't have to do this. Finally,

we recommend creating a dedicated account for different forensic activities. This account can be used, for instance, for regularly running your restore tests and checking whether the backups that you run your tests with are by any chance affected by malware. You can also use this account in case of a disaster, essentially as a safe environment where you have a pre-installed set of tools that you will be using for recovery.

Thumbnail 2160

Thumbnail 2180

Recovery Pipeline Phases and the Minimum Viable Company Concept

So now that we have seen what it takes to implement an effective backup strategy and how you can implement that using AWS Backup reference architecture, we need to understand how this entire effective backup strategy and your protection strategies for your backups align with your larger recovery pipeline. We're going to be talking about these effective recovery mechanisms.

Thumbnail 2200

Before I get started, how many of you here think that by just taking backups and having a secure backup is actually your silver bullet against recovering from a cyber event? Show of hands, please. I would like to tell you that that's probably not the case all the time. Obviously, there is a lot of planning that would have happened beforehand in terms of ensuring how this backup fits into your larger recovery pipeline.

The most important thing is that after an attack has really happened, there are multiple phases of recovery. Obviously, in the first phase of the recovery, your security operations center or your security team comes in and triages all the signals that are coming from the environment to decide whether it really is an event that you should be concerned about. After triaging all the security signals that are coming from your organization, the security team confirms that this is now an event. This is where they confirm that your organization is compromised and you would have to be thinking of threat containment. This is where your security operations center invokes procedures to ensure the threat is contained and you're not exposing your environment to further threats.

Thumbnail 2260

Obviously, after the first phase has happened, that is where the next and the most important phase kicks in, which is where you've decided that most of your critical applications have been impacted and they're inaccessible. After the recovery green light has been given by the security operations center or your C-level team, there is always a provision of a clean environment. Most of the time when such an event happens, there is never a reuse of an existing environment because we can't really figure out while the forensics are being done whether the back door is still persistent in that particular environment.

So when the actual recovery happens, the red box that you see here is where the backups that you have already secured, made available, and made integral in nature come into play. Once that particular backup data has been used to hydrate your environment with all the protected data, your appropriate cyber RPO in place, the application recovery team or the app team now comes back in to say whether there are any post-event actions that they have to do to ensure the hydrated data is made available in the obvious way possible.

Thumbnail 2340

Once that is done, now comes the most important part of it, which is the shared responsibility model of ensuring you understand how this compromise ever happened, because you don't want to get into a situation where you've put in all the effort and you've invested a lot of money and effort to be in a state where after the recovery, the recovery just brought the infection back online. You're allowing the compromise or attacker to come back online, so the post-event recovery review is the most critical aspect of it.

Just to summarize all of these things, one important thing that you have to always consider is that whatever we discussed until now in terms of creating a proper backup strategy is only one small aspect of the larger equation. Obviously, it requires a lot of planning, a lot of understanding about what your critical systems are. That's where the concept of a minimum viable company comes into play.

Thumbnail 2400

Imagine when you start a recovery, you wouldn't start recovering all of your business systems just on the get-go. Obviously, there are a lot of dependent systems that your business-critical systems would be relying on. These might be foundational layers that your organization relies on, which are security services or messaging services that would require you to be operating before you can start setting up anything else.

Thumbnail 2420

Thumbnail 2430

And there would always be dependent services that you would recover after that which is dependent on the foundational messaging services like a platform as a service that is required for your important business services to be brought online. So as you can see here, whenever the organization is trying to recover, they wouldn't start recovering the IBS or the important business services from the get-go.

Thumbnail 2440

Thumbnail 2450

Obviously they have to rely on the critical path of recovery, which is what decides how the recovery steps should look like. The most important thing to understand is that as part of the planning as an organization, you should have understood a couple of things, and the most important thing is your crown jewel applications or the minimum viable company. This minimum viable company is out of all the thousands of applications that you have, the part that is required for your organization to get back on its own feet after going through a cyber attack.

And even within the minimum viable company, the other important concept is the minimum viable service. So imagine you're running an e-commerce application. You wouldn't want to bring your databases to the original capacity that used to operate and handle millions of customers. Probably you would scale it down to a minimum capacity to ensure that it can operate on bare minimum capabilities like listing the products, placing orders, even if it means that customers have to go through queuing for an additional few minutes. That wouldn't matter.

Thumbnail 2510

Thumbnail 2530

So that important business services, the minimum viable services, and most importantly, the critical path of recovery, which is deciding what is the order in which you have to ensure the recovery of your minimum viable company. Some organizations call it the minimum viable company, some organizations call it the minimum viable business, but whatever it is, it is just your crown jewel applications that will ensure your organization can get back to its own feet. And obviously after all of these things have happened you can then go back to recover any other systems that is not critical for your organization's recovery on day one.

These include your leave management systems or your end of day charts and all those things that probably is not that business critical in nature. And the other important aspect of this is the path of recovery. Most of the time this diagram looks pretty simple because we had to fit that into a slide, but as a complicated organization you might have complicated critical paths of recovery so you might have a situation where your customer data has to be brought online before any of the other product management system can be brought online and all these things, so you could have nested cases of these critical paths of recovery for your organization.

Thumbnail 2600

Comprehensive Reference Architecture and Key Takeaways for Cyber Recovery

So the most important aspect is that it has to be thought through and backup and recovery is only one aspect of the entire process. Having said that, I'll let Vivek come back to summarize some of these findings. So actually I'd like to summarize with a more detailed reference architecture, but I'll explain it. It's not all that complicated, so I think Ivan gave a good overview how you can generically apply some of the concepts in building a reference architecture, and you can sort of build that or customize that as per your need, as per your requirements in terms of protection, whether you want to save onto cost.

There are various ways of setting that model up. What I'm trying to show here is essentially sort of summarizing my view of if you are trying to achieve the benefits of those three or try to achieve those three pillars as part of your strategy, then how you should think about setting up your air gap vault and how you can leverage that for recovery on one side and then performing regular testing or integrity tests for your backups. So the way this model looks like is that you have your workload accounts, you have your resources, you regularly take your backups in your backup vault.

You scan them for any kinds of malware for the resources we support, and then you isolate and protect them in air gap vault once they get inside an air gap vault. They're certainly protected. Now you can take two actions. One is you regularly test them, and the way we recommend our customers doing that is you can share that vault with other accounts. So what you see on the right side is the example we're using Resource Access Manager so that you don't have to copy those backups into your forensics account.

They are available because it's like a shadow to your logical air gap vault, and you can restore your backups using this sharing capability on the right side.

You can either use your own custom DIY solution or use other third-party tools. We have an example here where we are recommending an APN partner. You can leverage this by setting up restore testing plans, restore your backups, and then leverage a partner tool like Elasio to scan them and send those reports out for auditing capabilities.

On the left side is the recovery part. They're not the same. Sharing your backups with Resource Access Manager is primarily for data loss recovery or testing. It's not meant for recovery. The main aspect is if your workload account is compromised or your organization is compromised, the attacker can simply compromise the RAM share that you have created.

To really perform a recovery, you need to associate your air-gapped vault with a multi-party approval team. Once that association is done, the approval team can at any time approve a request to get access to that vault in your recovery account, which is a shadow sort of vault to your original logically air-gapped vault, which we call the restore access backup vault. Once you have that vault showing up in this recovery account, you can restore your backups or you can even copy those backups out. This account need not be in the same organization. This account can be in a different organization in itself, but you have to set that up as part of your initial setup of your entire reference architecture.

With that, I'd like to say that there are a few things you should consider. There are a few blogs that are already online which explain this concept and the various ways you can set this up. There are a couple of demos we have, and you should go over them and see how it can fit into your own model.

Thumbnail 2820

Thank you. One key important message that we want to get out there is that recovery for a lot of the organizations that we work with is becoming a key area for them. First, as a customer, you would not want to be taking a backup for the sake of it. Everything is aligned, what regulators expect you to do, what your organization would want to achieve as part of the business continuity, is all aligned and connected to the recovery.

We always recommend that you work backwards from how do you recover and what is required for you to get to that particular state of recovery and ensure there's business continuity. We have also seen, based on industry research and conversations with our customers and feedback from a lot of the existing attacks that we have seen over the year, that backups are also a target of all these attacks. As Vivek mentioned here, when we talk about backups, there are always three important pillars of backup security that you have to plan for: the availability, the integrity, and also the isolation of your backups to ensure that they're all readily available as part of your recovery.

How do you do this? The most important and critical aspect of this is to perform a threat modeling and work backwards from the recovery context of your organization. You wouldn't want to do what we talked about in the reference architecture in whole. Rather, we would advise that customers perform a threat modeling and understand what threats will compromise your recovery resilience and then work backwards to apply mitigations for those things. Probably that reference architecture that we talked about applies in whole for you, or probably in part. It all depends on what is the outcome of the threat modeling exercise that you have done.

When we discuss with our customers, we would always advise that considering operational recovery, disaster recovery, and cyber recovery as three independent things is the most critical thing. You can never achieve the same cyber operational recovery RPO when you try to plan for a cyber RPO, because cyber RPO will involve more aspects of planning and the RTO will depend on a validation process that you will apply. Disaster recovery, to Ian's point, might not apply to you if you're not looking at the threat of a region going down or a region being hampered on availability.

Finally, cyber resilience is actually a business problem. It is not something that the IT can solve for on its own. Obviously, IT can augment the capabilities or IT can help in solving the cyber recovery problem, but it has to be something that is mandated and aligned with the business expectations.

Having said that, it was great having you all here. We tried to pack a complicated concept within the 150 minutes that we had here. We'll be available for questions offline, but we highly recommend that you provide us your valuable feedback so that we can improve on the sessions and be back here with a more refined approach. Thank you very much. It was great seeing you all here. Thank you.


; This article is entirely auto-generated using Amazon Bedrock.

Top comments (0)