Kazuya

Posted on Dec 5, 2025 • Edited on Dec 8, 2025

AWS re:Invent 2025 - Build resilient SaaS: multi-account resilience testing patterns (ISV404)

🦄 Making great presentations more accessible.
This project enhances multilingual accessibility and discoverability while preserving the original content. Detailed transcriptions and keyframes capture the nuances and technical insights that convey the full value of each session.

Note: A comprehensive list of re:Invent 2025 transcribed articles is available in this Spreadsheet!

Overview

📖 AWS re:Invent 2025 - Build resilient SaaS: multi-account resilience testing patterns (ISV404)

In this video, Kannah V J, a Senior Solution Architect at AWS, demonstrates how to build resilient multi-tenant SaaS architectures using AWS Fault Injection Service (FIS). The session covers six critical SaaS pillars including tenant isolation, noisy neighbor mitigation, and observability, then explores AWS's resilience lifecycle framework with five stages: set objectives, design and implement, evaluate and test, operate, and respond and learn. Through live demonstrations using a multi-account setup with e-commerce and RAG applications, the speaker showcases specific resilience testing patterns including noisy neighbor scenarios with API Gateway throttling limits, tenant isolation validation using attribute-based access control, serverless Lambda timeout testing, and EKS pod fault injection. The presentation emphasizes defining clear hypotheses, running controlled experiments across multiple AWS accounts, and continuously iterating resilience testing as part of CI/CD pipelines to proactively identify weaknesses before production failures occur.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Introduction: Building Resilient Multi-Account SaaS Architectures

My co-speaker, Frenil could not be here for personal reasons. Hence, today I'm privileged to be a solo speaker. Welcome to our session ISV 404: Build Resilient SaaS: Multi-account Resilience Testing Patterns. I'm delighted you decided to join and appreciate you investing your valuable time with us. How many of you have experienced that panic-inducing 3 AM call about your multi-tenant SaaS environment failures? Any volunteers? Great.

What if you could proactively validate your multi-account resilient architecture against controlled experiments by injecting faults to ensure that you have your workloads tested within the tenant boundaries as well as mitigating any other failures? Today we'll explore how the leading ISVs use AWS Fault Injection Service to ensure that the SaaS architectural designs are validated and tested without impacting your SaaS architectural critical components. Usually, these failures are injected to ensure that you mitigate any of your end customer availability issues in terms of services or SLAs that you're committed to your end customers.

I'm Kannah V J, a Senior Solution Architect aligned with the UK ISV SaaS team. I've been with AWS for five years and I'm based out of Dublin. Today, first we'll begin with understanding some of the core principles of SaaS architectures and how every successful SaaS solution underpins how a multi-tenant solution has to be operated. Then we will understand how SaaS providers must adopt a comprehensive resilience framework, discussing some of the fundamentals of how AWS shares our experience in truly building resilient applications to our customers.

We will talk about a real-world user journey with an AWS reference architecture of a multi-tenant control plane and application plane architectures where we have set up two of the SaaS solutions to showcase resilience testing patterns. The core of today's talk is to showcase you multiple resilience testing patterns on how to induce various faults following the AWS best practices and some of the SaaS fundamentals. You'll walk away with key learnings on how to adopt multi-account resilience experiments for your own SaaS architectures, irrespective of your current maturity level.

Finally, we'll open up the floor for any question and answer sessions.

Six Critical Pillars of SaaS Resilience and AWS Well-Architected Framework

To build a truly resilient SaaS architecture, SaaS providers must adopt six critical pillars. First, robust tenant isolation to prevent security breaches and data breaches. Second, noisy neighbor mitigation to maintain consistent user experience as well as the overall performance of your multi-tenant solution.

Third, comprehensive identity and access management to ensure that you have secure tenant boundaries.

Fourth, tenant-aware observability to ensure that you can instrument various tenant usage resources and service throttling limits and various other observability-related metrics so that you can withstand any sort of disruption in terms of failures.

Fifth, strategic tiering to serve your diversified customer segments. And last but not the least, cost tracking mechanisms to ensure that you will be able to charge back to your end customers. All these fundamentals remain the core building blocks for you to build a truly resilient solution on AWS.

One of the best practices as part of AWS Well-Architected Framework's reliability pillar is failure management, where we talk about test reliability. The idea of correlating SaaS fundamentals with the AWS Well-Architected Framework's reliability pillar is to ensure that you need to consider reliability as a core pillar to validate and test your workloads against any disruptions. This will ensure that your workload is ready to withstand both partial failures or grave failures against your functional and non-functional requirements.

The objective here is to have controlled experiments which you induce as part of your multi-account SaaS architectures so that you would be able to either completely avoid, minimize impact, or have completely no disruptions without impacting your critical workloads.

AWS Resilience Lifecycle Framework: A Five-Stage Approach

If you really want to deep dive, feel free to grab the QR code. Now, how do you approach a structured methodology and a continuous iterative process to build and manage your SaaS architectures on AWS? This is where AWS developed a resilience lifecycle framework sharing resilience learnings and best practices that we have learned based on working with hundreds and thousands of customers to truly build resilient systems.

In terms of the resilience lifecycle, we have five critical stages where you can adopt various best practices, AWS services, and strategies to mitigate your resiliency objectives. To start with, you will set the objectives. At this stage, your key objective would be to understand what level of resiliency is needed for your workloads. This goes back to purely a business conversation, not a technology conversation, aligning your objectives in terms of RTO, RPO, availability, or any SLAs that you have committed as a business to your end customers.

With those objectives in mind, you work towards the second phase of design and implement. This is where you would anticipate various failure modes to ensure that your workload is rightly designed with the right technical architecture and accordingly adopt the right tools. Once you design and implement based on the objectives that you have set in the previous stage, you move towards the third stage of evaluate and test, where the primary focus would be to perform resilience testing.

We are not trying to bring in chaos here. It is about inducing faults with controlled experiments to ensure that you know the limits or the boundaries to reduce any blast radius. There are two phases to evaluate and test. The first stage would be to do pre-deployment activities related to your software development lifecycle that you traditionally do, and the post-deployment activities where you focus on resilience assessments, disaster recovery testing, or resilience testing using AWS Fault Injection Service, depending on the nature of your SaaS architectural workloads distributed either in a single account or multiple AWS accounts.

Once you are happy with your test results, you move on towards operate. This is where you look at your observability tools, logs, metrics, and various other best practices to instrument various disruptions causing your availability issues or various other system dependency issues. Once you have the right instrumentation in terms of observability, you move on towards respond and learn. Based on your various incidents that your operational team manages in terms of infrastructure failure, availability, or application level failures, you would have a mitigation strategy on runbooks, playbooks, and operating procedures and all other incident management related best practices.

Then you continually learn on how you are mitigating those failures and funnel those learnings into your set objectives and various other stages as part of the overall flywheel. The key goal here is that you can apply this flywheel for your existing workloads or even new workloads. Feel free to grab the QR code if you are interested. We have a very detailed white paper on this.

Phases of Resilience Experiments: From Steady State to Improvement

Now, this brings us to our next topic of phases of resilience experiments. For every experiment that you want to induce for your specific workloads, you eventually get started with steady state. Steady state is pretty much what normal looks like for your workload behavior depending on your user traffic patterns or depending on the service that you are offering to your end customers. You need to measure your steady state behavior to know based on metrics or logs how your system is behaving.

Once you know the baseline version of your steady state, you move towards the hypothesis. Hypothesis is nothing but a scenario where you want to validate your assumptions or weaknesses of your workload, which is more of an unknown. How do you know these unknown factors? By defining the hypothesis with a clear business outcome that states what the weaknesses of your workload are and what you want to validate with the specific objectives.

Once you define the hypothesis, you proceed towards the third phase of running an experiment. This is where based on the hypothesis that you have defined, say you want to induce fault A to components A, B, and C, and then you are expecting that the system behavior will be either a fault error rate increasing by ten percent or maybe an infrastructure failure or any other failures that you are planning to induce. So for example, if you have an EKS spot instance, then you want to induce an EKS pod CPU stress, or maybe if you have an RDS database, you want to fail over to another region, depending on the hypothesis that you want to define.

Once you run the experiments using AWS Fault Injection Service with a multi-account setup, you would evaluate how your system has behaved. You want to go back and verify what was the outcome and what kind of metrics you observed to verify those systems. Then accordingly, you move towards improvement of your architecture, run book, playbook, or any learnings that you had as part of this hypothesis.

This flywheel has to be adopted for every resilience experiment that you define. If you don't do this, then eventually it becomes a confusion for anyone involved in these phases of resilience experiments. Be sure that you create this and apply it as part of the resilience lifecycle, which is more of an inner flywheel within the resilience framework. This brings us to a situation where all of this sounds great, but how do I get started on AWS? I have tens or hundreds of AWS accounts where my SaaS workloads are deployed. How do I induce faults not in one single account but across multiple accounts? This is where AWS Fault Injection Service helps you to do that.

AWS Fault Injection Service: Orchestrating Multi-Account Resilience Testing

The way it works is that it's a fully managed service for you to eliminate any heavy lifting that you have to do in terms of building custom scripts or any other implementation that you have to perform to automate or induce these faults. AWS Fault Injection Service integrates with IAM where you have to define an orchestrated role as a single pane of glass to induce various faults across other accounts. Then you can access FIS through the management console, CLI, APIs, or any means that you want to integrate or get started. The very first step is to configure an experiment template, and this experiment template has three critical components.

First is an action. Action is nothing but the activity that you want to perform in terms of the faults, depending on the nature of your AWS service adoption. Once you define the action, it could be a sequential action or it could be a parallel action that you want to take as part of the faults. Then you would go ahead and define the targets. In this case, the target would be the AWS resources based on the nature of your architectural patterns, whether you're using a serverless architectural pattern or maybe traditional EC2-based deployment models. You decide on those specific targeted resources, and accordingly the action will be induced with those specific targeted resources.

The third critical element is FIS safeguards. As I mentioned earlier, resilience testing is about a controlled experiment to reduce the blast radius. This is where you will use safeguards to define a stop condition for these experiments. You have certain thresholds set for your experiment, and when those thresholds are met, the experiment can be stopped by creating an alarm. That way you're not creating chaos with your upstream or downstream systems. Once you define all these three components, that becomes your experiment template.

You can also use scenario libraries which we have pre-built as AWS opinionated recommendations. These are common sets of experiments that customers are interested in, where you can get started much faster. Once you configure the experiment templates, you can start or stop the experiment on an ad hoc basis as well. Another great thing about FIS is that it enables you to have your third-party observability tools outside of AWS consume certain events from those tools via Amazon EventBridge where you can start the experiment. It creates a sort of openness platform for you to have a single pane of glass in terms of your resilience experiments against your AWS resources.

Real-World SaaS Architecture: Control Plane and Application Plane Design

There's a great case study from BMW Group who have actually leveraged FIS to identify weaknesses of some of their critical components, and they are actually adopting this as a mental model to evaluate various critical components. Now we want to talk about a real-world scenario of a SaaS workload architecture. Let's take an example of personas who are interacting with the SaaS architecture. For today's session we have two critical personas who are trying to interact with the overall SaaS architecture. One is the SaaS provider who is a software company or an organization trying to design, build, deploy, and manage the SaaS offering to their end customers. As you know, SaaS providers typically have to manage multiple tenants. It's a multi-tenant solution we're talking about. As per AWS best practices, we recommend that as a SaaS provider, you break your overall SaaS workload architecture into a control plane and an application plane.

In terms of control plane operations, these are what you see on the screen: tenant management, admin management, billing, metrics, and authentication are the core common modules as global services serving across multiple tenants. These SaaS providers eventually have to operate their workload at scale to ensure that tenants are happy in terms of the service offering and they are eventually paying for what they are consuming. The second key persona is the tenant, who is actually the consumer of the SaaS application offered by the SaaS provider.

For today's session, I've set up two different SaaS solution offerings. One is a SaaS e-commerce solution providing tenant-specific product management in terms of product catalog, where tenants can configure their product catalog. Similarly, tenants can also configure their order management in terms of order processing workflow based on the products that they have configured. These tenants can also interact with the second solution offering about SaaS retrieval augmented generation, where tenants can onboard their organizational related data and that data would be indexed in a vector store. Then the vector based on the tenant's specific context along with the user query from the tenants would be passed to generative LLMs to generate the response, and accordingly the response is sent back.

In terms of both of these solutions, tenant isolation and noisy neighbor issues in terms of the SaaS fundamentals that we talked about are super critical to ensure that you maintain those boundaries. We will see in a practical sense some of the patterns that show how you verify tenant isolation is working as expected based on your authorization strategy that you would have implemented as part of your SaaS architecture. What we have seen earlier is just a kind of workflow or a very logical construct of how tenants interact. Beyond SaaS provider and tenant, you could have multiple other personas as well depending on your nature of business and the interactions.

What you see is an AWS SaaS reference architecture for today's setup. We have the topmost layer of user access layer where tenants are interacting with the SaaS application and they're also interacting with the RAG application based on the integration with Cognito for authentication and authorization. The user access layer connects the tenant and the SaaS provider specific control plane related components. You remember I talked about the control plane components in the previous slides. All of these functionalities are managed by the SaaS provider in terms of the control plane, and the control plane talks to the application plane through Amazon EventBridge as an event-driven pattern.

The first interface of the control plane is a core application plane offered through a kind of tenant workflow authorization in terms of managing the tenant onboarding process as well as the other integration elements. We also have an EKS application plane hosted as a pooled model offering tenant-specific namespaces in terms of order and product related microservices having their respective data stored in Amazon DynamoDB.

We also have a siloed deployment model of a serverless application plane as a silo architecture where tenants are specific to their deployment model and can adopt the silo architecture as well in terms of the order processing workflow. Then we have a generative application providing retrieval augmented generation with a shared service model where tenants are sharing the same infrastructure while isolating the tenants in terms of their data, their interaction, authorization, and all other SaaS fundamentals that we talked about.

Multi-Account Setup for Resilience Testing Patterns

Now let's talk about resilience testing patterns. Resilience testing is about how you induce purposeful faults into your workload to validate your assumptions with a controlled and bounded context to ensure that it doesn't impact your critical workloads or customer experience. This is more around how you define those hypotheses, induce the fault, and improve the overall resiliency posture. Before we get into the patterns, I just want to quickly talk about the multi-account structure and the setup that we have done for today's patterns where you can see we have picked two regions, US East 1 and US West 2, and there are multiple personas interacting with this solution.

We have a FIS administrator who could be an SRE admin, lead, or DevOps SRE lead who's trying to actually perform these faults, then tenants interacting with the solution, and then we have the SaaS provider.

We have an AWS organization as a route under which we have a test organizational unit. This test organizational unit has four different accounts, each performing a specific role in the overall multi-account setup. Account 1 represents the single pane of glass for AWS FIS experiments with the AWS FIS experiment orchestration. Account 2 represents the control plane and the e-commerce application plane, where we have AWS Systems Manager to induce custom faults based on Systems Manager documents, along with AWS STS to perform cross-account authorization with the FIS experiment target. Account 3 is for the RAG solution, and Account 4 is where we observe various system behaviors using Amazon CloudWatch with cross-account observability. We also have bucket policies configured for one of the accounts to replicate data from the primary region to the secondary region.

This is how you would set up your workload in terms of multi-account architecture for your fault injection service. Account 2 and Account 3 are examples for today's discussion, but you could have 10, 20, or hundreds of accounts depending on the complexity of your business and where your workloads are running. Account 1 and Account 4 could be your single pane of glass from your orchestration perspective and observability perspective, while Account 2 and Account 3 can replicate across multiple accounts. You need to carefully design your architecture and account structure, and determine how you want to create the hypothesis to ensure that you see the real value out of this.

Pattern 1: Multi-Tenant Noisy Neighbor Scenario and Throttling Validation

Let's get started with Pattern 1. Pattern 1 is a multi-tenant noisy neighbor scenario. How many of you have experienced a multi-tenant noisy neighbor scenario for your workloads? It's a scenario where one tenant's activity adversely affects the overall service performance of other tenants. To mitigate this, SaaS providers typically adopt a workload management and resource allocation strategy. One of the common methods is to have a throttling mechanism to ensure that tenants access and stay within their quota limits configured per tenant.

What you see is an AWS reference architecture. In terms of the resilience testing phases, we start with the steady state in terms of Amazon CloudWatch to observe how the system is behaving. Then we move on to defining a hypothesis, run the experiment, come back and verify, and then improve based on our learnings. In terms of tenant interaction, we have two tenants for Pattern 1. Tenant 1 is a security ISV that deals with threat data, and Tenant 2 is an HR tech ISV that deals with employee award data. Both tenants perform their respective RAG queries against API Gateway.

The API Gateway is configured with a usage plan to limit per-tenant requests to 10,000 tokens, 500 input tokens, and 500 output tokens with 100 requests per day. The API Gateway is attached with a Lambda authorizer, which extracts the JWT token, understands the tenant-specific token limits, and validates the role. It also dynamically generates the tenant-scoped credentials that are passed to the Lambda RAG service. The Lambda authorizer also checks against DynamoDB to ensure that the tenant stays within the limit. Once the response is received from the Lambda authorizer, the API Gateway forwards the request to the Lambda RAG service, whose job is to invoke Amazon Bedrock knowledge base to specifically retrieve the tenant context from Amazon OpenSearch vector store, which are already vectorized based on the data ingestion workload.

The context is retrieved along with the user query. Both pieces of information are passed to Amazon Bedrock LLM to generate the response, and accordingly the response is sent back. This covers the overall flow of retrieval augmented generation. On the top, you see an Amazon EventBridge where every one minute, the EventBridge rule invokes an Amazon Lambda in terms of token usage. Its only job is to analyze the CloudWatch logs and then update the DynamoDB table, which will be used by the Lambda authorizer.

Now, what we want to do in terms of the hypothesis is to induce two kinds of faults. One is to actually disable the EventBridge rule, and the second fault is to actually delete the CloudWatch log streams. After inducing these faults, we want to validate what will happen to our system behavior in terms of whether tenants can actually bypass those throttling limits or they stay within the limits to simulate or understand the noisy neighbor scenario. So let's quickly jump on to the AWS console.

I hope you are able to see my screen. What you see on the screen is the four accounts that we have set up, where as an administrator, I can feed into these accounts. Now, in terms of the code setup, let me quickly connect. In terms of the code setup, you can see that we have a tenant project for the RAG. We have a CDK under which we have a tenant template. The tenant template has multiple code repositories in terms of Python modules. You can see the Bedrock custom services have individual modules related to Python. We also have multiple other TypeScript CDK, which actually performs various activities in terms of creating roles, deploying the AWS resources, and trying to bundle the overall Python modules into Lambda and deploy that.

The correlation of these Python modules, you can see, for example, we have Bedrock custom, we have a Python module which is associated with a specific TypeScript CDK where you can see that the TypeScript is invoking or referencing the Bedrock custom Python module with the specific bedrock_logs.py with the handler and the handler with the specific Lambda functions. You can also see that based on the TypeScript CDK it generates the log group along with the role ARN which is passed as an environmental variable. Towards the TypeScript, where the CDK creates the CloudFormation script, it deploys the necessary resources, and it also uploads or packages the overall Python code into a Lambda function. It uploads into S3 and the CloudFormation actually deploys as a Lambda. This is just one example for the Bedrock-related functionality in terms of the tenant token usage. Similarly, we have the services under which you can see the RAG service. We have the core RAG service in terms of the Lambda, which has a similar functionality of showcasing, you know, like, let me go back to the respective TypeScript here. So we have service.ts, which is again a TypeScript specific to the RAG service, and you can see here we have also referenced the authorizer Lambda along with the specific combined attribute-based access control which we are using for the authorization strategy. You can see that we have the authorization service referenced again with the specific Python module with the Lambda handler in terms of the function and the necessary environment variables are passed, which are again correlated with your TypeScript along with your Python module packages.

The way the project is set up where we are using infrastructure as code along with Python to bundle it, deploy it to ensure that any changes that we do as part of the SaaS workload architecture are deployed in a simplified way and easy to maintain, and we can mitigate any sort of failures. Now what I have done is in terms of, so I have two terminals here, Terminal 1 and Terminal 2. Just to save some time, I have already executed a script here. So just to show you the tenant-related information, so let me just go back and execute.

So here, as I mentioned earlier, we have two tenants configured, tenant 1 and tenant 2. So tenant 1 has their respective registration ID along with the email address with the tenant config, reference to input token, output token, API gateway. Similarly, we have tenant 2 with their specific registration ID, email address, and then the tenant-specific configuration as well. Now, to understand the steady-state behavior for tenant 1, I have actually executed a script with an invoke API towards the API gateway with the specific input, the tenant name, and I am trying to ask a question here. What are the threats related to gathering host information? I am sending 13 requests, and as you can see, each of the requests I am receiving HTTP status code 200 sounds great until request number 7 and then from request 8 onwards I have received HTTP status code 403.

Which is access denied. This is because tenant one has already exceeded their token limits, considering the noisy neighbor scenario. Then we have tenant 2, who is performing the same query, but in this case, the query is different because the tenant context is different. So the question is: who are the employees who received an award for teamwork? I'm sending 4 requests just to differentiate, and all of the requests have succeeded. Now this validates that we have the throttling strategy in place.

With this, we want to quickly jump onto the AWS console where I've configured a dashboard for Pattern 1 where we have various observability metrics related to the AWS services to understand how tenants are using our services. We have DynamoDB latency and throttling, DynamoDB token usage, table operations, Bedrock invocation metrics, and more. We also have the respective authorization in terms of the Lambda attached to the API Gateway to analyze. I can show you one of the errors explaining why you received a 403, and the reason is because the tenant token limit has been exceeded.

Now I want to go ahead and introduce the file that I talked about. What you see is an AWS console where you have FIS service aligned with resilience testing, and then we have resilience management related to Resilience Hub. As I mentioned earlier, the first thing is you configure the experiment templates, which are already configured here, and you also have scenario libraries and experiments based on the various experiments that you run in terms of the overall status, and you have spotlights to talk about the various blocks. Once you understand the steady state, you come to this experiment template where I've configured a noisy neighbor fault and disable EventBridge rule. Let me click there and then start the experiment.

I've started the experiment. Now what we are expecting FIS to do is induce both of the faults into our multi-tenant account, and you can see that we have this experiment ID along with the account targeting as multi-account. We have an IAM role configured, and currently it's in the running state. You can export the experiment into reports as well. We have two actions based on what I explained as the architectural difference. One is to disrupt the EventBridge rule using AWS Systems Manager send command, which is a custom fault. Some of the AWS services have native integration with FIS where FIS can induce faults into AWS services. But in this case, because it's a SaaS workload, I have the flexibility to configure my own custom faults. That is why I'm using AWS Systems Manager send command, which eventually uses Systems Manager documents. A document is a custom set of steps or scripts that you want to induce in your SaaS workload architectures. Both of these actions are completed.

Similarly, you have another action or activity using AWS Systems Manager send command where I want to delete the CloudWatch log streams. Then you have the targets. In this case, since you're using Systems Manager, you would eventually use an EC2 instance, which will use the SSM agent to perform those actions. Then we have a target account where we have deployed the RAG solution. We have the task timeline and log events. Now this experiment is completed. With this, I want to ensure that in terms of observability, let me go back and show you the alarm here. Let me refresh. I have an alarm created as I induced the fault. You can see that we have deleted one of the log stream actions, which ensures that we have a fault induced.

Now can anyone guess whether if I send the same request query again, would that succeed? Because already tenant one has exceeded their limits. Any idea? Both injections are, yes, the faults have been induced. Exactly, so it's going to fail because the Lambda authorizer, the Lambda that was attached to the EventBridge rule, has already captured by analyzing the CloudWatch logs and it updated the DynamoDB. Now any request going in, the Lambda authorizer would actually go back and check in the DynamoDB and then say that this tenant is already exceeded their limits in spite of inducing the fault. So for today's demo, I'm going to reset the DynamoDB table so that at least we'll see what will be the behavior based on the faults that we have induced. What you see on the screen is a DynamoDB table related to the token usage.

I'm going to reset this table. However, in a real-world production system, you won't be able to do this and you don't need to do it, because this is an example where the throttling limit has already been set . We have a tracking mechanism using a DynamoDB table, which is why you won't be able to reset it in production. Now that I've reset the DynamoDB table, I'll go back and run the same set of experiments to see what happens.

Let me go back to Visual Studio Code and send the same request now. We're validating our hypothesis by executing the experiment and now we're trying to verify the results . I'm going to send the same request here with 13 requests for tenant one, and then similarly, I'm going to send 13 requests for tenant two just to see how both tenants perform in terms of the throttling limits. This is going to take some time to complete, probably a couple of minutes. Let me check if this request is proceeding to ensure that after we reset the DynamoDB table, we're receiving responses . Yes, we are receiving the response, so let it continue.

Meanwhile, let me quickly jump to the experiment templates to show you how this looks in the real world . What I've shown you is an AWS Resilience Hub console version of visualizing how the experiment has behaved . You can also configure a JSON file and upload it, or you can import it into FIS, and then the experiment can be configured. Let me quickly show you the experiment template and how it looks . It's a similar version of the UI, but here it's pretty clear that first you have to configure the action. In this case, we have two actions configured with the action ID of SSM send command . You can see the document ARN, which is super critical if you want to induce a custom fault. This is a Systems Manager document with the document ARN, and the duration is one minute with the target being our instance because it's an SSM send command. Whereas if you induce a Lambda fault or maybe EKS pod actions, then you would see that the action ID would be respective AWS services.

We have two actions configured, and you can see that the second action has a different ARN as well. Then we have the description, and you can export it into a report . You can also do log configuration, and the targets are configured with the respective EC2 instances. You can see that both of these resource ARNs are associated with an EC2 instance in the same account . Now if I go back and show you the Systems Manager documents as well , this is how the SSM documents look. You have a schema version with parameters passed as input in terms of the log stream to delete, so you're passing which log stream to be deleted with the run command, the log group, and a set of steps against the AWS services . Similarly, you have configuration for the EventBridge rule as well. This is how you will configure an experiment tied with the specific action commands using Systems Manager.

Pattern 2: Tenant Isolation Testing with Attribute-Based Access Control

We're just waiting for this script to complete . Meanwhile, let me quickly jump to Pattern 2 . Pattern 2 is about tenant isolation , which is one of the critical principles of any multi-tenant solution. How do you ensure that your tenants are not able to access other tenants' data or resources? What kind of strategy would you use? Does anyone have suggestions? What is a common method of tenant isolation strategy? Different accounts could be one strategy, but still, how do you control your principals so they're not able to bypass accessing other tenants' resources or even the data in terms of data isolation? That's one strategy. That's more about enforcing certain policies that you want to bring in as part of your multi-account strategy. To have a robust tenant isolation mechanism, customers typically adopt three authorization strategies.

Number 1 is role-based access control, where every tenant will have their specified role. That role will define which specific AWS resources they can use. For example, if you have an EC2 with S3 and RDS, you can have a role defined per tenant, and based on that particular authorization, you can restrict access to those specific AWS resources. The second mechanism is to dynamically generate an IAM policy for each tenant based on IAM identities, which could be an IAM user or role. The third common strategy is to use IAM role with attribute-based access control, where you define an attribute-based access control policy and attach that to a tag, and that tag will decide which specific IAM principal in terms of the role can access which resource.

For the RAG solution, we have adopted the third strategy of using IAM role with attribute-based access control. What you see is very similar to pattern one. The only difference I want to highlight is that whenever a tenant sends their query, Lambda Authorizer dynamically generates an attribute-based access control policy with the tenant's scoped credentials as part of the request. It then passes that to the API Gateway, and the API Gateway passes those credentials based on the principal in terms of the IAM role attached to that tag. That tag will eventually be passed on to the Lambda, which is the RAG service, to ensure that when it invokes, for example, the knowledge base or any other resources, it will try to match the tag. If both the principal tags and the resource tag match, then the request would be authorized. If not, they would get an access denied error. This is the common strategy using attribute-based access control.

Now in terms of the hypothesis, what we want to do is induce a microservice bug. This could be a human error or any other coding-level issue that could happen. For example, someone might mistakenly hard code the knowledge base ID of tenant 2 into tenant 1, which may not happen, but anything could happen in terms of failures. So if that's the case, whether tenant 1 can bypass and access tenant 2's data is something that we will validate.

Looking at what you see on the screen for pattern one, compared to the previous iteration, all of those tenant requests have succeeded. So as we induce the fault, tenant one can completely execute 13 requests, and tenant two can also proceed to execute 13 requests. Based on the fault that we've induced and after we reset the DynamoDB table, tenants can actually overshoot on their limits. This is a sort of violation from a throttling mechanism standpoint. It means tenants can actually create a greater volume of requests, which could actually jeopardize the overall SaaS solution in terms of account-level or regional-level limits.

Some of the key learnings for pattern one is how to enforce an IAM least privilege mechanism to restrict any of your IAM identities within your organization or accounts from being able to delete IAM rules or logs or anything that would restrict tenant behavior. The second strategy is that if such a fault happens, how do you ensure that overall, legitimate tenants who are still within their limits don't receive a throttling error because of the noisy neighbor problem where multiple tenants are trying to create more requests, which would put other tenants at risk due to your account-level or regional-level quotas that you have already pre-provisioned. So how do you mitigate that? You have to look at the broader scope of the tenant and the noisy neighbor scenario of mitigating the throttling mechanism.

Additional Patterns: Serverless Timeouts, EKS High Availability, and Cross-Region Replication

Last but not least, is to have a more robust notification and alerting strategy for the SaaS provider to know if there are certain tenants who are actually bypassing those throttling limits. Accordingly, you can have some sort of automation or remediation to actually go back and increase those quota limits either at the account level or regional level. Now in terms of the other patterns, we have pattern 3 for serverless applications.

How many of you have adopted serverless patterns for your workloads? Serverless is one of the modern architectures that most customers try to adopt because of reliability and high availability, as we manage the infrastructure and you can focus on your specific application-related components. One of the common challenges with serverless applications is timeouts, because you have so many serverless architectural components where you would have circular dependencies between components and maybe upstream or downstream systems as well.

To overcome this challenge, how do you induce a purposeful fault within your serverless architecture to identify those system dependencies or timeout scenarios? One of the common patterns is where you could induce AWS Lambda faults using FIS, which is a native capability that we support. You can induce Lambda faults in terms of adding latency into your Lambda and identify how user behavior and various other system components like API Gateway error rates or user error rates are increasing, where users have a bad experience. You can leverage FIS to do this, and the common pattern is that you will adopt or go through the same phases of implementing it.

The fourth pattern is EKS application high availability. Managing workloads within EKS is a complex task because of so many dependencies within EKS, such as configuring pods, node clusters, and various other internal dependencies of your Kubernetes resources. One of the common patterns where customers adopt FIS is to induce EKS pod actions, and we have so many pod actions that we support natively in terms of terminating instances in the node group servers and also inducing CPU stress, IO stress, memory, and latencies. All of those faults are natively supported as part of FIS where you can induce CPU stress, for example, against multiple tenants where tenants could be isolated based on namespaces.

In this case, tenants are trying to create requests towards tenant one and tenant two, and for both of those tenants, you can label the faults based on the application selector as your criteria to induce the faults. Some of the dependencies are prerequisites for you to induce the faults. You have to configure Kubernetes service accounts. The Kubernetes service account is a role-based mechanism where you provide certain permissions to FIS to actually create an FIS pod which will sit along with your existing pods in your clusters to induce those faults to ensure that the EKS failures in terms of pod actions can be mitigated.

I have another pattern about cross-region data replication on how to use S3 cross-region replication. If you're dealing with multi-regional architectures where especially the data injection workflow has complex steps, and if you're relying upon S3 to replicate the data using cross-region replication, then this could be one of the patterns that you can adopt to induce faults across multiple regions as well. If anyone is interested, please let me know to discuss.

Key Takeaways: Making Resilience Testing a Continuous Practice

In terms of the key takeaways, the first and fundamental element is to understand your workload architecture. What are the system dependencies that you want to validate and are you interested in identifying cross-regional failures or maybe are you trying to validate how one of the components would fail and have a circular dependency on other components? Understanding your end-to-end workload architecture is critical for resilience testing.

Once you understand your architecture, proceed towards identifying the key objective of resilience testing. If you don't know what specific hypothesis you want to create in terms of resilience testing, it will be very difficult for you to justify what the business outcome was and how you justify the outcomes in alignment with the improvement of your workload architecture. That is where you define your hypothesis with real consideration of the weaknesses and what validation you want to do, and then accordingly run the experiments.

This is where you can use FIS experiments based on the various supports that we offer. Once you run the experiment and verify it, the real benefit would be based on the outcomes. The outcomes are the learning that you have identified, and then accordingly improvise your resiliency posture to ensure that you can withstand those failures or disruptions based on your testing. As you mature towards your resilience testing phases, ensure that you would be able to integrate resilience testing as part of your CI/CD as a continuous iterative process.

Resilience testing is a continuous iterative process, not just a one-time activity, because your system would evolve, your team would change, and you would have various disruptions to handle. Ensure that you perform resilience testing in a continuous iterative process. Feel free to grab the QR code if you're really interested in diving deeper into various other best practices and guidance from AWS. If you're interested in having one-to-one demos or even an opportunity to grab some swags, meet us at the kiosk in the AWS village. If you really want to upskill your AWS knowledge, feel free to register on AWS Skill Builder, where you have thousands of on-demand trainings as well as hands-on experience.

; This article is entirely auto-generated using Amazon Bedrock.