DEV Community

Cover image for AWS re:Invent 2025 - Resilience testing and AWS Lambda actions under the hood (COP414)
Kazuya
Kazuya

Posted on

AWS re:Invent 2025 - Resilience testing and AWS Lambda actions under the hood (COP414)

🦄 Making great presentations more accessible.
This project aims to enhances multilingual accessibility and discoverability while maintaining the integrity of original content. Detailed transcriptions and keyframes preserve the nuances and technical insights that make each session compelling.

Overview

📖 AWS re:Invent 2025 - Resilience testing and AWS Lambda actions under the hood (COP414)

In this video, AWS Solutions Architects Saurabh Kumar and Haresh Nandwani demonstrate chaos engineering for serverless workloads using AWS FIS (Fault Injection Service) native fault actions for Lambda functions. They explain three key FIS actions: add start delay, modify integration response, and invocation errors. The session covers how FIS extensions work through an API proxy pattern, polling S3 buckets for experiment configurations with dual-mode polling (60-second slow poll, 20-second hot poll). Through live CDK coding in Java, they deploy a sample application with API Gateway, Lambda, and DynamoDB, then create FIS experiment templates to inject faults. They demonstrate setting up Lambda layers, environment variables, CloudWatch dashboards for observability, and IAM policies. The presentation includes running actual experiments, analyzing reports showing pre-experiment, experiment, and post-experiment states, and discusses the trade-offs between polling frequency and performance overhead. They also introduce new FIS scenarios for testing gray failures like AZ Application Slowdown and cross-AZ traffic slowdown.


; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

Thumbnail 0

Introduction: Chaos Engineering and Resilience Testing for Lambda Functions

How's everyone doing? Monday, got through it, right? Probably we are in between you and your drinks, so we'll try to keep it as smooth as possible. So before we start, let me ask the question: how many of you are using Lambda functions in production or playing around with them? Alright, that's the majority of you. And how many of you are performing some sort of resilience testing for it? Alright, you are in the right room.

So as evident from the room, more and more companies like yourselves are adopting serverless technologies like AWS Lambda functions. Lambda functions make it easy for you to deploy and run your applications because it takes away the undifferentiated heavy lifting of managing the infrastructure. But at the same time, it makes it difficult for you to perform chaos engineering or resilience testing, and that's where AWS FIS Native Fault Actions come into the picture. It helps you to get started with chaos engineering for serverless workloads quickly and easily.

In this session, we are going to talk about what those FIS fault actions are, how they work, and demonstrate some practical ways of getting started with chaos engineering and resilience testing for serverless technologies. My name is Saurabh Kumar, and I'm a Senior Solutions Architect at AWS. I'm also a member of a technical community that focuses on guiding customers on resilience and chaos engineering. For this talk today, I'm joined by my colleague Haresh.

Thanks Saurabh. Hello everyone. I'm Haresh Nandwani, a Solutions Architect like Saurabh. I predominantly work with financial services customers, and I'm also part of the same technical field community that focuses on all things resilience. So moving on, the session today will be covering a few things. As Saurabh mentioned, the focus is really to walk through how you can actually get started using these Lambda actions for testing the resilience of your Lambda functions.

Thumbnail 130

We'll first start off with chaos engineering benefits and a high-level overview of the Fault Injection Service. Then we'll give you a quick introduction to Lambda. Hopefully you all know Lambda well based on the number of hands that went up. But we'll just cover some of the basics and then start talking through the FIS Lambda actions that have been recently added. Then we'll move on to how the Lambda actions actually work. The FIS actions we are going to talk about use a concept called extensions. We'll cover how those extensions are really set up, and then we'll move on to the real fun element of the session, which is a live coding session.

Thumbnail 190

So chaos engineering, again, I think we asked a question already, but in terms of using chaos engineering or having heard of chaos engineering, can I have a quick show of hands on how many people here know the term or have even better used it to implement chaos engineering? Brilliant. Okay, a few hands, that's really positive to hear. A couple of years back, I'm sure I would have had a pretty silent room to that question. Chaos engineering as an approach is actually getting quite popular, and the approach is about experimenting with your system to basically build confidence that when things go wrong, your system can actually survive and even address any challenges and get back to normal as soon as possible.

The chaos engineering process is all about introducing faults into your system and then basically validating how your system reacts to those faults. The key benefit here is that as you test your system by running these failure scenarios, you get a better understanding of the resilience posture of your system. Really, how resilient your system is. Where are some of the key challenges from a resilience perspective, and most importantly, what are the key unknowns in your system? It's far better, as you can imagine, finding those unknowns early rather than finding them on a Sunday at 3:00 AM in your production system. So you can see the value of chaos engineering.

Thumbnail 290

AWS Fault Injection Service (FIS): A Managed Solution for Chaos Engineering

So how do you actually implement chaos engineering? There are a number of options in this space to be honest. Our recommendation is to use the AWS Fault Injection Service, or FIS as we'll refer to it during the session. And there are three key reasons why we recommend FIS.

The first is that it's a completely serverless and fully managed service, so there's no service supervision required for actually running your chaos tooling. That's all pre-provisioned for you. There is also a healthy set of what we call failures in our library that exists within the tool.

One of the key asks from customers when they want to implement chaos engineering is how to quickly get started implementing chaos experiments without spending a lot of time documenting failure scenarios or doing all the hard lifting involved. This is where FIS really shines. We'll talk about these scenario library shortly. The second key reason is that FIS includes native integration with AWS services like EC2. One common example would be if you wanted to pause autoscaling for your system in live or in your test system, then FIS provides you with the ability to effectively perform those actions natively, which is something that a lot of the other tools don't actually provide.

The third and most important thing is that when we talk about chaos engineering, a lot of it is effectively introducing faults into your system. But the key thing is that you want to introduce faults in a controlled manner. This is where FIS provides you with the controls and guardrails that allow you to control the execution of those faults and the execution of the experiments. For example, there is a feature in FIS called safety levers. When you run an experiment and if the experiment actually does far more than what you thought it would, you can pull that safety lever and all the experiments that are running will get stopped.

There are a few key things you need to know as we go into the live coding session. There are a few key terms that will be helpful to know. When we talk about chaos engineering, we talk about experiments. We will be setting up experiments in FIS, and the way to set up experiments in FIS is to write something called an experiment template. An experiment template is basically a failure scenario. For example, if you run a multi-AZ application, you might want to set up a scenario which tests the impact of a single AZ being impaired. That would be an experiment template that you set up, and then you run it, and then that becomes an experiment.

The other key terms that you will hear us use a lot are actions and targets. Actions are basically faults that you introduce. For example, if you want to degrade the network or introduce network latency between multiple components or shut down EC2 instances or pause autoscaling groups, those are actions that you introduce. Targets, as the name suggests, are effectively where those actions run. If you are running an action or introducing a fault against a Lambda function, then the Lambda function will be a target.

The other key element to know and to be aware of is CloudWatch. One of the key elements of chaos engineering that is really critical to get right is observability. If you can't observe it, it hasn't happened as far as you're concerned. It's really important that CloudWatch and whatever you're using for observability is set up correctly so that you can actually observe the impact of the experiment or the fault being introduced.

Thumbnail 540

Lambda Resilience Challenges and Three New FIS Actions

Next we come to Lambda, which is our serverless compute service. You obviously all know that very well. The intent of Lambda is that you don't have to think about or worry about server provisioning and all the infrastructure management stuff. You can focus on writing your business logic. Lambda is a regional service, and what that means is basically it uses the multiple availability zones that come with the region to provide redundancy and fault isolation.

That is then used for your function. Effectively, what happens is that if there is an infrastructure failure or there's an impairment to the infrastructure that is running your Lambda function, that is taken care of and the impact to you is minimal because of the multiple AZs that are effectively running in the region. So effectively, Lambda is a resilient service by design. What does that really mean? Does that mean that if you are running a Lambda function or you've set up a Lambda function, you're running a predominantly Lambda-based system, you don't have to worry about resilience, right?

Well, no. You still need to take care of resilience. Your Lambda functions will typically not operate in isolation. They're not really standalone in most cases. They integrate with upstream and downstream systems or components in your architecture, and you still need to validate that when these Lambda functions encounter failures or issues, how does the overall system's resilience get impacted? What is the impact on upstream and downstream systems?

Thumbnail 670

As you can imagine, that is quite a critical ask, something that our customers have been asking for quite some time. The approach that we have recommended to customers is to use chaos engineering to effectively test your Lambda systems from a resilience perspective. To help you with that, we recently introduced three actions within FIS.

The first one is called add start delay. What that does is allow you to specify an amount of time to delay the start of the Lambda function. For example, if you have your Lambda consuming components and you want to understand what the impact of a delay in your Lambda function is to those Lambda consuming functions, then that would be a great experiment to run. The second one is around modifying integration response. If you have Lambda functions that are either returning incorrect responses or if they are themselves integrating with other downstream systems, which then might be returning incorrect responses that the Lambda function returns back to the consuming applications, then that is where the modified integration response comes into play.

The third one is the invocation errors action. This is where you want to test the impact of your Lambda functions being marked as failed when they are invoked. What's the impact on the components in your architecture that are calling these Lambda functions? So how do you actually now use FIS and these actions to integrate that chaos engineering approach into your system? I'll hand over to Saurabh to talk about that.

Thumbnail 760

Thumbnail 770

How FIS Extensions Work: Architecture and Polling Mechanisms

Before we get into the fault actions, let's quickly look at the Lambda components. Most of you will probably be aware of this. Central to the Lambda function is the business logic code that you write. This is the code that you care the most about—your function code. You write the business logic along with its entry point called the function handler. Along with that comes the library code, which is the runtime-specific code. For example, if you're writing your code in Java, then the library code would be the Java runtime library necessary to run the function code.

Thumbnail 800

Thumbnail 810

Thumbnail 820

Thumbnail 830

Another component that you can have is the Lambda execution environment, which manages the function and library code in a secure and isolated runtime environment. The Lambda runtime environment communicates with the execution environment through an HTTP runtime API. That's where your events are triggered and Lambda functions are invoked. Another component your Lambda function can have is an extension. An extension is custom code that enables you to write custom code with your function code. Lambda extensions are packaged as a separate layer and run as a separate process within the runtime environment.

Thumbnail 860

Thumbnail 880

Now the next question is how does FIS provide native fault actions? The answer is through extensions. AWS provides an FIS extension to inject faults in your Lambda invocations. How does it do that? It implements an API proxy pattern, thereby hooking into the Lambda function execution request and response lifecycle, enabling you to inject the faults that were discussed. You've seen how the FIS extension enables you to implement chaos engineering. Let's see how it works. When you configure the FIS extension, you also configure an S3 bucket, and this is the bucket that the FIS extension keeps polling to see if you're running an experiment.

Thumbnail 910

When you start an experiment from FIS, FIS puts a fault configuration to the S3 bucket. So the next time when the FIS extension pulls for it, it sees the active fault and starts applying those faults accordingly.

Now I'll deep dive a little bit on the polling aspect of it. When you are not running any fault for Lambda functions, the normal lifecycle is that the runtime would initialize the extension followed by initializing the runtime itself, followed by initialization of the function. When all of the initializations are successful, then your Lambda function would be invoked one or multiple times, and this is where no faults are being injected. When there are no more events, the shutdown would happen in reverse order.

Thumbnail 970

Thumbnail 980

Thumbnail 990

When the FIS extension is initialized, it starts a timer that pulls the S3 bucket to see if you're running an experiment. If it sees that there is no experiment being run, then it sets the timer on slow poll mode, which is 60 seconds by default. But when you start the experiment, the next time it pulls and sees that there is an active experiment, then it sets the timer on hot poll mode, which is 20 seconds, and you can't configure that.

Thumbnail 1010

This dual mode of hot poll and slow poll is a trade-off between the performance when you're not running any experiment versus quick recovery to the normal state when the experiment has concluded. Now with that, I'll hand it over to Haresh, and we'll get into the live coding part of it.

Live Demo Part 1: Deploying the Serverless Application with CDK

Thank you. So what we'll do as part of the live coding session is that we'll first walk you through the app architecture that we'll use for demonstrating the use of Lambda FIS actions. Then we'll walk you through the CDK code. We have a ton of CDK code that we're going to use today, some of it to actually deploy the application that we are going to use as a demo today, and then there is some CDK code that is actually going to set up the experiments themselves. We will then run the experiment and effectively also show you, as Saurabh was mentioning, the impact of these slow and fast polling.

Thumbnail 1090

Can people at the end see clearly what's on the screen? Please let me know if it's not visible. So what you have here is the CDK code that we mentioned we'll walk through, but first I'll quickly go through the architecture. The architecture is pretty straightforward, to be honest. All we've done here is set up a DynamoDB database with a few tables, and then there is a set of Lambda functions that are accessing and running cloud operations on the DynamoDB table. Then you have a set of APIs hosted on API Gateway, which is then accessed by an external consumer.

The intent here is not to focus too much on the application itself, but on FIS. That's why the application is as simple as it is. What you see at the top is the FIS setup. The top half here is all that is required for the FIS experiment. There is the S3 bucket that Saurabh mentioned, and then we will cover some of how that is set up.

Thumbnail 1070

Getting into the code, there are two stacks that we have today. The API stack is the one which we're deploying the API elements directly, the serverless elements of the stack, which is the API Gateway, Lambda, and DynamoDB, and even the S3, to be honest. Then there is the other part of the stack, which is the FIS experiment part of the stack, which we'll be deploying the FIS experiments. So first, to go through the API code, a lot of the code that we'll show you here is pretty standard stuff. We will not be walking you through every line of code, as that's probably not a great use of your time. We'll just focus on the key elements of the code that are necessary for setting up FIS and running the FIS experiments.

Thumbnail 1210

So to start off with, first up is the serverless API. This is where we have written the code for setting up the APIs and running the code operations. The key elements we want to point you towards are these parameters that you set up for each of the Lambda functions that you want to run an FIS experiment against. Ignore the first two parameters, as those are for the Lambda to access the DynamoDB. However, the next four are really the key ones.

Starting off with the AWS_FIS_CONFIGURATION_LOCATION parameter. As Saurabh mentioned, S3 effectively acts as the intermediary between FIS and your Lambda function. S3 stores the experiment configuration, so that is where you need to point Lambda towards so that Lambda can access that FIS bucket. The next parameter is the AWS_LAMBDA_EXEC_WRAPPER. As Saurabh mentioned, we are using the concept of extensions, and what you need to specify there is the location of a wrapper script that we provide that bootstraps the extension.

The FIS extension, as Saurabh mentioned, runs in its own process. You might want to measure elements of how the extension itself is performing, beyond how the Lambda function is performing. There are metrics that can be emitted from the extension, and for enabling that we set AWS_FIS_EXTENSION_METRICS to all. Finally, there is the AWS_FIS_SLOW_POLL_INTERVAL_SECONDS parameter. This tells the extension how quickly or slowly it needs to poll for the experiment configuration from S3. Those are some of the key elements there.

Thumbnail 1330

Moving on to the next bit of the stack, this part of the stack effectively deploys the Lambda function using CDK code. I won't go through all the hundreds of lines of code, but this particular bit that you see on the screen is of some importance. There is some code here that is key. One is called resource_tag. When you are running experiments in FIS and you want to clearly define which parts of your stack you want to run the experiments on, you can use a concept of tags. You can tell FIS to only target Lambda functions that have a specific tag, and that is what we are doing here.

Thumbnail 1400

The second important element is the fis_lambda_layer_arn. As Saurabh mentioned, because we are using the concept of a managed extension that we have provided, you need to point Lambda towards the extension. The ARN there is the location of the extension code. This is the location of the Lambda layer ARN or the extension code. As you can see, it is actually an ARN that is pointing to a specific account and a specific region. You will be happy to know that not every code you run in every region needs to point to us-east-1. Effectively, you will have a version of this in every region you run.

Thumbnail 1430

And then you have the tags. As I mentioned before, you have the tag_name that you want the FIS experiments to target and the tag_value that you will want to use. These are some of the key elements of the stack that we wanted to explain to you. Now, if we deploy the stack in runtime, that takes about four to five minutes. In the interest of time, what we have done is we have pre-deployed the stack. So let me go to the AWS console and show you the output of the stack being deployed.

Thumbnail 1480

Thumbnail 1490

Thumbnail 1500

This is the AWS console. Hopefully it is not something that people have not seen before. Going back to the stacks then, the stacks that we have here. We have the API stack. Once the CDK code is deployed, it is going to deploy the APIs. Let us quickly look at a couple of things. There are a couple of outputs that we expect from this stack.

Thumbnail 1510

The first one is the CloudWatch dashboard ARN. As part of the FIS experiments, when we run the experiment, we also generate reports. We'll show that to you when we run the experiment, and the report can actually use outputs from a CloudWatch dashboard to effectively demonstrate the impact of running the experiment. That's the ARN that it will use. The second output is the FIS bucket ARN. We've created an S3 bucket and we are going to pass the ARN for that to the FIS stack or the experiment stack.

Thumbnail 1560

Thumbnail 1580

Thumbnail 1590

The resources that this has generated are all the resources that have been created. This includes the bucket and so on. What I'll do is dive deep into one of the Lambda functions just to show you how the deployment actually looks. Let's pick up the get item function. This is a Lambda function that is going to fetch items from a DynamoDB database. If I navigate to the Lambda, what you can see is there is a layer that has been added. This is effectively the FIS extension that has been introduced into the Lambda function as a layer.

Thumbnail 1600

Let's also look at a few of the parameters that we have set up. If we navigate to the environment variables, you can see these are all the configurations that we were introducing in the code. We have the FIS configuration location, which is the location to the S3 bucket where FIS will copy the experiment configuration. We have the extension metrics set, so it will actually generate a ton of metrics from the extension. We also have the slow polling interval and the Lambda exec wrapper and so on. Effectively, the Lambda is now set up with everything that needs to be configured to be ready to accept a fault introduction from FIS.

Thumbnail 1650

Thumbnail 1660

Thumbnail 1680

The last thing I'll show is the CloudWatch dashboard. This is for knowing your steady state before you run the experiment. So you have the pre-experiment state, the experiment state, and the recovery state. This is the FIS dashboard. As I've mentioned before, observability is absolutely the most critical thing that you do when you do chaos engineering. Having a clearly identified set of metrics that you want to track is absolutely critical so that when you run the experiment, you can actually see the impact.

Thumbnail 1700

Thumbnail 1710

Thumbnail 1720

What this dashboard does is show you the impact when an experiment runs. For example, the FIS latency impact should change when we introduce a latency error into the Lambda function. Similarly, if you introduce an error that causes the Lambda invocations to fail, then you should see some change in the metrics here. The next stage is for us to effectively write the CDK code that will set up the FIS experiment. I'll pass that on to Saurabh.

Thumbnail 1800

Live Demo Part 2: Creating FIS Experiment Templates with CDK Code

Before we move forward, let me quickly review what we have covered so far. We looked at the application architecture, we looked at the application code, and we looked at the hypothesis where I pointed out that the observability metrics are critical. If you're injecting errors, you need to understand what the impact would be latency-wise and error-wise. That's the hypothesis. Now we'll get into the CDK code. We are back in business. Are you still able to see, or would you like me to zoom in a bit?

Thumbnail 1820

Thumbnail 1830

Thumbnail 1840

Here is the CDK stack written in Java that I'm going to show you. When you're thinking about writing code for generating experiments, there are a couple of things that you have to build out. Starting from the bucket that we created, you have to configure the bucket and the CloudWatch dashboard. I will write that code, and in the interest of time, I'll take a shortcut rather than typing it out. What this piece of code is doing is importing the S3 bucket that has been created in the previous step. This is the bucket where FIS would write the configurations to. We are also importing the dashboard that was created, and that dashboard would be used in the report that is generated.

Thumbnail 1880

Thumbnail 1890

Thumbnail 1910

Thumbnail 1930

Next, we will create a bucket which would be the destination of the FIS report. There are two buckets here. One bucket is used by FIS to effectively share or send the experiment configuration. The second bucket is used by FIS for storing the reports that it generates when you run an experiment. With those two things done, we are now going to create some policies that your FIS needs to be able to run that experiment on your behalf. I'll walk you through what these are.

Thumbnail 1950

Thumbnail 1960

Thumbnail 1980

We are creating this policy for the bucket so that it can write the experiment report to that bucket. Since FIS needs to write the configuration to the S3 bucket, we are giving permissions to FIS to be able to read and write from that bucket that we configured. The same applies for the policy report bucket. Since we are targeting the Lambda functions, we are providing a Lambda policy and tag policy. These are the set of policies which would enable FIS to be able to fetch the Lambda functions that you want to target. Since we are using resource tags for identifying the resources, we are also adding a tag policy here.

Thumbnail 1990

Thumbnail 2000

Thumbnail 2020

Now we want to send the FIS experiment logs to CloudWatch. That's why we are configuring the CloudWatch policy as well. The final bit here is the CloudWatch dashboard policy, which would allow FIS to fetch the dashboard and all the metrics inside the dashboard so that it can pull all of those and put them in the report. With all those policies now, we are ready to create the role. There is an error in the previous one. We'll see when we run this. Live coding has its advantages and disadvantages, so bear with us.

Thumbnail 2050

Thumbnail 2060

Thumbnail 2090

Thumbnail 2100

What you see here is that we are creating a role with the name FIS role and passing in all the policies that we have created in the previous step. Now comes the part where we are creating the experiment template. For this session, we are going to create two experiment templates. One is going to create an experiment template which would inject a startup delay of two seconds for five minutes to all Lambda invocations for all the functions which are tagged with the specific value. I will now go ahead and add the code for it.

Thumbnail 2110

I'll walk you through the key components of it. Here we are giving a description to the experiment template. This is where you can be verbose so that the description itself is self-explanatory. When you go into the console, you can clearly see what the experiment is. Then I'm giving it a name of inject delay and specifying the action. The action here is invoke add delay, which is one of the actions that Haresh walked us through before. There are a couple of parameters here. The first is startup delay, which specifies the amount of delay that you want to inject. In this case, we are injecting two seconds of delay. For how long? For five minutes. The PT5M parameter specifies the duration you want this fault to be injected. So any invocations during that time would see this delay.

Thumbnail 2200

Then lastly, there is the invocation percent. Right now it is set to 100, which means that all the invocations of all the functions that are tagged would be impacted. But if you don't want this all or nothing kind of experience, you can set a different value, say 50 percent. It would be more sporadic, and you can see a different behavior of your application. Moving on to the target, the target specifies which Lambda functions you want to impact. The resource type is Lambda function because we want to impact the Lambda functions, and we want to impact the Lambda functions which are specified with this specific tag value. Lastly, the selection is all, so all Lambda functions. Again, you can specify if you want a certain percentage of Lambda functions to be impacted here.

Thumbnail 2220

Thumbnail 2230

Thumbnail 2240

Thumbnail 2250

This last section of the experiment template is configuring the report. We are specifying what the report bucket is, where it should go, and what CloudWatch dashboard we want to integrate in this report. Here we are specifying what the duration of the pre-experiment state is, which is your steady state, and the experiment state and after that, what the duration of the post-experiment state is. Think of it as a recovery state, when your experiment has concluded, how quickly your system goes back to its normal state. With that, this particular experiment template is ready.

Thumbnail 2270

Thumbnail 2280

Thumbnail 2290

Thumbnail 2320

Now similarly, I'm going to add code for adding the Lambda invocation error. You obviously don't have to use code for setting up the experiment templates. You can obviously use the console, but because this is a code walkthrough, we are basically showing you the details of the code. This experiment template is almost similar to the previous one. The difference is the action. Here we are saying that instead of add delay, we are saying inject invocation error. Everything else is the same. Now this experiment template code is ready and we will deploy this in our account.

Thumbnail 2330

The question is that if we have the prevent execution set to false, what would happen? So prevent execution, if it is set to true, then your Lambda would not be invoked. The code inside the Lambda function would not be executed, but the error would be returned. If you set that to false, then the Lambda code would run and then it would inject that fault. So there could be scenarios where, for example, if you're creating a transaction and adding a transaction might have a side effect on your system, then you might want to set that to false or true.

Thumbnail 2400

So there are two of the actions, and they both support this kind of operating mode. Both the invocation error and the one which returns integration responses.

Thumbnail 2420

You can actually decide whether you want the Lambda function to execute or not when the error or the errors in response is being returned.

Thumbnail 2450

Thumbnail 2460

Thumbnail 2470

Thumbnail 2480

Thumbnail 2510

Running the Experiment: Observing Fault Injection and Report Generation

I already specified the FIS role. Hopefully I covered that. I think maybe when you copied the first element, you might have copied everything. Let me go back and copy the question. Which one? Anyone able to spot the error? Yeah, where? Line 51, column 15. So report bucket. All right. I think it's been copied twice maybe. Everything. So first bucket error end, first report bucket. Alright, so first report bucket. Yeah, should be good right? Yes. This is the fun of live coding. Alright, so now the stack is being deployed and while it is being deployed I'll go over to the stack and show you the steady state.

Thumbnail 2530

Thumbnail 2540

Thumbnail 2550

So you see the steady state here and you see the Lambda invocations are happening and in this particular one you see there is no faults being injected. So now when the stack gets completed we will run the experiment and in the interest of time we'll run one and then we'll walk over the findings. Any questions in the meantime?

It's a good question. I guess the thing is that you will need to set the permissions accordingly. I'm assuming you should be able to use the same location, but whether it can be the same bucket or not, I'm not sure. It should be fine. I can't see any reason why that wouldn't work. So is the question that you can use the same bucket for FIS configuration and the report? Yes. Well technically you can. Yeah. But they're representing two different things. So if you think from the modularity or separation of concerns perspective, you might want to keep them separate.

Thumbnail 2640

Thumbnail 2650

Thumbnail 2670

Is there a real requirement or is it just more of a best practice? It's not a requirement. It's more of a best practice, how we think we should follow. Yeah, I think again the permissioning model might be slightly different depending on what you're doing with the bucket. So that might be one reason you might want to keep them separate. Maybe not a question but some feedback about the three buckets. Yep. And like for regular people, everybody knows what the bucket essentially is. And so I'm trying to understand like the purpose of the buckets, how information flows through the service itself. It seems like there is a chance that I might push a wrong configuration to the bucket. Why is there not something that tells FIS directly to use this template and validate my template when I retrieve the results?

Okay. So just to kind of repeat the question, if I understand correctly, let me know if not. I guess the question is why is the FIS Lambda set up with Lambda actions using S3 at all to share the configuration and why can't we just directly get FIS to write the configuration?

Thumbnail 2730

One key point is that the Lambda function will exist even without an experiment running on it. When you are setting up the Lambda function and the extension, there is no FIS experiment running at that point in time, so there is no need for it to have access to the FIS experiment configuration. It is only when you are running the experiment through FIS that the configuration needs to be shared with the FIS extension.

That might be the reason, though we can confirm with the service team on what the specific thinking was around that. If I understood the question correctly, the reason you have to configure an S3 bucket endpoint is that the extension code is written by AWS, but the experiment runs in your account. When an experiment is running in your account, there needs to be a place where it can put the configuration that can be used by the extension. So that is why the S3 bucket is necessary. Perhaps that is a feature request that we will take back to our service team.

Thumbnail 2800

Thumbnail 2810

Alright, so the stack is deployed, and this is the stack which I am showing for invocation error. I will walk you through a couple of key pieces here. The action itself, as we specified in the code, has a duration of five minutes with 100% invocation. Let us look at the targets. Here we specified all Lambda functions which are tagged with the specific value.

Thumbnail 2820

Thumbnail 2840

Now in this generate preview, you can actually look at the resources which would be impacted when you run this experiment. Think of it as another pair of eyes or a safety net where you can look at the resources and see something that you do not want to impact. For example, if for whatever reason you did not want to impact the create item function, by generating this preview you can go back and change the tags on those functions accordingly.

Thumbnail 2850

Thumbnail 2880

Then we talked about the report configuration. Here we are specifying the report destination where the report would go and along with the CloudWatch dashboard that it would take the metrics from. Now I will run the experiment in the interest of time. The warning that you see there is that when you are starting an experiment, it is going to actually impact your resources. This is not a simulation; this will actually run an action and introduce a fault into your Lambda function.

Thumbnail 2890

Thumbnail 2900

Thumbnail 2920

While it is running, I will walk you through a sample report. The report would look similar to this, with the experiment ID and template ID and the details about when it ran, which account, and what region. It would also have the timeline if you are running multiple faults in a sequence or in a combination. In this case we just had one, so it does not look any different. You just see one experiment there. And then it also has the dashboard which we were talking about.

Thumbnail 2960

You see the filled out area here. This region is the duration where the Lambda faults were actually injected. This is the experiment duration. To the left of it is the steady state, which is the pre-experiment state. This is what a pre-experiment state would look like. Towards the right is the recovery state or post-experiment state. Ideally you would want to see your post-recovery or post-experiment state to look similar to your steady state. One thing I would call out here is the errors injection.

Thumbnail 3000

We talked about polling. When the experiment starts, that is where FIS puts the bucket and puts the configuration on the bucket. The extension would poll the bucket, and based upon that your faults would be injected. The gap between here is what you would see the time when the experiment was started versus when the configurations were read by the extension and implemented on the Lambda invocations. Alright, I think we covered the experiments, we covered the report, and we covered that.

Thumbnail 3020

Thumbnail 3030

Understanding Polling Trade-offs and New Gray Failure Scenarios

Now the experiment is running. What we can show you is the duration from a previous run before coming into this experiment. When the experiments reach steady state, you can see in the FIS error metrics that there is no error on the steady state side. Then that's where the fault started happening. When the experiment completed, it went back to zero. This is what you can see here. So this is your Lambda duration, this is the experiment state, and then it's going back to the steady state.

Thumbnail 3060

Thumbnail 3070

Now going back to the polling aspect of it. In this graph, do you see any difference? These are snippets from two different experiment reports. On the left hand side, the polling interval was set to 120 seconds. On the right hand side, the polling interval was set to 30 seconds. The difference here is that the polling is happening at a longer duration. The time when FIS started the experiment versus the time when the experiments were actually effective is longer on the left hand side graph compared to the right hand side one. This is because the polling was happening more frequently.

More frequent polling ensures that your experiments are effective sooner than with a longer polling duration. Basically, this is the trade-off between quickly applying your experiments versus the performance overhead when you're doing the polling more frequently. This is the trade-off between quickly applying your experiments versus performance overhead when polling more frequently.

Thumbnail 3150

I think we have covered a lot of ground here. What I would pull up next is the code sample that we walked you through. This is based on a public repository which we published earlier and it has all the fault actions and implementations for multiple Lambda runtimes. If you want to try this out on your own, you can clone this GitHub repository and run these experiments in your account. You can also reach out to your AWS account teams. This fault action is also available in the FIS workshop and chaos engineering workshop. They can host a workshop for you to get started with it.

Thumbnail 3190

Thumbnail 3210

With that, we go back to the presentation. Yep. So we have seen the experiment code walkthrough and we have run the experiment. We looked at the polling mechanism and the difference between slow polling and hot polling. Yep. So just a quick introduction to two new failure scenarios that have been introduced lately in FIS. When you think about failures, you think about introducing failures that are potentially more absolute. So if you already have scenarios in the FIS library where you can test the impact of an AZ impairment, you can see what happens if you have a service or an application running across multiple AZs and then one AZ gets impaired. How do you actually test that scenario? That is already available in the FIS scenario library. But that's more of an absolute scenario. Either something is available or something is not.

We've heard a lot from our customers asking for scenarios which are more in the partial disruption domain. You must have come across a term called gray failure. These are scenarios where there is no complete unavailability of a service, but the service has been partially degraded in some form or manner. To allow you to run those kinds of scenarios in your application, we have introduced two new scenarios. One is AZ Application Slowdown. This is where you basically add latency between resources within a single AZ. You might want to understand what the impact is of some of your resources maybe not performing at the right level and what the impact of that is. The second one is where you want to understand if there is maybe packet loss between an application deployed across multiple AZs and there is packet loss between the components interfacing or interacting between AZs. For that scenario, we again have the cross-AZ traffic slowdown scenario.

Thumbnail 3320

Thumbnail 3330

Thumbnail 3340

Thumbnail 3390

Thumbnail 3410

We are happy to point you towards more detailed documentation on those scenarios if you want us to. This is the FIS scenario library. We did not go through that in detail today, but you can find the scenarios there in the FIS scenario library in the AWS console. I think we are pretty much there. There are a few more resources for you. A lot of what we spoke about in terms of chaos engineering, the benefits of chaos engineering, and the kind of testing you can do throughout the lifecycle of your application is covered in the AWS lifecycle framework on the left-hand side. If you want to look at fault isolation boundaries, which is an important concept for you to understand when designing a highly available application, you can look at fault isolation boundaries and multi-region fundamentals. If you have applications deployed across multiple regions and want to understand how to ensure their resilience, that information can be found in the multi-region fundamentals. FIS now comes under the Resilience Hub webpage. If you want to understand more about what features are available within FIS, you can navigate to the Resilience Hub webpage. Cloud ops. Lastly, drop by the AWS village and visit the cloud ops kiosks. We have a resilience booth there where you can have more one-on-one demos and don't miss out on the swag. Thank you folks. I guess doing a code walkthrough looking at a ton of FIS code is perhaps not the last thing you want to do on a day, but I hope we found the session useful. A lot of the code is already available as mentioned. If you want to play around, please access the QR code that was shared before. If you have any questions, please ask us now or we are available throughout the week. We will also be at the cloud ops kiosk, so please come and ask us any questions you might have.

Thumbnail 3500

Thumbnail 3510

Q&A and Closing: Extension Versioning and Additional Resources

Yep. Once again, thank you. We have two minutes for questions and then we can take more questions in the hallway. When specifying the ARN, it is to a specific version of the Lambda extension. I don't believe it points to a specific version though, actually. So if we go back to the AWS console, there is FIS extension documentation that we can pull up. When you are specifying the layer, where is it? I think maybe what you are trying to understand is how you actually remove the maintainability challenges once a new version comes along. Yeah, that is a good question. So there is public documentation on it by Lambda function for the latest extension version. But that is something you will have to test out and make sure that it is not interfering with your Lambda function. But it is a good question. We are happy to take that away.

Yeah. It is not going to change unless you update the stack, right? Yeah, that is true. So the latest version of this extension is available to specific accounts and you can do a lookup, but you are right. Once you have deployed the stack, unless you are managing or maintaining it actively, you would not know you are falling behind on the Lambda extension version. Exactly. We can come back to you with a response on that. Take that offline. Okay, I think we are at time folks. Thank you so much. Thanks for your patience with all the challenges around the code. Thank you.


; This article is entirely auto-generated using Amazon Bedrock.

Top comments (0)