Is your monitoring testing strategy chaos?

#monitoring #chaosengineering #fis #serverless

Introduction

Nowadays, many Cloud implementations will make use of serverless architectures, such as AWS Lambdas and API Gateways to implement micro-services, or other similar functionality to deliver business logic without the need to manage servers.

This is now a mature pattern, and we have a wealth of tools and approaches to help us ensure that our serverless code is performing as expected. We can develop and test locally, and use pipelines to deploy, all ensuring the risk of deploying non-functioning code is minimised.

Whenever I'm working with teams, I have some best practices I recommend such as deploying lambdas via CI/CD, ensuring that logs have a retention periods set etc. I also recommend that they have monitoring in place to capture errors, failures or timeouts. But whilst testing code functionality is relatively straightforward, it can be more complex ensuring that monitoring is capturing the events we want, or that alarms are raised when issues are detected.

With code, we'll test our functionality works (happy path testing), but we should also test how we handle errors (unhappy path), but how can we test our monitoring? I typically work within regulated industries, and here any testing has to be reproducable and we need to evidence our testing approaches which is difficult to do if we make changes manually. Alternatively, we could introduce changes to our code to allow us to introduce errors or slow down code (how many people have included code similar to if TEST then ... )? But this introduces complexity to our code which we should avoid, since we should only include business logic.

One tool that can help with this is Chaos Engineering. I've written previously on using Amazon's Fault Injection Service (also known as FIS) to deliver 'Chaos Engineering as a Service', and I've been using FIS to test various types of AWS resources.

With servers, it's relatively easy to understand how we might implement Chaos Engineering tools - we could for example ssh into the server, then run scripts to introduce CPU load, or slow network connectivity, but how do we test code running in a serverless environment where we have no control over the environment.

Bringing Chaos to Serverless environments

Luckily at re:Invent 2024, AWS announced they were introducing new capabilities to FIS, allowing it to interact with Lambda.

The new functionality provided three methods of testing lambdas by

delaying the start of a lambda function,
forcing the function to generate an error, or
modifying the responses returned by a lambda function.

To provide this functionality with a given lambda, we need to perform 4 actions

Configure the lambda to use a Lambda Layer that allows FIS to interact with the lambda runtime environment,
Create an S3 bucket which is used to pass configuration and runtime data between FIS and the lambda layer
Add some environment variables to the lambda configuration
1. AWS_FIS_CONFIGURATION_LOCATION - the S3 bucket (and an optional prefix within the bucket)
2. AWS_LAMBDA_EXEC_WRAPPER - the executable within the layer for FIS to use; currently this should be /opt/aws-fis/bootstrap
Ensure that the IAM execution role used to run the lambda has permissions to read and list the contents of the bucket

For more information on these pre-requisites, see the AWS documentation

Defining our testing approach

Once we have our lambdas configured to allow them to use the FIS lambda layer, we need to define how we want to test them. To do this in FIS, we define an experiment template. Templates comprise a number of components as shown below:

In this case, we're interested in two component types:

Targets - these define what AWS resources we want to test,in this case, our lambda function
Actions - these describe what we want to do to the resources; with lambdas, this would be one of the 3 actions - delay, introduce errors, or change response code

As an example, we might create an template which carries out these actions

And configure this to run with targets that match all lambdas with a particular tag. In FIS, this would look something like:

Running our tests

Once we have setup monitoring, we might expect to see results similar to below - the images are from a CloudWatch dashboard, but whatever monitoring tools, you should be able to see something similar:

With lambdas running expected, an example dashboard might look something like:

If we take the experiment described above, and run it during a similar period, our dashboard would now look like this:

We can see the dip in the Invocations widget, with matching peaks in Duration and Latency widgets at 09:45 that tie into when our experiment introduces a delay in the lambda executions, followed by a peak in the Error Count widget at 09:55 when we introduce errors via the experiment, and finally when we change the response code at 10:10, there is a peak in the 4xx Error Count graph.

Remember, this is done with no code changes, or having to manually modify how the infrastructure performs, and more importantly, provides a repeatable, auditable experiment that can be used at any time.

So how do I try this?

I've created a github repository which contains a CloudFormation template to deploy

an example lambda, configured to use the FIS lambda layer,
an API Gateway to access the lambda,
an example CloudWatch dashboard,
a FIS experiment template which you can run to test the lambda.

The repository with instructions can be found at https://github.com/headforthecloud/cloudformation-aws-fis-lambda-monitoring.

As part of the deployment, it will output the URL to access the API Gateway. For testing, and to produce the above dashboards, I used this simple bash script to call the API Gateway:

while :
do
    curl _insert_gateway_url_here &
    sleep 0.5
done

and then after running this to establish a baseline, we can start the FIS experiment template which should create similar results in the dashboard to those seen above.

Conclusion

When it comes to our monitoring, we should have a formal, defined approach to testing, rather than a 'it'll be ok' mindset. Using AWS FIS in conjunction with the Lambda specific tests, we can move away from manual tinkering with a configuration, or intrusive if TEST then ... code blocks and move to an approach where chaos engineering is an integral part of our testing process.

Taking this approach means we can:

Validate our monitoring: Ensure that your dashboard and alerts show us when real issues occur,
Audit our resilience: Provide stakeholders with repeatable, documented evidence that our monitoring approach is robust and fit for purpose,
Streamline our code: Ensure that your code is focused on business value, and reduce our unit testing overheads.

Embracing chaos lets us demonstrate that our monitoring approach works and provides our teams with the overview they need when they need it, rather than 3:00 AM on a Sunday morning.

So go ahead, introduce chaos to your testing - your team will thank you for it!

This post was originally published on my blog at: https://headforthe.cloud/article/is-your-testing-strategy-chaos/