Vadym Kazulkin for AWS Community Builders

Posted on Nov 6, 2023 • Edited on Dec 5, 2023

Amazon DevOps Guru for the Serverless applications - Part 5 Anomaly detection on Lambda invocations

#aws #serverless #devops #aiops

Introduction

In the 1st part of the series we introduced the Amazon DevOps Guru service, described its value proposition, the benefits of using it and explain how to configure it. We also need to go through all the steps in the 2nd part of the series to set everything up. In the subsequent parts we saw DevOps Guru in action detecting anomalies on DynamoDB and API Gateway. In this part of the series we'll generate anomalies on the Lambda function.

Anomaly Detection on Lambda

There are several kinds of anomalies that we can observe with Lambda :

Throttling
Time out
Initialization error
Out of memory error
Increased latency

We'll explore the detection of many of these anomalies in this article an then continue with other anomalies with Lambda function involved in the next part. Let's start with throttling. The easiest and cost effective way to simulate Lambda throttling is to set very low reserve concurrency (like 1) on the particular Lambda function as we do here on GetProductById function.

Now we're ready to run our stress test

hey -q 7 -z 15m -c 5 -H "X-API-Key: a6ZbcDefQW12BN56WEN7" YOUR_API_GATEWAY_ENDPOINT/prod/products/1

to retrieve the product by Id equals to 1 where API Gateway will call the GetProductById Lambda function when calling DynamoDB. After several minutes the anomaly with high severity on Lambda function being throttled will be recognized by the DevOps Guru.

In the "Aggregated metrics" section we see, that 2 metrics with anomaly behavior have been displayed "Throttles Sum" for GetProductById Lambda function and "5XXError Average" on the API Gateway (as the Lambda error will be propagated to the API Gateway).

"Graphed anomaly" section gives us more detailed information about both metrics.

The recommendation section gives us as usually some hints about how to fix those anomalies.

By having AWS Cloud Trail and optionally AWS Config recording active we'll see the API call to set "the reserved concurrency limit to 1" in the "Relevant events" section in the category "infrastructural" events as we described in the article about the detecting anomalies on DynamoDB.

The same kind of throttling with similar metrics can be reproduced also by exceeding the Lambda "concurrency requests" limit which is usually has the default value 1000. For that we have to modify our stress accordingly to exceed 1000 concurrent Lambda invocation like this.

hey -q 500 -z 15m -c 50 -H "X-API-Key: a6ZbcDefQW12BN56WEN7" YOUR_API_GATEWAY_ENDPOINT/prod/products/1

In this case we'll also see additional metrics being displayed like "Invocation Sum/Count" on Lambda and API Gateway as such a stress test drastically increased the number of the requests.

Please take into the account that such a test will generate higher AWS costs.

Let's continue with Lambda Error anomalies starting with time out error. To provoke this I simply added some sleep time in the GetProductById Lambda function code and configured shorter timeout than sleep time for this Lambda function. It is sufficient to call this function for some period of time like this

hey -q 1 -z 10m -c 1 -H "X-API-Key: a6ZbcDefQW12BN56WEN7" YOUR_API_GATEWAY_ENDPOINT/prod/products/1

As each call of Lambda function runs into time out DevOps Guru can quickly recognize this anomaly.

"Aggregated metrics" shows the following metrics.

We see mainly "Errors Sum" for GetProductById Lambda function and "5XXError Average" on the API Gateway (as the Lambda error will be propagated to the API Gateway). Latency and duration metrics appeared mainly due to the fact that we artificially added latency by adding sleep time to our GetProductById Lambda function.

"Graphed anomaly" section gives us more detailed information about these metrics.

In order to simulate Lambda initialization error anomaly we can take into the account the fact that Lambda function in the sample application has been written in Java progamming language which requires at least 256 MB memory to succesfully run. So by giving the Lambda function only 128 MB memory

and repeating the same test as we did for the Lambda timeout anomaly we'll see the DevOps Guru detects this anomaly and presents also the same metrics.

DevOps Guru will create the same anomaly and metrics for Lambda function running out of memory. For this we'll give Lambda function very limited amount of memory (256 MB) and initiate memory intensive computation in it. It will be nice if DevOps Guru would have some difference between anomalies that Lambda function not beeing initiliazed and it running out of memory as these are completely different errors. I think that CloudWatch needs to recognize Lambda errors more fine granular first in order for DevOps Guru to do a better job.

Let's finish this article by discussing the Lambda function increased latency anomalies and whether DevOps Guru detects them. Strictly speaking the increased latency doesn't produce any errors and it is matter of the definition (and is strickly a business decision) if the latency is acceptable or not. But of course the increased latency impacts the user behavior and may have the financial impact. For that I executed some test scenarios and added some sleep time in the GetProductById Lambda function code (which will be normally executed within lower when 1 second in case there is no cold start happening) but configured such time out value (like 50 seconds) that the Lambda function doesn't run into it. As long as this additional sleep time was between 1 and 20 seconds DevOps Guru didn't create any insight, so function duration increased from 1 to 21 seconds wasn't considered as an anomaly. Adding bigger sleep time as 20 seconds and periodically invoking our GetProductById Lambda function like this

hey -q 1 -z 10m -c 1 -H "X-API-Key: a6ZbcDefQW12BN56WEN7" YOUR_API_GATEWAY_ENDPOINT/prod/products/1

led DevOps Guru to recognize the following anomaly with medium severity.

"Aggregated metrics" shows the following metrics.

Sometimes it can be p90 metrics displayed.
"Graphed anomaly" section gives us more detailed information about these metrics.

As such an anomaly has medium severity it took DevOps Guru quite a long time (between 30 and 40 minutes) depending on the number of invocations with the increased latency to create an insight. I personally think it's acceptable as such short period anomalies can be a big source of false positive alarms.

Conclusion

In this article we described how DevOps Guru was able to detect different kind of anomalies on Lambda like throttling, time out, initialization, out of memory errors and increased latency. In the next part of this series we'll explore other anomalies on the Lambda function occurring mainly during its asynchronous invocations and in collaboration with other service like SQS, Step Functions, Kinesis and RDS.

The Future of AI, LLMs, and Observability on Google Cloud

Datadog sat down with Google’s Director of AI to discuss the current and future states of AI, ML, and LLMs on Google Cloud. Discover 7 key insights for technical leaders, covering everything from upskilling teams to observability best practices

Learn More

DEV Community

Amazon DevOps Guru for the Serverless applications - Part 5 Anomaly detection on Lambda invocations

Introduction

Anomaly Detection on Lambda

Conclusion

The Future of AI, LLMs, and Observability on Google Cloud

Top comments (0)

Read next

Enhancing Security Operations: Microsoft Sentinel's Unified Workbook Integration

Amazon Q Developer Tips: No.19 Amazon Q Developer Agents - /doc

Beyond Docker - A DevOps Engineer's Guide to Container Alternatives

Amazon Lightsail: Instances, Snapshots, and Networking