Introduction
In part 1 of the series, we introduced the Amazon DevOps Guru service, described its value proposition, the benefits of using it, and explained how to configure it. We also need to go through all the steps in part 2 to set everything up. In the subsequent parts, we saw DevOps Guru in action, detecting anomalies on DynamoDB and API Gateway. In this part of the series, we'll generate anomalies on the Lambda function.
Anomaly Detection on Lambda
There are several kinds of anomalies that we can observe with Lambda :
- Throttling
- Time out
- Initialization error
- Out of memory error
- Increased latency
We'll explore the detection of many of these anomalies in this article an then continue with other anomalies with Lambda function involved in the next part. Let's start with throttling. The easiest and cost-effective way to simulate Lambda throttling is to set very low reserve concurrency (like 1) on the particular Lambda function, as we do here onthe GetProductById function.
Now we're ready to run our stress test:
hey -q 7 -z 15m -c 5 -H "X-API-Key: a6ZbcDefQW12BN56WEN7" YOUR_API_GATEWAY_ENDPOINT/prod/products/1
to retrieve the product by id equal to 1, where API Gateway will call the GetProductById Lambda function when calling DynamoDB. After several minutes, the anomaly with high severity on the Lambda function being throttled will be recognized by the DevOps Guru.
In the "Aggregated metrics" section, we see that 2 metrics with anomaly behavior have been displayed: "Throttles Sum" for the GetProductById Lambda function and "5XXError Average" on the API Gateway (as the Lambda error will be propagated to the API Gateway).
The "Graphed anomaly" section gives us more detailed information about both metrics.
The recommendation section usually gives us some hints about how to fix those anomalies.
By having an active AWS Cloud Trail and optionally AWS Config recording, we'll see the API call to set "the reserved concurrency limit to 1" in the "Relevant events" section in the category "infrastructural" events, as we described in the article about detecting anomalies on DynamoDB.
The same kind of throttling with similar metrics can also be reproduced by exceeding the Lambda "concurrency requests" limit, which usually has the default value of 1000. For that, we have to modify our stress accordingly to exceed 1000 concurrent Lambda invocations like this:
hey -q 500 -z 15m -c 50 -H "X-API-Key: a6ZbcDefQW12BN56WEN7" YOUR_API_GATEWAY_ENDPOINT/prod/products/1
In this case, we'll also see additional metrics being displayed like "Invocation Sum/Count" on Lambda and API Gateway, as such a stress test drastically increased the number of requests.
Please take into account that such a test will generate higher AWS costs.
Let's continue with Lambda Error anomalies, starting with time out error. To provoke this, I simply added some sleep time in the GetProductById Lambda function code and configured a shorter timeout than sleep time for this Lambda function. It is sufficient to call this function for some period of time, like this:
hey -q 1 -z 10m -c 1 -H "X-API-Key: a6ZbcDefQW12BN56WEN7" YOUR_API_GATEWAY_ENDPOINT/prod/products/1
As each call of the Lambda function runs into time out, DevOps Guru can quickly recognize this anomaly.
"Aggregated metrics" show the following metrics.
We see mainly "Errors Sum" for the GetProductById Lambda function and "5XXError Average" on the API Gateway (as the Lambda error will be propagated to the API Gateway). Latency and duration metrics appeared mainly due to the fact that we artificially added latency by adding sleep time to our GetProductById Lambda function.
The "Graphed anomaly" section gives us more detailed information about these metrics.
To simulate a Lambda initialization error anomaly, we can take into account the fact that the Lambda function in the sample application has been written in the Java programming language, which requires at least 256 MB of memory to successfully run. So by giving the Lambda function only 128 MB of memory
and repeating the same test as we did for the Lambda timeout anomaly, we'll see that the DevOps Guru detects this anomaly and presents the same metrics.
DevOps Guru will create the same anomaly and metrics for a Lambda function running out of memory. For this, we'll give the Lambda function a very limited amount of memory (256 MB) and initiate memory-intensive computation in it. It will be nice if DevOps Guru had some difference between anomalies that the Lambda function is not being initialized and it is running out of memory, as these are completely different errors. I think that CloudWatch needs to recognize such Lambda errors more fine-grained first for DevOps Guru to do a better job here.
Let's finish this article by discussing the Lambda function's increased latency anomalies and whether DevOps Guru detects them. Strictly speaking, the increased latency doesn't produce any errors, and it is a matter of the definition (and is strictly a business decision) if the latency is acceptable or not. But of course, the increased latency impacts the user behavior and may have a financial impact. For that, I executed some test scenarios and added some sleep time in the GetProductById Lambda function code (which will normally execute within lower when 1 second when there is no cold start happening), but configured such time out value (like 50 seconds) that the Lambda function doesn't run into it. As long as this additional sleep time was between 1 and 20 seconds, DevOps Guru didn't create any insight, so the function duration increased from 1 to 21 seconds wasn't considered an anomaly. Adding bigger sleep time as 20 seconds and periodically invoking our GetProductById Lambda function like this:
hey -q 1 -z 10m -c 1 -H "X-API-Key: a6ZbcDefQW12BN56WEN7" YOUR_API_GATEWAY_ENDPOINT/prod/products/1
led DevOps Guru to recognize the following anomaly with medium severity.
"Aggregated metrics" show the following metrics.
Sometimes, it can be the p90 metrics displayed. The
"Graphed anomaly" section gives us more detailed information about these metrics.
As such, an anomaly has medium severity. It took DevOps Guru quite a long time (between 30 and 40 minutes), depending on the number of invocations, with the increased latency to create an insight. I personally think it's acceptable, as such short-period anomalies can be a big source of false positive alarms.
Conclusion
In this article, we described how DevOps Guru was able to detect different kinds of anomalies on Lambda, like throttling, time out, initialization, out-of-memory errors, and increased latency. In the next part of this series, we'll explore other anomalies on the Lambda function occurring mainly during its asynchronous invocations and in collaboration with other services like SQS, Step Functions, Kinesis, and RDS.













Top comments (0)