DEV Community

Vadym Kazulkin for AWS Community Builders

Posted on • Edited on

Amazon DevOps Guru for the Serverless applications - Part 6 Continuing with anomaly detection on Lambda invocations

Introduction

In part 1 of this series, we introduced the Amazon DevOps Guru service, described its value proposition, the benefits of using it, and explained how to configure it. We also need to go through all the steps in part 2 to set everything up. In the subsequent parts, we saw DevOps Guru in action, detecting anomalies on DynamoDB, API Gateway, and in the last article Lambda. In this part of the series, we'll continue with anomalies on the Lambda function occurring especially in conjunction with other AWS Services.

Anomalies with Lambda polling the SQS queue

Let's enhance our architecture so that in case of creation of the new product, we send the message to the SQS queue and have another Lambda which polls that queue (and then, as an example, informs the financial department to send the price for the newly created product).

Now let's imagine that the polling Lambda runs into some kind of error (timeout or runtime error processing the SQS payload). With that, the message remains in the SQS queue, the polling Lambda will retry to read from the queue according to the retry policy, always running into the error. The message will remain in the queue until it expires or we reach the maximum number of Lambda retries and leave the message unprocessed (or place it in the Dead-Letter Queue)

After setting up such a scenario, we'd like to figure out whether DevOps Guru will detect such an anomaly, and it will. We see the high-severity Lambda Error anomaly being recognized.

Digging deeper into the "Aggregated metrics"

and "Graphed anomalies"

we see that besides "Errors Sum" on the CreatedProduct Lambda function, DevOps Guru recognized other deviating metrics on the "product-created" SQS queue, like "NumberOfMessageRecievedSum" and "ApproximateNumberOfMessageNotVisibleSum", which both indicate that there are unprocessed messages for a long period of time in the
SQS queue with the name "new-product-created".

Anomalies with Lambda polling the Kinesis Data Streams

Let's imagine we have a use case where Lambda polls Kinesis Data Streams for the sake of storing the data in an S3 bucket to analyze it with Amazon Athena or Amazon QuickSight.

Now let's imagine the polling Lambda runs into some kind of error (timeout or runtime error processing the Kinesis Stream), similar to the previous use case with SQS. With that, the message remains in the Kinesis Data Streams, the polling Lambda will retry to process the Kinesis Data Streams record according to the retry policy, always running into the error. With that, the message will remain in the Kinesis Data Streams until it expires or we reach the maximum number of Lambda retries and leave the Kinesis Data Streams record unprocessed (or that use cases where there is a failure destination on the Kinesis Streams).

After setting up such a scenario, we'd like to figure out whether DevOps Guru will detect such an anomaly, and it will. We see the medium (I'd personally rate it as high) severity Lambda Error anomaly being recognized.

Digging deeper into the "Aggregated metrics"

and "Graphed anomalies"

we see that besides "IteratorAge Maximum" on the OrderedProduct Lambda function, DevOps Guru recognized other deviating metrics on the Kinesis Data Streams, like "GetRecords.Byte Sum" and "GetRecords.Records Maximum", which both indicate that there are unprocessed Kinesis Data Streams record(s) for a long period of time.

Anomalies with Step Functions invoking Lambda

Now, let's imagine another use case with Step Functions calling a Lambda function as a part of some task.

Now this Lambda runs into some kind of error (timeout or runtime error processing the payload), similar to the previous use cases with SQS and Kinesis Data Streams. This continues until we reach the maximum number of Lambda retries that we have configured in the Task Retry policy of our State Machine.

Also, this high-severity anomaly was correctly recognized by the DevOps Guru.

Digging deeper into the "Aggregated metrics"

and "Graphed anomalies"

we see that besides "Errors Sum" on our Lambda function, DevOps Guru recognized other deviating metrics on the Step Functions, like "Executons.Failed Sum" and "Executons.Failed Sum" that the number of failed and in the end aborted executions significantly increased for a certain period of time.

Anomalies Lambda communicating directly to RDS service

Now, let's imagine the scenario that we use RDS with PostgreSQL instead of DynamoDB.

And we don't use either RDS Proxy or Aurora Serverless (Data API). With that, there is a risk of exhausting the database connections, having more Lambda functions running in parallel, and connecting to the database more than the maximum number of database connections that RDS provides. For the sake of cost saving, we'll use the smallest possible RDS instance db.t3.micro, which provides you with 83 database connections. For your instance size, please check "maximum number of database connections". For example, for the PostgreSQL RDS instance, execute the following: "show max_connections".

Now let's do some stress test with hey tool :

hey -q 50 -z 15m -c 20 -H ""X-API-Key: a6ZbcDefQW12BN56WEN7" YOUR_API_GATEWAY_ENDPOINT/prod/productsFromRDS/2   

Enter fullscreen mode Exit fullscreen mode

With this (sending 50 requests per second per container with 20 containers in parallel for 15 minutes), we'll quickly exhaust all database connections.

DevOps Guru correctly recognizes such high severity anomaly.

Digging deeper into the "Aggregated metrics"

and "Graphed anomalies"

we see that besides "5XXError Average" on our API Gateway, DevOps Guru correctly recognized the deviating metric "DatabaseConnections Sum" on the RDS. Strangely, although our Lambda function GetProductFromRDSByID also threw an error, there was no "Errors Sum" deviating metric detected.

Conclusion

In this article, we described how DevOps Guru was able to detect different kinds of anomalies on Lambda, like polling the SQS queue and Kinesis Data Stream, but also Step Function invoking a Lambda function running into an error. Also, we set up the test to detect the anomaly in the Lambda function communicating directly with RDS, which runs out of database connections. In the next part of the series, we'll look into another DevOps Guru functionality called "Proactive Insights".

Top comments (0)