Amazon DevOps Guru for the Serverless applications - Part 3 Anomaly detection on DynamoDB

#aws #serverless #devops #aiops

Introduction

In part 1 of the series, we introduced the Amazon DevOps Guru service, described its value proposition, the benefits of using it, and explained how to configure it. We're also required to go through all the steps in part 2 to set everything up to be ready for our experiments. Now it's time to see DevOps Guru anomaly detection on real examples, starting with DynamoDB.

Anomaly Detection on DynamoDB

We'll start generating the anomaly by provoking the anomalies on the DynamoDB. For the sake of experimentation, we'll artificially create test cases to provoke such anomalies quickly and with the lowest cost possible. For the DynamoDB test case, let's set the capacity "ReadCapacityUnits" on the ProductsTable DynamoDB table from 5 to 1 as displayed in the image below.

Our goal is to provoke read throttling on this table more quickly. In the real-world scenario, your configured ReadCapacityUnits (in case you use "provisioned throughput" mode on the table) can be much higher, but of course, the throttling may also happen by exceeding it and burning the burst credits.

Now we're ready to run our stress test. For this, we execute the following command, which retrieves the product by id equal to 1:

hey -q 10 -z 15m -c 9 -H "X-API-Key: a6ZbcDefQW12BN56WEN7" YOUR_API_GATEWAY_ENDPOINT/prod/products/1

Please note, you will need to pass your own YOUR_API_GATEWAY_ENDPOINT generated when deploying the SAM template (see the explanation above on how to find it out). In this example, we run the test for 15 minutes (-z 15m), executing 10 queries per second (-q 10) with 9 containers in parallel (-c 9). 15 minutes duration will be more than enough to burn the entire burst credits on the DynamoDB ProductsTable and throttle it, as it only gives us 1 ReadCapacityUnit.

When DevOps Guru recognizes the anomaly (in our case, it will happen after 7 to 9 minutes), it generates so-called (operational) insight. Let's explore how general dashboard looks like in case there is ongoing reactive insight.

We see that we have "Ongoing reactive insights" and "Ongoing proactive insights," and the DevOpsGuruDemoProductsAPI application stack has been marked as unhealthy. We'll look into the "Ongoing proactive insights" in one of the upcoming articles. Let's explore the "Ongoing reactive insights" first. By clicking on it, we can view its details. As we see, it's the "DynamoDB read throttle events" insight as expected.

By removing the filter "status = ongoing", we can also see the past insights (the default is for 6 months)

Now let's dig deeper into "DynamoDB read throttle events" insight by going into the "Insight Overview" page. Here is one of the examples of such insight.

Here we see the information of the individual insight, like severity (in case of throttling, it is of course "high"), status, start, and end time. OpsItem ID is provided in case you integrate with the AWS Systems Manager OpsCenter service (also the topic of one of the upcoming articles). Generally speaking, DevOps Guru is an anomaly detection service that should be used in conjunction with professional incident management tools or services like AWS's internal Systems Manager OpsCenter service or third-party tools like PagerDuty (also the topic of one of the upcoming articles) and Opsgenie from Atlassian.

Below on the page, you can find other categories like "Graphed anomalies", "Aggregated metrics", and "Relevant events list". Let's start with "Aggregated metrics". There, you can see the metrics and exact time frames where some anomalies have been identified.

We can also group these metrics by "service name" in the upper-right corner.

With this, we have a much clearer picture that the incident started with the increased number of requests to DevOpsGuruDemoProductAPI (due to the execution of the stress test), which then led to ReadThrottleEvents on the DynamoDB ProductsTable. Increased latency metrics also appeared as a consequence of such throttle events. We can also take a closer look at the exact values of such metrics in the "Aggregated metrics" category.

Now let's explore the "Relevant events list" category.

We see two kinds of such events: "Infrastructure" and "Deployment" that occurred in the past, and can potentially be the reason for the incident. With "Deployment", you can verify when you deployed the last time, and this might be a reason for the detected anomaly. In our particular case, the "Infrastructure" events are more interesting as they describe all changes in the configuration and settings in our application stack. If we click on the last grey circle for "Infrastructure", we can see the details of the event.

By clicking on the "ops event", we can view the complete details of this event in the AWS CloudTrail (which should be, of course, enabled to capture such events)

We see that it is an UpdateTable event on the DynamoDB ProductsTable.
By digging deeper, we can see more details and the complete event record as JSON with the values before and after the change.

But with CloudTrail alone, it requires time to compare the values in the "from" and "to" states. In case AWS Config is enabled for recording, we can click on "View AWS Config resource timeline" to see the difference directly:

With that, we see the change for "ReadCapacityUnits" from 5 to 1, and now may link this to the DynamoDB "read throttle events" insight. Of course, our example was artificial by manually reducing the RCUs. In other cases, the natural increase in the number of requests may lead to the same anomaly as well.

The final piece of information with the linked documentation provided by the DevOps Guru is on how to address the anomalies in this insight; see the particular recommendations below.

Conclusion

In this article, we described how DevOps Guru was able to detect the anomaly on DynamoDB with read table throttling. Even if our example was a bit artificial, this kind of anomaly may happen in real-world scenarios, and also in case auto scaling was enabled on the "Read capacity" by exceeding the "Maximum capacity units". We also went through all the information that DevOps Guru additionally provided us, like "Graphed anomalies", "Aggregated metrics", "Relevant events list", and "Recommendations" to analyze and resolve this anomaly. In the next part of this series, we'll explore the anomaly detection on the API Gateway.

Please also check out my website for more technical content and upcoming public speaking activities.