Introduction
In part 1, we introduced the Amazon DevOps Guru service, described its value proposition, the benefits of using it, and explained how to configure it. We also needed to go through all the steps in the 2nd part of the series to set everything up. In the subsequent parts, we saw DevOps Guru in action, detecting anomalies on DynamoDB and Aurora Serverless v2, API Gateway, and Lambda alone and also in conjunction with other AWS Serverless Services like SQS, Kinesis, Step Functions, SNS, and DynamoDB Streams.
In this part of the series, I'd like to present you with my opinionated Amazon DevOps Guru wish and improvement list.
Wishlist for the Amazon DevOps Guru
- Missing AWS Serverless services to be supported by Amazon DevOps Guru :
- EventBridge
- Event Bridge Pipes
- AppSync
None of them are mentioned in the list, and my tests also confirmed that these AWS services are currently not being analyzed by Amazon DevOps Guru.
- Make pricing more accessible for testing, staging, and individual AWS environments. Currently, DevOps Guru pricing is based on resources analyzed per hour and not on the usage of the resources themselves. This means that the costs of all AWS environments are the same, which mainly disqualifies DevOps Guru for use on all AWS environments besides production/live ones.
My improvement list for the Amazon DevOps Guru
- In many cases, I miss some additional information in the insights when RDS instance (including Aurora Serverless and RDS both with PostgreSQL), in case of connecting to it without Data API, but using standard approaches like JDBC. If I exhaust all connections, typically the CPU goes 100%, and in the case of Aurora ACUs are also exhausted. But in the created anomaly, I very often only observed the anomaly metrics “DatabaseConnectionSum” and no CPU or ACU CloudWatch metrics. In the article Anomaly detection on Aurora Serverless v2, I could reproduce the DBLoad Average and DBLoadCPU Average anomalous metric, but this was very sporadically even for a very constant high load.
- DevOps Guru can't recognize the anomaly in case of exhausting connection on Aurora Serverless v2 over Data API (instead of JDBC). The reason for it is that the "DatabaseConnection Sum" CloudWatch metric is not exposed for the Data API (its value is always 0). I described this issue in my article Anomaly detection on Aurora Serverless v2 with Data API (kind of). It could be a CloudWatch and not DevOps Guru issue, as DevOps Guru mainly uses available CloudWatch metrics.
- Services with which I couldn't reproduce the incident reliably were anomalies with SNS. With SNS, I created a scenario where the SNS Topic didn't have permissions to put the message into the SQS queue or invoke the Lambda function, so the message hasn't been delivered. There was no created DevOps Guru insight for it. The same was true for delivering the SNS notification via email. I created a separate email account to confirm the notification and successfully delivered several notifications there. Then I deleted this email account and sent multiple notifications again. SNS monitoring showed me notifications not being delivered, but no DevOps Guru insight was created. In the article Anomaly detection on SNS, I presented some issues with DevOps Guru in detecting the anomaly in case the HTTP(s) endpoint became unavailable.
- Also, the anomaly detection described in the article Anomaly detection on Lambda consuming from DynamoDB Streams has some minor issues, as not all relevant metrics have been shown in the insight.
- What I really currently miss is a good integration between DevOps Guru and Tracing services like X-Ray or CloudWatch Service Lense and also 3rd parties like Datadog and Lumigo. Especially for the Serverless application, it's very hard to analyze the root cause without having access to the tracing. It's not enough to link CloudWatch Log Groups for the failed invocations, as we also need the possibility to navigate the whole invocation chain quickly with the information on which exact invocation in the chain was faulty. Here I see a lot of room to improvements, especially in collaboration with the 3rd party Serverless observability providers.
- DevOps Guru Proactive insights have improved a lot compared to when I started to use them more than a year ago. But there is more room for improvements, especially for the cost reduction use cases like having configured (high) Lambda provisioned concurrency but not making use of it (anymore). I described this issue in the article Amazon DevOps Guru Proactive insights.
Conclusion
In this part of the series, I presented my opinionated AWS wish and improvement list of the Amazon DevOps Guru service. I'll raise attention to this list to the AWS service team.
Top comments (0)