Amazon DevOps Guru for the Serverless applications - Part 14 my wish and improvement list

#aws #serverless #devops #aiops

Introduction

In the 1st part of the series we introduced the Amazon DevOps Guru service, described its value proposition, the benefits of using it and explain how to configure it. We also needed to go through all the steps in the 2nd part of the series to set everything up. In the subsequent parts we saw DevOps Guru in action detecting anomalies on DynamoDB and Aurora Serverless v2, API Gateway and Lambda alone and also in conjunction with other AWS Serverless Services like SQS, Kinesis, Step Functions, SNS and DynamoDB Streams.
In this part of the series, I'd like to present you my opinionated Amazon DevOps Guru wish and improvement list.

Wishlist for the Amazon DevOps Guru

Missing AWS Serverless services to be supported by Amazon DevOps Guru :

EventBridge
Event Bridge Pipes
AppSync

They are all not mentioned in the list and also my tests confirmed that these AWS services are currently not being analyzed by Amazon DevOps Guru.

Make pricing more accessible for testing, staging and individual AWS environments. Currently DevOps Guru pricing is based on resources analyzed per hour and not on the usage of the resources themselves. This means that the costs of all AWS environments are the same which mainly disqualifies DevOps Guru for the usage on all AWS environments besides production/live ones.

My improvement list for the Amazon DevOps Guru

In many cases I miss some additional information in the insights when RDS instance (including Aurora Serverless and RDS both with PostgreSQL) in case of connecting to it without Data API but using standard approaches like JDBC. If I exhaust all connections, typically CPU goes 100% and in case of Aurora ACUs are also exhausted. But in the created anomaly, I very often only observed the anomaly metrics “DatabaseConnectionSum” and no CPU or ACU CloudWatch metrics. In the article Anomaly detection on Aurora Serverless v2 I could reproduce the DBLoad Average and DBLoadCPU Average anomalous metric but this very sporadically even for very constant high load.
DevOps Guru can't recognize the anomaly in case of exhausting connection on Aurora Serverless v2 over Data API (instead of JDBC). The reason for it is that "DatabaseConnection Sum" CloudWatch metric is not exposed for Data API (its value is always 0). I described this issue in my article Anomaly detection on Aurora Serverless v2 with Data API (kind of). It could be a CloudWatch and not DevOps Guru issue, as DevOps Guru mainly uses available CloudWatch metrics.
Services with that I couldn't reproduce incident reliably was anomalies with SNS. With SNS I created the scenario, that SNS Topic didn't have permissions to put the message into SQS queue or invoke Lambda function, so the message hasn't been delivered. There were no created DevOps Guru insight for it. The same was true for delivering the SNS notification via email. I created the separate email account to confirm the notification and successfully delivered several notifications there. Then I deleted this email account and sent multiple notifications again. SNS monitoring showed me notifications not being delivered, but no DevOps Guru insight was created. In the article Anomaly detection on SNS I presented some issues with DevOps Guru in detecting the anomaly in case the HTTP(s) endpoint became unavailable.
Also the anomaly detection described in the article Anomaly detection on Lambda consuming from DynamoDB Streams has some minor issues as not all relevant metrics have been shown in the insight.
What I really currently miss is a good integration between DevOps Guru and Tracing services like X-Ray or CloudWatch Service Lense and also 3rd parties like Datadog and Lumigo. Especially for the Serverless application it's very hard to analyze the root case without having access to the tracing. It's not enough to link CloudWatch Log Groups for the failed invocations, as we also need the possibility to navigate the whole invocation chain quickly with the information what exact invocation in the chain was faulty. Here I see a lot of room to improvements especially in collaboration with the 3rd party Serverless observability providers.
DevOps Guru Proactive insights has improved a lot compared to when I started to use them more than a year ago. But there is more room for improvements especially for the cost reduction use cases like having configured (high) Lambda provisioned concurrency but not making use of it (anymore). I described this issue in the article Amazon DevOps Guru Proactive insights.

Conclusion

In this part of the series, I presented my opinionated AWS wish and improvement list of Amazon DevOps Guru service. I'll raise attention to this list to the AWS service team.

DEV Community

Amazon DevOps Guru for the Serverless applications - Part 14 my wish and improvement list

Introduction

Wishlist for the Amazon DevOps Guru

My improvement list for the Amazon DevOps Guru

Conclusion

Top comments (0)

Read next

Introducing fck-nat: Cost optimized alternative to AWS NAT Gateways

Azure Functions (dotnet): The Right Way to Work with Queue Storage

Mastering Remote Work with Cloud and DevOps Skills

GitHub Compliance – All You Need To Know