Amazon DevOps Guru for the Serverless applications - Part 13 Anomaly detection on Aurora Serverless v2 with Data API (kind of)

#aws #serverless #devops #aiops

Introduction

In my article Amazon DevOps Guru for the Serverless applications - Part 10 Anomaly detection on Aurora Serverless v2, we learned that DevOps Guru was able to successfully detect anomalies with the Aurora (Serverless v2) PostgreSQL database in case a Lambda function with Java 21 managed runtime was connected to it via JDBC. We scaled our database only from 0.5 to 1 ACU and created a very high load on the database by invoking a Lambda function to retrieve a product by id several hundred times concurrently for multiple minutes. We saw that DevOps Guru correctly pointed to the increased sum of database connections and the constantly high database (CPU) load. In this article, I'd like to figure out whether DevOps Guru will detect the anomaly when doing the same experiment but using the Data API for Aurora Serverless v2 with AWS SDK for Java instead of JDBC.

This article is a copy of my article Aurora Serverless v2 Data API meets DevOps Guru or not?, which I published as part of the Data API for Amazon Aurora Serverless v2 with AWS SDK for Java series.

Anomaly detection on Aurora Serverless v2 with Data API

Let's look into our sample application and use SAM template to create infrastructure and deploy the application described in the following picture :

The application creates products stored in the Aurora Serverless v2 PostgreSQL database and retrieves them by id using the Data API. The relevant Lambda function, which we'll use to retrieve a product by its id is GetProductByIdViaAuroraServerlessV2DataApi, and its handler implementation is GetProductByIdViaAuroraServerlessV2DataApiHandler.

As in the previous article, we use hey tool to perform the stress test like this

hey -z 15m -c 300 -H "X-API-Key: XXXa6XXXX" https://XXX.execute-api.eu-central-1.amazonaws.com/prod/productsWithDataApi/1

In this example, we invoke the API Gateway endpoint with 300 concurrent containers for 15 minutes. Behind the prod/productsWithoutDataApi endpoint Lambda function GetProductByIdViaAuroraServerlessV2WithoutDataApi will be invoked wich will retrieve the product by id 1 from the Aurora Serverless v2 PostgreSQL database.

We configured our SAM template Aurora database cluster to scale from a minimal capacity of 0.5 to a maximal capacity of 1 ACU (which is a very small database size) in case of the increased load for the cost-saving purpose.

  AuroraServerlessV2Cluster:
    Type: 'AWS::RDS::DBCluster'
...
      ServerlessV2ScalingConfiguration:
        MinCapacity: 0.5
        MaxCapacity: 1

Aurora (Serverless v2) database manages the maximum number of database connections available proportionally to the database size (in our case, the ACU setting), also with the Data API for Aurora Serverless v2 (which is a huge difference to v1, which will become out of support end of year 2024, where was a hard quota of 1000 database connection per second). For more information, please read the documentation about Maximum connections for Aurora Serverless v2. So, with the increased number of invocations, we expect to reach the maximal number of the database connections available and high database (CPU) load soon, so that database won't be able to respond to the new Lambda function requests to retrieve product by id (Lambda will then also run into). With that, we will provoke the anomaly and would like to figure out whether DevOps Guru will be able to detect it. And it was able, kind of.... The following insight was generated:

The following aggregated anomalous metrics have been identified:

Comparing to the aggregated anomalous metrics identified in case of using JDBC instead of Data API described in my article Amazon DevOps Guru for the Serverless applications - Part 10 Anomaly detection on Aurora Serverless v2 we completely muss the Aurora database anomalous metrics: database connection sum and database (CPU) load but correctly see the error in Lambda which ran into the defined time out of 15 seconds as the database couldn't respond.

So, what's the difference? Let's explore both incidents that we reproduced on Aurora Serverless v2 PostgreSQL cluster with JDBC(Non Data API) and Data API :

In terms of ACU utilization/scaling, they both look the same:

In terms of other database metrics like CPU Utilization, DatabaseConnection, and DBLoad(CPU), there are huge differences:

CPU Utilization looks the same for JDBC(Non-Data API) and Data API cases. But DevOps Guru seems not to consider this metric, as we didn't see it even for the JDBC experiment.
DBLoad(CPU), which is very low for Data API usage. It seems that for the Data API, there is a Load Balancer in front of the Aurora Serverless v2 database, which monitors the connection usage and protects the database from being overloaded.
DatabaseConnection metric is not shown (or shown as 0) for Data API usage. The reason for that is that we don't manage the database connection for the Data API; it's done on the other side for us. Of course, they still play an important role, as we learned in Maximum connections for Aurora Serverless v2, but this metric seems to be exposed to the outside in the CloudWatch Metrics, and even DevOps Guru doesn't have any access to the real numbers.

With that and very low DBLoad(CPU), no DevOps Guru insight for the Aurora Serverless v2 cluster with Data API usage has been generated compared to the JDBC use case.

I did the second experiment by connecting to the Aurora Serverless v2 cluster directly and wrote the script to create the load test by writing a script that fetches the product by id multiple hundred times using the standard way (non-Data API). Similar to what we did with hey tool, but taking it to the database directly instead of invoking Api Gateway. After I put the database under the load, I started the same experiment with the hey tool as described above, and wanted to see what would happen. The same insight was generated, but this time with the following anomalous metrics:

Now we see at least an additional Aurora Serverless v2 database connection sum anomalous metric, but DBLoad(CPU) metrics are still missing.

Graphed anomalies look like this:

Of course, the experiment wasn't clean, as I did 2 load tests after each other and partially in parallel: the first one connecting to the database directly without API Gateway usage, and the second by using Data API. This confirmed my initial assumption that database connection sum metrics is a very important criterion to generate DevOps Guru insight for Aurora Serverless v2 (and for RDS in general), and it's not exposed in general in the case of using Data API.
I already contacted Devops Guru team and shared my insights with the expectation that they will improve the service. First of all, exposing the database connection as a CloudWatch Metric will be fixed for using Aurora Serverless v2 with Data API.

Conclusion

In this article, we learned that DevOps Guru could successfully detect anomalies with the Aurora (Serverless v2) PostgreSQL database in case of a Lambda function with Java 21 managed runtime connected to it via Data API, but could only show the anomalous metrics related to the Lambda function being timed out, as the database didn't respond. The main reason for that seems to be that the database connection as a CloudWatch Metric isn't exposed (or always displayed as 0) in the case of using Aurora Serverless v2 with Data API. Aurora Serverless v2 database metrics (database connection sum) were only shown during the second artificial experiment.

Please also check out my website for more technical content and upcoming public speaking activities.