Monitoring AWS RDS may require some observability strategy changes if you switched from a classic on-prem MySQL/PostgreSQL solution.
AWS RDS is a great solution that helps you focus on the data, and forget about bare metal, patches, backups, etc. However, since you don’t have direct access to the machine, you’ll need to adapt your monitoring platform.
In this article, we are going to describe the differences between an on-prem database solution and AWS RDS, as well as how you can start monitoring AWS RDS. Also, we will identify the top five key metrics for monitoring AWS RDS. Maybe even more!
Since AWS RDS is a managed cloud service, the way you configure and use it is through the AWS Console or AWS API. You won’t have a terminal to access the machine directly, so every operation, like replication, backups, or disk management, has to be made this way.
You won’t have to worry about the infrastructure matters such as replication, scaling, or backups. But you won’t have direct access to the instance either. That being so, you won’t be able to monitor AWS RDS using a classic node-exporter strategy.
Using PromCat to include AWS RDS in this setup will take you a couple of clicks. Just configure the credentials and apply the deployment in your cluster. Every step in the configuration is very well explained in the AWS RDS PromCat setup guide.
Memory is constantly used in databases to cache the queries, tables, and results in order to minimize disk operations. This is directly related to how your database will perform. Not having enough memory will cause a low hit rate in the cache and an increase in the response time in your database. This is not good news!
Also, every time a client connects to your database, it creates a new process that will use some memory. In situations with massive concurrent connections, like Black Friday, running out of memory can result in multiple rejected connections.
aws_rds_freeable_memory_average metric (which YACE reads from the CloudWatch
FreeableMemory metric). This tells you the memory available (in bytes) for your instance.
Let’s create an alert if the available memory is under 128Mb:
aws_rds_freeable_memory_average < 128*1024*1024
Even if there’s enough available memory, there is a max number of DB connections in every instance. If you reach this number, the following connections will be rejected, causing database errors in your application.
aws_rds_database_connections_average metric (which uses the
DatabaseConnections CloudWatch metric).
Let’s create an alert if the DB connections number is greater than 1000. Unfortunately, CloudWatch does not provide the maximum number of DB Connections, so you’ll need to specify it manually in the PromQL query.
aws_rds_database_connections_average > 1000
You can also create an alert in case the number of connections has increased significantly in the last hour. That can be used to detect attempts of brute force or DDoS attacks. In this example, you’ll be notified if the number of connections has increased 10 times in the last hour.
aws_rds_database_connections_average / aws_rds_database_connections_average offset 1h > 10
Databases use CPU to run queries. If there are multiple, concurrent, complex, or not well-optimized queries, the CPU usage can reach the limit of the running instance. This will result in a very high response time and possibly some time-outs.
aws_rds_cpuutilization_average metric (which uses the CloudWatch
Let’s create an alert if the average CPU usage is higher than 95% of the instance.
aws_rds_cpuutilization_average > 0.95
Storage is one of the most important parts of a database since it’s where data is held. Not having enough storage capacity will crush your database.
Although setting up an auto-scaling strategy in AWS RDS is very easy, it could affect your infrastructure costs. That’s why you should be aware of the instance disk state.
aws_rds_free_storage_space_average metric (which uses the
FreeStorageSpace CloudWatch metric).
Let’s create an alert if the available storage is lower than 512Mb.
aws_rds_free_storage_space_average < 512*1024*1024
Apart from this PromQL query, you can go further by traveling to the future. How? Using the predict_linear PromQL function to predict when you are going to run out of storage. You may remember this from when we used it to cook a ham.
This PromQL query will alert you if you’re going to run out of storage in the next 48 hours.
predict_linear(aws_rds_free_storage_space_average[48h], 48 * 3600) < 0
If you want to dig deeper into PromQL functions, you can check our getting started PromQL CheatSheet.
In situations where there are queries returning a massive amount of data, the database will need to perform disk operations.
Database disks normally have a low read/write latency, but they can have issues that can result in high latency operations. Monitoring this will assure you the disk latency is as low as expected.
aws_rds_write_latency_average metrics (which use the
WriteLatency CloudWatch metrics).
Let’s create alerts to notify when the read or write latency is over 250ms.
aws_rds_read_latency_average > 0.250 aws_rds_write_latency_average > 0.250
It doesn’t matter if the database is working correctly if there’s no connection from the outside. A misconfiguration or a malicious act from an attacker can result in losing connection to the instance.
Learn how an attacker can infiltrate your cloud infrastructure and perform lateral movement. Also, learn how to prevent and detect such attacks.
aws_rds_network_transmit_throughput_average metrics (which use the
NetworkTransmitThroughput CloudWatch metrics).
Let’s create an alert if the network traffic is down.
aws_rds_network_receive_throughput_average = 0 AND aws_rds_network_transmit_throughput_average = 0
The number of operations per second (IOPS) available in the instance, can be configured and is billed separately.
Not having enough can affect the performance of your application, and having more than needed will have a negative impact on your infrastructure costs.
aws_rds_write_iops_average metrics (which use the
WriteIOPS CloudWatch metrics).
Let’s create alerts if the read or write IOPS are greater than 2,500 operations per second.
aws_rds_read_iops_average > 2500 aws_rds_write_iops_average > 2500
In this article, we’ve learned how easy it is to monitor AWS RDS and identify the top five key metrics when monitoring AWS RDS with examples.
All these metrics are available in the dashboards you can download from PromCat. They can be used in Grafana and in Sysdig Monitor as well!
These top key metrics will allow you to see the full picture when troubleshooting and performing improvements in our AWS RDS instance.
If you would like to try this integration, we invite you to sign up for a free trial of Sysdig Monitor.