Zippy Wachira

Posted on Feb 9

AWS CSI -Investigating Cloud Conundrums (CloudWatch - Part 1)

#aws #beginners #cloud #monitoring

If you’re anything like me, you absolutely hate going to the doctors. Unfortunately, (and at least until we can make ourselves indestructible🤞), every so often, you will always find yourself in a doctor’s office. Now, for the doctor to accurately diagnose your illness and prescribe the right treatment, they need to first collect a range of vitals — your temperature, blood pressure, heart rate, and so on. These vital signs provide crucial insights into your health, and tracking them over time helps the doctor identify patterns, detect issues early, and understand the overall state of your body.

Similarly, CloudWatch acts like your AWS environment’s diagnostic physician. It collects a comprehensive set of data points like system metrics (CPU usage, memory allocation, network latency) and logs (application errors, API calls, resource utilization) that serve as vital signs. By analyzing these metrics and logs, CloudWatch helps you diagnose the health of your application. An unexpected surge in CPU usage might point to inefficient code, while frequent errors in the logs could indicate configuration issues.

In this blog, we will delve into CloudWatch metrics and explore how you can leverage these metrics to understand the performance of your AWS Services as well as detect potential issues. Whether it’s preventing a minor symptom from becoming a major outage or optimizing your resources for peak performance, CloudWatch is your go-to solution for maintaining the well-being of your cloud infrastructure.

Some Basics

Let’s start with a few important details about metrics:

A metric is a quantitative measure of a system’s characteristic over time.
Majority of AWS services provide a set of free metrics under basic monitoring. However, to monitor a parameter that is not enabled for the free metrics, you can enable detailed monitoring or set up custom metrics.
Metrics are collected as a set of time-ordered data points. The period over which data points are collected varies between under a second and an hour. The retention period of a metric is dependent on how frequently data points are published. See here.
Metrics exist only in the Region in which they are created.
Metrics are categorized into dimensions, i.e., you can monitor the CPU Utilization of EC2 instances, RDS Databases, ECS Cluster,s etc. However, when you want to only view the CPU Utilization for one or all your RDS Databases, then you’d view this under the ‘Across All Databases’ dimension or the ‘DBInstanceIdentifier’ dimension.

Understanding CloudWatch Metrics

The good news is that AWS maintains exhaustive documentation for the supported metrics for each service. Additionally, each metric is explained in detail so it’s clear to understand what exactly the metric measures.

For a list of services that publish their metrics to CloudWatch, see here.
To understand the specific metrics that are supported for a particular service, search for ‘Available Metrics for ’ e.g. ‘Available Metrics for API Gateway’. For most services, this page will also display the available namespaces and dimensions available for the service.

E.g.,

1: This is the name of the metric, i.e., the characteristic that is being measured.

2: The description of the metric, i.e., what it is and what it measures. For some metrics, that description will also include other notable details of the metric, i.e, recommendations, when to use the metric, exceptions, etc.

3: The unit of a metric is the scale of measurement of that metric. e.g., For EC2 instance metrics, the BurstBalance has the unit ‘Percent’. This tells you that the BurstBalance metric is measured as a percentage value. Units provide context and meaning to the raw numerical values you see, e.g., you could compare BurstBalance (percentage) with CPUUtilization (percentage) to see if high CPU usage is depleting your burstable credits.

4: CloudWatch provides several statistics for a metric’s data points, e.g., sum, average, minimum, maximum, etc. See all available statistics here. Statistics are crucial to understanding a metric’s behaviour, e.g., the average helps to identify a baseline for the metric’s typical behaviour. Meaningful Statistics for a metric are the statistics that are considered the most useful for that metric.

5: For RDS, some metrics are only available for a specific database engine. The ‘Applies to’ column indicates the database engine for which the metric can be collected.

Graphing Metrics

Trying to understand what a set of data is trying to tell you purely by looking at rough numbers can leave you feeling foggy. Visuals, on the other hand, are like a lightbulb moment, illuminating complex ideas in a clear and memorable way. On CloudWatch, you can use graphs to view metrics over a period.

Say, for example, you want to view the average write I/O operations on your EBS volume for a period. You can access the metric on the console as follows.

1: You can use the time filter to granularize your search to a specific period. The custom option allows you to specify a custom period, e.g, view metrics over 3 weeks

2: The Actions/Options tabs allow you to customize your widget, i.e., specify how you want your data to be displayed. The Options tab provides more customization for your graph, e.g., labels to add to the axis, units, etc.

3: The Graphed Metrics tab allows you to customize the graph.

You can change the statistic being displayed, e.g., change from average to maximum or view a sum. You can also change the period, which alters the data points on the graph e.g,. To view the maximum values at each hour, can filter as below:

Examples

Scenario 1: Your users are reporting that your web application is responding slowly. You need to determine the cause of the high latency and resolve it quickly.

Resolution:

There are 2 main reasons for slow response times in an application.

- Resource limitations: When the resources assigned to the compute infrastructure where the application is running are insufficient to sustain the load, i.e., using a small instance for a high-load application may result in CPU overload and memory bottlenecks. This can also occur if the database is overloaded.

- Application Code Issues: Poorly written code with logic flaws e.g., code that does not properly release memory after use, can lead to memory depletion and slow performance.

To check if the lag is a result of resource constraints, we can examine the compute service’s CPU utilization and disk I/O. Now, so far, we have looked at how to access and view different metrics for a service. The next big question becomes, how do you interpret CloudWatch data and derive meaningful insights from it

i. CPU Utilization
As previously mentioned, there are various statistics available to you for each metric. For this case, to determine if CPU Utilization is the reason for latency, we need to look at the following 3 statistics over the given period:

- Average CPU Utilization:This statistic helps in understanding the general load on your instance over time.

- Maximum CPU Utilization:This statistic shows the peak CPU usage within a specified period. It is useful to identify if there are any spikes that might correlate with periods of high latency.

- CPU Credit Balance(only for burstable instances): If you’re using burstable instances (e.g., T2, T3 instances), running out of CPU credits can cause the instance to throttle and result in increased latency.

An important thing to remember here is the unit used to measure the metric, which can be found in the service’s official documentation. CPU Utilization is measured as a percentage; thus, the output would look something like the below:

Average CPU Utilization

Maximum CPU Utilization

CPUCreditBalance

From the above, we can see that there was a spike in CPU Utilization at three instances, which also corresponds to the time when the burst credits were most spent.

ii. Disk I/O

Disk I/O metrics reflect the performance and usage of your disk. Key Disk I/O metrics include:

DiskReadOps: The number of read operations performed on the disk.
DiskWriteOps: The number of write operations performed on the disk.
DiskReadBytes: The amount of data read from the disk, in bytes.
DiskWriteBytes: The amount of data written to the disk, in bytes.

Note: the above metrics are only available for instance store volumes. If using EBS, you’d be looking at the EBSReadOps, EBSWriteOps, EBSReadBytes, and EBSWriteBytes metrics.

Key statistics to measure include:
- Sum:For DiskReadOps and DiskWriteOps, the sum statistic helps you understand the total number of I/O operations over a period.

- Average: For DiskReadBytes and DiskWriteBytes, the average statistic provides insight into the average data throughput over a period.

Note: DiskReadOps and DiskWriteOps will show the number of completed read operations from all volumes in a specified period. To obtain the average IOPS, you need to take the total amount of operations/time in seconds, e.g., say you have a DiskReadOps of 100,000 over a period of 1 hour, then the read operations per second would be 100,000/(3600) = ~28

In most cases, it is not possible to troubleshoot an issue simply by examining a single metric. For example, in the scenario above, we can’t determine that the application lag is due to a resource constraint simply by looking at the CPU Utilization, this is even if the spikes in utilization align with periods of latency.

To get the full picture, we need to analyse multiple metrics together. Let’s say users report slowdowns. Examining both CPU utilization and disk I/O during those periods can reveal if spikes or abnormal patterns in both metrics coincide with the latency. If you have the CloudWatch agent installed, you can also compare these against memory utilization metrics. This combined view strengthens the case for resource limitations being the root cause.

This article has provided a foundational understanding of CloudWatch metrics and logs. However, the vast capabilities of CloudWatch extend far beyond what we’ve covered here. In a future article, we’ll delve deeper into advanced techniques for leveraging CloudWatch logs and metrics to troubleshoot issues and ensure the optimal health of your AWS resources. Stay tuned!

DEV Community

AWS CSI -Investigating Cloud Conundrums (CloudWatch - Part 1)

Some Basics

Understanding CloudWatch Metrics

Graphing Metrics

Examples

Top comments (0)