Zippy Wachira

Posted on Feb 9

AWS CSI - Investigating Cloud Conundrums (CloudWatch-Part 2)

#aws #monitoring #performance #tutorial

In Part 1 of this series, we looked at the basics of CloudWatch metrics and one example of how you can leverage CloudWatch metrics to troubleshoot performance issues on AWS. In this second piece, we’ll dive a little deeper and investigate a few more examples.

So, let’s dive in!

Scenario 2: You have a microservices-based application running on Amazon ECS (Elastic Container Service). Users have reported that the application becomes unresponsive after running for a few hours.

Background:
A memory leak is a type of a resource leak that occurs when a program allocates memory but fails to release it back to the system after it is no longer needed. The result is that over time, the program consumes more and more memory, leading to resource exhaustion. As memory becomes scarce, the application may slow down due to increased garbage collection activity or the need to swap memory to disk.

Memory leaks typically cause a gradual increase in memory usage. The application may start normally but degrade over time as memory is exhausted. If the application becomes unresponsive after a consistent period, it suggests a pattern where memory consumption reaches a critical threshold, causing the failure.

Investigation:
Occasionally, a memory allocation spike can cause a one-time spike in the amount of memory being used by a resource in your AWS environment. For an allocation spike, restarting the service will temporarily resolve the issue. However, if the problem recurs, it could be an indication that the underlying issue is a memory leak rather than a one-time allocation spike.

For either case, you need to look at the ‘MemoryUtilization’ metrics. The ‘MemoryUtilization’ metric shows the percentage of memory that is used by tasks in the specific dimension. For statistics, you’d need to look at the average and maximum utilization over the period of interest.

Scenario 3: You have an e-commerce website, hosted on Amazon EC2 instances behind an Application Load Balancer (ALB), is experiencing a sudden spike in traffic. Customers report slow loading times and intermittent outages. You suspect a Distributed Denial of Service (DDoS) attack.

Background:
A Distributed Denial of Service (DDoS) attack is a malicious attempt to disrupt the normal traffic of a targeted server, service, or network by overwhelming it with a flood of internet traffic. This flood typically originates from a network of compromised computers or devices, making it difficult to pinpoint and block the source. The sheer volume of illegitimate traffic can overload resources, making the website or service inaccessible to legitimate users. End users might encounter slow loading times, error messages, or complete outages.

In the context of AWS, a DDoS attack can target various services such as EC2 instances, load balancers, or even the application running on AWS infrastructure.

Investigation:
While a sudden spike in traffic can occur during legitimate events e.g., sales, promotion events etc., there are key patterns that can help identify a possible DDoS attack:

- Traffic Patterns: While legitimate spikes may follow a more gradual increase and decrease in traffic, a DDoS attack will typically involve a sudden and sustained surge in traffic, often exceeding normal peak usage patterns.

- Source of Traffic: The source of legitimate traffic can usually be traced back to a diverse set of users and locations. DDoS traffic on the other hand, might originate from a limited number of IP addresses or geographical locations, indicating a coordinated attack.

- Application Impact: DDoS attacks usually target specific web applications or services. Legitimate traffic spikes might affect overall website performance but wouldn’t target specific applications.

- Increased Error Rates: Along with high traffic, you may observe an increase in 4xx (client error) and 5xx (server error) HTTP status codes, indicating that the backend servers are overwhelmed and unable to process the requests.

Key metrics to monitor to investigate a possible DDoS attack include:
1. Number of Requests Received:
If your application is fronted by an Application Load Balancer, then you need to look at the RequestCount metric. The RequestCount metric shows the number of requests processed over IPv4 and IPv6. A sudden and unusual spike in request count is a primary indicator of a potential DDoS attack. For API Gateway, this would be the Count metric.

For the RequestCount metrics, the statistics of interest would be:

Sum: the total number of requests over a period will help in understanding the overall traffic volume.
Average: the average number of requests per second helps to identify spikes relative to normal traffic patterns.
Maximum: the peak number of requests received in the given period is useful for identifying the highest load.

2. Network Traffic
For the instances hosting the application, you need to check the NetworkIn and NetworkOut metrics. If these also show a sharp increase, it may be indicative of a DDoS attack.

For network traffic metrics, we need to look at:

Sum: the total amount of data transferred in and out, respectively, which helps quantify the scale of traffic.
Average: the average data transfer rate, useful for comparing against baseline traffic levels.
Maximum: the peak data transfer rate, which can indicate periods of intense activity typical of a DDoS attack.

3. HTTP Error Rates
An increase in HTTP error rates can indicate that your servers are struggling to handle the incoming requests. To check the error rates, you can check the HTTPCode_ELB_4XX_Count and*HTTPCode_ELB_5XX_Count* for your ALB or 4XXError and 5XXError if using API Gateway.

For HTTP Error metrics, we need to look at:

Sum: the total number of server and client errors over a period. A significant increase in server errors (5xx) can indicate that the backend is overwhelmed and an increase in client error (4xx) can increase due to an increase in the number of malformed requests.
Average: the average rate of errors, useful for comparing against normal error rates.
Maximum: the peak error rate, which can indicate the most stressful/problematic periods.

4. Target Response Time
The ALB’s TargetResponseTime metric shows the time passed, in seconds, after the request leaves the load balancer until the target starts to send the response headers. Increased response times can signal that your application is under strain.

The key statistics to look at for this metric include:

Average: the average response time, helping to identify trends in performance degradation.
Maximum: the longest response time recorded, which can indicate extreme cases of backend strain.
P95 or P99: Percentile metrics show response times at the 95th or 99th percentile, useful for identifying the response times experienced by the top 5% or 1% of requests, which can be heavily affected during an attack.

CloudWatch Statistics

When it comes to trying to make sense of CloudWatch metrics, statistics can be a powerful ally. The Sum, Average, Minimum, and Maximum statistics are the most used. But there are other powerful statistics that you can leverage. For example:

- Percentiles
Percentiles help to understand the relative standing of a value in a dataset. It tells you how a particular value compares to the rest of the data. For example, imagine you are in a race with 100 participants. If you finish in the 95th position, you are in the top 5 runners. This means you are faster than 95 other runners, and only 4 runners are faster than you. Here, your position represents the 95th percentile (p95).

Similarly, in CloudWatch, p95 would mean that 95 percent of the data within the specified period is lower than this value, and 5 percent of the data is higher than this value. Let’s say, for example, that you’re monitoring the latency (response time) of your game servers using CloudWatch. You have checked the average latency for the application, and it is 50ms. Is this good? Is this bad? The average latency would not be able to show you the entire picture as there could be a significant variation in individual player experiences.

Let’s say instead that you filter the metric using the p90 statistic. This statistic will show the experience of most players. So, for example, if the p90 response time is 100 ms, this means that 90% of the requests were completed in 200 ms or less, and only 10% of the requests took longer than 200ms. Similarly, if the p50 response time is 50 ms, it means that 50% of requests were completed in 50 ms or less.

Percentiles help you understand the typical performance and identify outliers. For example, while the average (mean) response time might be 50 ms, the p90 being 100 ms indicates that some requests take significantly longer.

To understand more about CloudWatch statistics, view here.

In this article, we’ve explored several real-world scenarios where CloudWatch metrics empower you to investigate and troubleshoot performance issues within your AWS environment. But a crucial question remains: how do you identify the right metrics to look at for a specific issue?

Well, worry not, help is on the way! In our next blog post, we’ll delve into practical strategies and best practices to guide you in selecting the most relevant CloudWatch metrics for troubleshooting various performance concerns in your AWS infrastructure. Stay tuned!

DEV Community

AWS CSI - Investigating Cloud Conundrums (CloudWatch-Part 2)

CloudWatch Statistics

Top comments (0)