While working on a production ERP system, slow database queries were not just a performance issue but a crisis that halted the business workflow. We were monitoring how long each query took, server CPU usage, and disk I/O, but finding the root cause of the problem sometimes took us hours. It was at this point that I started questioning how critical the metrics we used to understand system health were, and whether these metrics were always sufficient.
There's a popular set of metrics that often comes up, especially when it comes to monitoring application performance and health: RED. So, do these RED metrics truly provide all the information we need in every scenario? Or do we sometimes need more, and sometimes different perspectives? In this post, I will delve into RED metrics and explain with concrete examples from my own experiences when they are a comprehensive solution and when they are not.
What are RED Metrics?
RED stands for Latency, Error Rate, and Duration. This set of metrics is particularly useful for understanding the health and performance of a service, especially in service-oriented architectures or microservices. By tracking how long incoming requests take to be processed (Latency), how many of them result in errors (Error Rate), and how long these operations take on average (Duration), we can detect potential issues in the service early on.
These metrics are a great starting point for understanding bottlenecks and performance degradations in a system. For example, when we see a sudden increase in the latency of requests to a service, we can look for the source of the problem within that service or its dependencies. Similarly, spikes in the error rate can indicate errors introduced by a recent deployment or infrastructure issues.
ℹ️ Core Components of RED Metrics
- Latency: The time it takes for a request to be processed. Usually measured in milliseconds (ms).
- Error Rate: The percentage of requests that result in an error out of the total number of requests.
- Duration: The total time spent to complete an operation. This is important for understanding particularly long-running operations, distinct from Latency.
The basic principle behind these metrics is to consider every incoming request to a service as an "operation." Each operation is either successful or unsuccessful. If successful, how long did it take? If unsuccessful, why did it fail? RED metrics guide us in seeking answers to these questions. However, it's important to remember that these metrics may not always be sufficient on their own.
Latency: Is Speed All That Matters?
Latency is the time elapsed from the beginning to the end of a request being processed. It indicates how quickly a service responds and is critical for user experience. For instance, delays in accessing a website can cause users to abandon it. Similarly, a background process taking too long can unnecessarily occupy system resources.
However, looking only at latency can be misleading. A service might have low latency, but this doesn't mean the operation is being performed correctly. For example, a database query might return very quickly, but if the data it returns is incomplete or incorrect, this low latency is actually masking a problem. I saw this situation in an incident on a large e-commerce site: the payment process was completing much faster than expected, but the stock update was failing in the background. The low latency was actually covering up an error.
💡 Latency Examples
- An API request returning in 50ms.
- A web page taking 200ms to load.
- A database operation completing in 10ms.
When measuring latency, it's important to examine percentiles like p95 and p99, rather than just the average value. This is because the average value can hide rare but very long-running requests. In my own projects, I've always considered sudden increases in p99 latency as a warning sign. This indicates that the system is experiencing performance issues for some users.
Error Rate: How Should We Understand Errors?
Error Rate indicates the proportion of requests to a service that result in an error. This metric is vital for detecting and fixing errors in the system. A high error rate can indicate bugs introduced by a deployment, infrastructure issues, or the service's inability to withstand unexpected situations.
However, one must be careful when looking at the error rate. The definition of "error" can vary depending on the context. For example, a malformed request from a user might be counted as an error, or does it truly indicate that a service has crashed internally? It's important to make this distinction. In a production ERP system, it was more sensible to prevent incorrect data entry by the operators through input validation rather than marking it as an "error." This allowed us to see the system's real errors more clearly.
⚠️ Questions for Error Rate Analysis
- What are the error types? (HTTP 5xx, 4xx, custom application errors, etc.)
- Are the errors related to a specific request, or is it a general problem?
- How long have the errors been occurring?
- Are the errors affecting a specific user group or geography?
For example, a sudden increase in the rate of 500 Internal Server Error in a REST API typically indicates a serious server-side issue. In such cases, it's necessary to immediately examine the logs and find the source of the problem. In my own systems, I've tried to keep the error rate low by monitoring errors coming through journald and preventing brute-force attempts with tools like fail2ban.
Duration: The True Cost of Operations
Duration refers to the total time spent to complete an operation. Unlike latency, Duration focuses more on the "completion" time of an operation. In some cases, while a request might appear to have "started" very quickly, the operation might take a long time to complete. This is particularly important when dealing with asynchronous operations, long-running database operations, or calls to external systems.
For instance, a user's profile update request might have been received quickly (low latency), but the synchronization of this update with other systems in the background might take a long time (high duration). This situation is frequently encountered in corporate software with complex workflows. In a supply chain integration, while approving a shipment might seem instantaneous, updating all relevant systems could take hours. If we only looked at latency, we might overlook these long-running background operations.
🔥 The Importance of Duration
- Long-running operations occupy resources (CPU, memory, disk I/O) for an extended period.
- This can negatively impact the performance of other operations.
- It is important for understanding whether an operation has completed, especially in asynchronous processes.
Monitoring duration can reveal hidden bottlenecks in the system. An operation might appear to finish quickly, but it could actually be consuming resources in the background. In my own projects, by monitoring the duration of database operations, I've identified slow queries and missing indexes. The pg_stat_activity view in PostgreSQL was very helpful in this regard.
The Comprehensiveness of RED Metrics: When Are They Not Enough?
While RED metrics are a great starting point in many scenarios, they may not always be sufficient. Especially in complex systems or situations with specific business requirements, we need more detail. For example, just looking at the error rate doesn't tell us the cause of the error or which users are affected.
While working on a bank's internal platform, looking only at the error rate of requests was insufficient. We needed to know which module the error originated from, which API call triggered it, and which users encountered this error. Therefore, we integrated trace and log information in addition to RED metrics. This allowed us to find the root cause of the problem much faster.
ℹ️ Beyond RED Metrics
- Throughput: The number of requests processed in a given time period.
- Resource Utilization: How much of system resources like CPU, memory, disk I/O are being used.
- Queue Lengths: The number of pending jobs in asynchronous operations.
- Application-Specific Metrics: Metrics specific to the business logic (e.g., number of orders completed, number of invoices processed).
Furthermore, focusing solely on server-side metrics can also be misleading. To fully understand user experience, client-side metrics must also be considered. Performance issues experienced in mobile applications may not appear solely in server logs. In the Android spam blocking app I developed, the analysis I conducted based on user complaints showed that there was no server-side issue; the problem stemmed from the app itself performing intensive processing.
Alternative and Complementary Metrics
There are other metric sets and approaches we can use in addition to RED metrics. For example, USE (Utilization, Saturation, Errors) metrics are more focused on infrastructure and resource usage. These metrics can be used alongside RED to understand the overall health of the system.
- Utilization: How much of a resource is being used (e.g., CPU usage at 80%).
- Saturation: How strained a resource's capacity is (e.g., number of requests waiting in the disk queue).
- Errors: The number of errors related to a resource (e.g., network packet loss).
Similarly, for more detailed analysis, trace information is indispensable. In distributed systems, when a request passes through multiple services, distributed tracing tools (e.g., Jaeger, Zipkin) are used to understand how much time the request spent in each service. These tools make complex flows visible that RED metrics cannot capture.
💡 Complementary Metric Examples
- Throughput: Requests per second (RPS).
- Queue Depth: Number of items waiting in the processing queue.
- Resource Usage: CPU Load, Memory Usage, Disk I/O.
- Application-Specific: Cost per transaction, number of user steps completed.
In my own projects, especially with complex workflows, I used USE metrics in conjunction with RED to address infrastructural bottlenecks and performance issues more holistically. For instance, when choosing eviction policies for Redis OOM (Out Of Memory) scenarios, I considered the system's overall memory usage (USE) and operation queues, not just RED metrics.
When to Perform Comprehensive Monitoring?
Situations where RED metrics alone are insufficient often include:
- Complex Distributed Systems: In microservice architectures or systems with multiple dependencies, looking only at the RED metrics of a single service can be misleading.
- Business-Critical Applications: In places where uninterrupted business workflow is essential, such as ERP systems and financial platforms, technical metrics alone are not enough. The health of business processes must also be monitored.
- User Experience-Focused Applications: In mobile applications or web applications with intensive user interaction, client-side metrics are also important.
- Asynchronous Operations: For long-running or background operations, it's necessary to focus on the completion time of the operation, not just the request start time.
While developing an ERP for a manufacturing company, we didn't just look at the error rates of services; we also monitored whether shipments were completed on time and how accurately production plans were progressing. This demonstrated that metrics related to the business itself were as critical as technical metrics.
🔥 Situations Requiring Comprehensive Monitoring
- Architectures where multiple services depend on each other.
- Systems where downtime leads to significant financial or operational losses.
- Applications where user experience is directly tied to performance.
- Long-running and resource-intensive operations in the background.
In summary, RED metrics are an excellent starting point. However, developing additional monitoring strategies to complement these metrics based on your system's complexity, business requirements, and criticality level is vital for identifying and resolving real problems. In my own projects, I generally adopt an approach that uses RED metrics as a base and enriches it with USE metrics, distributed tracing, and business-specific metrics when necessary. This has allowed me to maintain the overall health of the system and take swift action in the face of unexpected issues.
Top comments (0)