DEV Community

Mustafa ERBAY
Mustafa ERBAY

Posted on • Originally published at mustafaerbay.com.tr

RED and USE Metrics: Which is More Effective for System Monitoring?

When defining monitoring strategies, I always face a dilemma: Should I focus on the application level or the infrastructure level? The answer is usually "both," but understanding which one takes precedence when is the critical part. I've spent a lot of time thinking about this for my own systems and projects, made various mistakes, and eventually developed a specific approach.

Most of the time, when an issue arises, I compare these two metric groups—RED and USE metrics—to determine where to look first. One tells me how the user feels, while the other shows the state of the infrastructure. Although they complement each other, sometimes one can point to the source of the problem much faster than the other.

ℹ️ Why Are RED and USE Metrics Important?

In my experience, these two metric sets are among the most fundamental and practical approaches for understanding system health and performance in the monitoring world. Over the years, I've used many different monitoring tools and methodologies, but these two simple principles have always helped me reduce complexity and provide directly actionable data. Especially in production environments, where every second counts, these metrics save critical time.

RED Metrics: What's Happening at the Application Layer?

RED metrics are a framework designed to understand the state of a service or application from the user experience perspective. As the name suggests, it includes three core measurements: Rate, Errors, and Duration. I typically use these metrics for APIs, microservices, or any interface with which users directly interact.

  • Rate: The number of requests coming into the service per second or minute. For example, seeing how many requests per second an API receives is critical for understanding the application's load and demand. In the backend of a side project I developed, I quickly noticed instantaneous traffic spikes or unexpected bot traffic thanks to this metric.
  • Errors: The rate or number of failed requests. This includes situations like HTTP 5xx errors, application-level exceptions, or database connection errors. In a production ERP, seeing the error rate of a specific shipment API jump from 0.1% to 5% helped me realize that operations were nearing a halt. Such an increase usually requires rapid intervention.
  • Duration: The time it takes for a request to complete. It's typically monitored as average, median (p50), or upper percentiles of latency (p95, p99). This directly reflects the response time users expect. For me, the p99 latency value is very important as it shows the worst-case scenario of user experience. In a financial calculator project, when the p99 latency increased from 300ms to 1200ms, it was a clear indication that users were starting to feel the application slowing down.

RED metrics directly reflect how the application appears to the outside world, meaning its performance from the perspective of customers or other services. This is why I usually set them up as the primary alarm mechanism. When there's a problem, I first look for an answer to the question, "What are the users seeing?"

# Example of capturing RED metrics in Nginx access log format

<figure>
  <Image src={cover} alt="Graphs and diagrams representing RED and USE metrics, showing performance and resource utilization." />
</figure>

log_format red_metrics '$remote_addr - $remote_user [$time_local] "$request" '
                       '$status $body_bytes_sent "$http_referer" '
                       '"$http_user_agent" "$http_x_forwarded_for" '
                       '$request_time $upstream_response_time';

# We can parse this log format with a Prometheus exporter and collect it as metrics.
# For example, $request_time gives the "Duration" value, while $status can capture "Errors".
# The number of log lines also forms the "Rate" metric.
Enter fullscreen mode Exit fullscreen mode

USE Metrics: How Are Your Resources Performing?

USE metrics, on the other hand, are a framework used to evaluate the health and performance of infrastructure components. These metrics show how effectively the system's core resources (CPU, Memory, Disk, Network) are being utilized. I typically look at USE metrics to understand the internal state of servers, database servers, or network devices.

  • Utilization: The percentage indicating how busy a resource is. For example, CPU utilization at 80%, disk utilization at 95%. On a Linux server, CPU utilization can be easily seen with top or htop. For me, these percentages show how much the resource is being strained. When I saw CPU utilization consistently above 90% on the backend servers of an internal banking platform, I understood that there was a need to spin up a new instance or optimize existing resources.
  • Saturation: The degree to which a resource is unable to meet demand. For example, processes waiting on the CPU (load average), disk I/O queue depth, or packet drops on a network interface. Even if utilization is low, saturation can be high; this indicates that the resource cannot process some demands quickly. When the disk I/O saturation on a service running on my own VPS exceeded 50%, I noticed the application slowing down and sometimes even freezing. This indicates that disk I/O queues were growing, and operations were waiting.
  • Errors: Errors at the resource level. For example, CRC errors on a network interface, disk read/write errors, or OOM (Out Of Memory) kills due to memory pressure. Such errors usually indicate a fundamental problem in the infrastructure. When I saw kernel: Out of memory: Kill process X errors in the journald logs on one of my PostgreSQL servers, I understood that memory was insufficient and the system was terminating a process.

USE metrics are very powerful for identifying the source of an application's performance problems, i.e., hardware or operating system-level bottlenecks. Especially when I see a drop in RED metrics or an increase in errors for an application, my next step is to look at USE metrics to check if there's an infrastructure problem.

# Commands showing basic USE metrics on Linux
# CPU Utilization and Saturation (Load Average)
$ uptime
 09:14:03 up 2 days, 18:23,  1 user,  load average: 1.25, 1.10, 0.98

# Memory Utilization and Saturation (Swap usage, cache etc.)
$ free -h
              total        used        free      shared  buff/cache   available
Mem:          15Gi       5.2Gi       7.6Gi       1.0Gi       2.7Gi       9.0Gi
Swap:          2Gi        20Mi       1.9Gi

# Disk Utilization, Saturation (IOPS, throughput, await)
$ iostat -x 1 5
# Example output: Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  %util
#               sda              0.00     0.00    0.00    1.00     0.00     4.00     8.00     0.00    1.00    0.00    1.00   0.01

# Network Errors (drops, errors)
$ netstat -s | grep -i "drop\|error"
# Example output:    10492 packets dropped by interface
#                 403 retransmitted packets errors
Enter fullscreen mode Exit fullscreen mode

Examples from My Experience: When Did Which One Save the Day?

Throughout my career, I've repeatedly seen how RED and USE metrics interact and when one provides a solution faster than the other. Here are a few concrete examples:

Scenario 1: Delayed Shipment Reports in a Production ERP

In a manufacturing company's ERP, the completion time for critical shipment reports, especially in the mornings, suddenly increased from 2 minutes to 10 minutes between 08:00 and 09:00. Users were naturally complaining, and operations were disrupted. This was directly a RED Duration metric issue: the p95 duration for the reporting service had soared to unacceptable levels.

My first step was to examine the reporting service's logs and analyze the response times of relevant API calls. I checked the queries with EXPLAIN ANALYZE in PostgreSQL. I quickly realized that a specific ORDER BY clause was slowing down a large temporary table created in previous days. This situation was causing lock contention in the database.

Here, RED metrics (Duration) very clearly showed the existence of the problem and its impact on the user. If I had directly looked at CPU or disk utilization (USE metrics), I would have seen the CPU hovering around 50% and disk I/O not being excessively busy. This is because the problem wasn't the CPU actively doing things, but rather idling while waiting for a lock in the database. The solution was query optimization and adding an appropriate index. As a result, the p95 reporting time dropped back to 1.5 minutes.

💡 Query Locks and RED Metrics

When a query in PostgreSQL is locked or takes too long, the server's overall CPU utilization (USE) might not appear very high, but the latency of the affected service (RED) will skyrocket. This situation often occurs in cases like I/O waiting or lock contention. Here, RED metrics are much more effective for catching the problem directly through user experience.

Scenario 2: Disk I/O Crisis on My Side Project's VPS

On a VPS I used for my side project, I occasionally experienced a general slowdown. Especially at certain hours at night, when cron jobs were running, the system would take seconds to respond. While this indicated a general slowdown in RED metrics, it wasn't clear which service was affected. There was only a general complaint like "everything is slow."

At this point, I turned to USE metrics. When I looked at my VPS's disk I/O usage, I saw that disk utilization was approaching 100% at specific hours, and disk I/O saturation was very high. In the iostat output, the avgqu-sz (average queue size) values were reaching 20-30, which meant that disk operations were seriously waiting.

# iostat output (example)
# Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  %util
# sda              0.00     0.00    1.00   200.00  100.00   8000.00   40.00    25.00   125.00   0.00  125.00  99.00
Enter fullscreen mode Exit fullscreen mode

In the iostat output above, a %util value of 99.00 indicates that the disk is almost completely busy, and an avgqu-sz value of 25.00 indicates that the disk I/O queue is very deep. This caused all other operations that wanted to read from or write to the disk to wait.

The source of the problem was a data collection and processing script that ran every night and performed intensive disk read/write operations. By optimizing this script (for example, by improving database queries to do less disk I/O or by processing data in memory and writing it in batches), I significantly reduced disk I/O saturation. As a result, the overall system response time returned to normal. Here, USE metrics (Disk Saturation) clearly revealed the underlying infrastructure bottleneck. While RED metrics only said "there's a slowdown," USE metrics answered the question "why is there a slowdown?" I had examined these types of issues in more detail in my post [related: my disk optimization experiences on VPS].

Combined Use and Integration: Merging the Two Approaches

In reality, RED and USE metrics are not mutually exclusive but complementary approaches. The model I use in my production systems typically involves using RED metrics as the primary alarm trigger, then, once a problem is detected, diving into USE metrics to find the root cause.

For example, when an API's error rate (RED Errors) jumps from 0.5% to 5%, I first look at the RED metrics dashboard. After identifying the type of error (HTTP 500, 401, etc.) and which API endpoint it occurred on, I switch to the USE metrics of the servers or microservices hosting this API.

  1. RED Alarm: The 500 Internal Server Error rate for the API POST /orders endpoint suddenly jumped to 10%. (Alarm: RED Errors)
  2. Initial Investigation (RED): Looking at the error logs, I see that the database connection pool is exhausted.
  3. In-depth Investigation (USE): I look at the USE metrics of the relevant database server.
    • CPU Utilization: Has CPU usage, normally 30-40%, risen to 80%?
    • Memory Utilization/Saturation: Has memory usage increased, is swap space being used, are there OOM kills?
    • Network Saturation/Errors: Are there packet drops or errors on the network interface?
    • Disk Saturation: Are disk I/O queues elongated, especially on database servers?

In this example, the increase in the error rate might be due to CPU saturation on the database server. Or perhaps the database server itself is fine, but the application server is unable to manage database connections correctly due to insufficient memory. In both cases, while RED metrics tell me "there's a problem," USE metrics help me find answers to "where and why is the problem?"

⚠️ The Importance of Correlation

When interpreting RED and USE metrics together, correlation is very important. High CPU utilization (USE) is not always a bad thing; perhaps your application is running at full capacity and performing wonderfully (RED). The key is to understand the relationship between these metrics and your application's expectations. Sudden changes or exceeding expected thresholds usually indicate a problem.

My dashboards typically include a summary RED panel and detailed RED and USE panels specific to each service. For example, when using Prometheus and Grafana, I collect these metrics with PromQL queries like this:

# Example PromQL queries
# RED - Rate:
sum(rate(http_requests_total{job="my_service"}[5m])) by (endpoint)

# RED - Errors: (HTTP 5xx errors)
sum(rate(http_requests_total{job="my_service", status=~"5..", endpoint="/api/v1/data"}[5m]))
/
sum(rate(http_requests_total{job="my_service", endpoint="/api/v1/data"}[5m])) * 100

# RED - Duration (p95):
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{job="my_service"}[5m])) by (le, endpoint))

# USE - CPU Utilization (from node_exporter):
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# USE - Disk Saturation (I/O Wait time - from node_exporter):
avg by (instance) (rate(node_disk_io_time_seconds_total[5m])) * 100
Enter fullscreen mode Exit fullscreen mode

In this way, I can monitor both the outward-facing aspect of the application and the internal workings of the infrastructure simultaneously.

Conclusion: My Preference and Future Approach

In conclusion, there isn't a single "superior" side between RED and USE metrics. For me, both are indispensable. However, in the event of a problem, I tend to look at RED metrics first because they directly reflect the user experience. In a production environment, the priority is always to resolve the end-user's problem as quickly as possible. RED metrics allow me to immediately see how serious the problem is and who it affects.

Then, USE metrics come into play to delve into the root cause of the problem. By showing infrastructure-level bottlenecks, resource saturations, or errors, they pinpoint exactly where I need to intervene. These two metric sets, much like a detective following clues, guide me to the source of the problem. I had explained this process in more detail in another post I wrote on [related: my root cause analysis approaches].

In the future, I believe AI-powered anomaly detection and predictive analytics will play a greater role in the collection and interpretation of these metrics. Especially with low-latency inferencing engines like Groq and Cerebras and models like Gemini Flash, it will be possible to detect real-time anomalies in these metric streams and predict potential problems before they even arise. This will make the job of us operators and system administrators a bit easier, but the fundamental RED and USE principles will continue to form the underlying logic of these AI systems. My clear position is that these two metric sets will always go hand in hand, but RED will be one step ahead as a starting point.

Top comments (0)