Observability: Metrics or Logs, Which is Truly Enough?

#life #observability #monitoring #debugging

We have two fundamental tools we use to understand the health of our systems and solve problems: metrics and logs. Questioning which one is more important or which one is sufficient on its own has been a long-standing debate. Based on my field experience, I've seen that these two actually complement each other and can make a critical difference when not used correctly. In this post, we will delve into the power, limitations, and optimal use cases of metrics and logs.

As I start thinking about this topic, the first thing that comes to mind is a "sneaky" performance problem we experienced last year. Feedback from our users indicated that some operations were taking longer than expected. However, the metrics on our main dashboard—CPU usage, memory usage, network traffic—all appeared normal. It was at this point that I realized relying solely on metrics was insufficient. We had to dive deep into the logs to get to the root of the problem.

Metrics: A Broad Perspective and Quick Diagnosis

Metrics are a great way to monitor the overall health and performance of our systems. Being numerical data, they are very valuable for tracking trends, detecting anomalies, and identifying performance bottlenecks. Values like CPU usage, memory usage, disk I/O, and network packet loss are like the "vital signs" of our system. Collecting and visualizing this data provides a broad perspective.

For example, while working on a production ERP system, we would monitor the number of requests per server and response times. If the number of requests suddenly spiked and response times increased, it indicated a serious problem. Such metrics allowed us to quickly detect the existence of a problem. We would collect these metrics with tools like Prometheus and visualize them with Grafana. By setting up an alerting system, we would be instantly notified when certain thresholds were crossed. This was critical for proactive intervention.

ℹ️ The Power of Metrics

Metrics are indispensable for seeing the general trends and the big picture in a system. They allow us to quickly say "something is going wrong." They are particularly effective in identifying performance bottlenecks and resource exhaustion.

However, metrics don't tell us why the problem is occurring. They only indicate that something is not right. Seeing CPU usage at 90% on a web server indicates a problem, but it doesn't tell us if the source of this problem is a heavy query, a memory leak, or simply increased traffic. This is where logs come into play.

Logs: A Detailed Excavation to the Root Cause of Problems

Logs are textual records that describe what our systems did, when they did it, and why they did it. They contain the details behind an event, error messages, transaction steps, and their outputs. While metrics answer the question "what happened," logs seek to answer the question "why did it happen." Logs are invaluable, especially in debugging and incident analysis processes.

In an enterprise software project, we received a ticket about the application crashing when a user clicked a specific button. There were no anomalies in the metrics. However, when we examined the detailed logs from the user, we found a timeout error in the database connection. This error was triggered by a specific user profile and a particular data set. The SQL query execution timeout exceeded message in the logs clearly revealed the source of the problem.

💡 Detailed Analysis of Logs

Logs contain all the details of an event. They provide information about error messages, exceptions, transaction flows, and the interactions between system components. These details are vital for finding the root cause of complex problems.

To use logs effectively, a good logging strategy needs to be established. What information will be logged, how log levels (DEBUG, INFO, WARN, ERROR, FATAL) will be used, and how logs will be centralized (e.g., with ELK Stack or Loki) are important. In my own systems, I always make sure to keep detailed error logs, especially for critical services. For instance, in case of an error in a background worker, I create a log entry that includes the ID of the relevant job, its parameters, and the error details.

The Dance Between Metrics and Logs: The Power of Correlation

The real power emerges when we can establish a correlation between metrics and logs. When you see an unexpected increase or decrease in a metric, you can immediately understand what's happening by looking at the logs from that specific time frame. Or, when an error is logged, you can assess the impact of the problem by examining how the metrics behaved during the same time frame. This forms the basis of "Observability."

A few years ago, I started experiencing performance issues in the backend of a financial calculator application I developed. Metrics indicated that CPU usage was maxing out at certain hours. However, I couldn't clearly see how many requests were coming in or which queries were running. By combining Prometheus metrics, FastAPI logs, and data from tools like PostgreSQL's pg_stat_statements, I found the source of the problem.

The issue was in a function that performed complex calculations for a specific financial instrument. This function contained numerous nested loops and database queries. With increased traffic, this function was causing excessive CPU usage. Logs showed which queries were running slowly, and metrics showed when CPU was being strained. By combining these two data sets, I diagnosed the problem with over 90% certainty and optimized the relevant function.

⚠️ Failure to Correlate Can Be Dangerous

Failing to establish a connection between metrics and logs can significantly slow down the troubleshooting process. You might have to spend hours sifting through log files to understand the cause of an anomaly in a metric. Or conversely, when you see an error log, you may not be able to understand how this error is affecting the overall performance of the system.

One of the most effective ways to achieve this correlation is to add tags to your logs that can be associated with metrics. For example, you can add a unique request ID (trace ID), user ID, or transaction ID to each log entry. This way, you can easily group all logs related to a specific request and the metrics associated with that request.

In Which Situation is Which More Effective?

Both tools have their unique strengths. Understanding when each is more effective increases our efficiency.

When are Metrics More Effective?

General Performance Monitoring: Metrics are the best choice for continuously monitoring the overall health and performance of your system.
Trend Analysis: Used for seeing trends over time, capacity planning, and predicting future problems.
Anomaly Detection: Ideal for quickly detecting unexpected sudden changes (e.g., traffic spikes, increased request latency).
Alerting: Used for receiving notifications when specific thresholds are exceeded.

When are Logs More Effective?

Debugging: Logs are indispensable for understanding why an error occurred and how it happened.
Incident Analysis: Used for understanding exactly when an incident started, what steps it involved, and how it was resolved.
Auditing: Used for recording who did what and when, and for security audits.
Monitoring Business Workflows: Used for tracking each step of complex business workflows and identifying bottlenecks.

For example, I had developed a spam blocker on the Android side of a mobile application. Users reported that the application was occasionally slow. The metrics (CPU, memory) were within normal limits. However, when I examined the application's logs, I saw unnecessary and repetitive network requests made by a background service. These requests, though brief, were consuming system resources and negatively impacting the user experience. The logs showed exactly which service and which function the problem originated from.

Trade-offs and Costs: Things Not to Be Ignored

Both metric collection and logging add overhead to our systems. It's important to manage this overhead and optimize costs.

Costs of Metric Collection:

Data Storage: Storing long-term metric data can require significant disk space.
Network Traffic: Moving metrics from where they are collected to where they are analyzed creates additional network traffic.
Processing Load: Collecting and processing metrics (e.g., aggregations) uses CPU and memory.

Costs of Logging:

Data Storage: Logs are typically larger in volume than metrics, and long-term storage costs are high.
Network Traffic: Sending logs to centralized logging systems can create significant network traffic.
Processing Load: Processing, parsing, and indexing logs requires CPU and memory.
Development Cost: Designing and implementing a good logging strategy requires additional development time.

On my own VPS, I set up both metric and log collection infrastructure for custom financial calculators. Initially, I logged everything in excessive detail and collected as many metrics as possible. However, after a few months, I found that the storage costs and the load on the system were higher than expected. Especially journald logging too much detail quickly consumed disk space.

🔥 The Risk of Excessive Logging and Metric Collection

While it's tempting to log everything or collect every piece of metric data, this increases costs and makes it harder to find the information you're looking for. The saying "getting lost in the log forest" perfectly describes this situation. You need to determine what you need to use your resources effectively.

After this experience, I adjusted the logging levels. I started keeping only critical ERROR and FATAL level logs for a long time, while keeping INFO and DEBUG level logs for shorter periods. Similarly, I optimized storage costs by keeping only detailed data from the last week for the metrics I collected, and storing longer-term data at a lower resolution. This allowed me to analyze detailed data from the last week in Grafana while still providing enough information for monthly trends.

Observability: Not Just Metrics and Logs, but Traces Too!

While metrics and logs are the fundamental building blocks, modern Observability approaches typically include a third component: Distributed Tracing. Tracing allows you to end-to-end track the journey of a request across different services in your system. A request starts at the API gateway, travels through microservices, and reaches the database. Tracing records each step of this journey and shows how much time each service spent.

Especially in microservice architectures, completing a request may require the interaction of multiple services. In such cases, looking only at the metrics or logs of a single service may not be enough. Tracing clearly reveals which service or services are causing the problem and where the latency is occurring.

In one project, we were working on a microservice architecture that processed user orders. The order processing time sometimes increased unexpectedly. Metrics showed that the overall system load was normal. Logs revealed no serious errors in individual services. However, by using a tracing tool like Jaeger to track the request's journey, we found a significant delay in communication between the order service and the inventory service. The problem stemmed from a slow database query in the inventory service. Tracing allowed us to find this issue.

ℹ️ The Role of Tracing

Distributed Tracing is a critical tool for troubleshooting and performance optimization, especially in complex, distributed systems. It makes it easier to identify bottlenecks and errors by visualizing the entire journey of a request through the system.

Implementing tracing usually requires some code changes. Trace context needs to be propagated during inter-service communication (e.g., in HTTP requests), and the relevant tracing library needs to be used in each service. Although this is an additional development overhead, the visibility it provides, especially in microservice environments, is invaluable.

Conclusion: Not a Single Tool, but a Whole

In conclusion, the answer to the question "metrics or logs?" is simple: both. And often, tracing should be added too. These tools should be seen not as competitors, but as complements to each other.

Metrics provide a quick overview of your system's overall health and allow you to detect anomalies.
Logs help you deeply understand the causes of these anomalies and speed up the debugging process.
Tracing, especially in distributed systems, reveals bottlenecks and inter-service issues by tracking the entire lifecycle of a request.

In my own experiences, I've seen that systems that can effectively combine these three components can detect and resolve problems much faster. This makes a big difference not only in operational efficiency but also in user satisfaction and system reliability. Therefore, when building your Observability strategy, consider how these tools complement each other, rather than focusing on just one.

On this journey, I always see that there are new things to learn. I constantly strive to improve how I can better use these tools and make our systems more visible.