Logs vs. Metrics: Which is More Effective for Troubleshooting?

#tutorials #systemadmin #observability #troubleshooting

When something goes wrong in our systems, the first thing that usually comes to mind is the question "why?" To find the answer, we turn to two main tools: logs and metrics. One tells us in detail what happened, while the other provides numerical data about the overall health of the system. So, which one is more effective in the journey of troubleshooting? The answer isn't a simple matter of preference; using the right tool at the right time is critical.

Both tools are indispensable for the "observability" of our systems. However, they serve different functions and shine in different scenarios. In this post, we will take a deep dive into what logs and metrics are, how they differ, their strengths and weaknesses, and how they can be used together in our troubleshooting strategies.

Logs: The Storytellers Explaining What Happened in Detail

Logs are text-based records that capture the state or actions of a system at a specific time. Each log line can represent an event, an error, a warning, or simply a normal operation. These records are usually kept in timestamped and structured or semi-structured formats.

A log record can include a wide range of events, from the start time of an application and a user logging in to the execution of a database query or the establishment of a network connection. For example, a web server's access logs show which IP address requested which URL, which HTTP method was used, and how long the server took to respond. Error logs, on the other hand, can detail an exception that caused an application to crash, a configuration error, or a permission issue.

ℹ️ Concrete Example: Error Log

While managing a production ERP system, I noticed that data retrieval times on operator screens were increasing. When I scanned the application logs for a detailed investigation, I encountered an error record like this:
2026-05-17 10:35:12.456 [ERROR] com.example.erp.datafetcher.ProductService: Exception while fetching product details for ID 12345.
java.lang.NullPointerException: Cannot invoke "com.example.erp.model.Product.getSupplierInfo()" because "product" is null
    at com.example.erp.datafetcher.ProductService.processProduct(ProductService.java:87)
    ... (stack trace continues)
This log line clearly shows that ProductService threw a NullPointerException on line 87 because the product object was null. This information allowed me to quickly understand the source of the problem (missing information in the database while fetching product details) and focus on the solution. Such details can never be obtained through metrics alone.

Logs are generally used to find the answer to the "why" question. When an error occurs, logs tell us exactly when the event happened, which operation failed, what inputs were used, and what error was received. This is invaluable for identifying and fixing the root cause of the error. Especially in complex systems or distributed architectures, log correlation becomes critical for tracing how an event propagated across different services.

Metrics: Numerical Indicators of System Health

Metrics are numerical values representing the performance and state of a system over a specific period. These can generally be cumulative (counters) or measured (gauges) values. Metrics provide a continuous view of the overall health, performance, and capacity of the system.

For example, values such as CPU usage, memory usage, network traffic, disk I/O performance, application request count, and error rate are metrics. By monitoring these metrics over time, anomalies, performance bottlenecks, or potential problems can be detected at an early stage. Metrics are typically stored in Time Series Databases (TSDB) and tracked through visualization tools (dashboards).

ℹ️ Concrete Example: CPU Usage Metric

A customer reported a sudden slowdown on their website. One of the first places I checked was the server's CPU usage metrics. Looking at the time series graph, I saw that CPU usage had been consistently above 95% just before the problem was reported.
Timestamp           | CPU Usage (%)
--------------------|------------
2026-05-17 10:00:00 | 25.5
2026-05-17 10:05:00 | 30.2
2026-05-17 10:10:00 | 96.1  <-- Anomaly starts
2026-05-17 10:15:00 | 97.5
2026-05-17 10:20:00 | 95.8
2026-05-17 10:25:00 | 40.3  <-- Drop after issue resolved
This metric data immediately showed that the problem was caused by high CPU usage. At this point, I needed to perform a deeper investigation (for example, using top or htop commands) to understand which process was causing this CPU load. Metrics are a great starting point for understanding the existence and general impact of a problem.

Metrics are generally used to find the answer to the "what is happening?" question. They provide an idea of the overall performance of the system and help proactively identify potential problems. Situations such as a sudden system slowdown, an increase in a service's response time, or a spike in error rates can be easily detected with metrics. This allows for intervention before users are affected.

Logs vs. Metrics: Key Differences and Working Together

Logs and metrics complement each other. While logs offer detailed event records, metrics provide numerical data about the general state. The main differences between them are:

Format: Logs are text-based, while metrics are numerical values.
Level of Detail: Logs are very detailed and describe specific events. Metrics are more abstract and general.
Purpose: Logs are typically used for debugging and root cause analysis. Metrics are used for performance monitoring, capacity planning, and general system health tracking.
Storage: Logs are usually stored in large text files or log management systems. Metrics are stored more efficiently in time series databases (TSDB).
Volume: The volume of log data generated by a system can be much larger than metric data.

⚠️ Misconception: Metrics Alone Are Not Enough

Once, I was trying to resolve a low success rate during payment transactions on a Turkish e-commerce site. Metrics showed that there was no drop in the number of requests to the payment API, but the response time had increased. This suggested that the problem might not be in the API itself but in a background database process. If I had stuck only to metrics, finding the source of the problem would have been much harder. Detailed log analysis revealed that a specific database query was causing excessive resource consumption.

The collaboration of these two tools forms the basis of an "observability" strategy. When an anomaly is detected in a metric (for example, a sudden spike in CPU usage), the logs for the relevant time period are examined to understand the event causing this anomaly. Conversely, when an error log is seen, metrics can be examined to understand how this error affects the overall performance of the system.

For example, when a sudden spike is noticed in an application's error rate metric, we understand from the metrics what time it started and how long it lasted. Then, by examining the application logs for that time period, we can identify which requests failed, what errors were received, and the causes of these errors. This integrated approach allows us to quickly understand both the existence and the cause of the problem.

The Most Effective Approach for Troubleshooting: Integration

When it comes to troubleshooting, there is no winner in the "logs vs. metrics" debate; both are necessary. However, there are specific scenarios where one tool is more effective than the other:

Cases Where Logs Are More Effective:

Root Cause Analysis (RCA): When an error occurs, logs tell us in detail exactly when, where, and how the event happened. Logs are indispensable for understanding specific issues like a NullPointerException, a ConnectionRefusedError, or an authorization error.
Tracking a Specific Event: To understand a specific problem encountered by a user, it is necessary to follow the logs belonging to that user's session or request.
Security Events: In the event of a security breach or suspicious activity, logs provide critical information for understanding the source, method, and impact of the attack.
Step-by-step Debugging: To understand why a specific function of an application is not working, it is necessary to examine the log records during the execution of that function step by step.

Cases Where Metrics Are More Effective:

General Performance Monitoring: Metrics are ideal for understanding the general state of system resources (CPU, RAM, Disk, Network) and identifying potential bottlenecks.
Anomaly Detection: Metrics are used to detect deviations from a system's normal behavior (e.g., abnormally high response time or low transaction volume).
Capacity Planning: Historical metric data is used to predict future needs and plan infrastructure upgrades.
SLO (Service Level Objective) Tracking: Metrics such as error rate and availability are used to monitor whether service level objectives are being met.
Trend Analysis: Metrics are used to understand system performance trends over time and anticipate potential problems.

💡 Pragmatic Approach: Troubleshooting Workflow

Here is the general workflow I follow when an error is detected:

Check Metrics: Is there a system-wide anomaly? (e.g., CPU, RAM, Network, Request Rate, Error Rate)

If There Is an Anomaly:

Which service or component is affected?

When did the anomaly start?

Is there a known change (deployment, configuration change) that could have caused this anomaly?

Dive into Logs:

Examine the logs of the affected service or component for the relevant time period.

Search for specific error messages (ERROR, EXCEPTION, FATAL).

Try to determine the root cause.

Correlation: Understand the propagation of the problem in distributed systems by establishing a relationship between the logs of different services.

Resolution and Verification: Fix the problem, then verify the return to normal in both metrics and logs.

This integrated approach is very effective, especially when examining journald logs of systemd services and access/error logs of services like Nginx. For example, when a systemd service stops unexpectedly, I pull the relevant logs with the journalctl -u <service-name> command and check the general status and latest metrics of the service with systemctl status <service-name>.

Log Management Systems and Metrics Collection Tools

In modern systems, specialized tools are used to effectively manage logs and metrics.

Log Management Systems:

Elastic Stack (Elasticsearch, Logstash, Kibana): A very popular open-source solution. We collect logs with Logstash, index them in Elasticsearch, and visualize and query them with Kibana.
Loki (Grafana Labs): A lighter alternative that works compatibly with Prometheus, which does not index logs but queries them based on labels.
Splunk: A commercial solution typically used in large enterprise environments.

These systems allow you to collect large amounts of log data in a central location and search and analyze them effectively. Such systems are essential, especially in distributed systems, for collecting and correlating logs from different servers.

Metrics Collection & Monitoring Tools:

Prometheus: An open-source monitoring and alerting system. It queries services using a "pull" model and stores them in a time series database. It is visualized with Grafana.
Grafana: One of the most popular tools for visualizing metrics. It supports many data sources such as Prometheus, InfluxDB, and Graphite.
Datadog, New Relic: Commercial, comprehensive monitoring and analysis platforms.

These tools help you continuously monitor system performance, set up alerts, and create visual reports. For example, tools like systemd-exporter can be integrated with Prometheus to collect metrics from systemd services.

Conclusion: Stronger Together

The "logs vs. metrics" question is not actually an "either-or" question, but a "when which and how together" question. Logs are the detailed narrators telling the story of the system; they are indispensable for finding the root cause of errors, tracking specific events, and analyzing security incidents. Metrics, on the other hand, are like X-ray machines that numerically show the overall health and performance of the system; they are critical for detecting anomalies, performing capacity planning, and monitoring service levels.

An effective troubleshooting and system monitoring strategy requires combining the strengths of these two tools. By combining an anomaly in metrics with the details in logs, we can quickly understand both the existence and the cause of problems. In my own systems, by using both detailed log analysis and comprehensive metric monitoring together, I proactively detect potential problems and resolve them before users are affected. Remember, the best observability comes from balancing the depth of logs with the breadth of metrics.