DEV Community

Mustafa ERBAY
Mustafa ERBAY

Posted on • Originally published at mustafaerbay.com.tr

Traced Logging vs. Metric-Based Monitoring: A Practical Comparison

In nearly 20 years of managing my systems, I've tried different approaches to understand the behavior of an application or service. From critical modules of a production ERP I developed to the backend services of my side projects, the fundamental question was always the same: When a problem occurs or performance degrades, how do I detect it and find the root cause? This question became even more complex as distributed system architectures became more prevalent.

Today, we have two main paths for monitoring our systems: Traced Logging and Metric-Based Monitoring. Both offer their own advantages, but deciding when to use which depends on the project and team's needs. In this post, I'll compare these two approaches based on my own experiences and explain which one makes more sense in which scenario.

Traced Logging: Following the Flow of Events

Traced Logging, as the name suggests, involves tracking all steps of a request or operation within a system and aggregating the logs associated with these steps. We use Traced Logging to see which services run in what order in the background when a user clicks a button, how long each step takes, and what errors, if any, are encountered. Especially in microservice architectures, understanding how a request flows from service A to service B, and then to service C, can be very difficult with metrics alone.

When I observed that the order creation process in a manufacturing company's ERP sometimes took up to 10 seconds, my first action was to check the tracing implementation. Instead of collecting logs separately, by adding a unique trace_id to each request and a span_id to each step, we can visualize the entire lifecycle of the request. This allowed me to understand that the "stock control" step within the "order approval" service was taking much longer than expected, and sometimes even getting stuck on a Redis query.

# Example tracing integration in FastAPI
from opentelemetry import trace
from opentelemetry.sdk.resources import Resource
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import ConsoleSpanExporter, SimpleSpanProcessor
from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor

# Configure TracerProvider
resource = Resource.create({"service.name": "erp-order-service"})
provider = TracerProvider(resource=resource)
processor = SimpleSpanProcessor(ConsoleSpanExporter())
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)

tracer = trace.get_tracer(__name__)

# Instrument the FastAPI application
# app = FastAPI()
# FastAPIInstrumentor.instrument_app(app)

@app.post("/order")
async def create_order(order_data: dict):
    with tracer.start_as_current_span("create_order_request") as parent_span:
        # Simulate some processing
        order_id = "ORD-" + str(uuid.uuid4())[:8]
        parent_span.set_attribute("order.id", order_id)

        with tracer.start_as_current_span("validate_items"):
            # Item validation logic
            time.sleep(0.05) # Simulate IO
            pass

        with tracer.start_as_current_span("deduct_stock"):
            # Stock deduction logic, potentially calling another service
            time.sleep(0.1) # Simulate DB call or external API
            if random.random() < 0.1: # Simulate a stock error
                parent_span.set_attribute("error", True)
                raise HTTPException(status_code=500, detail="Stock deduction failed")
            pass

        # ... more steps

        return {"message": "Order created successfully", "order_id": order_id}
Enter fullscreen mode Exit fullscreen mode

With this type of tracing, we can clearly see which stages each request goes through and where it gets stuck. In the example above, I simulated an error in the deduct_stock step. Thanks to tracing, I can instantly identify for which order, in which service, and within which span this error occurred. This is a very powerful tool, especially for debugging and finding performance bottlenecks. However, Traced Logging also has its costs and complexities; adding so much metadata to every log line increases storage and processing load.

Metric-Based Monitoring: Taking the Pulse of System Health

Metric-Based Monitoring, on the other hand, focuses on collecting numerical data about the overall health and performance of the system. Metrics such as CPU usage, memory consumption, disk I/O, network traffic, request count, error rates, and response times allow us to take the pulse of the system. These metrics are typically stored in time-series databases and visualized through graphs and dashboards.

I generally use metrics to understand the overall status of my critical services. For example, seeing that an API's requests per second (RPS) or HTTP 5xx error rate exceeds a certain threshold immediately tells me there's a problem. This, unlike Traced Logging, is typically used to detect that a problem exists, not to find why it exists. In an internal banking platform, an alarm would immediately trigger when the payment service's response time exceeded 200 ms. This alarm indicated the start of a problem.

# Example alert rule for Prometheus
groups:
- name: api_service_alerts
  rules:
  - alert: HighErrorRate
    expr: sum(rate(api_requests_total{status_code=~"5.."} [5m])) by (service_name) / sum(rate(api_requests_total[5m])) by (service_name) > 0.05
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: "High error rate in {{ $labels.service_name }} service"
      description: "{{ $labels.service_name }} service returned over 5% 5xx error codes in the last 5 minutes."

  - alert: HighLatency
    expr: histogram_quantile(0.99, rate(api_request_duration_seconds_bucket[5m])) > 0.5
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "High P99 latency in {{ $labels.service_name }} service"
      description: "P99 response time for {{ $labels.service_name }} service exceeded 500ms in the last 5 minutes."
Enter fullscreen mode Exit fullscreen mode

The Prometheus alert rules above demonstrate how powerful metric-based monitoring can be. Alerts are automatically triggered when specific thresholds are exceeded. This means I, or my operations team, don't have to constantly watch dashboards to become aware of a problem. Metrics are also indispensable for long-term trend analysis, capacity planning, and SLA/SLO tracking. For example, by monitoring the change over time in my PostgreSQL servers' wal_buffers usage or checkpoint_timeout values, I can predict future performance issues. However, metrics typically do not include details such as why a single operation slowed down or with what parameters it was called. This only gives us the information "there is a problem," not the answer to "where exactly is the problem and why?"

Different Use Cases and Trade-offs

So, in which situation should I prefer which? In fact, it depends on what you're trying to answer. In my experience, the two are complementary tools.

  • Metric-Based Monitoring:

    • When: When I want to quickly understand the overall health and performance of a system or service. When I want to define critical thresholds and generate alarms when these thresholds are exceeded. When I want to do capacity planning or monitor long-term trends.
    • Advantages: Lower storage and processing costs, quick summary information, instant alerts, ideal for trend analysis. Especially when monitoring the CPU/memory limits of my systemd units, when Redis's eviction_policy kicks in, or when Nginx's request_per_second value suddenly drops, metrics are my first go-to.
    • Disadvantages: Does not provide enough detail to find the root cause of a problem. Cannot answer the "why" question.
  • Traced Logging:

    • When: When I want to deeply examine the lifecycle of a specific request or operation in a distributed system, which services it passed through, how much time was spent at each step, and where an error occurred. During development or when debugging a complex error.
    • Advantages: Excellent for root cause analysis, greatly simplifies debugging in distributed systems, clearly shows performance bottlenecks. In a client project, when detecting an N+1 query problem in PostgreSQL, I saw which ORM call came from which controller in seconds thanks to tracing.
    • Disadvantages: High storage and processing costs; collecting all details of every request leads to significant resource consumption. Generally does not provide a quick summary of the system's overall status.

đź’ˇ A Tip from My Experience

In one of my side products, I initially used only metric-based monitoring. One day, a user complained about "payment failed." Metrics showed that the overall error rate was normal. However, when I enabled tracing, I saw that only transactions made with a specific payment method were experiencing timeouts in the 3rd party API. That's when I realized that general metrics can sometimes overlook specific problems that directly affect the user's experience.

The biggest trade-off between these two is usually between cost and level of detail. While tracing provides very detailed information, storing all that detail can be expensive. Metrics, on the other hand, offer a broader perspective with less detail and are generally more cost-effective.

Integration and Hybrid Approaches

For me, the most effective way is to use these two approaches in an integrated manner. That is, to consider them as complementary elements rather than alternatives to each other. We often call this "Observability," and the trio of metrics, logs, and tracing is referred to as the "three-legged stool."

Typically, when a problem occurs, my workflow is as follows:

  1. Metrics: First, I look at my dashboards or alarm system. Is CPU usage abnormal? Are error rates rising? Have response times increased? For example, I can tell from cgroup metrics if a service I manage with Docker Compose is hitting its memory limit.
  2. Logs: When metrics indicate a problem, I check the logs of the relevant service. Through journald or a centralized log collection system, I search for error messages or warnings within a specific time range. When Redis's OOM eviction policy kicks in, I clearly see it in the logs.
  3. Tracing: If I can't find enough detail in the logs or if the problem affects several components of the distributed system, tracing tools come into play. This allows me to visualize where a request got stuck or delayed between which services.

This hybrid approach allows me to continuously monitor the overall health of the system and, when necessary, delve into the finest details. In a production ERP, if the production planning module sometimes slowed down, I would first check the general status (CPU, I/O) with metrics, then look for anomalies in the logs, and finally use tracing to examine which AI model call a specific planning operation was getting stuck on. Without this trio, solving complex problems would take much longer.

Performance and Cost Implications

Observability tools introduce overhead to systems, requiring important decisions regarding performance and cost. Both approaches have their unique performance and cost implications:

  • Metric-Based Monitoring:

    • Performance: Metric collection typically has a light load. Most client libraries collect and cache metrics periodically, then expose them via an endpoint. This introduces minimal overhead to the application. However, very high cardinality metrics (e.g., a separate metric for each user ID) can strain time-series databases and increase storage costs.
    • Cost: Metrics generally occupy less storage space because they are aggregated and compressed over time. Even if data retention is extended, the cost remains more manageable compared to tracing. On my own VPS, metricizing Nginx access logs (like request count, response time) and storing only these metrics is much more cost-effective than storing raw logs.
  • Traced Logging:

    • Performance: Tracing can introduce more overhead to the application because it adds metadata like trace_id and span_id to every step of each request or operation and collects this data. Especially in systems with very high traffic, this can have a noticeable impact on CPU and network usage. Sending tracing data with synchronous I/O operations can degrade performance, which is why asynchronous sending is generally preferred.
    • Cost: Storing detailed steps and logs for every request means a very large amount of data. This increases storage and processing costs much more than metrics. Therefore, "sampling" strategies are often used in production environments. That is, not every request's trace is stored, only a certain percentage or those that meet specific conditions (e.g., those containing errors or exceeding a certain duration) are stored. In an internal banking platform, instead of storing all transaction traces, we only collected traces of failed transactions or those exceeding 5 seconds. This reduced costs by up to 90%.

⚠️ Disk Fire Alert

On a Docker container running a service that wrote tracing data to local disk, I observed disk I/O maxing out and disk space rapidly filling up under heavy load. This was one of the situations I called a container disk fire. Sending tracing data directly over the network to a central collector or implementing sampling is critical to prevent such issues.

Considering cost and performance implications, using both approaches in a balanced way is important for both managing the budget and ensuring adequate observability.

My Approach and Conclusion

My twenty years of field experience have shown me that there is no single "best" solution. Every project, every team, and every budget has its unique needs. My general approach is to base monitoring on metrics and use Traced Logging as a deep-dive investigation tool for specific scenarios.

In a scenario where we used multiple ISPs for company egress, to ensure DSCP marking was done correctly, I monitored both network device metrics (packet drops, latencies) and analyzed the network traffic of critical voice communication applications with tracing. While metrics showed a general degradation, tracing revealed which intermediate device had corrupted the DSCP tagging. This was a concrete example of how valuable both approaches are in their respective areas of expertise.

In summary:

  • Metrics for Overall Health and Alarms: I track whether my systems are alive and if basic performance thresholds are exceeded using metrics. This gives me the ability to proactively detect problems.
  • Traced Logging for Root Cause Analysis and Deep Debugging: When a problem is detected or when I encounter a complex error during development, the detailed flow information provided by Traced Logging is invaluable. However, I don't use it at 100% all the time, but rather with sampling when necessary or for specific critical operations.

Using these two tools with the right balance makes the operational load manageable and provides full control and understanding of my systems. Remember, the important thing is to know the tools you have and use the right tool at the right time.

In my next post, I will explain how I detect and resolve WAL bloat issues in PostgreSQL and my vacuum monitoring strategies.

Top comments (0)