DevHelm

Posted on Jun 8 • Originally published at devhelm.io

Monitoring and Logging: How They Work Together and When You Need Both

#guides #infrastructure #reliability

Monitoring and logging solve two different problems that look identical from a distance. Both produce data about your system. Both live in dashboards. Both show up in incident timelines. The difference only becomes obvious when something breaks and you need to act.

Monitoring answers "is it broken?" Logging answers "why is it broken?" Every production system needs both, but the order you set them up, the tools you pick, and the architecture that connects them depend on your team size and what keeps breaking.

What monitoring actually means

Monitoring is the practice of collecting metrics — numeric measurements sampled at regular intervals — and alerting when those metrics cross a threshold. CPU usage, request latency, error rate, queue depth, disk usage. Each metric is a time series: a stream of (timestamp, value) pairs that you can graph, aggregate, and set rules against.

The defining characteristic of monitoring is that it operates on aggregates. You don't monitor individual requests; you monitor the p99 latency of all requests to /api/v1/orders over the last 5 minutes. You don't monitor individual log lines; you monitor the rate of 5xx responses per second.

A monitoring system has three parts:

Collection — scrape or push metrics from your services (Prometheus pull model, StatsD push model, OpenTelemetry SDK)
Storage — time-series database that handles high write throughput and efficient range queries (Prometheus TSDB, InfluxDB, TimescaleDB, Mimir)
Alerting — rules that evaluate metric expressions and fire notifications (Alertmanager, Grafana Alerting, PagerDuty)

When your API's p99 latency exceeds 500ms for 5 consecutive minutes, the monitoring system fires an alert. You know something is wrong. But you don't know what — the metric tells you the symptom, not the cause.

What logging actually means

Logging is the practice of recording discrete events — structured or unstructured text entries that describe what happened at a specific moment. "User 4821 requested /api/v1/orders, query took 2.3s, database connection pool exhausted" is a log line. It has context that metrics can't capture: the specific user, the specific endpoint, the specific failure mode.

Where monitoring operates on aggregates, logging operates on individual events. You search logs for a specific request ID, a specific error message, a specific time window. The power of logging is correlation: you can reconstruct the sequence of events that led to a failure.

A logging system also has three parts:

Collection — capture log events from application code and infrastructure (structured loggers like Pino or Winston for Node.js, Python's structlog, Fluent Bit as a log shipper)
Storage + indexing — full-text search engine optimized for log-shaped data (Elasticsearch, Loki, CloudWatch Logs, Datadog Log Management)
Query + visualization — search interface for filtering, correlating, and visualizing log events (Kibana, Grafana with Loki, Datadog Log Explorer)

Logs give you the "why." But without monitoring, you don't know to look at them in the first place. Nobody sits in Kibana watching logs scroll by in real time during a normal day.

Where monitoring stops and logging starts

The handoff happens at the alert. Here's the sequence in a well-instrumented system:

Monitoring detects the anomaly. Error rate on /api/v1/checkout spikes from 0.1% to 12% over 90 seconds.
Alert fires. The on-call engineer's phone buzzes. The alert says: "checkout error rate > 5% for 2 minutes."
Engineer opens the dashboard. Monitoring shows which service is affected and when it started. The error rate graph shows a sharp step function at 14:32 UTC.
Engineer pivots to logs. Searching for service=checkout AND level=error AND timestamp > 2026-06-07T14:30:00Z reveals 400 instances of "connection refused: payments-service:443."
Root cause identified. The payments service certificate expired. The checkout service can't establish TLS connections.

Steps 1–3 are monitoring. Steps 4–5 are logging. The architecture must make this handoff fast — ideally under 60 seconds from alert to relevant log query.

When monitoring alone fails

Monitoring without logging is like a smoke detector without a fire extinguisher. You know there's a problem, but you can't do anything about it without more information.

Scenario 1: intermittent failures. Your API returns 500 errors at a rate of 0.5% — below your alerting threshold of 1%. Users complain. Monitoring says everything is green. Without logs, you have no way to find the specific requests that failed, identify the common pattern (all failures hit the same database shard), and trace the failure to a specific query.

Scenario 2: performance degradation without threshold breach. p99 latency drifts from 200ms to 450ms over two weeks. It never crosses your 500ms alert threshold. Users feel the slowness but nobody investigates because monitoring never fires. When you finally look at logs, you find a query plan regression after a schema migration — the database switched from an index scan to a sequential scan on a table that grew 3x.

Scenario 3: data correctness bugs. Monitoring tracks availability and latency, not business logic. An off-by-one error in your billing calculation charges users 10% less than it should. Latency is fine, error rate is zero, availability is 100%. Only logs (or audit trails) reveal that the calculateTotal() function is returning wrong values.

When logging alone fails

Logging without monitoring is like a security camera with no motion sensor. You're recording everything, but nobody watches the feed until after the break-in.

Scenario 1: silent infrastructure failures. Your Elasticsearch cluster runs out of disk at 3 AM. Log ingestion stops. No more logs arrive. Without a monitoring check on Elasticsearch disk usage and ingestion rate, you don't discover the gap until Monday morning — and you've lost 60 hours of log data.

Scenario 2: gradual resource exhaustion. Memory usage on your API servers climbs 50MB per hour due to a leak. Each individual request looks fine in the logs. There's no single log event that says "memory is leaking." Only a metric tracking RSS over time makes the trend visible.

Scenario 3: high-volume events that need aggregation. Your API processes 10,000 requests per second. Searching logs for "how many 5xx errors happened in the last 5 minutes" requires scanning millions of log lines. A pre-aggregated metric answers the same question in milliseconds.

The architecture that connects them

The modern observability stack has three signal types: metrics, logs, and traces. OpenTelemetry defines a unified collection layer for all three. The architecture looks like this:

Application
  ├── OTel SDK (metrics + logs + traces)
  └── Structured logger (Pino, structlog, slog)
        │
        ▼
  OTel Collector (receives all three signals)
  ├── Metrics → Prometheus / Mimir
  ├── Logs → Loki / Elasticsearch
  └── Traces → Jaeger / Tempo
        │
        ▼
  Grafana (unified query + dashboards + alerting)

The OpenTelemetry Collector acts as the central routing layer. It receives OTLP data from your applications, processes it (batching, sampling, enrichment), and exports to the appropriate backends. This decouples your application code from your backend choices — you can switch from Elasticsearch to Loki without redeploying a single service.

The critical integration point is exemplars — metrics that link to specific trace IDs. When your p99 latency spikes, you click on the spike in Grafana, and it takes you directly to a slow trace in Jaeger. From the trace, you see which span was slow. From the span, you pivot to the logs for that specific request. The three signals connect into a single investigation flow.

The tool landscape in 2026

Here's an honest assessment of the major options, organized by the problem they solve:

Metrics + alerting

Tool	Strengths	Weaknesses
Prometheus + Grafana	Free, battle-tested at scale, massive ecosystem of exporters. PromQL is expressive.	Operational burden of running Prometheus at scale (storage, federation, HA). Not great at long-term retention without Thanos/Mimir.
Datadog	Zero operational burden, unified metrics+logs+traces, good alerting UI.	Expensive at scale ($15–23/host/mo for infra, $0.10/GB for logs). Vendor lock-in — custom query language.
Grafana Cloud	Managed Prometheus + Loki + Tempo. Same open-source query languages.	Costs scale with active series and log volume. Less feature-rich alerting than Datadog.

Log management

Tool	Strengths	Weaknesses
Elasticsearch + Kibana (ELK)	Full-text search, mature ecosystem, handles high cardinality well.	Resource-hungry (RAM, disk). Cluster management is a specialty skill. Expensive at high volume.
Grafana Loki	Cheap storage (only indexes labels, not full text). Pairs naturally with Prometheus. LogQL mirrors PromQL.	Full-text search is slow compared to Elasticsearch — you need good label discipline.
CloudWatch Logs	Zero setup on AWS. Integrates with Lambda, ECS, EKS natively.	Slow query performance at scale. Log Insights query language is limited. Egress costs.

Tracing

Tool	Strengths	Weaknesses
Jaeger	CNCF graduated, open source, Elasticsearch or Cassandra storage.	No built-in metrics or logs — tracing only. UI is functional but basic.
Grafana Tempo	Cost-efficient (object storage backend), integrates with Grafana, TraceQL.	Newer, smaller community than Jaeger. Requires Grafana for visualization.

See our Jaeger tracing deep-dive and OTel Collector guide for hands-on setup.

What to set up first

The order depends on your team size and what's currently breaking.

Solo developer or 2–3 person team

Start with monitoring. You don't have the operational capacity to run an ELK cluster. Use a managed monitoring service or a simple Prometheus + Grafana stack. Add structured logging to your application (console.log with JSON format is a valid starting point). Ship logs to CloudWatch or a free Loki instance.

Priority order:

Uptime monitoring — know when your service is down before your users tell you
Application metrics — request rate, error rate, latency (the RED method)
Structured logging — JSON logs with request IDs, user IDs, timestamps
Alerting rules — error rate > 1%, latency p99 > 1s, disk > 80%

5–20 person engineering team

Invest in the logging pipeline. At this size, "check the logs" is a daily activity. The cost of grep-ing through unstructured logs on 10 servers exceeds the cost of running a log management system. Deploy an OTel Collector, standardize on structured logging, and set up a Loki or Elasticsearch cluster.

Priority order:

Everything from the solo tier, if missing
Centralized log aggregation with search
Distributed tracing for cross-service requests
Runbooks that link alerts to the relevant log queries and dashboards

20+ person engineering team

Build the correlation layer. At this scale, the problem isn't collecting data — it's connecting the dots. Invest in exemplars (metrics → traces), trace-to-log links, and unified dashboards. Every alert should link to a runbook that includes the first three log queries to run.

Your MTTR at this scale is dominated by "time to find the relevant signal," not "time to fix the bug." The architecture that connects monitoring and logging is the primary lever for reducing incident duration.

The monitoring layer that catches everything else failing

Your logging pipeline is infrastructure. Your tracing backend is infrastructure. Your metrics database is infrastructure. All of it can fail — and when it does, the irony is that you lose visibility precisely when you need it most.

External uptime monitoring is the safety net. A check that hits your Elasticsearch health endpoint every 30 seconds, a check that verifies your Prometheus is scraping targets, a check that confirms your OTel Collector is accepting spans — these are the monitors that prevent the "we lost 6 hours of logs and nobody noticed" incident.

Set up your first monitor in 60 seconds at app.devhelm.io. Start with your most critical endpoint, then add checks for every piece of your observability stack. The thing that monitors everything else should itself be monitored by something outside your infrastructure.

Originally published on DevHelm.

DEV Community