DEV Community

Young Gao
Young Gao

Posted on

The Three Pillars of Observability: Logs, Metrics, and Traces Explained (2026)

Your logs say the request succeeded. Your users say it didn't. Sound familiar?

Logs alone are like having security cameras but no audio, no motion sensors, and no way to correlate footage across rooms. You need all three pillars working together.

Pillar 1: Structured Logs

Stop doing this:

console.log(`User ${userId} placed order ${orderId}`);
Enter fullscreen mode Exit fullscreen mode

Start doing this:

import pino from 'pino';

const logger = pino({
  level: process.env.LOG_LEVEL || 'info',
  formatters: {
    level: (label) => ({ level: label }),
  },
  timestamp: pino.stdTimeFunctions.isoTime,
});

logger.info({
  event: 'order.placed',
  userId,
  orderId,
  amount: order.total,
  correlationId: req.headers['x-correlation-id'],
}, 'Order placed successfully');
Enter fullscreen mode Exit fullscreen mode

The difference? The second version is queryable. When your Loki or CloudWatch Insights query looks like {event="order.placed"} | json | amount > 500, you'll understand why structure matters.

Pino vs Winston: Pino is faster (it serializes to JSON natively, Winston transforms). For high-throughput services, Pino wins. For complex transport needs (writing to files, Slack, databases simultaneously), Winston's transport system is more flexible. Pick one and standardize across your org.

Aggregation Pipeline

Your logs need to flow somewhere useful:

App (Pino) → stdout → Fluentd/Vector → Loki/CloudWatch/Elasticsearch
Enter fullscreen mode Exit fullscreen mode

Don't ship logs directly from your app to the aggregator. Write to stdout, let infrastructure handle routing. This keeps your app decoupled and your containers lightweight.

// For local dev, make it readable
const logger = pino({
  transport: process.env.NODE_ENV === 'development'
    ? { target: 'pino-pretty' }
    : undefined,
});
Enter fullscreen mode Exit fullscreen mode

Pillar 2: Metrics with Prometheus

Logs tell you what happened. Metrics tell you how things are going.

Three types you need to internalize:

import { Counter, Histogram, Gauge, register } from 'prom-client';

// COUNTER: things that only go up
const httpRequestsTotal = new Counter({
  name: 'http_requests_total',
  help: 'Total HTTP requests',
  labelNames: ['method', 'route', 'status'],
});

// HISTOGRAM: distribution of values (latency, sizes)
const httpRequestDuration = new Histogram({
  name: 'http_request_duration_seconds',
  help: 'Request duration in seconds',
  labelNames: ['method', 'route'],
  buckets: [0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5],
});

// GAUGE: values that go up and down
const activeConnections = new Gauge({
  name: 'active_connections',
  help: 'Currently active connections',
});
Enter fullscreen mode Exit fullscreen mode

Wire it into your middleware:

app.use((req, res, next) => {
  const end = httpRequestDuration.startTimer({
    method: req.method,
    route: req.route?.path || 'unknown',
  });

  res.on('finish', () => {
    httpRequestsTotal.inc({
      method: req.method,
      route: req.route?.path || 'unknown',
      status: res.statusCode,
    });
    end();
  });

  next();
});

// Expose /metrics endpoint for Prometheus scraping
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', register.contentType);
  res.end(await register.metrics());
});
Enter fullscreen mode Exit fullscreen mode

Choose your histogram buckets carefully. The defaults are almost never right. Think about what latency matters for your service. An API gateway needs sub-100ms buckets. A report generator might need 1-60s.

Pillar 3: Distributed Traces with OpenTelemetry

When a request touches 5 services, logs and metrics can't tell you where the slowdown happened. Traces can.

import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';

const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({
    url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://localhost:4318/v1/traces',
  }),
  instrumentations: [getNodeAutoInstrumentations()],
  serviceName: 'order-service',
});

sdk.start();
Enter fullscreen mode Exit fullscreen mode

Auto-instrumentation catches HTTP calls, database queries, and queue operations automatically. For custom spans around business logic:

import { trace } from '@opentelemetry/api';

const tracer = trace.getTracer('order-service');

async function processOrder(order: Order) {
  return tracer.startActiveSpan('processOrder', async (span) => {
    try {
      span.setAttribute('order.id', order.id);
      span.setAttribute('order.items_count', order.items.length);

      await validateInventory(order);
      await chargePayment(order);
      await sendConfirmation(order);

      span.setStatus({ code: SpanStatusCode.OK });
    } catch (err) {
      span.setStatus({ code: SpanStatusCode.ERROR, message: err.message });
      throw err;
    } finally {
      span.end();
    }
  });
}
Enter fullscreen mode Exit fullscreen mode

I covered the OpenTelemetry setup in more depth in my earlier article on distributed tracing — this is about how it fits into the bigger picture.

The Glue: Correlation IDs

The three pillars are useless in isolation. Correlation IDs tie them together.

import { randomUUID } from 'crypto';
import { context, trace } from '@opentelemetry/api';

function correlationMiddleware(req, res, next) {
  const correlationId = req.headers['x-correlation-id'] || randomUUID();
  const traceId = trace.getSpan(context.active())?.spanContext().traceId;

  // Attach to request
  req.correlationId = correlationId;

  // Attach to response
  res.setHeader('x-correlation-id', correlationId);

  // Bind to logger for every subsequent log in this request
  req.log = logger.child({ correlationId, traceId });

  next();
}
Enter fullscreen mode Exit fullscreen mode

Now when a user reports an issue, you take their correlation ID and:

  1. Logs: {correlationId="abc-123"} → see every log line for that request
  2. Traces: search by the same trace ID → see the full request waterfall
  3. Metrics: you already know the route and status, check if it's a pattern or a one-off

This is observability. Not just collecting data — connecting it.

Cost-Effective Observability

Observability gets expensive fast. Here's how to keep costs sane.

Sampling traces: You don't need 100% of traces. Use tail-based sampling — keep all error traces and a percentage of successful ones:

import { ParentBasedSampler, TraceIdRatioBasedSampler } from '@opentelemetry/sdk-trace-base';

const sampler = new ParentBasedSampler({
  root: new TraceIdRatioBasedSampler(0.1), // 10% of traces
});
Enter fullscreen mode Exit fullscreen mode

Log levels in production: Run at info level. Use debug only when actively investigating. One service logging at debug can generate more data than the rest of your fleet combined.

Retention policies: Not all data ages the same.

Data Hot (queryable) Cold (archive)
Logs 7-14 days 90 days
Metrics 30 days (full res) 1 year (downsampled)
Traces 7 days 30 days

Cardinality control: Never use user IDs, request IDs, or timestamps as metric labels. A single high-cardinality label can blow up your Prometheus storage. Stick to bounded values: HTTP methods, status code classes, service names.

Common Mistakes

1. Logging sensitive data. PII in logs is a compliance nightmare. Redact by default:

const logger = pino({
  redact: ['req.headers.authorization', 'user.email', 'body.password'],
});
Enter fullscreen mode Exit fullscreen mode

2. Alert fatigue. If everything is critical, nothing is. Alert on symptoms (error rate > 1%, p99 latency > 2s), not causes. Let dashboards show the causes.

3. Missing the "O" in observability. Collecting data isn't observability. Can you answer novel questions about your system without deploying new code? If not, you're just monitoring.

4. Instrumenting after the incident. You won't have time to add tracing when production is on fire. Instrument from day one. The OpenTelemetry auto-instrumentation setup above takes 10 minutes.

5. Ignoring the bill. I've seen teams spend more on log storage than on their actual infrastructure. Set retention policies and review costs monthly.


Part of my Production Backend Patterns series. Follow for more practical backend engineering.




---

If this was useful, consider:
- [Sponsoring on GitHub](https://github.com/sponsors/NoPKT) to support more open-source tools
- [Buying me a coffee on Ko-fi](https://ko-fi.com/gps949)
Enter fullscreen mode Exit fullscreen mode

Top comments (0)