The Three Pillars of Observability: Logs, Metrics, and Traces Explained (2026)

#devops #observability #monitoring #backend

Your logs say the request succeeded. Your users say it didn't. Sound familiar?

Logs alone are like having security cameras but no audio, no motion sensors, and no way to correlate footage across rooms. You need all three pillars working together.

Pillar 1: Structured Logs

Stop doing this:

console.log(`User ${userId} placed order ${orderId}`);

Start doing this:

import pino from 'pino';

const logger = pino({
  level: process.env.LOG_LEVEL || 'info',
  formatters: {
    level: (label) => ({ level: label }),
  },
  timestamp: pino.stdTimeFunctions.isoTime,
});

logger.info({
  event: 'order.placed',
  userId,
  orderId,
  amount: order.total,
  correlationId: req.headers['x-correlation-id'],
}, 'Order placed successfully');

The difference? The second version is queryable. When your Loki or CloudWatch Insights query looks like {event="order.placed"} | json | amount > 500, you'll understand why structure matters.

Pino vs Winston: Pino is faster (it serializes to JSON natively, Winston transforms). For high-throughput services, Pino wins. For complex transport needs (writing to files, Slack, databases simultaneously), Winston's transport system is more flexible. Pick one and standardize across your org.

Aggregation Pipeline

Your logs need to flow somewhere useful:

App (Pino) → stdout → Fluentd/Vector → Loki/CloudWatch/Elasticsearch

Don't ship logs directly from your app to the aggregator. Write to stdout, let infrastructure handle routing. This keeps your app decoupled and your containers lightweight.

// For local dev, make it readable
const logger = pino({
  transport: process.env.NODE_ENV === 'development'
    ? { target: 'pino-pretty' }
    : undefined,
});

Pillar 2: Metrics with Prometheus

Logs tell you what happened. Metrics tell you how things are going.

Three types you need to internalize:

import { Counter, Histogram, Gauge, register } from 'prom-client';

// COUNTER: things that only go up
const httpRequestsTotal = new Counter({
  name: 'http_requests_total',
  help: 'Total HTTP requests',
  labelNames: ['method', 'route', 'status'],
});

// HISTOGRAM: distribution of values (latency, sizes)
const httpRequestDuration = new Histogram({
  name: 'http_request_duration_seconds',
  help: 'Request duration in seconds',
  labelNames: ['method', 'route'],
  buckets: [0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5],
});

// GAUGE: values that go up and down
const activeConnections = new Gauge({
  name: 'active_connections',
  help: 'Currently active connections',
});

Wire it into your middleware:

app.use((req, res, next) => {
  const end = httpRequestDuration.startTimer({
    method: req.method,
    route: req.route?.path || 'unknown',
  });

  res.on('finish', () => {
    httpRequestsTotal.inc({
      method: req.method,
      route: req.route?.path || 'unknown',
      status: res.statusCode,
    });
    end();
  });

  next();
});

// Expose /metrics endpoint for Prometheus scraping
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', register.contentType);
  res.end(await register.metrics());
});

Choose your histogram buckets carefully. The defaults are almost never right. Think about what latency matters for your service. An API gateway needs sub-100ms buckets. A report generator might need 1-60s.

Pillar 3: Distributed Traces with OpenTelemetry

When a request touches 5 services, logs and metrics can't tell you where the slowdown happened. Traces can.

import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';

const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({
    url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://localhost:4318/v1/traces',
  }),
  instrumentations: [getNodeAutoInstrumentations()],
  serviceName: 'order-service',
});

sdk.start();

Auto-instrumentation catches HTTP calls, database queries, and queue operations automatically. For custom spans around business logic:

import { trace } from '@opentelemetry/api';

const tracer = trace.getTracer('order-service');

async function processOrder(order: Order) {
  return tracer.startActiveSpan('processOrder', async (span) => {
    try {
      span.setAttribute('order.id', order.id);
      span.setAttribute('order.items_count', order.items.length);

      await validateInventory(order);
      await chargePayment(order);
      await sendConfirmation(order);

      span.setStatus({ code: SpanStatusCode.OK });
    } catch (err) {
      span.setStatus({ code: SpanStatusCode.ERROR, message: err.message });
      throw err;
    } finally {
      span.end();
    }
  });
}

I covered the OpenTelemetry setup in more depth in my earlier article on distributed tracing — this is about how it fits into the bigger picture.

The Glue: Correlation IDs

The three pillars are useless in isolation. Correlation IDs tie them together.

import { randomUUID } from 'crypto';
import { context, trace } from '@opentelemetry/api';

function correlationMiddleware(req, res, next) {
  const correlationId = req.headers['x-correlation-id'] || randomUUID();
  const traceId = trace.getSpan(context.active())?.spanContext().traceId;

  // Attach to request
  req.correlationId = correlationId;

  // Attach to response
  res.setHeader('x-correlation-id', correlationId);

  // Bind to logger for every subsequent log in this request
  req.log = logger.child({ correlationId, traceId });

  next();
}

Now when a user reports an issue, you take their correlation ID and:

Logs: {correlationId="abc-123"} → see every log line for that request
Traces: search by the same trace ID → see the full request waterfall
Metrics: you already know the route and status, check if it's a pattern or a one-off

This is observability. Not just collecting data — connecting it.

Cost-Effective Observability

Observability gets expensive fast. Here's how to keep costs sane.

Sampling traces: You don't need 100% of traces. Use tail-based sampling — keep all error traces and a percentage of successful ones:

import { ParentBasedSampler, TraceIdRatioBasedSampler } from '@opentelemetry/sdk-trace-base';

const sampler = new ParentBasedSampler({
  root: new TraceIdRatioBasedSampler(0.1), // 10% of traces
});

Log levels in production: Run at info level. Use debug only when actively investigating. One service logging at debug can generate more data than the rest of your fleet combined.

Retention policies: Not all data ages the same.

Data	Hot (queryable)	Cold (archive)
Logs	7-14 days	90 days
Metrics	30 days (full res)	1 year (downsampled)
Traces	7 days	30 days

Cardinality control: Never use user IDs, request IDs, or timestamps as metric labels. A single high-cardinality label can blow up your Prometheus storage. Stick to bounded values: HTTP methods, status code classes, service names.

Common Mistakes

1. Logging sensitive data. PII in logs is a compliance nightmare. Redact by default:

const logger = pino({
  redact: ['req.headers.authorization', 'user.email', 'body.password'],
});

2. Alert fatigue. If everything is critical, nothing is. Alert on symptoms (error rate > 1%, p99 latency > 2s), not causes. Let dashboards show the causes.

3. Missing the "O" in observability. Collecting data isn't observability. Can you answer novel questions about your system without deploying new code? If not, you're just monitoring.

4. Instrumenting after the incident. You won't have time to add tracing when production is on fire. Instrument from day one. The OpenTelemetry auto-instrumentation setup above takes 10 minutes.

5. Ignoring the bill. I've seen teams spend more on log storage than on their actual infrastructure. Set retention policies and review costs monthly.

Part of my Production Backend Patterns series. Follow for more practical backend engineering.




---

If this was useful, consider:
- [Sponsoring on GitHub](https://github.com/sponsors/NoPKT) to support more open-source tools
- [Buying me a coffee on Ko-fi](https://ko-fi.com/gps949)

---

## You Might Also Like

- [Environment Variables Done Right: From .env Files to Production Configs](https://dev.to/young_gao/environment-variables-done-right-from-env-files-to-production-configs-2916)
- [Docker Compose for Development: The Setup Every Backend Dev Needs](https://dev.to/young_gao/docker-compose-for-development-the-setup-every-backend-dev-needs-3eoe)
- [Database Migrations in Production: Zero-Downtime Schema Changes (2026 Guide)](https://dev.to/young_gao/database-migrations-in-production-zero-downtime-schema-changes-5fng)

*Follow me for more production-ready backend content!*

---

*If this helped you, [buy me a coffee on Ko-fi](https://ko-fi.com/gps949)!*