DEV Community

JEFFERSON ROSAS CHAMBILLA
JEFFERSON ROSAS CHAMBILLA

Posted on

Observability Practices in Action: Instrumenting a Node.js API with Metrics, Logs, and Traces

Observability Practices in Action: Instrumenting a Node.js API with Metrics, Logs, and Traces

It's 3 AM. An alert fires. Checkout latency spiked, and orders are failing intermittently. You SSH into a server, grep through log files, and hope you find something before your coffee gets cold — or before customers start noticing.

This scenario is exactly what observability was built to prevent. Not by adding more dashboards for the sake of having dashboards, but by giving engineers the ability to ask new questions about their system without shipping new code every time something unexpected breaks.

In this article, I'll walk through what observability actually means, why it's different from traditional monitoring, and then get hands-on: instrumenting a real Node.js API with metrics (Prometheus), structured logs, and distributed traces (OpenTelemetry), visualized in Grafana. By the end, you'll have a working local stack you can run yourself.

Monitoring vs. Observability: a real distinction, not a buzzword swap

Monitoring answers questions you already knew to ask. You define a threshold ("alert me if error rate > 5%"), and the system tells you when that threshold is crossed. It's reactive and works great for known failure modes.

Observability is the property of a system that lets you answer questions you didn't anticipate when you wrote the code — "why did only requests from mobile clients in the EU region get slow after the last deploy?" — by exploring the data your system emits, not by adding new instrumentation on the fly.

The difference matters because modern systems (microservices, serverless, distributed databases) fail in combinatorial ways nobody can fully predict in advance. You can't write an alert for every possible failure mode. You can build a system that emits rich enough telemetry that, when something weird happens, you can drill down and find the answer.

The Three Pillars of Observability

  1. Metrics — numeric measurements aggregated over time (request latency, error rate, queue depth, CPU usage). Cheap to store, great for trends, dashboards, and alerting thresholds. Bad at explaining why something happened for one specific request.

  2. Logs — discrete, timestamped events with context ("Order 4821 failed: payment gateway timeout after 3000ms"). Expensive to store and search at scale, but rich in detail for a single event.

  3. Traces — the end-to-end journey of a single request as it moves through multiple services, showing exactly where time was spent and where it failed. Essential once you have more than one service talking to another.

None of these pillars alone gives you the full picture. The real value comes from correlating them — going from "latency spiked" (metric) → "here are the specific failing requests" (trace) → "here's the exact error and stack context" (log), all connected by a shared identifier like a trace ID.

Hands-on: instrumenting a real API

We'll build a small Express-based "orders" API and instrument all three pillars:

  • Metrics with prom-client, scraped by Prometheus, visualized in Grafana.
  • Structured logs with pino, correlated to the request via a trace ID.
  • Distributed traces with the OpenTelemetry SDK, exported to a local collector.

Project setup

mkdir observability-demo && cd observability-demo
npm init -y
npm install express prom-client pino pino-http
npm install @opentelemetry/api @opentelemetry/sdk-node \
  @opentelemetry/auto-instrumentations-node \
  @opentelemetry/exporter-trace-otlp-http
Enter fullscreen mode Exit fullscreen mode

1. Metrics layer (metrics.js)

const client = require('prom-client');

const register = new client.Registry();
client.collectDefaultMetrics({ register }); // process CPU, memory, event loop lag, GC

const httpRequestDuration = new client.Histogram({
  name: 'http_request_duration_seconds',
  help: 'Duration of HTTP requests in seconds',
  labelNames: ['method', 'route', 'status_code'],
  buckets: [0.05, 0.1, 0.3, 0.5, 1, 2, 5],
});

const httpErrorsTotal = new client.Counter({
  name: 'http_errors_total',
  help: 'Total count of HTTP 5xx errors',
  labelNames: ['route'],
});

const inFlightRequests = new client.Gauge({
  name: 'http_requests_in_flight',
  help: 'Number of HTTP requests currently being processed',
});

register.registerMetric(httpRequestDuration);
register.registerMetric(httpErrorsTotal);
register.registerMetric(inFlightRequests);

module.exports = { register, httpRequestDuration, httpErrorsTotal, inFlightRequests };
Enter fullscreen mode Exit fullscreen mode

Notice the three metric types being used deliberately:

  • Histogram for latency, because we care about the distribution (p50, p95, p99), not just the average.
  • Counter for errors, because it only ever goes up and Prometheus can compute a rate from it.
  • Gauge for in-flight requests, because it can go up and down — useful for spotting saturation.

2. Structured logging (logger.js)

const pino = require('pino');

const logger = pino({
  level: process.env.LOG_LEVEL || 'info',
  formatters: {
    level: (label) => ({ level: label }),
  },
  timestamp: pino.stdTimeFunctions.isoTime,
});

module.exports = logger;
Enter fullscreen mode Exit fullscreen mode

Structured (JSON) logs matter because they're machine-parseable. A log line like logger.error({ orderId, traceId, err }, 'payment failed') can be indexed and queried in tools like the ELK stack, Datadog, or Grafana Loki — versus a plain string that only humans can meaningfully search.

3. Distributed tracing (tracing.js)

const { NodeSDK } = require('@opentelemetry/sdk-node');
const { getNodeAutoInstrumentations } = require('@opentelemetry/auto-instrumentations-node');
const { OTLPTraceExporter } = require('@opentelemetry/exporter-trace-otlp-http');

const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({
    url: 'http://localhost:4318/v1/traces',
  }),
  instrumentations: [getNodeAutoInstrumentations()],
  serviceName: 'orders-api',
});

sdk.start();

module.exports = sdk;
Enter fullscreen mode Exit fullscreen mode

This must be required before Express, since OpenTelemetry patches Node's internals (HTTP client, Express routing, etc.) to automatically capture spans for every request and downstream call.

4. Wiring it all together (app.js)

require('./tracing'); // must be first
const express = require('express');
const pinoHttp = require('pino-http');
const { trace } = require('@opentelemetry/api');

const logger = require('./logger');
const { register, httpRequestDuration, httpErrorsTotal, inFlightRequests } = require('./metrics');

const app = express();

// Attach a request-scoped logger, and pull the active trace ID for correlation
app.use(pinoHttp({
  logger,
  customProps: (req) => {
    const span = trace.getActiveSpan();
    return span ? { traceId: span.spanContext().traceId } : {};
  },
}));

// Metrics middleware
app.use((req, res, next) => {
  inFlightRequests.inc();
  const end = httpRequestDuration.startTimer();
  res.on('finish', () => {
    end({ method: req.method, route: req.path, status_code: res.statusCode });
    inFlightRequests.dec();
    if (res.statusCode >= 500) {
      httpErrorsTotal.inc({ route: req.path });
    }
  });
  next();
});

app.get('/orders/:id', async (req, res) => {
  const delay = Math.random() * 800;

  await new Promise((resolve) => setTimeout(resolve, delay));

  if (Math.random() < 0.05) {
    req.log.error({ orderId: req.params.id }, 'payment gateway timeout');
    return res.status(500).json({ error: 'Order service unavailable' });
  }

  req.log.info({ orderId: req.params.id, latencyMs: delay }, 'order confirmed');
  res.json({ id: req.params.id, status: 'confirmed' });
});

app.get('/metrics', async (req, res) => {
  res.set('Content-Type', register.contentType);
  res.end(await register.metrics());
});

app.listen(3000, () => logger.info('API listening on port 3000'));
Enter fullscreen mode Exit fullscreen mode

Every request now produces:

  • a span in a trace, showing how long it took and where,
  • a structured log line tagged with that same trace ID,
  • and contributes to the histogram/counter/gauge metrics scraped by Prometheus.

If p95 latency spikes in Grafana, you can jump to the traces around that time window, find the slow ones, grab their trace ID, and pull the exact log line that explains what happened — all without guessing.

5. Running the stack locally

# docker-compose.yml
version: '3.8'
services:
  api:
    build: .
    ports:
      - "3000:3000"
    depends_on:
      - otel-collector

  prometheus:
    image: prom/prometheus
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
    ports:
      - "9090:9090"

  otel-collector:
    image: otel/opentelemetry-collector
    volumes:
      - ./otel-collector-config.yml:/etc/otel-collector-config.yml
    command: ["--config=/etc/otel-collector-config.yml"]
    ports:
      - "4318:4318"

  grafana:
    image: grafana/grafana
    ports:
      - "3001:3000"
    depends_on:
      - prometheus
Enter fullscreen mode Exit fullscreen mode
# prometheus.yml
global:
  scrape_interval: 5s
scrape_configs:
  - job_name: 'orders-api'
    static_configs:
      - targets: ['api:3000']
Enter fullscreen mode Exit fullscreen mode

Run docker compose up, hammer /orders/1 a few times (for i in {1..200}; do curl localhost:3000/orders/$i; done), then open Grafana on localhost:3001, add Prometheus as a data source, and build a dashboard with:

  • p95 latency: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
  • Error rate: rate(http_errors_total[5m])
  • Throughput: rate(http_request_duration_seconds_count[5m])
  • Saturation: http_requests_in_flight

These four signals map directly to Google's well-known "Four Golden Signals" (latency, traffic, errors, saturation) — a solid starting point for any service dashboard.

Practices that make observability actually useful

  • Instrument early, not after the first production incident. Retrofitting observability into a system under fire is painful and slow.
  • Watch label cardinality. Adding a user_id or order_id label to a Prometheus metric will silently blow up your storage and query performance — high-cardinality data belongs in logs or traces, not metric labels.
  • Correlate signals with a shared ID. A trace ID that flows through your logs, metrics exemplars, and traces is what turns three separate tools into one coherent debugging workflow.
  • Alert on symptoms, not causes. Define SLOs (e.g., "99% of requests under 300ms") and alert when you're burning your error budget — not on every CPU blip.
  • Logs are not free. At scale, verbose logging is one of the most expensive parts of an observability stack. Log what you'll actually need to debug, structured, at the right level.
  • Sampling matters for traces. In high-traffic systems, tracing every single request is often unnecessary and costly — tail-based sampling (keep traces that are slow or error out) gives you the signal without the storage bill.

Beyond this stack

The concepts above translate directly to managed platforms: Datadog, New Relic, Dynatrace, AWS CloudWatch, Azure Monitor / Log Analytics, or the ELK stack (Elasticsearch, Logstash, Kibana) all implement the same three pillars, usually with less setup and more built-in correlation, at the cost of vendor lock-in and pricing that scales with data volume. Prometheus, Grafana, and OpenTelemetry were chosen here specifically because they're open source, vendor-neutral, and can be run entirely on your laptop for learning purposes.

Wrapping up

Observability isn't a product you install — it's a property you design into your system, one signal at a time. Metrics tell you something is wrong. Traces tell you where. Logs tell you why. Put together, and correlated through a shared identifier, they turn 3 AM incidents from hours of guesswork into minutes of investigation.


What's your team's observability stack of choice — Prometheus/Grafana, ELK, or a managed platform like Datadog? Let me know in the comments.

Top comments (0)