Sergio Alberto Colque Ponce

Posted on Jul 2

Observability in Practice: Instrumenting a Node.js API with Prometheus and Grafana

#observability #prometheus #grafana #node

Introduction

Every backend engineer eventually reaches the same painful moment: production breaks, and nobody knows why. The logs are scattered across containers, the only metric available is "the server is up," and reproducing the issue locally is impossible. This is the gap that observability is meant to close.

Observability is often confused with monitoring, but they are not the same thing. Monitoring tells you that something is wrong (a dashboard turns red). Observability lets you ask new questions about your system without having to ship new code, because the system already exposes enough context to explain its own internal state from the outside.

Observability is usually built on three pillars:

Metrics: numeric, aggregatable data over time (request rate, error rate, latency, CPU usage).
Logs: discrete, timestamped events with context (an error trace, an audit event).
Traces: the path of a single request as it moves through multiple services.

In this article I'll focus on the metrics pillar, and walk through a real, working example: instrumenting a Node.js REST API with Prometheus and visualizing it with Grafana, running entirely on Docker.

Why This Matters

In distributed systems (microservices, SaaS platforms, multi-tenant apps), failures rarely look like a simple crash. They look like:

A slow database query that only appears under load.
An error rate that creeps up 5% after a deploy, but nobody notices until customers complain.
A memory leak that takes six hours to become visible.

Without instrumentation, these problems are invisible until they become incidents. With observability in place, you can define SLOs (Service Level Objectives), get alerted before customers do, and debug production issues using data instead of guesswork.

Hands-on Example: An Observable Express API

The stack for this example:

Node.js + Express — the API itself.
prom-client — the official Prometheus client library for Node.js.
Prometheus — scrapes and stores the metrics.
Grafana — visualizes the metrics in dashboards.
Docker Compose — runs the whole stack with one command.

1. Instrumenting the application

The key idea is exposing a /metrics endpoint that Prometheus can scrape. We track three kinds of metrics: a counter for total requests, a histogram for request duration (which lets us calculate percentiles like p95 latency), and a business metric — because observability isn't only about infrastructure, it should also reflect what the business cares about.

// server.js
const express = require('express');
const client = require('prom-client');

const app = express();
const PORT = process.env.PORT || 3000;

// Registry that holds all metrics
const register = new client.Registry();
client.collectDefaultMetrics({ register }); // CPU, memory, event loop lag, etc.

// Counter: total HTTP requests, labeled by method, route and status code
const httpRequestsTotal = new client.Counter({
  name: 'http_requests_total',
  help: 'Total number of HTTP requests',
  labelNames: ['method', 'route', 'status_code'],
});

// Histogram: request duration in seconds, used to compute percentiles
const httpRequestDuration = new client.Histogram({
  name: 'http_request_duration_seconds',
  help: 'Duration of HTTP requests in seconds',
  labelNames: ['method', 'route', 'status_code'],
  buckets: [0.05, 0.1, 0.3, 0.5, 1, 1.5, 2, 5],
});

// Business metric: orders processed by the system
const ordersProcessed = new client.Counter({
  name: 'orders_processed_total',
  help: 'Total number of orders processed',
  labelNames: ['status'],
});

register.registerMetric(httpRequestsTotal);
register.registerMetric(httpRequestDuration);
register.registerMetric(ordersProcessed);

// Middleware: measure every request automatically
app.use((req, res, next) => {
  const end = httpRequestDuration.startTimer();
  res.on('finish', () => {
    const labels = { method: req.method, route: req.path, status_code: res.statusCode };
    httpRequestsTotal.inc(labels);
    end(labels);
  });
  next();
});

app.get('/', (req, res) => {
  res.json({ message: 'Observability demo API is running' });
});

// Simulated business endpoint with a realistic ~15% failure rate
app.post('/orders', (req, res) => {
  const success = Math.random() > 0.15;
  if (success) {
    ordersProcessed.inc({ status: 'success' });
    res.status(201).json({ status: 'created' });
  } else {
    ordersProcessed.inc({ status: 'failed' });
    res.status(500).json({ status: 'error' });
  }
});

// Prometheus scrapes this endpoint
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', register.contentType);
  res.end(await register.metrics());
});

app.listen(PORT, () => console.log(`Server listening on port ${PORT}`));

Notice the label design on purpose: method, route, and status_code give us enough dimensionality to slice the data (e.g. "error rate for POST /orders"), without using something like user_id as a label, which would create unbounded cardinality and eventually crash Prometheus. This is one of the most common mistakes when people start instrumenting their own code.

2. Running the stack with Docker Compose

# docker-compose.yml
version: '3.8'

services:
  app:
    build: .
    container_name: observability-app
    ports:
      - "3000:3000"
    networks:
      - observability-net

  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - ./alert_rules.yml:/etc/prometheus/alert_rules.yml
    ports:
      - "9090:9090"
    networks:
      - observability-net

  grafana:
    image: grafana/grafana:latest
    container_name: grafana
    ports:
      - "3001:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
    networks:
      - observability-net

networks:
  observability-net:
    driver: bridge

# prometheus.yml
global:
  scrape_interval: 5s

rule_files:
  - "alert_rules.yml"

scrape_configs:
  - job_name: 'observability-app'
    static_configs:
      - targets: ['app:3000']

Running docker compose up -d brings up the API on port 3000, Prometheus on 9090, and Grafana on 3001.

3. Querying with PromQL

Once Prometheus is scraping the /metrics endpoint, we can answer real operational questions using PromQL. A few examples:

Request rate per second, per route:

rate(http_requests_total[1m])

Error rate (percentage of 5xx responses over the last 5 minutes):

sum(rate(http_requests_total{status_code=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))

p95 latency (95% of requests are faster than this value):

histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

These three queries alone cover the classic "RED method" (Rate, Errors, Duration) used to monitor almost any service.

4. Visualizing in Grafana

After opening Grafana at http://localhost:3001 (user admin, password admin):

Add Prometheus as a data source, pointing to http://prometheus:9090.
Create a new dashboard.
Add panels using the PromQL queries above — one time series panel for request rate, one for error rate, one for p95 latency.

This gives a live view of the API's health without touching a single log line.

5. Alerting before customers notice

Metrics are most useful when they trigger alerts automatically. Here's a Prometheus alerting rule that fires when the error rate goes above 10%, or when p95 latency exceeds one second:

# alert_rules.yml
groups:
  - name: api-alerts
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status_code=~"5.."}[5m]))
          /
          sum(rate(http_requests_total[5m])) > 0.1
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "High error rate detected"
          description: "More than 10% of requests are failing."

      - alert: HighLatency
        expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
        for: 2m
        labels:
          severity: warning
        annotations:
          summary: "High p95 latency"
          description: "p95 latency is above 1 second."

Lessons Learned

A few practical takeaways from building this:

Instrument early, not after the incident. Retrofitting observability into a system that's already on fire is much harder than adding it from day one.
Watch your label cardinality. Labels like user_id, email, or request_id on a metric will silently destroy your Prometheus instance's memory. Keep labels bounded (status codes, routes, methods, statuses).
Metrics without alerts are just pretty graphs. The real value comes from connecting metrics to alert rules tied to your SLOs.
Business metrics matter as much as infrastructure metrics. Knowing orders_processed_total{status="failed"} is spiking is often more actionable than knowing CPU usage went up.

Try It Yourself

The full working example — Express app, Dockerfile, Prometheus config, alerting rules, and a CI pipeline — is available on GitHub:

👉 https://github.com/srg-cp/observability-demo-prometheus-grafana

Clone it, run docker compose up -d, hit the /orders endpoint a few times with curl, and watch the metrics show up in Prometheus and Grafana in real time.

Conclusion

Observability isn't a tool you install once — it's a practice you build into how you write software. Starting with something as simple as a /metrics endpoint and a few well-chosen counters and histograms already puts you miles ahead of debugging production with console.log and hope.

If you've instrumented a service before, I'd love to hear what platform you used (Datadog, New Relic, ELK, CloudWatch, Grafana Cloud...) and what tripped you up the first time. Drop a comment below 👇

DEV Community