DEV Community

Observability Practices: A Hands-On Guide with Prometheus and Grafana

Introduction

Modern software systems are distributed, complex, and constantly changing. When something breaks in production, you need answers fast. That's where observability comes in.

Observability is the ability to understand the internal state of a system purely from its external outputs — without needing to redeploy, add debug code, or guess. It goes beyond traditional monitoring, which only tells you whether something is wrong. Observability tells you why it's wrong, where it started, and how it's spreading.

In this article, we'll explore the three pillars of observability, set up a real Node.js API instrumented with Prometheus and Grafana, and walk through how to detect and diagnose a real-world issue using the data we collect.


The Three Pillars of Observability

1. Logs

Logs are discrete, timestamped records of events that happened in your system. They're the most familiar form of observability — every developer has done console.log debugging at some point.

Example:

[2026-07-02T10:34:21Z] INFO  User 4821 logged in from IP 192.168.1.10
[2026-07-02T10:34:25Z] ERROR Failed to process payment for order #9932: timeout
Enter fullscreen mode Exit fullscreen mode

Logs are great for capturing specific events, errors, and context. But they can become expensive at scale and hard to query across millions of lines.

2. Metrics

Metrics are numeric measurements collected over time. Unlike logs, they're aggregated and efficient to store and query.

Common examples:

  • HTTP request count per minute
  • p95 response latency
  • CPU and memory usage
  • Error rate per endpoint

Metrics are the backbone of dashboards and alerts.

3. Traces

Traces follow a single request as it travels across multiple services. In a microservices architecture, a user request might touch 5–10 services. A trace shows you exactly where time was spent and where failures occurred.

Tools like Jaeger, Zipkin, and OpenTelemetry handle distributed tracing.


Why Prometheus and Grafana?

There are many observability platforms out there: Datadog, New Relic, Dynatrace, Azure Monitor, AWS CloudWatch. Most are excellent but come with a cost.

Prometheus + Grafana is the open-source industry standard:

  • Prometheus scrapes metrics from your app at regular intervals and stores them in a time-series database. It uses a powerful query language called PromQL.
  • Grafana connects to Prometheus (and many other sources) and turns the data into rich, interactive dashboards.
  • Both are free, battle-tested at massive scale (they're used by companies like GitLab, DigitalOcean, and Cloudflare), and have huge communities.

Project Setup

We'll build a simple Node.js API with three endpoints, then instrument it to expose metrics. After that, we'll wire up Prometheus and Grafana with Docker.

Prerequisites

  • Node.js 18+
  • Docker and Docker Compose

Install dependencies

mkdir observability-demo && cd observability-demo
npm init -y
npm install express prom-client
Enter fullscreen mode Exit fullscreen mode

Step 1: Building the Instrumented API

// server.js
const express = require('express');
const client = require('prom-client');

const app = express();
const register = new client.Registry();

// ─── Default Metrics ───────────────────────────────────────────────────────
// Automatically collects Node.js process metrics:
// CPU usage, memory heap, event loop lag, active handles, GC duration, etc.
client.collectDefaultMetrics({ register });

// ─── Custom Metrics ─────────────────────────────────────────────────────────

// Counter: monotonically increasing count of HTTP requests
const httpRequestCounter = new client.Counter({
  name: 'http_requests_total',
  help: 'Total number of HTTP requests received',
  labelNames: ['method', 'route', 'status_code'],
});
register.registerMetric(httpRequestCounter);

// Histogram: tracks request duration distribution
// Buckets let us calculate percentiles (p50, p90, p95, p99)
const httpRequestDuration = new client.Histogram({
  name: 'http_request_duration_seconds',
  help: 'Duration of HTTP requests in seconds',
  labelNames: ['method', 'route', 'status_code'],
  buckets: [0.01, 0.05, 0.1, 0.3, 0.5, 1, 2, 5],
});
register.registerMetric(httpRequestDuration);

// Gauge: can go up and down (current value)
const activeRequests = new client.Gauge({
  name: 'http_active_requests',
  help: 'Number of HTTP requests currently being processed',
});
register.registerMetric(activeRequests);

// ─── Middleware ─────────────────────────────────────────────────────────────
app.use((req, res, next) => {
  activeRequests.inc();
  const end = httpRequestDuration.startTimer();

  res.on('finish', () => {
    activeRequests.dec();
    const labels = {
      method: req.method,
      route: req.path,
      status_code: res.statusCode,
    };
    httpRequestCounter.inc(labels);
    end(labels);
  });

  next();
});

// ─── Endpoints ──────────────────────────────────────────────────────────────
app.get('/', (req, res) => {
  res.json({ message: 'Observability Demo API', status: 'ok' });
});

// Simulates a slow database query or external API call
app.get('/slow', async (req, res) => {
  const delay = Math.random() * 3000; // up to 3 seconds
  await new Promise((resolve) => setTimeout(resolve, delay));
  res.json({ message: 'Slow response', delay_ms: Math.round(delay) });
});

// Simulates a failing endpoint (30% error rate)
app.get('/unstable', (req, res) => {
  if (Math.random() < 0.3) {
    return res.status(500).json({ error: 'Internal Server Error' });
  }
  res.json({ message: 'Success' });
});

// Health check
app.get('/health', (req, res) => {
  res.json({ status: 'healthy', uptime: process.uptime() });
});

// ─── Metrics Endpoint ───────────────────────────────────────────────────────
// Prometheus will scrape this endpoint every few seconds
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', register.contentType);
  res.end(await register.metrics());
});

app.listen(3000, () => {
  console.log('Server running at http://localhost:3000');
  console.log('Metrics available at http://localhost:3000/metrics');
});
Enter fullscreen mode Exit fullscreen mode

The three metric types we used each serve a different purpose:

  • Counter: only goes up. Use for requests, errors, tasks completed.
  • Histogram: records distributions. Use for latency, request size.
  • Gauge: goes up and down. Use for active connections, queue depth, memory.

Step 2: Prometheus Configuration

Create a prometheus.yml file in your project root:

global:
  scrape_interval: 5s      # How often to scrape targets
  evaluation_interval: 5s  # How often to evaluate alerting rules

scrape_configs:
  - job_name: 'node-app'
    static_configs:
      - targets: ['host.docker.internal:3000']  # Your Node.js app
    metrics_path: '/metrics'
Enter fullscreen mode Exit fullscreen mode

Step 3: Docker Compose Setup

Create a docker-compose.yml to run Prometheus and Grafana together:

version: '3.8'

services:
  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'

  grafana:
    image: grafana/grafana:latest
    ports:
      - "3001:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
    depends_on:
      - prometheus
Enter fullscreen mode Exit fullscreen mode

Start everything:

# In one terminal: start your Node.js app
node server.js

# In another terminal: start Prometheus and Grafana
docker compose up
Enter fullscreen mode Exit fullscreen mode

Step 4: Generating Traffic

To see meaningful data in your dashboard, generate some traffic:

# Install a simple load testing tool
npm install -g autocannon

# Hit the normal endpoint
autocannon -d 60 -c 10 http://localhost:3000/

# Hit the slow endpoint
autocannon -d 60 -c 5 http://localhost:3000/slow

# Hit the unstable endpoint
autocannon -d 60 -c 10 http://localhost:3000/unstable
Enter fullscreen mode Exit fullscreen mode

Step 5: Building the Grafana Dashboard

  1. Open Grafana at http://localhost:3001 (login: admin / admin).
  2. Go to Connections > Data Sources > Add data source > Prometheus.
  3. Set the URL to http://prometheus:9090 and click Save & Test.
  4. Go to Dashboards > New Dashboard > Add visualization.

Here are the most useful PromQL queries to add as panels:

Request rate (requests per second):

rate(http_requests_total[1m])
Enter fullscreen mode Exit fullscreen mode

Error rate:

rate(http_requests_total{status_code=~"5.."}[1m])
Enter fullscreen mode Exit fullscreen mode

p95 latency (95th percentile response time):

histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
Enter fullscreen mode Exit fullscreen mode

p99 latency:

histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
Enter fullscreen mode Exit fullscreen mode

Active requests right now:

http_active_requests
Enter fullscreen mode Exit fullscreen mode

Node.js heap memory usage:

nodejs_heap_size_used_bytes
Enter fullscreen mode Exit fullscreen mode

Step 6: Diagnosing a Real Issue

Now let's see observability in action. Suppose your team deploys a new feature and suddenly the error rate on /unstable spikes.

On your dashboard you'd see:

  1. Error rate panel jumps from ~0% to ~30% for the /unstable route.
  2. p95 latency panel stays normal — so it's not a slowdown, it's actual failures.
  3. Active requests gauge stays stable — so it's not a connection leak.

This tells you immediately: the problem is a code-level error on a specific route, not infrastructure. You can then check your logs for that route and find the root cause — without having to search blindly across your entire system.

Without observability, this investigation might take hours. With proper instrumentation, it takes minutes.


Best Practices

  • Use meaningful label names. Labels like route, method, and status_code make filtering easy. Avoid high-cardinality labels (like user IDs) — they'll blow up your metric storage.
  • Track the RED method: Rate, Errors, Duration. These three signals cover most service health scenarios.
  • Set up alerts in Grafana. Don't just look at dashboards reactively — configure alerts to notify your team via Slack or email when error rate or latency exceeds a threshold.
  • Don't instrument everything. Focus on what matters: the critical paths and external dependencies. Too many metrics create noise.
  • Correlate your pillars. When an alert fires, jump from the metric to the logs for that time window. Great observability platforms (and setups like this one) let you do that.

Key Takeaways

  • Observability is built on three pillars: logs, metrics, and traces. Each has a role.
  • Prometheus + Grafana gives you a free, production-grade observability stack in minutes.
  • Use Counters for totals, Histograms for latency distributions, and Gauges for current state.
  • The RED method (Rate, Errors, Duration) gives you the core signals for any service.
  • Good observability doesn't just tell you something is broken — it helps you understand exactly what, where, and why.

Top comments (0)