DEV Community

Devanand
Devanand

Posted on

Microservices Monitoring: A Practical Guide to Observability

Microservices Monitoring: A Practical Guide to Observability

Service: SEO-Optimized Blog Post | Price: $15 | Format: dev.to-ready | Category: DevOps

When your application has one service, you can debug by reading logs. When you have twenty services talking across a mesh, you need observability — metrics, traces, and structured logs working together.

This guide covers production monitoring patterns that catch problems before customers do.


1. The Three Pillars of Observability

┌─────────────────────────────────────────────────┐
│                 Observability                   │
├────────────┬──────────────────┬─────────────────┤
│  Metrics   │      Logs        │     Traces      │
│ (Prometheus)│ (Structured JSON)│ (OpenTelemetry) │
├────────────┼──────────────────┼─────────────────┤
│ CPU, memory│ Error messages   │ Request path    │
│ Request rate│ Access logs     │ Service-to-     │
│ Latency     │ Audit trails    │ service calls   │
│ SLOs        │                  │                 │
└────────────┴──────────────────┴─────────────────┘
Enter fullscreen mode Exit fullscreen mode

Why You Need All Three

Pillar Answers Example
Metrics "Is the system healthy?" p99 latency spiked to 5s
Logs "What exactly happened?" "error": "connection refused"
Traces "Which service caused it?" Auth service → DB timeout

2. Health Check Endpoints

Every service needs a health check endpoint:

// health.ts
import { Router } from "express";

const router = Router();

router.get("/health", async (req, res) => {
  const checks = {
    database: await checkDatabase(),
    redis: await checkRedis(),
    disk: checkDiskSpace(),
  };

  const healthy = Object.values(checks).every((c) => c.status === "ok");

  res.status(healthy ? 200 : 503).json({
    status: healthy ? "ok" : "degraded",
    checks,
    uptime: process.uptime(),
    timestamp: new Date().toISOString(),
  });
});

async function checkDatabase() {
  try {
    await db.$queryRaw`SELECT 1`;
    return { status: "ok" };
  } catch (error) {
    return { status: "error", message: String(error) };
  }
}
Enter fullscreen mode Exit fullscreen mode

Docker Health Check

healthcheck:
  test: ["CMD", "curl", "-f", "http://localhost:3100/api/health"]
  interval: 30s
  timeout: 10s
  retries: 3
  start_period: 40s
Enter fullscreen mode Exit fullscreen mode

3. Structured Logging

JSON Log Format

const logger = pino({
  level: process.env.LOG_LEVEL || "info",
  formatters: {
    level(label) {
      return { level: label };
    },
  },
  timestamp: pino.stdTimeFunctions.isoTime,
});

// Usage
logger.info({ service: "auth", userId: "usr_123" }, "User logged in");
logger.error({ err, requestId: req.id }, "Authentication failed");
Enter fullscreen mode Exit fullscreen mode

Output:

{
  "level": "info",
  "time": "2025-01-15T10:30:00.000Z",
  "msg": "User logged in",
  "service": "auth",
  "userId": "usr_123"
}
Enter fullscreen mode Exit fullscreen mode

Log Levels

Level Use Case
debug Development troubleshooting
info Normal operations (requests, signups)
warn Unexpected but handled (retries, rate limits)
error Failures requiring investigation
fatal Service is shutting down

4. Prometheus Metrics

import prometheus from "prom-client";

// Counter: things that increase
const httpRequestsTotal = new prometheus.Counter({
  name: "http_requests_total",
  help: "Total HTTP requests",
  labelNames: ["method", "path", "status"],
});

// Histogram: latency distribution
const httpRequestDuration = new prometheus.Histogram({
  name: "http_request_duration_seconds",
  help: "HTTP request latency in seconds",
  labelNames: ["method", "path"],
  buckets: [0.01, 0.05, 0.1, 0.5, 1, 5],
});

// Gauge: current values
const memoryUsageBytes = new prometheus.Gauge({
  name: "process_memory_bytes",
  help: "Current memory usage in bytes",
});
Enter fullscreen mode Exit fullscreen mode

Middleware Integration

router.use((req, res, next) => {
  const end = httpRequestDuration.startTimer();
  res.on("finish", () => {
    httpRequestsTotal.inc({ method: req.method, path: req.route?.path, status: res.statusCode });
    end({ method: req.method, path: req.route?.path });
  });
  next();
});
Enter fullscreen mode Exit fullscreen mode

5. OpenTelemetry Distributed Tracing

import { NodeSDK } from "@opentelemetry/sdk-node";
import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-http";
import { getNodeAutoInstrumentations } from "@opentelemetry/auto-instrumentations-node";

const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({
    url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT,
  }),
  instrumentations: [getNodeAutoInstrumentations()],
});

sdk.start();
Enter fullscreen mode Exit fullscreen mode

Manual Tracing

import { trace } from "@opentelemetry/api";

const tracer = trace.getTracer("my-service");

async function processOrder(orderId: string) {
  const span = tracer.startSpan("processOrder", { attributes: { orderId } });
  try {
    await validatePayment(orderId);
    await updateInventory(orderId);
    await sendConfirmation(orderId);
  } catch (error) {
    span.recordException(error as Error);
    throw error;
  } finally {
    span.end();
  }
}
Enter fullscreen mode Exit fullscreen mode

6. Grafana Dashboard Template

{
  "title": "Service Overview",
  "panels": [
    {
      "title": "Request Rate",
      "type": "timeseries",
      "targets": [{ "expr": "rate(http_requests_total[5m])" }]
    },
    {
      "title": "p99 Latency",
      "type": "timeseries",
      "targets": [{ "expr": "histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))" }]
    },
    {
      "title": "Error Rate",
      "type": "timeseries",
      "targets": [{ "expr": "rate(http_requests_total{status=~\"5..\"}[5m])" }]
    },
    {
      "title": "Memory Usage",
      "type": "gauge",
      "targets": [{ "expr": "process_memory_bytes" }]
    }
  ]
}
Enter fullscreen mode Exit fullscreen mode

Key Takeaways

  1. All three pillars — Metrics alone won't tell you which service failed
  2. Health checks everywhere — Every service exposes /health for orchestration
  3. Structured JSON logs — Machine-parseable, painless to search
  4. Standard metrics — RED method (Rate, Errors, Duration) for every service
  5. Trace everything — OpenTelemetry auto-instrumentation covers most cases
  6. Alert on SLOs — Not on every anomaly, only on user-impacting degradation

This post is part of the Production DevOps Patterns series. Follow for more DevOps, monitoring, and infrastructure best practices.

Publish-ready: Copy this markdown directly to dev.to, Medium, or your blog. Frontmatter included.

Top comments (0)