Devanand

Posted on May 25

Microservices Monitoring: A Practical Guide to Observability

#monitoring #devops #microservices #observability

Microservices Monitoring: A Practical Guide to Observability

Service: SEO-Optimized Blog Post | Price: $15 | Format: dev.to-ready | Category: DevOps

When your application has one service, you can debug by reading logs. When you have twenty services talking across a mesh, you need observability — metrics, traces, and structured logs working together.

This guide covers production monitoring patterns that catch problems before customers do.

1. The Three Pillars of Observability

┌─────────────────────────────────────────────────┐
│                 Observability                   │
├────────────┬──────────────────┬─────────────────┤
│  Metrics   │      Logs        │     Traces      │
│ (Prometheus)│ (Structured JSON)│ (OpenTelemetry) │
├────────────┼──────────────────┼─────────────────┤
│ CPU, memory│ Error messages   │ Request path    │
│ Request rate│ Access logs     │ Service-to-     │
│ Latency     │ Audit trails    │ service calls   │
│ SLOs        │                  │                 │
└────────────┴──────────────────┴─────────────────┘

Why You Need All Three

Pillar	Answers	Example
Metrics	"Is the system healthy?"	p99 latency spiked to 5s
Logs	"What exactly happened?"	`"error": "connection refused"`
Traces	"Which service caused it?"	Auth service → DB timeout

2. Health Check Endpoints

Every service needs a health check endpoint:

// health.ts
import { Router } from "express";

const router = Router();

router.get("/health", async (req, res) => {
  const checks = {
    database: await checkDatabase(),
    redis: await checkRedis(),
    disk: checkDiskSpace(),
  };

  const healthy = Object.values(checks).every((c) => c.status === "ok");

  res.status(healthy ? 200 : 503).json({
    status: healthy ? "ok" : "degraded",
    checks,
    uptime: process.uptime(),
    timestamp: new Date().toISOString(),
  });
});

async function checkDatabase() {
  try {
    await db.$queryRaw`SELECT 1`;
    return { status: "ok" };
  } catch (error) {
    return { status: "error", message: String(error) };
  }
}

Docker Health Check

healthcheck:
  test: ["CMD", "curl", "-f", "http://localhost:3100/api/health"]
  interval: 30s
  timeout: 10s
  retries: 3
  start_period: 40s

3. Structured Logging

JSON Log Format

const logger = pino({
  level: process.env.LOG_LEVEL || "info",
  formatters: {
    level(label) {
      return { level: label };
    },
  },
  timestamp: pino.stdTimeFunctions.isoTime,
});

// Usage
logger.info({ service: "auth", userId: "usr_123" }, "User logged in");
logger.error({ err, requestId: req.id }, "Authentication failed");

Output:

{
  "level": "info",
  "time": "2025-01-15T10:30:00.000Z",
  "msg": "User logged in",
  "service": "auth",
  "userId": "usr_123"
}

Log Levels

Level	Use Case
`debug`	Development troubleshooting
`info`	Normal operations (requests, signups)
`warn`	Unexpected but handled (retries, rate limits)
`error`	Failures requiring investigation
`fatal`	Service is shutting down

4. Prometheus Metrics

import prometheus from "prom-client";

// Counter: things that increase
const httpRequestsTotal = new prometheus.Counter({
  name: "http_requests_total",
  help: "Total HTTP requests",
  labelNames: ["method", "path", "status"],
});

// Histogram: latency distribution
const httpRequestDuration = new prometheus.Histogram({
  name: "http_request_duration_seconds",
  help: "HTTP request latency in seconds",
  labelNames: ["method", "path"],
  buckets: [0.01, 0.05, 0.1, 0.5, 1, 5],
});

// Gauge: current values
const memoryUsageBytes = new prometheus.Gauge({
  name: "process_memory_bytes",
  help: "Current memory usage in bytes",
});

Middleware Integration

router.use((req, res, next) => {
  const end = httpRequestDuration.startTimer();
  res.on("finish", () => {
    httpRequestsTotal.inc({ method: req.method, path: req.route?.path, status: res.statusCode });
    end({ method: req.method, path: req.route?.path });
  });
  next();
});

5. OpenTelemetry Distributed Tracing

import { NodeSDK } from "@opentelemetry/sdk-node";
import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-http";
import { getNodeAutoInstrumentations } from "@opentelemetry/auto-instrumentations-node";

const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter({
    url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT,
  }),
  instrumentations: [getNodeAutoInstrumentations()],
});

sdk.start();

Manual Tracing

import { trace } from "@opentelemetry/api";

const tracer = trace.getTracer("my-service");

async function processOrder(orderId: string) {
  const span = tracer.startSpan("processOrder", { attributes: { orderId } });
  try {
    await validatePayment(orderId);
    await updateInventory(orderId);
    await sendConfirmation(orderId);
  } catch (error) {
    span.recordException(error as Error);
    throw error;
  } finally {
    span.end();
  }
}

6. Grafana Dashboard Template

{
  "title": "Service Overview",
  "panels": [
    {
      "title": "Request Rate",
      "type": "timeseries",
      "targets": [{ "expr": "rate(http_requests_total[5m])" }]
    },
    {
      "title": "p99 Latency",
      "type": "timeseries",
      "targets": [{ "expr": "histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))" }]
    },
    {
      "title": "Error Rate",
      "type": "timeseries",
      "targets": [{ "expr": "rate(http_requests_total{status=~\"5..\"}[5m])" }]
    },
    {
      "title": "Memory Usage",
      "type": "gauge",
      "targets": [{ "expr": "process_memory_bytes" }]
    }
  ]
}

Key Takeaways

All three pillars — Metrics alone won't tell you which service failed
Health checks everywhere — Every service exposes /health for orchestration
Structured JSON logs — Machine-parseable, painless to search
Standard metrics — RED method (Rate, Errors, Duration) for every service
Trace everything — OpenTelemetry auto-instrumentation covers most cases
Alert on SLOs — Not on every anomaly, only on user-impacting degradation

This post is part of the Production DevOps Patterns series. Follow for more DevOps, monitoring, and infrastructure best practices.

Publish-ready: Copy this markdown directly to dev.to, Medium, or your blog. Frontmatter included.

DEV Community

Microservices Monitoring: A Practical Guide to Observability

Microservices Monitoring: A Practical Guide to Observability

1. The Three Pillars of Observability

Why You Need All Three

2. Health Check Endpoints

Docker Health Check

3. Structured Logging

JSON Log Format

Log Levels

4. Prometheus Metrics

Middleware Integration

5. OpenTelemetry Distributed Tracing

Manual Tracing

6. Grafana Dashboard Template

Key Takeaways

Top comments (0)