Microservices Monitoring: A Practical Guide to Observability
Service: SEO-Optimized Blog Post | Price: $15 | Format: dev.to-ready | Category: DevOps
When your application has one service, you can debug by reading logs. When you have twenty services talking across a mesh, you need observability — metrics, traces, and structured logs working together.
This guide covers production monitoring patterns that catch problems before customers do.
1. The Three Pillars of Observability
┌─────────────────────────────────────────────────┐
│ Observability │
├────────────┬──────────────────┬─────────────────┤
│ Metrics │ Logs │ Traces │
│ (Prometheus)│ (Structured JSON)│ (OpenTelemetry) │
├────────────┼──────────────────┼─────────────────┤
│ CPU, memory│ Error messages │ Request path │
│ Request rate│ Access logs │ Service-to- │
│ Latency │ Audit trails │ service calls │
│ SLOs │ │ │
└────────────┴──────────────────┴─────────────────┘
Why You Need All Three
| Pillar | Answers | Example |
|---|---|---|
| Metrics | "Is the system healthy?" | p99 latency spiked to 5s |
| Logs | "What exactly happened?" | "error": "connection refused" |
| Traces | "Which service caused it?" | Auth service → DB timeout |
2. Health Check Endpoints
Every service needs a health check endpoint:
// health.ts
import { Router } from "express";
const router = Router();
router.get("/health", async (req, res) => {
const checks = {
database: await checkDatabase(),
redis: await checkRedis(),
disk: checkDiskSpace(),
};
const healthy = Object.values(checks).every((c) => c.status === "ok");
res.status(healthy ? 200 : 503).json({
status: healthy ? "ok" : "degraded",
checks,
uptime: process.uptime(),
timestamp: new Date().toISOString(),
});
});
async function checkDatabase() {
try {
await db.$queryRaw`SELECT 1`;
return { status: "ok" };
} catch (error) {
return { status: "error", message: String(error) };
}
}
Docker Health Check
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:3100/api/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 40s
3. Structured Logging
JSON Log Format
const logger = pino({
level: process.env.LOG_LEVEL || "info",
formatters: {
level(label) {
return { level: label };
},
},
timestamp: pino.stdTimeFunctions.isoTime,
});
// Usage
logger.info({ service: "auth", userId: "usr_123" }, "User logged in");
logger.error({ err, requestId: req.id }, "Authentication failed");
Output:
{
"level": "info",
"time": "2025-01-15T10:30:00.000Z",
"msg": "User logged in",
"service": "auth",
"userId": "usr_123"
}
Log Levels
| Level | Use Case |
|---|---|
debug |
Development troubleshooting |
info |
Normal operations (requests, signups) |
warn |
Unexpected but handled (retries, rate limits) |
error |
Failures requiring investigation |
fatal |
Service is shutting down |
4. Prometheus Metrics
import prometheus from "prom-client";
// Counter: things that increase
const httpRequestsTotal = new prometheus.Counter({
name: "http_requests_total",
help: "Total HTTP requests",
labelNames: ["method", "path", "status"],
});
// Histogram: latency distribution
const httpRequestDuration = new prometheus.Histogram({
name: "http_request_duration_seconds",
help: "HTTP request latency in seconds",
labelNames: ["method", "path"],
buckets: [0.01, 0.05, 0.1, 0.5, 1, 5],
});
// Gauge: current values
const memoryUsageBytes = new prometheus.Gauge({
name: "process_memory_bytes",
help: "Current memory usage in bytes",
});
Middleware Integration
router.use((req, res, next) => {
const end = httpRequestDuration.startTimer();
res.on("finish", () => {
httpRequestsTotal.inc({ method: req.method, path: req.route?.path, status: res.statusCode });
end({ method: req.method, path: req.route?.path });
});
next();
});
5. OpenTelemetry Distributed Tracing
import { NodeSDK } from "@opentelemetry/sdk-node";
import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-http";
import { getNodeAutoInstrumentations } from "@opentelemetry/auto-instrumentations-node";
const sdk = new NodeSDK({
traceExporter: new OTLPTraceExporter({
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT,
}),
instrumentations: [getNodeAutoInstrumentations()],
});
sdk.start();
Manual Tracing
import { trace } from "@opentelemetry/api";
const tracer = trace.getTracer("my-service");
async function processOrder(orderId: string) {
const span = tracer.startSpan("processOrder", { attributes: { orderId } });
try {
await validatePayment(orderId);
await updateInventory(orderId);
await sendConfirmation(orderId);
} catch (error) {
span.recordException(error as Error);
throw error;
} finally {
span.end();
}
}
6. Grafana Dashboard Template
{
"title": "Service Overview",
"panels": [
{
"title": "Request Rate",
"type": "timeseries",
"targets": [{ "expr": "rate(http_requests_total[5m])" }]
},
{
"title": "p99 Latency",
"type": "timeseries",
"targets": [{ "expr": "histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))" }]
},
{
"title": "Error Rate",
"type": "timeseries",
"targets": [{ "expr": "rate(http_requests_total{status=~\"5..\"}[5m])" }]
},
{
"title": "Memory Usage",
"type": "gauge",
"targets": [{ "expr": "process_memory_bytes" }]
}
]
}
Key Takeaways
- All three pillars — Metrics alone won't tell you which service failed
-
Health checks everywhere — Every service exposes
/healthfor orchestration - Structured JSON logs — Machine-parseable, painless to search
- Standard metrics — RED method (Rate, Errors, Duration) for every service
- Trace everything — OpenTelemetry auto-instrumentation covers most cases
- Alert on SLOs — Not on every anomaly, only on user-impacting degradation
This post is part of the Production DevOps Patterns series. Follow for more DevOps, monitoring, and infrastructure best practices.
Publish-ready: Copy this markdown directly to dev.to, Medium, or your blog. Frontmatter included.
Top comments (0)