You don’t “add monitoring later.” If a microservice ships without observability, your on-call pays the tax.
Below is a pre-launch checklist we run on Node.js services. It’s short, opinionated, and battle-tested.
1) *RED metrics (Prometheus with prom-client) *
Measure Rate (RPS), Errors (non-2xx by class), and Duration (p95/p99) per route/operation. Export labels: method, route, status. Add version/commit as a tag so dashboards split cleanly.
Dashboard: one panel each for RPS, Error %, and p95/p99 Duration per route.
2) SLOs + error budgets
Pick SLIs that users feel. Example API SLI: availability = 1 − (5xx + timeouts) / total.
service: checkout-api
sli:
type: events
good: http_requests_total{status=~"2..|3.."}
total: http_requests_total
slo: 99.9 # monthly objective
alerting:
burn_rates:
- window: 5m rate: 14 # page (fast burn)
- window: 1h rate: 6 # page
- window: 6h rate: 3 # ticket
You page on budget burn, not on every 500.
3) Distributed tracing (OpenTelemetry)
Instrument HTTP, DB, and queue operations; propagate trace id + tenant id across services.
//tracing.js
import { NodeSDK } from '@opentelemetry/sdk-node';
import { HttpInstrumentation } from '@opentelemetry/instrumentation-http';
import { ExpressInstrumentation } from '@opentelemetry/instrumentation-express';
import { PrismaInstrumentation } from '@prisma/instrumentation';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
const sdk = new NodeSDK({
traceExporter: new OTLPTraceExporter({ url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT }),
instrumentations: [new HttpInstrumentation(), new ExpressInstrumentation(), new PrismaInstrumentation()]
});
sdk.start();
Minimum: parent/child spans, HTTP attributes (route, status, target), DB statement summaries, and message queue spans (publish/consume).
4) Queue depth & consumer lag
For RabbitMQ/Kafka/SQS, track:
- Queue depth (messages ready).
- Lag (Kafka consumer group lag).
- Age of oldest message (or time-in-queue).
- DLQ rate.
//Example: RabbitMQ depth (management API)
const depth = await fetch(${RMQ}/api/queues/%2F/orders).then(r=>r.json());
metrics.queueDepth.set({ queue:'orders' }, depth.messages_ready);
Alert when depth/lag grows while consumer CPU is idle → likely stuck handler or poison message.
5) Synthetic checks (outside-in)
Hit public routes from multiple regions every minute; alert when error rate or latency breaks SLO.
Run smoke on deploy; run full flows (login → create → pay) on schedule.
6) Liveness / Readiness
/healthz (liveness): process is alive; quick checks only.
/readyz (readiness): dependencies OK (DB ping, queue connect, config loaded). Fail readiness when backpressure kicks in.
app.get('/healthz', (_req,res)=> res.send('ok'));
app.get('/readyz', async (_req,res)=>{
const ok = await db.ping() && await queue.ping();
res.status(ok?200:503).send(ok?'ready':'not-ready');
});
7) Release/rollback sanity
- Log version/commit on every request (trace attr + metric label).
- Dashboards pinned for latest version.
- Alert routes: paging only for fast budget burn, tickets for slow burn.
- Rollback plan documented (traffic switch, canary %, who approves).
What we keep on one dashboard
- RED per route (RPS, Error%, p95/p99).
- SLO objective vs. actual & budget left.
- Trace waterfall for 3 slowest endpoints.
- Queue depth/lag + DLQ rate.
- Synthetic latency (per region).
- Deploy marker overlays.
If you want a lean Node.js microservice checklist we share with teams, ping me.
Top comments (0)