NodeJS Microservices: 7 Observability Checks Before Launch Published

#backend #microservices #node #devops

You don’t “add monitoring later.” If a microservice ships without observability, your on-call pays the tax.

Below is a pre-launch checklist we run on Node.js services. It’s short, opinionated, and battle-tested.

1) *RED metrics (Prometheus with prom-client) *

Measure Rate (RPS), Errors (non-2xx by class), and Duration (p95/p99) per route/operation. Export labels: method, route, status. Add version/commit as a tag so dashboards split cleanly.

Dashboard: one panel each for RPS, Error %, and p95/p99 Duration per route.

2) SLOs + error budgets
Pick SLIs that users feel. Example API SLI: availability = 1 − (5xx + timeouts) / total.

service: checkout-api
sli:
type: events
good: http_requests_total{status=~"2..|3.."}
total: http_requests_total
slo: 99.9 # monthly objective
alerting:
burn_rates:
- window: 5m rate: 14 # page (fast burn)
- window: 1h rate: 6 # page
- window: 6h rate: 3 # ticket

You page on budget burn, not on every 500.

3) Distributed tracing (OpenTelemetry)
Instrument HTTP, DB, and queue operations; propagate trace id + tenant id across services.

//tracing.js
import { NodeSDK } from '@opentelemetry/sdk-node';
import { HttpInstrumentation } from '@opentelemetry/instrumentation-http';
import { ExpressInstrumentation } from '@opentelemetry/instrumentation-express';
import { PrismaInstrumentation } from '@prisma/instrumentation';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';

const sdk = new NodeSDK({
traceExporter: new OTLPTraceExporter({ url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT }),
instrumentations: [new HttpInstrumentation(), new ExpressInstrumentation(), new PrismaInstrumentation()]
});
sdk.start();
Minimum: parent/child spans, HTTP attributes (route, status, target), DB statement summaries, and message queue spans (publish/consume).

4) Queue depth & consumer lag

For RabbitMQ/Kafka/SQS, track:

Queue depth (messages ready).
Lag (Kafka consumer group lag).
Age of oldest message (or time-in-queue).
DLQ rate.

//Example: RabbitMQ depth (management API)
const depth = await fetch(${RMQ}/api/queues/%2F/orders).then(r=>r.json());
metrics.queueDepth.set({ queue:'orders' }, depth.messages_ready);
Alert when depth/lag grows while consumer CPU is idle → likely stuck handler or poison message.

5) Synthetic checks (outside-in)
Hit public routes from multiple regions every minute; alert when error rate or latency breaks SLO.

Run smoke on deploy; run full flows (login → create → pay) on schedule.

6) Liveness / Readiness
/healthz (liveness): process is alive; quick checks only.
/readyz (readiness): dependencies OK (DB ping, queue connect, config loaded). Fail readiness when backpressure kicks in.

app.get('/healthz', (_req,res)=> res.send('ok'));
app.get('/readyz', async (_req,res)=>{
const ok = await db.ping() && await queue.ping();
res.status(ok?200:503).send(ok?'ready':'not-ready');
});

7) Release/rollback sanity