DEV Community

Discussion on: Your API is down - now what? Capturing failure context in Node.js

Collapse
 
henryaza profile image
Henry A • Edited

The capture side is solid, but the part that usually burns teams is what happens after you have the context — specifically, making it actionable at 2 AM when your on-call engineer is half awake.

A few things I've added to production Node.js services that made incident response dramatically faster:

1. Correlation IDs that survive async boundaries

Generate a x-correlation-id at the edge (API Gateway, load balancer, or first middleware) and propagate it through every downstream call — HTTP, queue messages, DB queries. When something fails, you grep one ID and get the full request lifecycle across services. Without this, you're correlating by timestamp, which falls apart under load.

2. Structured error context, not just stack traces

Instead of logging the error object, log a structured envelope:

{
  "correlationId": "abc-123",
  "operation": "createOrder",
  "input": { "userId": "u_456", "items": 3 },
  "dependency": "stripe-api",
  "latencyMs": 12400,
  "error": "ETIMEDOUT",
  "retryCount": 2
}
Enter fullscreen mode Exit fullscreen mode

This tells you what was happening, what it was talking to, and how long it waited — not just that it failed. CloudWatch Insights or Datadog can query these fields directly.

3. The "last known good" snapshot

On every successful health check, write a lightweight state snapshot (active connections, queue depth, memory, last successful DB roundtrip latency) to a known location. When the service goes down, the previous snapshot tells you what was degrading before the crash — not after.

4. Circuit breakers with context emission

If you're using a circuit breaker (opossum, cockatiel, etc.), make it emit structured events when it opens. "Circuit to payments-service opened after 5/10 failures in 30s, median latency 8400ms" is an alert that tells you exactly what to look at. Most teams have circuit breakers but never wire the state transitions to their alerting pipeline.

The gap I see most often isn't capture — it's that teams capture plenty of data but can't find the right data during an incident because it's unstructured or scattered across services. Correlation IDs + structured envelopes solve 80% of that.

Collapse
 
riyon_sebastian profile image
Riyon Sebastian • Edited

This is a fantastic breakdown, really appreciate you taking the time to write this.

You’re absolutely right that the hard part isn’t just capturing data, but being able to use it effectively during an incident, especially when everything is happening under pressure.

One bit of context on where the monitoring tool I’m developing sits: it’s intentionally an external observer. It captures what the world outside your service sees at the moment of failure - DNS, TLS, TTFB, response body. So I see it as complementary to things like correlation IDs and structured logs, not a replacement - those are doing the hard work inside, while the monitoring layer gives you the matching evidence from outside.

A couple of your points are genuinely useful for where it goes next:

  • Surfacing correlation IDs in snapshots (when available in headers), so you can pivot straight into internal logs
  • A “last known good” baseline, so you can see degradation over time, not just the failure moment

Both feel like natural next steps.

Really appreciate the perspective, this is exactly the kind of discussion I was hoping for.