Debugging Microservices Like a Pro: How Trace IDs Saved My Production Incident

#microservices #observability

Last week, our checkout service started timing out randomly. 2% of requests, no pattern, no obvious cause. Here's how I tracked it down using trace IDs.

## The Problem

Our architecture:

API Gateway → Auth Service → Cart Service → Payment Service → Notification Service

Users reported "payment failed" but logs showed success everywhere. Classic distributed systems nightmare.

## The Fix: Following the Trace ID

Every request gets a unique trace ID at the gateway. I grep'd for a failing request's trace ID across all services:

  grep "trace_id=abc123" /var/log/*.log

Found it: Cart Service was returning 200, but with an empty response body. Payment Service treated empty as "no items" and silently skipped.

Without trace IDs, I'd still be searching.

Trace ID vs Correlation ID

One thing that confused me early on: what's the difference between a trace ID and correlation ID?

https://last9.io/blog/correlation-id-vs-trace-id/ - they follow a request across service boundaries with parent-child span relationships. Correlation IDs are simpler - just a shared identifier without the hierarchy.

For debugging microservices, trace IDs give you the full picture.

Quick Implementation (Node.js)

`const { trace } = require('@opentelemetry/api');

app.use((req, res, next) => {
const span = trace.getActiveSpan();
const traceId = span?.spanContext().traceId;
req.traceId = traceId;
console.log([${traceId}] ${req.method} ${req.path});
next();
});`

Key Takeaways