DEV Community

Nishant Modak
Nishant Modak

Posted on • Originally published at last9.io

Debugging Microservices Like a Pro: How Trace IDs Saved My Production Incident

Last week, our checkout service started timing out randomly. 2% of requests, no pattern, no obvious cause. Here's how I tracked it down using trace IDs.

## The Problem

Our architecture:

  • API Gateway → Auth Service → Cart Service → Payment Service → Notification Service

Users reported "payment failed" but logs showed success everywhere. Classic distributed systems nightmare.

## The Fix: Following the Trace ID

Every request gets a unique trace ID at the gateway. I grep'd for a failing request's trace ID across all services:

  grep "trace_id=abc123" /var/log/*.log
Enter fullscreen mode Exit fullscreen mode

Found it: Cart Service was returning 200, but with an empty response body. Payment Service treated empty as "no items" and silently skipped.

Without trace IDs, I'd still be searching.

Trace ID vs Correlation ID

One thing that confused me early on: what's the difference between a trace ID and correlation ID?

https://last9.io/blog/correlation-id-vs-trace-id/ - they follow a request across service boundaries with parent-child span relationships. Correlation IDs are simpler - just a shared identifier without the hierarchy.

For debugging microservices, trace IDs give you the full picture.

Quick Implementation (Node.js)

`const { trace } = require('@opentelemetry/api');

app.use((req, res, next) => {
const span = trace.getActiveSpan();
const traceId = span?.spanContext().traceId;
req.traceId = traceId;
console.log([${traceId}] ${req.method} ${req.path});
next();
});`

Key Takeaways

  1. Always propagate trace IDs across services
  2. Log the trace ID with every log line
  3. Use distributed tracing tools (Jaeger, Tempo, Last9) to visualize
  4. When debugging, start with the trace ID and follow it

What's your go-to debugging strategy for microservices? Drop a comment!

Top comments (0)