Last week, our checkout service started timing out randomly. 2% of requests, no pattern, no obvious cause. Here's how I tracked it down using trace IDs.
## The Problem
Our architecture:
- API Gateway → Auth Service → Cart Service → Payment Service → Notification Service
Users reported "payment failed" but logs showed success everywhere. Classic distributed systems nightmare.
## The Fix: Following the Trace ID
Every request gets a unique trace ID at the gateway. I grep'd for a failing request's trace ID across all services:
grep "trace_id=abc123" /var/log/*.log
Found it: Cart Service was returning 200, but with an empty response body. Payment Service treated empty as "no items" and silently skipped.
Without trace IDs, I'd still be searching.
Trace ID vs Correlation ID
One thing that confused me early on: what's the difference between a trace ID and correlation ID?
https://last9.io/blog/correlation-id-vs-trace-id/ - they follow a request across service boundaries with parent-child span relationships. Correlation IDs are simpler - just a shared identifier without the hierarchy.
For debugging microservices, trace IDs give you the full picture.
Quick Implementation (Node.js)
`const { trace } = require('@opentelemetry/api');
app.use((req, res, next) => {
const span = trace.getActiveSpan();
const traceId = span?.spanContext().traceId;
req.traceId = traceId;
console.log([${traceId}] ${req.method} ${req.path});
next();
});`
Key Takeaways
- Always propagate trace IDs across services
- Log the trace ID with every log line
- Use distributed tracing tools (Jaeger, Tempo, Last9) to visualize
- When debugging, start with the trace ID and follow it
What's your go-to debugging strategy for microservices? Drop a comment!
Top comments (0)