## Unraveling the Chaos: How to Trace Requests in Distributed Architectures
In a world where microservices reign supreme and complexity grows exponentially, understanding the flow of a request across multiple services has become a monumental challenge. Imagine a bank transaction, an e-commerce search, or order processing – each could trigger a cascade of calls between dozens of services. When something goes wrong, or even to optimize performance, pinpointing the source of the problem can feel like searching for a needle in a haystack. This is where distributed tracing comes in.
Why Trace Requests?
Distributed architectures, despite their benefits in scalability and resilience, introduce a new layer of complexity: observability. Without an effective tracing mechanism, debugging production issues becomes a nightmare. Distributed tracing allows us to:
- Diagnose Failures: Quickly identify which service failed and at what point in the request chain.
- Optimize Performance: Map bottlenecks and latencies at each processing step.
- Understand Flow: Visualize the complete journey of a request across different services.
- Auditing and Compliance: Record the history of operations for security and compliance purposes.
The Magic of Distributed Tracing: Fundamental Concepts
The essence of distributed tracing lies in propagating a unique identifier – the Trace ID – through all network calls between services. Each operation within a service is represented by a Span, which has a unique ID, the Trace ID it belongs to, the Parent Span ID (if applicable), the operation name, and metadata (tags and logs).
- Trace ID: A unique global identifier for a complete request, from the entry point to the end.
- Span ID: A unique identifier for a unit of work within a service (e.g., a database call, an internal HTTP request).
- Parent Span ID: The ID of the Span that originated the current Span. Essential for building the dependency tree.
- Context Propagation: The key to connecting Spans. The Trace ID, Span ID, and other metadata are injected into the headers of requests (HTTP, gRPC, message queues) and extracted by consuming services.
Getting Hands-On: Implementing with TypeScript/Node.js
Let's illustrate with a practical example using Node.js and a popular library like OpenTelemetry, which has become the de facto standard for observability.
Imagine two services: service-a and service-b. service-a calls service-b.
Prerequisites:
- Node.js installed
- npm or yarn
Installing Dependencies:
npm install @opentelemetry/api @opentelemetry/sdk-trace-base @opentelemetry/sdk-trace-node @opentelemetry/auto-instrumentations-node @opentelemetry/exporter-trace-otlp-http
# or
yarn add @opentelemetry/api @opentelemetry/sdk-trace-base @opentelemetry/sdk-trace-node @opentelemetry/auto-instrumentations-node @opentelemetry/exporter-trace-otlp-http
Configuring OpenTelemetry (in both services):
// otel.config.ts (in both services)
import { NodeSDK } from '@opentelemetry/sdk-trace-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { HttpInstrumentation } from '@opentelemetry/instrumentation-http';
import { ExpressInstrumentation } from '@opentelemetry/instrumentation-express';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
const sdk = new NodeSDK({
// Configures the exporter to send traces to a backend (e.g., Jaeger, Tempo)
// Ensure your backend is running and accessible at 'http://localhost:4318/v1/traces'
// or adjust the URL as needed.
traceExporter: new OTLPTraceExporter({
url: process.env.OTLP_ENDPOINT || 'http://localhost:4318/v1/traces', // Your OTLP collector endpoint
}),
instrumentations: [
// Automatic instrumentations to capture spans from common libraries
getNodeAutoInstrumentations(),
// Specific instrumentations for Express and HTTP, if not covered by auto-instrumentation
new HttpInstrumentation(),
new ExpressInstrumentation(),
],
});
// Initializes the OpenTelemetry SDK
sdk.start();
// Adds a handler to ensure spans are exported upon application shutdown
process.on('SIGTERM', () => {
sdk.shutdown()
.then(() => console.log('Tracing terminated'))
.catch((error) => console.error('Error terminating tracing', error))
.finally(() => process.exit(0));
});
export default sdk;
Service A (servico-a.ts):
// servico-a.ts
import express, { Request, Response } from 'express';
import axios from 'axios';
import { trace, context, propagation, SpanKind, SpanStatusCode } from '@opentelemetry/api';
import './otel.config'; // Imports the OpenTelemetry configuration
const app = express();
const port = 3000;
const servicoBUrl = process.env.SERVICO_B_URL || 'http://localhost:3001';
// Gets the OpenTelemetry tracer
const tracer = trace.getTracer('servico-a-tracer');
app.get('/processar', async (req: Request, res: Response) => {
// Creates a span for the incoming request in service-a
const currentSpan = tracer.startSpan('processar-request-servico-a', {
kind: SpanKind.SERVER, // Indicates this span represents work initiated by an external request
attributes: { // Adds useful attributes for the span
'http.method': req.method,
'http.url': req.url,
'http.target': req.originalUrl,
'http.host': req.headers.host,
'net.peer.ip': req.socket.remoteAddress,
}
});
// Activates the current context so that child spans are associated with this trace
context.with(trace.setSpan(context.active(), currentSpan), async () => {
try {
console.log('Starting processing in Service A...');
// Gets the current propagation context to inject into the next request's headers
const carrier = {};
propagation.inject(context.active(), carrier);
// Calls Service B, injecting the tracing context into the headers
const responseServicoB = await axios.get(`${servicoBUrl}/dados`, {
headers: carrier // Injects the tracing context here
});
const dadosDoServicoB = responseServicoB.data;
console.log('Data received from Service B:', dadosDoServicoB);
// Processes the data...
const resultado = `Service A processed successfully. Data from B: ${dadosDoServicoB.message}`;
// Sets the span status to success
currentSpan.setStatus({ code: SpanStatusCode.OK });
currentSpan.addEvent('Service B called successfully.'); // Adds an event to the span
res.json({ message: resultado });
} catch (error) {
console.error('Error processing in Service A:', error);
// Sets the span status to error
currentSpan.setStatus({ code: SpanStatusCode.ERROR, message: (error as Error).message });
currentSpan.recordException(error as Error); // Records the exception in the span
res.status(500).json({ message: 'Internal Server Error in Service A' });
} finally {
// Ends the current span. IMPORTANT: Always end the span!
currentSpan.end();
}
});
});
app.listen(port, () => {
console.log(`Service A listening on port ${port}`);
});
Service B (servico-b.ts):
// servico-b.ts
import express, { Request, Response } from 'express';
import { trace, context, propagation, SpanKind, SpanStatusCode } from '@opentelemetry/api';
import './otel.config'; // Imports the OpenTelemetry configuration
const app = express();
const port = 3001;
// Gets the OpenTelemetry tracer
const tracer = trace.getTracer('servico-b-tracer');
app.get('/dados', (req: Request, res: Response) => {
// Extracts the tracing context from the incoming request headers
const extractedContext = propagation.extract(context.active(), req.headers);
// Creates a span for the incoming request in service-b, associating it with the existing trace
// If the context was extracted, it will be used as the parent context. Otherwise, a new trace will be initiated.
const currentSpan = tracer.startSpan('processar-request-servico-b', {
kind: SpanKind.SERVER,
attributes: {
'http.method': req.method,
'http.url': req.url,
'http.target': req.originalUrl,
'http.host': req.headers.host,
'net.peer.ip': req.socket.remoteAddress,
},
// Uses the extracted context to link this span to the original trace
parent: extractedContext && trace.getSpanContext(extractedContext) ? extractedContext : undefined,
});
// Activates the current context
context.with(trace.setSpan(context.active(), currentSpan), () => {
try {
console.log('Starting processing in Service B...');
// Simulates some processing
const dadosProcessados = { message: 'Mocked data from Service B!' };
// Sets the span status to success
currentSpan.setStatus({ code: SpanStatusCode.OK });
currentSpan.addEvent('Processing in Service B completed.');
res.json(dadosProcessados);
} catch (error) {
console.error('Error processing in Service B:', error);
// Sets the span status to error
currentSpan.setStatus({ code: SpanStatusCode.ERROR, message: (error as Error).message });
currentSpan.recordException(error as Error);
res.status(500).json({ message: 'Internal Server Error in Service B' });
} finally {
// Ends the current span
currentSpan.end();
}
});
});
app.listen(port, () => {
console.log(`Service B listening on port ${port}`);
});
Important Notes:
- Context Propagation: The magic happens in
propagation.injectandpropagation.extract.axiosinservice-asends the tracing headers, andservice-breads them to continue the same trace. - Span Kind:
SpanKind.SERVERindicates that the span was initiated due to an external request.SpanKind.CLIENTwould be used for outgoing calls (like theaxioscall inservice-a, if we weren't using automatic instrumentation). - Span Lifecycle Management: It is crucial to start (
startSpan) and end (end) each span. We usetry...catch...finallyto ensurecurrentSpan.end()is always called. - Status and Events: Setting the
SpanStatusCodeand addingevents(likerecordException) provides valuable information about what happened during the span's execution. - Aggregation Backend: The traces configured above will send data to an OTLP collector (like the OpenTelemetry Collector). You will need a visualization backend (like Jaeger, Zipkin, or Grafana Tempo) to inspect the traces.
Conclusion
Distributed tracing is no longer a luxury but a necessity in modern architectures. By implementing practices like context propagation and proper instrumentation, we gain essential visibility to debug, optimize, and understand the complex behavior of our distributed systems. Tools like OpenTelemetry provide a solid foundation for building this observability, empowering teams to navigate the labyrinth of microservices with confidence. Remember: what cannot be measured cannot be improved. And in the distributed world, tracing is our primary measurement tool.
Top comments (0)