Lucas Pereira de Souza

Posted on Jan 21

Observability with OpenTelemetry

#architecture #microservices #monitoring #performance

## Unraveling the Chaos: How to Trace Requests in Distributed Architectures

In a world where microservices reign supreme and complexity grows exponentially, understanding the flow of a request across multiple services has become a monumental challenge. Imagine a bank transaction, an e-commerce search, or order processing – each could trigger a cascade of calls between dozens of services. When something goes wrong, or even to optimize performance, pinpointing the source of the problem can feel like searching for a needle in a haystack. This is where distributed tracing comes in.

Why Trace Requests?

Distributed architectures, despite their benefits in scalability and resilience, introduce a new layer of complexity: observability. Without an effective tracing mechanism, debugging production issues becomes a nightmare. Distributed tracing allows us to:

Diagnose Failures: Quickly identify which service failed and at what point in the request chain.
Optimize Performance: Map bottlenecks and latencies at each processing step.
Understand Flow: Visualize the complete journey of a request across different services.
Auditing and Compliance: Record the history of operations for security and compliance purposes.

The Magic of Distributed Tracing: Fundamental Concepts

The essence of distributed tracing lies in propagating a unique identifier – the Trace ID – through all network calls between services. Each operation within a service is represented by a Span, which has a unique ID, the Trace ID it belongs to, the Parent Span ID (if applicable), the operation name, and metadata (tags and logs).

Trace ID: A unique global identifier for a complete request, from the entry point to the end.
Span ID: A unique identifier for a unit of work within a service (e.g., a database call, an internal HTTP request).
Parent Span ID: The ID of the Span that originated the current Span. Essential for building the dependency tree.
Context Propagation: The key to connecting Spans. The Trace ID, Span ID, and other metadata are injected into the headers of requests (HTTP, gRPC, message queues) and extracted by consuming services.

Getting Hands-On: Implementing with TypeScript/Node.js

Let's illustrate with a practical example using Node.js and a popular library like OpenTelemetry, which has become the de facto standard for observability.

Imagine two services: service-a and service-b. service-a calls service-b.

Prerequisites:

Node.js installed
npm or yarn

Installing Dependencies:

npm install @opentelemetry/api @opentelemetry/sdk-trace-base @opentelemetry/sdk-trace-node @opentelemetry/auto-instrumentations-node @opentelemetry/exporter-trace-otlp-http
# or
yarn add @opentelemetry/api @opentelemetry/sdk-trace-base @opentelemetry/sdk-trace-node @opentelemetry/auto-instrumentations-node @opentelemetry/exporter-trace-otlp-http

Configuring OpenTelemetry (in both services):

// otel.config.ts (in both services)
import { NodeSDK } from '@opentelemetry/sdk-trace-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { HttpInstrumentation } from '@opentelemetry/instrumentation-http';
import { ExpressInstrumentation } from '@opentelemetry/instrumentation-express';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';

const sdk = new NodeSDK({
  // Configures the exporter to send traces to a backend (e.g., Jaeger, Tempo)
  // Ensure your backend is running and accessible at 'http://localhost:4318/v1/traces'
  // or adjust the URL as needed.
  traceExporter: new OTLPTraceExporter({
    url: process.env.OTLP_ENDPOINT || 'http://localhost:4318/v1/traces', // Your OTLP collector endpoint
  }),
  instrumentations: [
    // Automatic instrumentations to capture spans from common libraries
    getNodeAutoInstrumentations(),
    // Specific instrumentations for Express and HTTP, if not covered by auto-instrumentation
    new HttpInstrumentation(),
    new ExpressInstrumentation(),
  ],
});

// Initializes the OpenTelemetry SDK
sdk.start();

// Adds a handler to ensure spans are exported upon application shutdown
process.on('SIGTERM', () => {
  sdk.shutdown()
    .then(() => console.log('Tracing terminated'))
    .catch((error) => console.error('Error terminating tracing', error))
    .finally(() => process.exit(0));
});

export default sdk;

Service A (servico-a.ts):

// servico-a.ts
import express, { Request, Response } from 'express';
import axios from 'axios';
import { trace, context, propagation, SpanKind, SpanStatusCode } from '@opentelemetry/api';
import './otel.config'; // Imports the OpenTelemetry configuration

const app = express();
const port = 3000;
const servicoBUrl = process.env.SERVICO_B_URL || 'http://localhost:3001';

// Gets the OpenTelemetry tracer
const tracer = trace.getTracer('servico-a-tracer');

app.get('/processar', async (req: Request, res: Response) => {
  // Creates a span for the incoming request in service-a
  const currentSpan = tracer.startSpan('processar-request-servico-a', {
    kind: SpanKind.SERVER, // Indicates this span represents work initiated by an external request
    attributes: { // Adds useful attributes for the span
      'http.method': req.method,
      'http.url': req.url,
      'http.target': req.originalUrl,
      'http.host': req.headers.host,
      'net.peer.ip': req.socket.remoteAddress,
    }
  });

  // Activates the current context so that child spans are associated with this trace
  context.with(trace.setSpan(context.active(), currentSpan), async () => {
    try {
      console.log('Starting processing in Service A...');

      // Gets the current propagation context to inject into the next request's headers
      const carrier = {};
      propagation.inject(context.active(), carrier);

      // Calls Service B, injecting the tracing context into the headers
      const responseServicoB = await axios.get(`${servicoBUrl}/dados`, {
        headers: carrier // Injects the tracing context here
      });

      const dadosDoServicoB = responseServicoB.data;
      console.log('Data received from Service B:', dadosDoServicoB);

      // Processes the data...
      const resultado = `Service A processed successfully. Data from B: ${dadosDoServicoB.message}`;

      // Sets the span status to success
      currentSpan.setStatus({ code: SpanStatusCode.OK });
      currentSpan.addEvent('Service B called successfully.'); // Adds an event to the span

      res.json({ message: resultado });

    } catch (error) {
      console.error('Error processing in Service A:', error);
      // Sets the span status to error
      currentSpan.setStatus({ code: SpanStatusCode.ERROR, message: (error as Error).message });
      currentSpan.recordException(error as Error); // Records the exception in the span
      res.status(500).json({ message: 'Internal Server Error in Service A' });
    } finally {
      // Ends the current span. IMPORTANT: Always end the span!
      currentSpan.end();
    }
  });
});

app.listen(port, () => {
  console.log(`Service A listening on port ${port}`);
});

Service B (servico-b.ts):

// servico-b.ts
import express, { Request, Response } from 'express';
import { trace, context, propagation, SpanKind, SpanStatusCode } from '@opentelemetry/api';
import './otel.config'; // Imports the OpenTelemetry configuration

const app = express();
const port = 3001;

// Gets the OpenTelemetry tracer
const tracer = trace.getTracer('servico-b-tracer');

app.get('/dados', (req: Request, res: Response) => {
  // Extracts the tracing context from the incoming request headers
  const extractedContext = propagation.extract(context.active(), req.headers);

  // Creates a span for the incoming request in service-b, associating it with the existing trace
  // If the context was extracted, it will be used as the parent context. Otherwise, a new trace will be initiated.
  const currentSpan = tracer.startSpan('processar-request-servico-b', {
    kind: SpanKind.SERVER,
    attributes: {
      'http.method': req.method,
      'http.url': req.url,
      'http.target': req.originalUrl,
      'http.host': req.headers.host,
      'net.peer.ip': req.socket.remoteAddress,
    },
    // Uses the extracted context to link this span to the original trace
    parent: extractedContext && trace.getSpanContext(extractedContext) ? extractedContext : undefined,
  });

  // Activates the current context
  context.with(trace.setSpan(context.active(), currentSpan), () => {
    try {
      console.log('Starting processing in Service B...');

      // Simulates some processing
      const dadosProcessados = { message: 'Mocked data from Service B!' };

      // Sets the span status to success
      currentSpan.setStatus({ code: SpanStatusCode.OK });
      currentSpan.addEvent('Processing in Service B completed.');

      res.json(dadosProcessados);

    } catch (error) {
      console.error('Error processing in Service B:', error);
      // Sets the span status to error
      currentSpan.setStatus({ code: SpanStatusCode.ERROR, message: (error as Error).message });
      currentSpan.recordException(error as Error);
      res.status(500).json({ message: 'Internal Server Error in Service B' });
    } finally {
      // Ends the current span
      currentSpan.end();
    }
  });
});

app.listen(port, () => {
  console.log(`Service B listening on port ${port}`);
});

Important Notes:

Context Propagation: The magic happens in propagation.inject and propagation.extract. axios in service-a sends the tracing headers, and service-b reads them to continue the same trace.
Span Kind: SpanKind.SERVER indicates that the span was initiated due to an external request. SpanKind.CLIENT would be used for outgoing calls (like the axios call in service-a, if we weren't using automatic instrumentation).
Span Lifecycle Management: It is crucial to start (startSpan) and end (end) each span. We use try...catch...finally to ensure currentSpan.end() is always called.
Status and Events: Setting the SpanStatusCode and adding events (like recordException) provides valuable information about what happened during the span's execution.
Aggregation Backend: The traces configured above will send data to an OTLP collector (like the OpenTelemetry Collector). You will need a visualization backend (like Jaeger, Zipkin, or Grafana Tempo) to inspect the traces.

Conclusion

Distributed tracing is no longer a luxury but a necessity in modern architectures. By implementing practices like context propagation and proper instrumentation, we gain essential visibility to debug, optimize, and understand the complex behavior of our distributed systems. Tools like OpenTelemetry provide a solid foundation for building this observability, empowering teams to navigate the labyrinth of microservices with confidence. Remember: what cannot be measured cannot be improved. And in the distributed world, tracing is our primary measurement tool.

DEV Community