In a monolithic application, debugging a slow or failing request is straightforward, you have one codebase, one log stream, and one execution context to reason about. In a microservices architecture, a single user request can touch a dozen services, three databases, and two external APIs before a response is returned. When something goes wrong, where do you look?
This is the problem distributed tracing solves. By attaching a unique trace identifier to every request and propagating it across every service boundary, distributed tracing gives you a complete, chronological map of exactly what happened, which services were called, in what order, how long each took, and where failures occurred.
OpenTelemetry is the open-source observability standard that makes this possible across any language and infrastructure. And NestJS, with its modular architecture and middleware system, is exceptionally well-suited for clean OpenTelemetry integration.
This guide walks through setting up distributed tracing in NestJS from scratch, auto-instrumenting HTTP and database calls, creating custom spans for business logic, propagating context across service boundaries, and visualizing traces in Grafana Tempo.
How Distributed Tracing Works
Before writing code, it helps to understand the core concepts:
-
Trace: the complete journey of a single request across all services. Identified by a
traceId. - Span: a single unit of work within a trace (e.g., an HTTP handler, a database query, an external API call). Each span has a start time, duration, and status.
-
Context Propagation: the mechanism by which trace and span identifiers are passed between services, typically via HTTP headers (
traceparent). - Exporter: the component that sends collected spans to a backend (Tempo, Jaeger, Zipkin, Datadog).
A fully traced request looks like this:
Trace: usr_checkout_8f3a2c
├── [0ms] API Gateway → POST /checkout (12ms)
├── [12ms] orders-service → createOrder() (45ms)
│ ├── [14ms] PostgreSQL → INSERT orders (18ms)
│ └── [33ms] payments-service → chargeCard() (24ms)
│ └── [35ms] Stripe API → POST /charges (21ms)
└── [57ms] notifications-service → sendConfirmation() (8ms)
Every box is a span. Every span shares the same traceId. The entire tree is the trace.
Step 1 - Install OpenTelemetry Packages
NestJS's OpenTelemetry setup requires the core SDK plus instrumentation libraries for Node.js HTTP, Express (which NestJS runs on), and any databases or HTTP clients your services use:
npm install \
@opentelemetry/sdk-node \
@opentelemetry/auto-instrumentations-node \
@opentelemetry/exporter-trace-otlp-http \
@opentelemetry/resources \
@opentelemetry/semantic-conventions \
@opentelemetry/api
Step 2 - Create the Tracing Bootstrap File
OpenTelemetry must be initialized before any other application code, before NestJS bootstraps, before TypeORM connects, before any modules load. Create a dedicated file for this:
// src/tracing.ts
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { Resource } from '@opentelemetry/resources';
import { SEMRESATTRS_SERVICE_NAME, SEMRESATTRS_SERVICE_VERSION, SEMRESATTRS_DEPLOYMENT_ENVIRONMENT } from '@opentelemetry/semantic-conventions';
const exporter = new OTLPTraceExporter({
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT ?? 'http://localhost:4318/v1/traces',
});
const sdk = new NodeSDK({
resource: new Resource({
[SEMRESATTRS_SERVICE_NAME]: process.env.SERVICE_NAME ?? 'unknown-service',
[SEMRESATTRS_SERVICE_VERSION]: process.env.APP_VERSION ?? '1.0.0',
[SEMRESATTRS_DEPLOYMENT_ENVIRONMENT]: process.env.NODE_ENV ?? 'development',
}),
traceExporter: exporter,
instrumentations: [
getNodeAutoInstrumentations({
'@opentelemetry/instrumentation-http': { enabled: true },
'@opentelemetry/instrumentation-express': { enabled: true },
'@opentelemetry/instrumentation-pg': { enabled: true }, // PostgreSQL
'@opentelemetry/instrumentation-redis': { enabled: true }, // Redis
'@opentelemetry/instrumentation-dns': { enabled: false }, // too noisy
'@opentelemetry/instrumentation-fs': { enabled: false }, // too noisy
}),
],
});
sdk.start();
process.on('SIGTERM', () => {
sdk.shutdown().finally(() => process.exit(0));
});
Then import it as the very first line of your entry point:
// src/main.ts
import './tracing'; // ← must be first, before all other imports
import { NestFactory } from '@nestjs/core';
import { AppModule } from './app.module';
async function bootstrap() {
const app = await NestFactory.create(AppModule);
await app.listen(3000);
}
bootstrap();
With this in place, every inbound HTTP request, outbound HTTP call, and PostgreSQL query is automatically traced, zero additional code required in your controllers or services.
Step 3 - Adding Custom Spans for Business Logic
Auto-instrumentation covers infrastructure-level operations. Custom spans capture business logic that matters to your domain, the operations auto-instrumentation doesn't know about:
// src/orders/orders.service.ts
import { Injectable } from '@nestjs/common';
import { trace, SpanStatusCode, context } from '@opentelemetry/api';
const tracer = trace.getTracer('orders-service');
@Injectable()
export class OrdersService {
async createOrder(dto: CreateOrderDto): Promise<Order> {
return tracer.startActiveSpan('orders.createOrder', async (span) => {
try {
// Add semantic attributes to the span
span.setAttributes({
'order.customerId': dto.customerId,
'order.itemCount': dto.items.length,
'order.currency': dto.currency,
});
const order = await this.processOrder(dto);
span.setAttributes({ 'order.id': order.id });
span.setStatus({ code: SpanStatusCode.OK });
return order;
} catch (error) {
span.setStatus({
code: SpanStatusCode.ERROR,
message: error instanceof Error ? error.message : 'Unknown error',
});
span.recordException(error as Error);
throw error;
} finally {
span.end(); // always end the span
}
});
}
private async processOrder(dto: CreateOrderDto): Promise<Order> {
return tracer.startActiveSpan('orders.processOrder', async (span) => {
try {
// Nested span — appears as a child in the trace tree
const inventory = await this.checkInventory(dto.items);
span.addEvent('inventory.checked', { available: inventory.allAvailable });
const order = await this.orderRepo.create(dto);
span.addEvent('order.persisted', { orderId: order.id });
return order;
} finally {
span.end();
}
});
}
}
Key practices for custom spans:
-
Always call
span.end(), an unclosed span leaks memory and never exports. Usetry/finallyto guarantee it. -
Use semantic attribute names. Prefix with your domain (
order.,user.,payment.) for consistent querying. -
Record exceptions with
span.recordException(), this attaches the full stack trace to the span in Tempo. -
Add events for significant moments within a span (
span.addEvent()), they appear as timestamped annotations on the span timeline.
Step 4 - Context Propagation Across Services
A trace is only useful if it spans service boundaries. When orders-service calls users-service, the trace context must travel with the request via HTTP headers.
Good news: if both services use OpenTelemetry with HTTP auto-instrumentation enabled, context propagation happens automatically. The outbound HTTP call from orders-service injects traceparent and tracestate headers, and users-service extracts them, linking the spans into the same trace tree.
For manual HTTP clients (e.g., Axios without auto-instrumentation), inject headers explicitly:
import { propagation, context } from '@opentelemetry/api';
import axios from 'axios';
async function callUsersService(userId: string) {
const headers: Record<string, string> = {};
// Inject current trace context into headers
propagation.inject(context.active(), headers);
const response = await axios.get(`${USERS_SERVICE_URL}/users/${userId}`, {
headers,
});
return response.data;
}
For NestJS microservices using TCP or message brokers (NATS, RabbitMQ, Kafka), inject trace context into the message payload or metadata:
// Injecting context into a NestJS microservice message
const carrier: Record<string, string> = {};
propagation.inject(context.active(), carrier);
this.client.emit('order_placed', {
...orderPayload,
_traceContext: carrier, // carry trace headers in the message
});
On the consumer side, extract and activate the context before processing:
@EventPattern('order_placed')
async handleOrderPlaced(data: OrderPlacedEvent) {
const parentContext = propagation.extract(context.active(), data._traceContext ?? {});
await context.with(parentContext, async () => {
// All spans created here are children of the originating trace
await this.processOrder(data);
});
}
Step 5 - Trace-Log Correlation
Distributed tracing and structured logging are most powerful when correlated. By injecting the current traceId and spanId into every log entry, you can jump from a log line directly to the corresponding trace in Tempo, and vice versa.
// src/common/middleware/trace-context.middleware.ts
import { Injectable, NestMiddleware } from '@nestjs/common';
import { Request, Response, NextFunction } from 'express';
import { trace, context } from '@opentelemetry/api';
import { PinoLogger } from 'nestjs-pino';
@Injectable()
export class TraceContextMiddleware implements NestMiddleware {
constructor(private readonly logger: PinoLogger) {}
use(req: Request, res: Response, next: NextFunction) {
const span = trace.getActiveSpan();
const spanContext = span?.spanContext();
if (spanContext) {
// Pino's child logger binds these fields to every subsequent log in the request
this.logger.assign({
traceId: spanContext.traceId,
spanId: spanContext.spanId,
traceFlags: spanContext.traceFlags,
});
}
next();
}
}
In Grafana, this enables Loki-to-Tempo linking, click a traceId in a Loki log query and jump directly to the full distributed trace in Tempo. This is the observability trifecta, logs, traces, and metrics unified in a single investigation workflow.
Step 6 - Visualizing Traces in Grafana Tempo
Grafana Tempo is the recommended trace backend for teams already on the Grafana stack (and pairs naturally with Grafana Faro for frontend-to-backend tracing, as covered in the previous article).
Send spans to Tempo via the OpenTelemetry Collector:
# otel-collector-config.yaml
receivers:
otlp:
protocols:
http:
endpoint: 0.0.0.0:4318
grpc:
endpoint: 0.0.0.0:4317
exporters:
otlp:
endpoint: tempo:4317
tls:
insecure: true
service:
pipelines:
traces:
receivers: [otlp]
exporters: [otlp]
Once traces are flowing into Tempo, configure Grafana dashboards to:
- Search traces by service name, operation name, duration, and status
- View flame graphs showing the time breakdown of every span in a trace
- Set alerts on p95/p99 latency for critical operations
-
Link traces to logs via
traceIdin Loki datasource configuration
Step 7 - CI/CD and Environment Configuration
Manage OpenTelemetry configuration via environment variables, the standard supported by all OTEL SDKs:
# .env.production
SERVICE_NAME=orders-service
APP_VERSION=2.4.1
NODE_ENV=production
OTEL_EXPORTER_OTLP_ENDPOINT=http://otel-collector:4318/v1/traces
OTEL_TRACES_SAMPLER=parentbased_traceidratio
OTEL_TRACES_SAMPLER_ARG=0.1 # sample 10% of traces in production
Sampling is critical in production. Tracing 100% of requests at scale generates enormous volume and cost. Use parent-based ratio sampling, if the upstream service sampled the request, all downstream services honor that decision, keeping traces complete while reducing overall volume.
Common Pitfalls to Avoid
Initializing OpenTelemetry after NestJS bootstraps: If tracing.ts is not the first import in main.ts, auto-instrumentation patches won't apply to already-loaded modules. The SDK must load before any instrumented libraries.
Forgetting span.end(): Every span that isn't explicitly ended leaks memory and never exports. Always use try/finally blocks around custom spans.
Over-instrumenting: Not every function needs a custom span. Focus on operations with meaningful duration variance, external calls, database queries, and complex business logic. Instrumenting trivial utility functions adds noise without insight.
Ignoring sampling in production: Tracing 100% of requests is expensive. Configure ratio-based sampling early, retrofitting it later requires changes across every service.
Not propagating context through message brokers: Context propagation via HTTP headers is automatic, but message broker propagation requires manual injection and extraction. Skipping it breaks the trace tree at every async boundary.
Conclusion
Distributed tracing with OpenTelemetry and NestJS transforms debugging in microservices from guesswork into a precise, evidence-based workflow. A single traceId gives you the complete story of any request, every service it touched, every database query it triggered, every external call it made, and exactly where time was spent or errors occurred.
Set up auto-instrumentation first for immediate value with zero code changes, layer in custom spans for your critical business operations, enforce context propagation across all service boundaries, and correlate traces with your structured logs. Combined with Grafana Faro for frontend traces, you achieve the observability holy grail: end-to-end request visibility from the user's click to the database and back.
Running NestJS microservices on Kubernetes? OpenTelemetry's Kubernetes operator can inject the collector as a sidecar automatically, no per-pod configuration required. More on that in a future post.
Top comments (0)