Observability from Day One: What We Got Wrong in v1 and How We Fixed It in v2

#webdev #productivity #devops #architecture

The worst kind of production incident is the one where you're staring at a broken system, and you have absolutely nothing to go on. No traces, no structured logs, no metrics. Just a user saying "it's not working" and a handful of console.log statements that, somehow, didn't fire.

That was v1. And when things broke, and they did break, we were flying blind.

What We Actually Got Wrong in v1

It wasn't that we didn't care about observability. We just didn't treat it as a first-class concern. It was the kind of thing we told ourselves we'd add "later, once the core features are stable." Later never came. The system grew, the edge cases multiplied, and every debugging session became a painful exercise in guesswork.

The logs were unstructured strings. Some requests had context, most didn't. There was no way to correlate a frontend error with a specific backend request, let alone trace it through multiple services. When an error did bubble up, the message was something like "Something went wrong" — a message written in a hurry during a deadline and never revisited, or a huge pile of stack trace with no real meaning.

We had no concept of log levels applied consistently. No error codes. No request IDs are threaded through the system. The word "metrics" meant checking the AWS console manually and hoping something looked obviously wrong.

The v2 Decision: Observability Isn't a Feature, It's Infrastructure

When we started v2, we made one rule early: observability goes in before the first route handler. Not as a sprint backlog item, not as a nice-to-have after launch. Day zero.

That decision shaped everything. We picked OpenTelemetry as the foundation because it's vendor-neutral, now GA for traces and metrics (logs became stable in late 2024), and has a strong ecosystem around Node.js. We'd export everything via OTLP to a collector, and the collector would forward to CloudWatch. The pipeline looked like this:

App (OTEL SDK) → OTLP (gRPC) → ADOT Collector (sidecar) → CloudWatch

The ADOT (AWS Distro for OpenTelemetry) collector runs as a sidecar and handles the translation to CloudWatch Logs, Metrics, and X-Ray traces. It means the app doesn't need to know about AWS — it just speaks OTLP.

The Bootstrap: Why Order of Operations Matters More Than You Think

Here's the thing nobody warns you about until you've wasted a day debugging it: OpenTelemetry auto-instrumentation only works if the SDK is initialized before your application code loads. If NestJS bootstraps first, the auto-instrumentation patches never take hold. You end up with an SDK running but capturing nothing.

The fix is to use a separate instrumentation entry point and ensure it runs before the app boots. Here's what that looks like:

// instrumentation.ts — loaded via --require flag BEFORE app bootstrap
import 'dotenv/config';
import { createOtelSDK } from '@your-org/observability/sdk';

const collectorEndpoint = process.env.OTEL_EXPORTER_OTLP_ENDPOINT;
const environment = process.env.ENVIRONMENT;

if (!environment) {
  // Fail loudly — missing env at startup is better than silent no-op observability
  throw new Error('Environment variable ENVIRONMENT is required for OTEL');
}

export const otelSDK: ReturnType<typeof createOtelSDK> = createOtelSDK({
  collectorEndpoint,
  environment,
  serviceName: 'my-service',
});

otelSDK.start();

This file gets loaded via --require ./dist/instrumentation.js in the Node startup command, before anything else runs. The SDK is live before a single NestJS module is instantiated. One thing to be explicit about: --require needs the compiled .js output, not the TypeScript source. Your package.json start script should look something like this:

"scripts": {
  "start": "node --require ./dist/instrumentation.js ./dist/main.js",
  "start:dev": "tsx --require ./src/instrumentation.ts ./src/main.ts"
}

In production you're running the compiled output. Locally, tsx (or ts-node) handles the TypeScript without a build step.

The SDK Factory: Configuring What You Actually Need

The createOtelSDK factory is where all three pillars — traces, metrics, logs — come together. It's also where you make important decisions about noise reduction.

// sdk.ts — creates and configures the NodeSDK instance
export function createOtelSDK(config: OtelConfig): NodeSDK {
  // gRPC OTLP endpoint — no scheme prefix, just host:port (4317 is the standard gRPC port)
  // If you switch to HTTP exporters, use 'http://localhost:4318/v1/traces' instead
  const collectorEndpoint = config.collectorEndpoint || 'localhost:4317';
  const attributes = makeOtelAttributes(config);

  const sdk = new NodeSDK({
    resource: new Resource(attributes),

    // Batch spans before exporting — reduces overhead vs. SimpleSpanProcessor
    spanProcessor: new BatchSpanProcessor(
      new OTLPTraceExporter({ url: collectorEndpoint }),
    ),

    metricReader: new PeriodicExportingMetricReader({
      exporter: new OTLPMetricExporter({ url: collectorEndpoint }),
    }),

    logRecordProcessors: [
      // BatchLogRecordProcessor is a direct named export from @opentelemetry/sdk-logs
      new BatchLogRecordProcessor(
        new OTLPLogExporter({ url: collectorEndpoint }),
      ),
    ],

    instrumentations: [
      getNodeAutoInstrumentations({
        // fs instrumentation generates a span for EVERY file read — absolute noise
        '@opentelemetry/instrumentation-fs': { enabled: false },
        '@opentelemetry/instrumentation-pg': { enabled: true },
        '@opentelemetry/instrumentation-http': {
          // Health checks hit every few seconds — keep them out of your traces
          ignoreIncomingRequestHook: (request) => {
            return request.url?.includes('/health') ?? false;
          },
        },
      }),
      new PinoInstrumentation({
        // logHook is required — without it, trace context is NOT injected automatically
        logHook: (_span, record) => {
          record['trace_id'] = _span?.spanContext().traceId;
          record['span_id'] = _span?.spanContext().spanId;
        },
      }),
      new ORPCInstrumentation(),   // traces our RPC layer
    ],
  });

  // Handle both SIGTERM (containers/k8s) and SIGINT (Ctrl+C in dev) — don't lose buffered spans
  const shutdown = () =>
    sdk
      .shutdown()
      .then(
        () => console.log('OTEL SDK shut down successfully'),
        (err: unknown) => console.log('Error shutting down OTEL SDK', err),
      )
      .finally(() => process.exit(0));

  process.on('SIGTERM', shutdown);
  process.on('SIGINT', shutdown);

  return sdk;
}

Two things I want to call out here. First, disabling instrumentation-fs is intentional. Enabling it will flood your traces with file system operations — every module load, every config file read. It's useless for debugging and expensive in volume. Turn it off.

Second, the PinoInstrumentation is doing something important: via the logHook, it injects trace_id and span_id into every Pino log record. Worth noting — this is not automatic out of the box. Without the explicit logHook, the instrumentation wraps Pino's methods but doesn't attach trace context to log records. The hook is what makes it useful. Once it's wired in, that's the bridge between your logs and your traces. When you see an error in CloudWatch Logs, you take the trace_id and jump directly to the corresponding trace in X-Ray. In v1, this kind of correlation didn't exist. In v2, it's a few lines of config.

Wiring Observability into NestJS Without It Becoming a Mess

We wrapped all of this into an ObservabilityModule that every service imports once. It handles Pino logging configuration and OpenTelemetry metrics in one place:

// observability.module.ts
@Module({})
export class ObservabilityModule {
  static forRootAsync(): DynamicModule {
    return {
      module: ObservabilityModule,
      imports: [
        ConfigModule,
        LoggerModule.forRootAsync({
          inject: [ConfigService],
          useFactory: (configService: ConfigService) => ({
            // Pretty-print in TTY (local dev), raw JSON in production
            ...transport,
            pinoHttp: {
              level: configService.get<string>('LOG_LEVEL', 'info'),
              useLevel: 'debug',
              stream: process.stdout,
              // Non-negotiable: never log auth headers or cookies
              redact: ['req.headers.authorization', 'req.headers.cookie'],
              ...transport,
            },
          }),
        }),
        OpenTelemetryModule.forRootAsync({
          inject: [ConfigService],
          useFactory: (configService: ConfigService) => {
            const endpoint = configService.get<string>(
              'OTEL_EXPORTER_OTLP_ENDPOINT',
            );
            if (!endpoint) {
              // Warn, don't crash — useful in local dev without a collector running
              console.warn('OTEL collector endpoint not configured...');
              return { metrics: { enabled: false } };
            }
            return { metrics: { hostMetrics: true } };
          },
        }),
      ],
      exports: [LoggerModule],
    };
  }
}

The transport variable at the top deserves mention. In a TTY environment (local dev), we use pino-pretty to get human-readable output. In production, process.stdout.isTTY is false, so we get raw JSON — which is what CloudWatch Logs Insights expects for structured querying. One module, two behaviors, no manual switching.

What "Normalized Logs" Actually Means in Practice

Structured logging sounds obvious until you've inherited a codebase where half the logs are strings and the other half are JSON objects with inconsistent field names. Normalization means you make decisions upfront and enforce them everywhere.

In v2, every log record has a consistent shape: a level field, a message, a trace_id and span_id (injected via the PinoInstrumentation logHook we configure in the SDK factory), and any additional context as typed fields — not concatenated into the message string. Errors include an error_code field as a first-class property, not buried in a message.

Log levels are treated seriously. info is for events that are expected and worth tracking. warn is for degraded states that aren't failures yet. error is for things that need attention — and they include stack traces. debug is off in production by default, toggled via LOG_LEVEL env var when you need to investigate something specific.

Redacting req.headers.authorization and req.headers.cookie isn't optional. These are the fields most likely to appear in HTTP request logs, and logging them — even accidentally — is a security incident waiting to happen. Pino's redact config handles it before the log record ever leaves the process.

The Real Takeaway

The observability debt we carried in v1 cost us far more than if we'd built it in from the start. Not just in debugging time — in confidence. When you don't have traces, you don't fully understand your own system. You have theories, not facts.

OpenTelemetry, Pino, and a collector sidecar aren't a heavy stack. The setup I've described here is maybe a day of work for a new project. The cost of not doing it is paid every time something breaks in production and you're staring at CloudWatch with nothing useful to show for it.

Build it in on day zero. Not day thirty. Not after launch. Day zero.