Temitope Bamidele

Posted on Apr 11

Observability Engineering in Production Systems: Structured Logging, Metrics, and Distributed Tracing at Scale (Part 1)

#observability #distributedtracing #structuredlogging #productionsystems

How Observability Engineering Cut Incident Response Time by 85% in Production (Part 1)

Part 1 of 3: Structured Logs and Correlation IDs

In late 2023, Helpa's payment infrastructure went quiet in the worst possible way.

Helpa is a crowdfunding platform — people use it to raise money for surgeries and emergency expenses. A donor completed a payment on Paystack. The gateway returned a success response. The webhook fired. And then — nothing. No settlement. No ledger entry. No wallet credit. The campaign stayed at its previous total as if the donation had never happened. The donor had been charged. The campaign owner was waiting.

What followed was two hours of reconstruction: cross-referencing Paystack's dashboard against our database records, reading through raw log files on two different servers, trying to piece together a request timeline with no shared identifier across the output. We found it eventually — a regression in the webhook queue processor that caused jobs to exit silently without processing — but nothing in our logging surface had made the failure visible while it was happening. By the time we knew there was a problem, it had been running for over two hours.

I want to be precise about what the actual failure was, because the easy reading is "your logging was bad." It wasn't, exactly. Helpa started as a single-service deployment with modest transaction volume. The logging we had was adequate for that: structured enough to be readable, centralised enough that one person could scan the output and understand what was happening. That worked for a long time.

What changed was the system. By 2023 Helpa was processing hundreds of webhook events on a normal day, with traffic spiking 5–8x during successful campaign shares. We were running async queue processing across three payment providers, a reconciliation cron, and multiple replicas behind a load balancer. The observability model hadn't kept pace with that complexity. We still had the logging practices of a simpler system applied to infrastructure that had quietly become much harder to observe. The gap wasn't negligence — it was that we hadn't felt the operational friction of the old model badly enough until we were well into an active incident.

That incident was the forcing function for a complete rethink of how I approach observability — not as a debugging utility bolted on after the fact, but as an architectural discipline that evolves alongside the system. This article documents what I built, the reasoning behind each decision, and what it changed.

One conclusion I want to flag upfront, because it took time to articulate clearly: logs, metrics, and traces are not three ways of doing the same thing. They answer structurally different questions — and understanding when to reach for each one, rather than defaulting to whichever is familiar, is the operational discipline this article is really about. Getting that wrong doesn't just make your system harder to operate — it shapes which failures are visible and which ones aren't.

1. The Illusion of Logging

Most backend engineers have logging. Few have useful logging.

The first form of logging is the console.log. It's fast to write, and it's enough when you're running code on your own machine and can watch the output scroll past in real time. It communicates nothing to a production system. console.log('payment processed') tells you something happened. It doesn't tell you which user triggered it, what the payment reference was, what the amount was, whether it was the first time the webhook fired or the third, or which instance of your API server handled it when you're running three replicas behind a load balancer.

The second, more dangerous form of logging is the one that looks like real logging but isn't. This is where teams use a proper logging library — Winston, Pino, Bunyan — but configure it to write plain text messages to a file that nobody reads until something breaks. The messages have timestamps and log levels, which creates the impression of a proper system. But the content is still unstructured narrative: Payment webhook received for reference PAY-12345 written as a free-text string rather than as a queryable record. When an incident happens and you need to find all events related to a specific user's payment across multiple services, you're running grep and hoping the reference number appears consistently in every relevant log line.

Here's the thing: a log you cannot query isn't a log — it's an audit trail that only works for the events you already know to look for. You need to answer questions you haven't thought of yet, correlate events across services you didn't know were involved, and do all of it under pressure at 1am when a campaign owner's surgery fundraiser is stalled.

That requires a fundamentally different approach to what a log entry is.

2. Structured Logging: The Foundation

A structured log entry is a record, not a sentence. Every field is named, typed, and independently queryable. The difference is immediate once you see it:

Unstructured:

[2025-11-14 02:41:33] INFO Payment webhook received for reference PAY-abc123, amount 50000, result: success

Structured:

{
    "level": "info",
    "timestamp": "2025-11-14T02:41:33.421Z",
    "service": "payment-service",
    "event": "webhook.received",
    "provider": "paystack",
    "reference": "PAY-abc123",
    "amount": 50000,
    "currency": "NGN",
    "campaignId": "camp_8f2a1c",
    "userId": "usr_3b9d2e",
    "requestId": "7f4e8a1b-2c3d-4e5f-a6b7-c8d9e0f1a2b3",
    "environment": "production",
    "host": "api-server-3"
}

The structured version gives you: the ability to query all webhooks from a specific provider, all events related to a specific campaign, all events for a specific user, all events across a specific host, all events carrying a specific requestId — without writing custom parsers, without grep, and with sub-second query times in a properly configured log aggregation system.

If I were starting a system today, this is the first thing I'd lock in: every log entry is a structured JSON record. Everything downstream depends on it.

Setting This Up in NestJS with Pino

I switched from Winston to Pino on Helpa's backend after a load test at 400 concurrent requests revealed that Winston's synchronous transport was adding around 3–4ms to p95 request latency — not fatal, but visible and entirely attributable to logging overhead. The same test with Pino showed sub-millisecond logging cost. Pino serialises asynchronously and buffers writes to avoid blocking the event loop. At low traffic the difference is irrelevant. At the concurrency levels a viral campaign produces, a 3–4ms logging tax compounds across every request in the burst.

// logger.module.ts
import { Module } from "@nestjs/common";
import { ConfigService } from "@nestjs/config";
import { LoggerModule as PinoLoggerModule } from "nestjs-pino";

@Module({
    imports: [
        PinoLoggerModule.forRootAsync({
            inject: [ConfigService],
            useFactory: (config: ConfigService) => ({
                pinoHttp: {
                    level:
                        config.get("NODE_ENV") === "production"
                            ? "info"
                            : "debug",
                    transport:
                        config.get("NODE_ENV") !== "production"
                            ? {
                                  target: "pino-pretty",
                                  options: { colorize: true }
                              }
                            : undefined,
                    serializers: {
                        req(req) {
                            return {
                                method: req.method,
                                url: req.url
                            };
                        }
                    },
                    customProps: (req) => ({
                        service: config.get("SERVICE_NAME") ?? "api",
                        environment: config.get("NODE_ENV"),
                        requestId: req.headers["x-request-id"]
                    }),
                    redact: {
                        paths: [
                            "req.headers.authorization",
                            'req.headers["x-api-key"]',
                            "body.cardNumber",
                            "body.cvv",
                            "body.bvn",
                            "body.password"
                        ],
                        censor: "[REDACTED]"
                    }
                }
            })
        })
    ]
})
export class LoggerModule {}

The redact configuration is not optional. Every field in the path list is automatically replaced with [REDACTED] before the log is serialised. This means Authorization headers, API keys, BVNs, card numbers, and passwords can never appear in log output regardless of which developer writes what. It's enforced at the library level, not left to discipline. One caveat: the body.* paths only match when log calls nest the DTO under a body key. A call like this.logger.log({ cardNumber: dto.cardNumber }) passes fields at the root level and bypasses every body.* path in the list. The safe convention is to always pass domain data as { body: dto }, or add explicit root-level keys — cardNumber, bvn, etc. — to the redact list as well. I cover the broader security implications in Section 8.

The difference between pino-pretty in development and raw JSON in production is deliberate. In development, pretty-printed logs are dramatically easier to read. In production, you want raw JSON because your log aggregation pipeline — Logstash, Promtail, Fluentd — can parse it without brittle pattern matching. Pretty-printing in production just adds overhead and makes parsing harder.

The Standard Log Shape

Across both systems, we standardised on a log shape that every event must carry:

interface LogEntry {
    level: "debug" | "info" | "warn" | "error" | "fatal";
    timestamp: string; // ISO 8601
    service: string; // Which service emitted this
    event: string; // Namespaced event identifier: 'payment.webhook.received'
    requestId?: string; // HTTP and queue-processor events; absent on cron-initiated events which use runId instead
    userId?: string; // When the request is user-scoped
    tenantId?: string; // For multi-tenant systems like Festease
    environment?: string; // Injected by customProps on HTTP requests; absent on async/cron events not handled by nestjs-pino middleware
    host?: string; // Injected by customProps on HTTP requests; absent on async/cron events not handled by nestjs-pino middleware
    duration?: number; // Milliseconds, for operation timing
    message?: string; // Human-readable summary — passed as the second argument to logger calls: logger.log(mergeObject, 'message')
    [key: string]: unknown; // Domain-specific context (reference, amount, etc.)
}

The event field uses dot-notation namespacing intentionally. payment.webhook.received, payment.webhook.processed, payment.webhook.failed, donation.settled, donation.reconciliation.triggered. This creates a taxonomy of system events that makes query construction intuitive: show me everything under payment.webhook.* across the last hour.

3. Correlation IDs: Tracking a Request Across Its Entire Lifecycle

The single most impactful change I made to incident response capability was introducing correlation IDs — a unique identifier assigned to each request at the boundary of the system and propagated through every downstream operation that request generates.

Without a correlation ID, a request that touches multiple services, two database calls, a Redis cache lookup, and a webhook queue job produces log entries distributed across all of those components with no shared field that ties them together. You have timestamps, and you can try to reconstruct the sequence by time, but if two requests are in flight simultaneously — which is always the case in production — you are manually separating interleaved log streams from memory.

With a correlation ID, every log entry for everything a single request caused carries the same requestId. In Kibana or Grafana Loki, one query returns the complete story.

The Implementation

The correlation ID is generated (or forwarded if one arrives in the request header) at the middleware layer:

// correlation-id.middleware.ts
import { Injectable, NestMiddleware } from "@nestjs/common";
import { Request, Response, NextFunction } from "express";
import { v4 as uuidv4 } from "uuid";

@Injectable()
export class CorrelationIdMiddleware implements NestMiddleware {
    use(req: Request, res: Response, next: NextFunction): void {
        const incomingId = req.headers["x-request-id"] as string;
        const requestId = incomingId ?? uuidv4();

        // Attach to request so downstream services can read it
        req["requestId"] = requestId;
        req.headers["x-request-id"] = requestId;

        // Expose on response so the caller can correlate their own logs
        res.setHeader("x-request-id", requestId);

        next();
    }
}

Why accept an incoming x-request-id rather than always generating a new one? Because your clients and upstream services have their own logs. When a Paystack webhook fires and you want to trace that flow from Paystack's retry logs through your ingestion endpoint through your queue processor, the same requestId needs to appear in all of them. If your gateway or proxy generates the ID before it reaches your service, you preserve that chain. If you generate a fresh one on arrival, you break it.

Propagating Through Async Queues

The part that most implementations get wrong is propagation through asynchronous boundaries. HTTP middleware propagates the ID through the synchronous request lifecycle. But once you enqueue a job — a webhook to process, a settlement to confirm, a reconciliation to run — that job executes in a different process, often much later. If the requestId isn't embedded in the job payload, it's gone.

On Helpa, every queue job includes a meta object:

interface JobMeta {
  requestId: string;
  correlatedAt: string;   // When the job was enqueued
  source: string;         // What enqueued it: 'webhook-handler', 'cron', 'api'
}

// When enqueuing
const job = await this.donationQueue.add('process-webhook', {
  payload: webhookData,
  meta: {
    requestId: req['requestId'],
    correlatedAt: new Date().toISOString(),
    source: 'webhook-handler',
  },
});

this.logger.log({
  event: 'webhook.job.enqueued',
  requestId: req['requestId'],
  reference: webhookData.reference,
  jobId: job.id,
});

// In the queue processor
@Process('process-webhook')
async handleWebhook(job: Job<WebhookJobData>): Promise<void> {
  const { payload, meta } = job.data;

  this.logger.log({
    event: 'webhook.processing.started',
    requestId: meta.requestId,
    provider: payload.provider,
    reference: payload.reference,
    attempt: job.attemptsMade,
  });

  // ... processing logic
}

When that webhook processor later throws an exception, the error log carries the same requestId as the HTTP log entry from when the webhook was first received. One query in your log aggregation system retrieves the complete chain. Even when a processor exits silently without throwing — as in the opening incident — that shared requestId across the two log entries that were emitted (webhook.received and webhook.job.enqueued) is what made the gap visible as a correlated pattern rather than two unrelated events.

Note that Paystack's reference is not the same as your requestId. The reference identifies a specific transaction on Paystack's side — it lets you correlate a webhook back to the original charge. Your requestId is an internal identifier that tracks a request across your own services. Where the two meet is in the webhook handler: logging both together means you can move in either direction — from your internal request trace outward to Paystack's records, or from a Paystack reference inward to your processing logs.

Propagating to External Service Calls

Any outbound HTTP call to an external service should forward the requestId as x-request-id. This creates traceability across system boundaries:

// http-client.service.ts
async post<T>(url: string, body: unknown, requestId: string): Promise<T> {
  const response = await this.httpService.axiosRef.post<T>(url, body, {
    headers: {
      'x-request-id': requestId,
      'Content-Type': 'application/json',
    },
  });
  return response.data;
}

This applies to any downstream service that honours the x-request-id header — internal microservices, third-party APIs, or your own async workers invoked over HTTP.

4. Log Levels: Strategy, Not Convention

Every logging framework ships with log levels. Most teams treat the choice of level as a rough approximation — a stylistic preference. It isn't. Log levels are the primary mechanism for controlling signal-to-noise ratio in production, and miscalibrating them is a failure mode I've seen quietly neutralise an entire on-call rotation's ability to respond to real incidents.

The levels and what they actually mean in production:

Level	When to use it	Who reads it
`DEBUG`	Detailed execution context useful during development or targeted investigation	Engineers, disabled by default in production
`INFO`	Significant business events that should always be recorded: request received, donation settled, webhook processed	Dashboards, alerting, post-incident review
`WARN`	Something unexpected happened but the system recovered: a retry succeeded, a cache miss forced a database fallback, a third-party API returned a non-standard response	On-call alerts with low-priority routing
`ERROR`	An operation failed and the system could not recover it automatically: a webhook could not be processed after exhausting retries, a settlement failed	PagerDuty / immediate on-call routing
`FATAL`	The service is unable to continue: database connection lost at startup, configuration missing	Immediate escalation, service restart required

The failure I see most often is treating ERROR as a synonym for "something went wrong that I want to notice." That produces a system where every handled exception logs at ERROR, and after a few months of growth, your ERROR log stream is dominated by recoverable edge cases — validation failures, 404s, expired tokens — and the genuine ERROR-level events (failed settlements, lost webhook jobs) are invisible in the noise.

The mental model that actually works: an ERROR log means a human needs to investigate and take action. If the system can handle it automatically — retry, fallback, return a validation response to the client — it isn't an error from the system's perspective, even if something went wrong from the user's.

The Noise Problem and How We Fixed It

At a fintech I worked at, our error dashboards were triggering alerts dozens of times a day. The on-call rotation was becoming desensitised — engineers were starting to dismiss alerts reflexively because most of them turned out to be nothing. This is one of the most dangerous failure modes in production ops: alert fatigue causes real incidents to be dismissed alongside the noise.

When I audited the error log volume, roughly 60% of ERROR entries were user-triggered validation failures: invalid transfer amounts, expired payment sessions, duplicate transaction attempts. These are completely expected events in a payments system. They should be WARN at most. The remaining 40% were genuine system errors.

After reclassifying log levels based on recovery semantics, the error alert volume dropped by 65%. The on-call rotation started treating error alerts as real again.

@Post('transfer')
async initiateTransfer(
  @Body() dto: InitiateTransferDto,
  @Req() req: Request,
): Promise<TransferResponse> {
  try {
    const result = await this.transferService.initiate(dto, req['requestId']);

    this.logger.log({
      event: 'transfer.initiated',
      requestId: req['requestId'],
      userId: dto.userId,
      accountId: dto.accountId,
      amount: dto.amount,
      provider: dto.provider,
    });

    return result;
  } catch (error) {
    if (error instanceof TransferValidationException) {
      // User error — not a system failure
      this.logger.warn({
        event: 'transfer.validation.failed',
        requestId: req['requestId'],
        reason: error.message,
        userId: dto.userId,
      });
      throw error;
    }

    // Genuine system failure — requires investigation
    this.logger.error({
      event: 'transfer.initiation.failed',
      requestId: req['requestId'],
      userId: dto.userId,
      accountId: dto.accountId,
      error: error.message,
      stack: error.stack,
    });
    throw error;
  }
}

The four pillars covered here — structured log records, a standard log shape, correlation IDs propagated through every async boundary, and log levels enforced as contracts rather than convention — are the foundation of a queryable, incident-ready observability system.

The operational impact of building this properly is measurable. Before structured logging and correlation IDs, the incident described in this article's opening took over two hours to diagnose — manually cross-referencing dashboards and raw log files with no shared identifier across the output. After introducing the practices in this article, a structured log query returned the complete request chain in under a minute. Adding centralised log aggregation — without proactive alerting yet — brought mean time to detection down to around 18 minutes for a comparable incident: the information was queryable, but an engineer still had to notice it. With structural log-based alerting layered on top, that dropped to 8 minutes. With metrics-based alerting, 3 minutes. The 85% reduction in the subtitle isn't a projected estimate — it's the gap between a two-hour reconstruction and a three-minute alert.

DEV Community