Temitope Bamidele

Posted on Apr 12

Observability Engineering in Production Systems: Structured Logging, Metrics, and Distributed Tracing at Scale (Part 2)

#observability #logging #productionsystems #sre

How Observability Engineering Cut Incident Response Time by 85% in Production (Part 2)

Part 2 of 3: Centralized Logging and Async Pipelines

This article covers where structured logs go once they leave your application: centralised log aggregation that makes them queryable across services and time, with the async pipeline patterns that keep distributed systems observable. It assumes two foundational practices are in place: logs are structured JSON records with consistent event naming, and each request carries a correlation ID that propagates through every downstream operation. If you're reading this as part of the series, those are covered in Part 1. The full story of the Helpa production incident — a webhook regression that ran silently for over two hours — is reconstructed here through the logs that eventually exposed it.

5. Centralized Logging: Aggregating the Signal

Individual log files per service, written to disk, are not a logging system. They're a precondition for one. A log that lives only on the server that generated it is useless the moment you have more than one server, useless when that server restarts, and useless in practice even on a single server because reading raw files under incident pressure is slow and error-prone.

Centralized logging means all logs from all services flow to a single queryable store. Two architectures cover most production use cases:

ELK Stack (Elasticsearch, Logstash, Kibana)

Elasticsearch is a distributed search and analytics engine optimized for full-text and structured queries. Logstash (or the lighter Filebeat) collects and ships logs from your services. Kibana provides the query interface and dashboard layer.

The pipeline: your service writes JSON logs to stdout → a Filebeat sidecar ships to Logstash → Logstash parses, enriches, and routes to Elasticsearch → Kibana queries Elasticsearch. Two non-obvious decisions live in the Logstash filter:

# logstash.conf — filter block
filter {
  if [fields][service] {
    mutate { add_field => { "[@metadata][index]" => "%{[fields][service]}" } }
  } else {
    mutate { add_field => { "[@metadata][index]" => "unknown" } }
  }
  # Re-parse the app timestamp into @timestamp so Kibana time-range queries
  # reflect event time, not ingest time.
  date {
    match => ["[app][timestamp]", "ISO8601"]
    target => "@timestamp"
  }
}

With this configuration, logs-payment-service-2025.11.14 is a distinct Elasticsearch index. You can query across all indexes with logs-*, or scope to a specific service for faster queries.

Loki + Grafana

Loki is the lighter-weight alternative, designed for label-based log querying rather than full-text indexing. It's significantly cheaper to operate at scale because it doesn't index the content of log messages — only the labels (metadata like service, environment, host). Content is queried by filtering indexed log streams rather than full-text search.

The trade-off: Loki's LogQL query language is less expressive than Elasticsearch's query DSL for complex analytical queries. For operational use cases — "show me all ERROR events from the payment service in the last hour" — Loki is fast and cheap. For complex analytics — "what was the median time between webhook receipt and settlement for donations over ₦100,000 last month" — Elasticsearch is more capable.

On lower-volume, cost-sensitive workloads Loki is the pragmatic choice — you're not indexing log content, only labels, so storage costs stay manageable as the system grows. On systems with higher query complexity requirements — payment audit trails, multi-dimensional analytics — Elasticsearch is the more capable tool.

Promtail is the agent that collects and ships logs to Loki. The non-obvious decision in its pipeline config is what to promote to a Loki label and what not to:

# promtail-config.yml — pipeline_stages
pipeline_stages:
    - json:
          expressions:
              level: level
              service: service
    - labels:
          level: # low cardinality — safe as a Loki stream label
          service: # low cardinality — safe as a Loki stream label
          # requestId is intentionally NOT a label: it is a UUID unique per request
          # and would create millions of distinct Loki streams, exhausting memory.
          # Query it as a parsed field instead: | json | requestId="<value>"

With labels extracted and indexed, this Grafana Loki query retrieves all error-level log entries for the payment service from the last two hours and returns them in chronological order:

{service="payment-service", level="error"} | json | line_format "{{.timestamp}} {{.event}} {{.requestId}} {{.message}}"

6. Logging in Distributed Systems and Async Pipelines

The hardest logging problems are not in your HTTP request handlers. They're in the places that operate outside the standard request/response cycle: webhook processors, job queues, cron jobs, event bus consumers.

These components have no caller waiting for a response. If they fail silently, nobody knows. If they fail noisily but without context, you know something is wrong but not what or why or for whom.

The Webhook Black Hole

Webhooks are one of the highest-risk surfaces in any payment system precisely because they're the most consequential and the least observed. A webhook represents money in motion. A webhook that fails silently means a settlement that never happened, a campaign that didn't receive funds it's owed, a donor record that was never created.

The pattern I use consistently across both systems:

// webhook.processor.ts
@Process('process-webhook')
async handleWebhook(job: Job<WebhookJobData>): Promise<void> {
  const { payload, meta } = job.data;
  const startTime = Date.now();

  this.logger.log({
    event: 'webhook.processing.started',
    requestId: meta.requestId,
    provider: payload.provider,
    reference: payload.reference,
    jobId: job.id,
    attempt: job.attemptsMade,
  });

  try {
    await this.webhookService.process(payload, meta.requestId);

    this.logger.log({
      event: 'webhook.processing.completed',
      requestId: meta.requestId,
      reference: payload.reference,
      duration: Date.now() - startTime,
      jobId: job.id,
    });

  } catch (error) {
    const isLastAttempt = job.attemptsMade >= (job.opts.attempts ?? 1) - 1;
    const logContext = {
      requestId: meta.requestId,
      reference: payload.reference,
      error: error.message,
      stack: error.stack,
      jobId: job.id,
      attempt: job.attemptsMade,
    };

    if (isLastAttempt) {
      this.logger.error({ event: 'webhook.processing.exhausted', ...logContext });
    } else {
      this.logger.warn({ event: 'webhook.processing.failed', ...logContext });
    }

    throw error; // Re-throw so Bull marks the job for retry
  }
}

Notice webhook.processing.exhausted as a distinct event from webhook.processing.failed. A failed job that will be retried is a recoverable situation — WARN level, note and move on. A job that has exhausted all retries is a permanently lost webhook — ERROR level, alert immediately, requires manual intervention.

Logging Cron Job Behavior at the Right Granularity

Cron jobs that check every pending transaction or trigger reconciliation runs are critical to payment system integrity. They also generate enormous log volume if you log at the wrong granularity:

// reconciliation.cron.ts
@Cron('0 */6 * * *') // Every 6 hours
async runReconciliation(): Promise<void> {
  const runId = `run_${uuidv4().split('-')[0]}`;
  const startTime = Date.now();

  this.logger.log({
    event: 'reconciliation.run.started',
    runId,
    triggeredBy: 'cron',
  });

  const staleTransactions = await this.transactionService
    .findStalePending();

  if (staleTransactions.length === 0) {
    this.logger.log({
      event: 'reconciliation.run.completed',
      runId,
      resolved: 0,
      duration: Date.now() - startTime,
    });
    return;
  }

  let resolved = 0;
  let failed = 0;

  for (const tx of staleTransactions) {
    try {
      await this.reconciliationService.process(tx, runId);
      resolved++;
    } catch (error) {
      failed++;
      this.logger.error({
        event: 'reconciliation.transaction.failed',
        runId,
        transactionId: tx.id,
        reference: tx.reference,
        staleSince: tx.updatedAt,
        error: error.message,
      });
    }
  }

  this.logger.log({
    event: 'reconciliation.run.completed',
    runId,
    total: staleTransactions.length,
    resolved,
    failed,
    duration: Date.now() - startTime,
  });
}

The runId stitches all log entries from a single reconciliation run together, exactly as requestId does for HTTP requests. This means when reconciliation partially fails — some transactions recover, some don't — you can filter by runId in Kibana and see the complete picture of that specific run.

7. The Incident That Changed Everything

The Helpa incident is worth unpacking in full, because the root cause is the kind of thing that catches experienced teams — not junior ones.

Earlier in the project, the webhook receiver passed the raw Paystack payload directly onto the queue. During a broader normalisation effort, this was refactored: the receiver now mapped the inbound payload to a typed internal event object before enqueuing, so the queue processor would always handle a consistent internal schema rather than Paystack's raw shape. As part of that normalisation, the internal field previously named donorId was renamed to customerId to align with the rest of the platform's identity model. The receiver was updated. The processor — in a separate module, decoupled across the queue boundary — was not.

The processor extracted the identity field using optional chaining: const donorId = job.data?.donorId. After the deployment, donorId was no longer present in the enqueued payload; the field was now customerId. Optional chaining resolved silently to undefined — valid TypeScript, no error, nothing flagged.

Months earlier, a guard had been added near the top of the processing block. It had been added as a precaution against a specific edge case during early development, had never fired in any meaningful volume, and carried no log alongside it because the path was considered unreachable in normal operation:

const donorId = job.data?.donorId; // undefined after rename — no compile error
if (!donorId) {
    return;
} // no log — fires for every job post-deployment

When the field rename made donorId absent in every enqueued job, that guard became the first thing the processor hit, and it fired silently for every one of them.

Paystack fires a webhook. The receiver maps it to the internal schema, enqueues the job with customerId set correctly. The queue processor runs, reads job.data?.donorId, gets undefined, hits the unlogged guard, and returns. No exception, no error log, no dead-letter entry. No donation ledger entry. No wallet credit. Bull marks the job completed.

The donor sees a completed payment on their bank statement. The campaign owner sees nothing.

When I looked at what we had in the logs:

{ "level": "info", "timestamp": "2023-11-09T14:22:17.341Z", "service": "payment-service", "event": "webhook.received", "reference": "PAY-abc123", "requestId": "f3a2b1c0-9e8d-4b7a-a6c2-d3e4f5a6b7c8" }
{ "level": "info", "timestamp": "2023-11-09T14:22:17.389Z", "service": "payment-service", "event": "webhook.job.enqueued", "reference": "PAY-abc123", "customerId": "usr_7f3a1b2c", "jobId": "4821", "requestId": "f3a2b1c0-9e8d-4b7a-a6c2-d3e4f5a6b7c8" }

And then nothing. No processing entry of any kind. The processor had run, exited cleanly from Bull's viewpoint, and left no trace — because the guard that fired had no accompanying log call. It was a return statement with no record that it had ever executed. A job that Bull considered completed was indistinguishable in the logs from a job that had successfully processed a donation.

The normalisation refactor wasn't wrong — unifying the internal schema was the right call. The guard wasn't wrong either; guarding against missing identity on a payment processor is sensible. What was wrong was the combination: a cross-module rename across a queue boundary, with no integration test covering the end-to-end path from enqueue to processDonation() being called, and an unlogged early return that had been written off as unreachable. TypeScript can enforce payload contracts across a queue boundary — but only if you explicitly share a typed Job<WebhookJobData> generic between the producer and consumer. The processor was a legacy module that predated the team's move toward shared payload generics; it had been working correctly with job.data.donorId typed loosely for months. The normalisation refactor introduced the shared WebhookJobData type for new code and updated the receiver, but didn't backfill the processor. Optional chaining on a loosely typed field is structurally valid code regardless of whether the field is present at runtime, so the rename produced no compile error and the gap in the refactor's scope went unnoticed.

This is worth naming as a general principle: silent early returns in critical processing code — even well-intentioned defensive ones — are an observability hazard. They are not exception paths, so they bypass error handlers. They are not intentional exits, so they're easy to forget to log. They tend to be written during early development when their execution seems implausible. And when a schema change, a field rename, or an unexpected edge case makes them suddenly reachable, they become invisible. Every exit from a payment processor — clean or otherwise — should leave a record in the logs.

In hindsight the observability gaps were clear: no entry log at the top of the processor before any branching, no log on the early return that had become unexpectedly reachable, no reconciliation check linking webhook receipt to an expected settlement, no paired-event alerting to catch a started without a corresponding completed. None of these were hard to build. The system had simply grown to a complexity where their absence became expensive.

After the incident, every async processor was updated to emit a started log unconditionally at the point of entry — before any branching, before any validation — so that code paths exiting early still leave a visible trace. Alerting runs a query every 15 minutes: if there are more webhook.processing.started events than webhook.processing.completed plus webhook.processing.exhausted events in the last 30 minutes — accounting for the processing time window — an alert fires. This means a webhook that starts processing but never reaches a terminal state is caught within 15 minutes.

Additionally, the reconciliation cron was updated to log not just stale pending transactions but the delta against expected settlements: how many webhooks fired today, how many settlements recorded, how many are unaccounted for. The log entry looks like:

{
    "event": "reconciliation.integrity.summary",
    "runId": "run_9c3d2f1a",
    "period": "last-24h",
    "webhooksFired": 847,
    "settlementsRecorded": 847,
    "delta": 0,
    "status": "clean"
}

When delta is non-zero, the alert fires and the individual unreconciled references are listed in the same log entry.

The outcome: In the six months following the incident, three separate payment processing anomalies were caught by this alert — two caused by one of our payment gateways returning HTTP 200 responses with a status: false body during a transient degradation period (our processor was checking only for non-2xx HTTP errors and treated those responses as successful, leaving settlements silently unqueued), and one caused by a deployment that briefly broke queue worker connectivity. All three were caught and resolved without donor impact. Average detection time: eight minutes. Average resolution time: twelve minutes.

Compare that to the original incident, which ran silently for over two hours and was only found through manual investigation — cross-referencing Paystack's dashboard against database records with no shared identifier to work from. The automated reconciliation cron was no help — it looked for stale pending transactions, but no transaction records existed because the processor had returned before processDonation() was ever reached. Had that manual investigation not located the gap, the accounting team would have caught it eventually by reconciling Paystack's settlement reports against the donation ledger — but that is an audit cycle measured in days, not a detection mechanism. The structural pairing alert and delta reconciliation introduced after the incident close that gap: the same failure is now caught within minutes, automatically, without waiting for a campaign owner's vigilance or a finance team's next audit run.

Centralised log aggregation, async pipeline instrumentation, and structural alerting change the nature of incident response. The Helpa incident described in this article — a webhook regression that ran silently for over two hours — would have been detected in under eight minutes with the structural pairing alert introduced at the end. The logs existed; what was missing was the infrastructure to make them queryable and the alerting to make the signal automatic. Across the three subsequent incidents in the six months that followed, average detection time was eight minutes and average resolution time was twelve.

The two remaining pieces — what you are allowed to log in regulated environments, and how logs connect to metrics and traces for proactive detection — are covered in Part 3 if you're continuing the series.