DEV Community

Cover image for Structured Logging That Actually Helps Debugging at 3 AM
HelperX
HelperX

Posted on

Structured Logging That Actually Helps Debugging at 3 AM

Most logging is written for the person who wrote the code. The author knows the system, knows what the messages mean, and can fill in context from memory. Then at 3 AM, an alert fires, and the person debugging is someone else — or it's the author at month six, who has forgotten everything.

Logs that work at 3 AM are different from logs that look good in a unit test. The difference is structure, context, and the discipline to leave breadcrumbs that the future-you will actually be able to follow.

We've had two real incidents on HelperX where this mattered. Both were resolved in under an hour because the logs were good. Both could have been multi-hour debugging sessions if the logs had been the typical console.log("error") slop.

This is the logging setup that actually helped, and the incident that shaped it.

The incident that taught us

3:47 AM, a Saturday. Pager: "high error rate on slot worker pool." I opened the dashboard. Half the slot workers were failing on reply generation with a non-specific timeout. The errors had been happening for 12 minutes.

The instinct was to look at the most recent error logs. They said:

[2026-04-12 03:47:23] ERROR Reply generation failed for slot abc123
[2026-04-12 03:47:23] ERROR Reply generation failed for slot def456
[2026-04-12 03:47:24] ERROR Reply generation failed for slot ghi789
Enter fullscreen mode Exit fullscreen mode

Nothing. Just "failed." No error message, no stack, no context. The errors were correlated in time but the messages were useless.

The actual root cause turned out to be a regional outage at our LLM provider that was returning 503s with empty bodies. Our timeout handler was swallowing the body, logging the generic message, and moving on. The fix took 4 minutes once we found the cause. Finding the cause took 38 minutes.

Most of those 38 minutes were spent scrolling through logs trying to find any hint of what was happening. After the incident, I rewrote the logging system in a weekend. The next incident — a proxy provider hiccup three months later — was resolved in 11 minutes flat.

The pillars: what to log

The setup has four design rules. Every log line satisfies all four.

1. Structure, not strings

Every log line is JSON, not text. This is non-negotiable.

// Bad
logger.error(`Reply generation failed for slot ${slotId}: ${err.message}`);

// Good
logger.error({
  event: 'reply_generation_failed',
  slot_id: slotId,
  error: err.message,
  error_code: err.code,
  duration_ms: duration,
  tweet_id: tweet.id,
  attempt: attemptNumber,
}, 'Reply generation failed');
Enter fullscreen mode Exit fullscreen mode

The text version is fine if you're tailing logs by eye. The structured version is queryable, filterable, and aggregatable. At 3 AM, you don't want to grep — you want to WHERE event = 'reply_generation_failed' AND error_code = 'ECONNRESET'.

We use Pino because it's fast and outputs JSON natively. Bunyan, Winston with JSON format, or any equivalent works fine. The library matters less than the discipline.

2. Every log line has an event name

The event field is the most important field in any log. It's what you'll group by, filter on, and alert against.

Event names follow a convention:

  • noun_verb_statusreply_generation_failed, proxy_request_succeeded, auth_token_refreshed
  • Lowercase, underscore-separated
  • Always include the status (success/failed/skipped/started/completed)

This convention is what makes log analytics queryable. The first dashboard I built after the incident was a per-event-name frequency chart. Twelve minutes of looking at it tells me everything about system health.

3. Correlation IDs everywhere

Every log line carries a correlation_id that follows a logical operation across all the modules it touches.

async function processSlot(slot) {
  const correlationId = crypto.randomUUID();
  const log = baseLogger.child({ correlation_id: correlationId, slot_id: slot.id });

  log.info({ event: 'slot_processing_started' });

  const tweets = await fetchTweets(slot, log);
  log.info({ event: 'tweets_fetched', count: tweets.length });

  for (const tweet of tweets) {
    const tweetLog = log.child({ tweet_id: tweet.id });
    const reply = await generateReply(tweet, slot, tweetLog);
    if (reply) {
      await postReply(reply, slot, tweetLog);
    }
  }

  log.info({ event: 'slot_processing_completed' });
}
Enter fullscreen mode Exit fullscreen mode

When something fails, you grab the correlation_id from the error log and pull every log line with that ID. You get the entire trace of the operation in one query. No more scrolling through logs trying to figure out what request the error belonged to.

We use Pino's child mechanism for this — child loggers automatically include the parent's fields:

const log = baseLogger.child({ correlation_id, slot_id });
// All future log calls automatically include those fields
Enter fullscreen mode Exit fullscreen mode

4. Latency on every operation

Every meaningful operation logs its duration. This is the second-most-important field after event:

const start = Date.now();
try {
  const result = await externalApiCall();
  log.info({ event: 'external_api_succeeded', duration_ms: Date.now() - start });
  return result;
} catch (err) {
  log.error({
    event: 'external_api_failed',
    duration_ms: Date.now() - start,
    error: err.message,
    error_code: err.code,
  });
  throw err;
}
Enter fullscreen mode Exit fullscreen mode

At 3 AM, the question "is this fast or slow?" is what tells you whether the system is hung or just struggling. Without a duration field, you have to infer from timestamps — which works but is annoying. With it, your dashboard can show "p99 latency by event" and you immediately spot the slow operations.

The implementation

The full Pino setup, with our customizations:

// lib/logger.js
import pino from 'pino';
import { hostname } from 'os';

const isProduction = process.env.NODE_ENV === 'production';

const baseLogger = pino({
  level: process.env.LOG_LEVEL || 'info',
  base: {
    service: 'helperx-worker',
    hostname: hostname(),
    pid: process.pid,
  },
  formatters: {
    level: (label) => ({ level: label }), // string level instead of number
  },
  timestamp: pino.stdTimeFunctions.isoTime,
  redact: {
    paths: [
      'req.headers.authorization',
      'req.headers.cookie',
      'auth_token',
      'password',
      'api_key',
      '*.password',
      '*.token',
      '*.secret',
    ],
    censor: '[REDACTED]',
  },
});

// Per-slot logger factory
export function getSlotLogger(slotId) {
  return baseLogger.child({ slot_id: slotId });
}

// Operation logger factory — includes correlation_id
export function getOperationLogger(parentLog, operation) {
  return parentLog.child({
    correlation_id: crypto.randomUUID(),
    operation,
  });
}

export default baseLogger;
Enter fullscreen mode Exit fullscreen mode

The redaction is important — auth_token, password, and similar fields get scrubbed before they hit the log. Easy to forget, hard to recover from if they leak.

Wrapping operations

Most operations follow a standard pattern. We wrap it in a helper:

// lib/log-operation.js
export async function logOperation(log, eventBase, fn, context = {}) {
  const start = Date.now();
  log.info({ event: `${eventBase}_started`, ...context });

  try {
    const result = await fn();
    log.info({
      event: `${eventBase}_succeeded`,
      duration_ms: Date.now() - start,
      ...context,
    });
    return result;
  } catch (err) {
    log.error({
      event: `${eventBase}_failed`,
      duration_ms: Date.now() - start,
      error: err.message,
      error_code: err.code,
      stack: err.stack,
      ...context,
    });
    throw err;
  }
}

// Usage
const reply = await logOperation(
  log,
  'reply_generation',
  () => llmGenerate(tweet, persona),
  { tweet_id: tweet.id, persona_id: persona.id, model: 'haiku' },
);
Enter fullscreen mode Exit fullscreen mode

The wrapper guarantees:

  • Start and end log lines have matching event prefixes
  • Duration is captured automatically
  • Errors include stack traces in dev, not in production (to keep logs slim)
  • Context is propagated to both success and failure logs

90% of our operations go through this wrapper. The 10% that don't are ones with custom log shapes for very specific cases.

Sampling: log less, see more

Logging too much is its own problem. We log:

  • Every error — 100%
  • Every operation start/end at info level — 100%
  • Every debug-level event — 1% in production, 100% in development
  • Every routine event (heartbeats, polls) — 10% sampled

The sampling logic:

function sampledInfo(log, sampleRate, payload, msg) {
  if (Math.random() < sampleRate) {
    log.info({ ...payload, sampled: true, sample_rate: sampleRate }, msg);
  }
}

// Usage in a tight loop
for (const item of largeBatch) {
  sampledInfo(log, 0.01, { event: 'batch_item_processed', item_id: item.id });
}
Enter fullscreen mode Exit fullscreen mode

The sampled: true flag tells log analytics that this log represents many more events than there are log lines. Aggregations can multiply counts by 1 / sample_rate for accurate volume tracking.

The math: at 200 slots × 1,000 operations/day, full info logging would be 200K lines/day = ~60MB/day. With sampling, it's ~6MB/day. Storage and query cost drops dramatically.

Log levels that mean something

Default Node.js logging has 5+ levels (trace, debug, info, warn, error, fatal). Most teams use them inconsistently. We use them strictly:

Level Use case Frequency
error Operation failed, action needed rare
warn Operation succeeded with caveat uncommon
info Normal operation milestones common
debug Detailed tracing for active debugging high (sampled in prod)
trace Reserved for very deep instrumentation almost never

The strict rule: if it's at error level, it must be alertable. If it's not actionable at 3 AM, it's not an error — it's an info with outcome: 'failed' or a warn.

This rule sounds pedantic but it changes alerting completely. The error log channel becomes the alert channel. If errors are appearing, something needs human attention. If something doesn't need human attention, it's not in the error log.

Log shipping and storage

We use a simple setup:

  • Pino writes to stdout in production (no file rotation in the process)
  • A sidecar process tails stdout and ships to our log aggregator
  • The aggregator is a small self-hosted ELK stack (Elasticsearch + Kibana)

This is overkill for our scale but the queryability is worth it. At 6MB/day per slot worker, a single instance with 100 slots produces ~600MB/day = ~220GB/year. A 1TB Elasticsearch instance handles 4+ years of logs.

For a smaller setup, just shipping to a managed service like Logtail, Better Stack, or Axiom works fine. The cost is typically $10-50/month, which buys you indexing and search without operating the infrastructure.

The crucial thing is something queryable. Plain log files are useless when you have 100 worker processes writing to them in parallel. You need an index and a search UI, or you're back to grep at 3 AM.

The dashboards that matter

After incidents, I built three dashboards. These are the ones I still use.

1. Error rate by event_name

Top-level query: COUNT(*) WHERE level = 'error' GROUP BY event, 1m

Shows which event names are erroring, when, and at what rate. The first thing I look at when a pager fires.

2. p50/p95/p99 duration by event_name

Top-level query: PERCENTILES(duration_ms, [50, 95, 99]) WHERE event LIKE '%_succeeded' GROUP BY event, 5m

Shows which operations are getting slow. Latency regressions usually precede outright failures by hours or days.

3. Per-slot failure rate

Top-level query: COUNT(*) WHERE event = 'reply_generation_failed' GROUP BY slot_id, 5m

Highlights individual slots that are failing while the rest are healthy. Catches per-slot issues (bad credentials, proxy problem) before they look like global incidents.

These three dashboards cover 80% of the operational debugging we ever need to do.

Common mistakes worth calling out

Logging the entire object

log.info({ event: 'thing_happened', context: req }); // BAD
Enter fullscreen mode Exit fullscreen mode

req contains headers, body, params, hops, and references to the response. Serializing it produces 10KB+ of useless detail. Pick the fields you actually need.

Logging in tight loops without sampling

for (let i = 0; i < 10000; i++) {
  log.debug({ event: 'item_processed', index: i });
}
Enter fullscreen mode Exit fullscreen mode

This generates 10K log lines per batch. Multiply by your batch frequency and you've blown out your log volume budget. Sample.

Logging only on success

const reply = await llmGenerate(tweet);
log.info({ event: 'reply_generated' });
Enter fullscreen mode Exit fullscreen mode

If llmGenerate throws, you never see a log for that attempt. You also have no record of attempts that didn't succeed. Log start, log success, log failure.

Including PII without redaction

log.info({ event: 'user_registered', email: user.email });
Enter fullscreen mode Exit fullscreen mode

The email is PII. Even if your log retention is short, log databases get backed up and shared. Use a hashed version, or redact.

Forgetting timezones

log.info(`Event at ${new Date()}`);
Enter fullscreen mode Exit fullscreen mode

Local time is locale-dependent. Use ISO 8601 with timezone — Pino does this by default with pino.stdTimeFunctions.isoTime.

Key takeaways

  1. Structured JSON logs are non-negotiable. Strings are for development.
  2. Every log line has an event name following noun_verb_status convention.
  3. Correlation IDs propagate through all related logs. Pino's child logger pattern does this elegantly.
  4. Latency on every operation. Without it, you can't tell hung from slow.
  5. Wrap operations in a helper so the start/end/duration pattern is consistent and unmissable.
  6. Sample routine events — 1-10% in production keeps storage costs manageable.
  7. error level means alertable. If it's not action-required, it's not an error.
  8. Build the three dashboards — error rate by event, p99 by event, per-tenant failure rate. They cover most incidents.

Good logging is a workout you do for your future self. The investment feels like overhead while you're building. It pays back massively the first time a pager fires.


HelperX ships with structured logging, per-slot correlation IDs, and reference dashboards out of the box. Self-hosted, free 30-day trial.

Top comments (0)