ThankGod Chibugwum Obobo

Posted on Apr 12 • Originally published at actocodes.hashnode.dev

Production Logging Best Practices: How to Balance Observability with Security

#observability #sitereliabilityengineering #softwareengineering #logging

Logging is the backbone of production observability. Without it, debugging a live incident is like navigating in the dark, you know something is wrong but have no way to trace why or where. Yet logging done carelessly introduces its own risks: sensitive user data written to disk, credentials captured in request logs, and compliance violations hiding in plain text files.

The challenge every engineering team faces is not whether to log, but what to log, how to structure it, and who gets access to it.

This guide covers production logging best practices that give your team the observability they need, without turning your log aggregator into a liability.

Why Logging Strategy Matters in Production

Poorly designed logging creates two equally dangerous failure modes:
Too little logging means you're flying blind during incidents. No request context, no error trail, no performance baseline. Mean time to resolution (MTTR) skyrockets because engineers spend hours reconstructing what happened.

Too much logging or logging the wrong things, creates serious security and compliance risks:

Passwords captured in login request bodies
Credit card numbers logged from payment payloads
Session tokens recorded in access logs
Personally Identifiable Information (PII) written to third-party log aggregators
HIPAA or GDPR violations from retaining sensitive health or user data

A mature logging strategy threads this needle deliberately, maximizing signal while minimizing exposure.

Log Levels: Using Them Correctly

Log levels are the first layer of control. Using them correctly keeps your logs actionable and your signal-to-noise ratio high.

Level	When to Use
ERROR	Unrecoverable failures that require immediate attention, exceptions, service crashes, failed critical operations
WARN	Recoverable issues that indicate something is wrong but hasn't broken yet, retry attempts, deprecated API usage, slow queries
INFO	Normal application lifecycle events, service startup, job completion, significant state transitions
DEBUG	Detailed diagnostic information useful during development and troubleshooting, never enabled in production by default
VERBOSE / TRACE	Highly granular execution flow, only for deep debugging in isolated environments

A common mistake is logging everything at INFO or ERROR. This floods your aggregator with noise (making alerts meaningless) or misses important context entirely. Be deliberate: if a message doesn't require human attention, it doesn't belong at ERROR.

Structured Logging: JSON Over Plain Text

Plain text logs are human-readable but machine-hostile. Parsing "[2025-04-06 10:23:11] ERROR: User not found for ID 42" requires regex and breaks the moment the format changes.

Structured logging emits logs as JSON objects, making them directly queryable in any log aggregator (Datadog, ELK, Loki, CloudWatch):

{
  "level": "error",
  "message": "User not found",
  "context": "UsersService",
  "userId": "usr_8f3a2c",
  "requestId": "req_9d1b4e",
  "statusCode": 404,
  "timestamp": "2025-04-06T10:23:11.412Z",
  "environment": "production",
  "service": "users-service"
}

Every field is indexable, filterable, and alertable. You can instantly query "all 404 errors in the users-service in the last hour" without parsing a single string.

Setting Up Structured Logging in NestJS with Pino

Pino is the fastest structured logger for Node.js with native JSON output and minimal overhead:

npm install nestjs-pino pino-http pino-pretty

Configure it in your AppModule:

// app.module.ts
import { LoggerModule } from 'nestjs-pino';

@Module({
  imports: [
    LoggerModule.forRoot({
      pinoHttp: {
        level: process.env.LOG_LEVEL ?? 'info',
        transport:
          process.env.NODE_ENV !== 'production'
            ? { target: 'pino-pretty' } // human-readable in development
            : undefined,                // raw JSON in production
        redact: {
          paths: [
            'req.headers.authorization',
            'req.headers.cookie',
            'req.body.password',
            'req.body.creditCard',
          ],
          censor: '[REDACTED]',
        },
      },
    }),
  ],
})
export class AppModule {}

The redact configuration is critical, it automatically censors sensitive fields before they ever reach your log output. More on this in the next section.

Redacting Sensitive Data

Sensitive data leaking into logs is one of the most common and costly compliance failures. It often happens accidentally, a developer logs req.body for debugging and forgets to remove it before merging.

What to Always Redact

Authentication: passwords, tokens, API keys, session IDs, JWTs
Payment data: credit card numbers, CVVs, bank account numbers
PII: full names combined with identifiers, email addresses in certain contexts, phone numbers, dates of birth
Health data: any field that could be classified as PHI under HIPAA
Infrastructure secrets: database connection strings, internal IPs, service credentials

Redaction Strategies

Field-level redaction (as shown with Pino's redact config) is the most reliable approach, it operates at the logger level before output, regardless of what gets passed in.

For custom redaction logic, implement a sanitization utility:

// src/common/utils/sanitize-log.util.ts
const SENSITIVE_KEYS = new Set([
  'password', 'token', 'secret', 'authorization',
  'creditCard', 'ssn', 'apiKey', 'refreshToken',
]);

export function sanitizeForLog(obj: Record<string, unknown>): Record<string, unknown> {
  return Object.fromEntries(
    Object.entries(obj).map(([key, value]) => [
      key,
      SENSITIVE_KEYS.has(key.toLowerCase()) ? '[REDACTED]' : value,
    ])
  );
}

Use this utility whenever logging request payloads or user-supplied data:

this.logger.log({
  message: 'User registration attempt',
  payload: sanitizeForLog(dto),
});

Never Log Full Request Bodies Indiscriminately

Logging req.body wholesale in a middleware is a common antipattern. Instead, log only specific, known-safe fields:

// ❌ Dangerous — logs everything including passwords
this.logger.log({ body: request.body });

// ✅ Safe — log only what you explicitly need
this.logger.log({
  message: 'Login attempt',
  email: request.body.email, // safe — not a secret
  ip: request.ip,
});

Request Correlation with Trace IDs

In a distributed system, a single user action triggers calls across multiple services. Without a shared identifier, correlating logs across services is nearly impossible.

Request correlation IDs (also called trace IDs) solve this by attaching a unique identifier to every request at the entry point (API Gateway or first service), then propagating it through every downstream call via headers.

// src/common/middleware/correlation-id.middleware.ts
import { Injectable, NestMiddleware } from '@nestjs/common';
import { Request, Response, NextFunction } from 'express';
import { v4 as uuidv4 } from 'uuid';

@Injectable()
export class CorrelationIdMiddleware implements NestMiddleware {
  use(req: Request, res: Response, next: NextFunction) {
    const correlationId = (req.headers['x-correlation-id'] as string) ?? uuidv4();
    req.headers['x-correlation-id'] = correlationId;
    res.setHeader('x-correlation-id', correlationId);
    next();
  }
}

Include the correlationId in every log entry. When an incident occurs, you can filter your entire log aggregator by a single ID and see the complete request journey across every service, with timestamps, durations, and error context.

Log Retention and Access Control

Storing logs indefinitely is a compliance and cost problem. Define a retention policy that balances operational need with regulatory requirements:

Log Type	Recommended Retention
Application errors	90 days
Access / audit logs	1-7 years (varies by regulation)
Debug logs	7-14 days
Security events	1-2 years

Beyond retention, control who can access logs:

Logs containing any PII should be access-controlled - not every engineer needs to read production user data.
Use role-based access in your log aggregator (Datadog, Splunk, ELK) to restrict sensitive log streams.
Enable audit logging on your log aggregator itself - knowing who queried which logs is part of your compliance story.
Consider log encryption at rest for any aggregator storing sensitive application data.

Alerting: Turning Logs Into Action

Logs without alerts are archives, not observability. Define alert rules on your log aggregator for conditions that require immediate human attention:

Error rate spike: more than X ERROR level logs per minute
Authentication failures: repeated 401s from the same IP (brute force indicator)
Downstream service failures: sustained connection errors to a dependency
Zero logs: a sudden absence of logs from a service may indicate it has crashed

Keep alert thresholds tuned, too sensitive and engineers become desensitized to noise; too lenient and real incidents go undetected.

Logging Antipatterns to Avoid

Logging in a tight loop: logging inside high-frequency loops or hot paths creates I/O pressure and can degrade application performance. Sample logs or aggregate counters instead.

Using console.log in production: console.log bypasses your structured logger, produces unstructured output, and cannot be controlled by log level configuration. Replace all instances with your logger before deploying.

Logging and rethrowing without context: catching an exception, logging it, and rethrowing it without adding context creates duplicate log entries with no additional signal. Add context or don't log, pick one.

Storing logs locally on application servers: local log files are lost when containers restart or servers are terminated. Always ship logs to an external aggregator in real time.

Recommended Tooling

Category	Tool
Logger (Node.js)	Pino, Winston
Log aggregation	Datadog, ELK Stack, Grafana Loki
Distributed tracing	OpenTelemetry, Jaeger, Tempo
Alerting	PagerDuty, Grafana Alerting, Datadog Monitors
Compliance scanning	Nightfall, Presidio (PII detection in logs)

Conclusion

Production logging is not a feature you bolt on after launch, it's a foundational engineering practice that determines how quickly your team can detect, diagnose, and resolve incidents. Done well, it's invisible to users and invaluable to engineers. Done poorly, it's either useless noise or a ticking compliance time bomb.

The formula is straightforward: use structured JSON logs, redact sensitive fields at the logger level, correlate requests with trace IDs, enforce retention policies, and alert on what matters. Everything else is tuning.

Observability and security are not in conflict, with the right logging architecture, they reinforce each other.

What log aggregator does your team use in production? Share your stack in the comments - we'd love to hear how different teams approach this.

DEV Community