Polliog

Posted on Mar 16

PII in Your Logs Is a GDPR Time Bomb - Here's How to Defuse It

#webdev #security #typescript #node

Your application is probably logging PII right now.

Not maliciously - it happens naturally. A user submits a form with their email. Your framework logs the full request body for debugging. The email lands in CloudWatch, Datadog, or your ELK cluster. It sits there for 90 days, or 365, or however long your retention policy says.

Under GDPR, that's a data breach waiting for a complaint. Under HIPAA, it's a violation. Under any audit, it's a finding.

The fix isn't "tell developers to be careful." Developers are already careful - until they're debugging a production incident at 2am and add a quick console.log(request.body). The fix is a masking layer that runs automatically, before any log hits storage.

This article is about building that layer in Node.js.

What PII Actually Looks Like in Logs

Before masking, you need to know what you're masking. PII in logs shows up in three forms:

Structured fields - JSON payloads where the key makes the value obvious:

{ "email": "alice@example.com", "password": "hunter2", "ssn": "123-45-6789" }

Embedded in strings - PII inside log messages:

User alice@example.com failed login from 192.168.1.1
Authorization: Bearer eyJhbGciOiJIUzI1NiJ9...

Nested or transformed - Base64-encoded, URL-encoded, or buried in stack traces:

Error processing request body: %7B%22email%22%3A%22alice%40example.com%22%7D

A good masking pipeline handles all three. Most tutorials only handle the first one.

The Architecture: Mask at Ingestion, Not at Display

There are two schools of thought on when to mask:

Mask at display - store everything, redact when showing logs in the UI
Mask at ingestion - strip PII before it ever reaches storage

Mask at ingestion is the only defensible choice for compliance. If PII reaches your database, it's already a GDPR problem - even if you never display it. The data is there, it can be breached, and you own the liability.

The pipeline looks like this:

Application → Log event → [Masking layer] → Storage
                                ↑
                         This is where we operate

The masking layer runs synchronously, in-process, before any network call to your log storage. No PII leaves the machine.

Building the Masking Layer

Step 1: Define your masking strategies

Before writing regex, decide what "masked" means for your use case. Three strategies cover most cases:

type MaskingStrategy = 'mask' | 'redact' | 'hash'

// mask: show partial value - useful for debugging (still recognizable, not storable)
// "alice@example.com" → "al***@***.com"

// redact: replace entirely - use when value has no debugging value
// "hunter2" → "[REDACTED]"

// hash: deterministic SHA-256 - use when you need to correlate without exposing
// "alice@example.com" → "sha256:2f3a4b..." (same input always produces same hash)
// ⚠️ Always set PII_HASH_SALT in your environment. Emails and SSNs have low entropy
// and are trivially reversible from unsalted hashes via rainbow tables.

Hashing is underused. It lets you answer "did this user appear in these logs?" without storing the actual email. Useful for audit trails and correlation.

Step 2: Pattern-based detection

import { createHash } from 'crypto'

const PII_PATTERNS: Array<{
  name: string
  pattern: RegExp
  strategy: MaskingStrategy
}> = [
  // Email addresses
  {
    name: 'email',
    pattern: /\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b/g,
    strategy: 'mask',
  },
  // Credit card numbers (Format-valid patterns — prefix and length, not Luhn checksum)
  {
    name: 'credit_card',
    pattern: /\b(?:4[0-9]{12}(?:[0-9]{3})?|5[1-5][0-9]{14}|3[47][0-9]{13}|3(?:0[0-5]|[68][0-9])[0-9]{11})\b/g,
    strategy: 'redact',
  },
  // US Social Security Numbers
  {
    name: 'ssn',
    pattern: /\b\d{3}-\d{2}-\d{4}\b/g,
    strategy: 'redact',
  },
  // Bearer tokens / JWT
  {
    name: 'bearer_token',
    pattern: /Bearer\s+[A-Za-z0-9\-_=]+\.[A-Za-z0-9\-_=]+\.?[A-Za-z0-9\-_.+/=]*/g,
    strategy: 'redact',
  },
  // AWS access keys
  {
    name: 'aws_access_key',
    pattern: /\b(AKIA|AIPA|AKIA|ASIA)[A-Z0-9]{16}\b/g,
    strategy: 'redact',
  },
  // IPv4 addresses (optional — some teams want these, some don't)
  {
    name: 'ipv4',
    pattern: /\b(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\b/g,
    strategy: 'mask',
  },
  // Phone numbers (loose — adjust for your region)
  {
    name: 'phone',
    pattern: /(\+?[\d\s\-().]{10,15})/g,
    strategy: 'mask',
  },
]

function applyStrategy(value: string, strategy: MaskingStrategy): string {
  switch (strategy) {
    case 'redact':
      return '[REDACTED]'

    case 'hash':
      return `sha256:${createHash('sha256').update(value + (process.env.PII_HASH_SALT ?? '')).digest('hex').slice(0, 16)}`

    case 'mask': {
      if (value.includes('@')) {
        // Email masking: show first 2 chars of local part and domain TLD
        const [local, domain] = value.split('@')
        const [domainName, ...tlds] = domain.split('.')
        return `${local.slice(0, 2)}***@***.${tlds.join('.')}`
      }
      // Generic masking: show first and last char, mask middle
      if (value.length <= 4) return '****'
      return `${value[0]}${'*'.repeat(value.length - 2)}${value[value.length - 1]}`
    }
  }
}

function maskString(input: string): string {
  let result = input
  for (const { pattern, strategy } of PII_PATTERNS) {
    result = result.replace(pattern, (match) => applyStrategy(match, strategy))
  }
  return result
}

Step 3: Field-name detection

Pattern matching catches PII embedded in strings. But for structured JSON, matching on field names is faster and more reliable:

const SENSITIVE_FIELD_NAMES = new Set([
  'password', 'passwd', 'secret', 'token', 'api_key', 'apikey', 'api-key',
  'authorization', 'auth', 'credential', 'credentials',
  'email', 'e_mail', 'e-mail',
  'ssn', 'social_security', 'national_id',
  'credit_card', 'card_number', 'cvv', 'cvc',
  'phone', 'phone_number', 'mobile',
  'dob', 'date_of_birth', 'birthday',
  'address', 'street_address', 'postal_code', 'zip_code',
  'ip_address', 'ip', 'x_forwarded_for',
])

function isFieldSensitive(key: string): boolean {
  const normalized = key.toLowerCase().replace(/[-_\s]/g, '_')
  return SENSITIVE_FIELD_NAMES.has(normalized)
}

Step 4: Recursive object traversal

The masking function needs to traverse nested objects - request bodies aren't always flat:

type LogValue = string | number | boolean | null | LogObject | LogValue[]
type LogObject = { [key: string]: LogValue }

function maskObject(obj: LogObject, depth = 0): LogObject {
  // Prevent infinite recursion on circular references
  if (depth > 10) return { '[max_depth_exceeded]': true }

  const result: LogObject = {}

  for (const [key, value] of Object.entries(obj)) {
    if (isFieldSensitive(key)) {
      // Field name match: redact or hash based on field type
      // Note: this hardcodes the strategy per field type for brevity. In a production
      // system, map field names to your central PII_PATTERNS configuration to keep
      // strategies consistent across both field-name and pattern-based detection.
      const strategy = key.toLowerCase().includes('email') ? 'hash' : 'redact'
      result[key] = typeof value === 'string'
        ? applyStrategy(value, strategy)
        : '[REDACTED]'
      continue
    }

    if (typeof value === 'string') {
      result[key] = maskString(value)
    } else if (Array.isArray(value)) {
      result[key] = value.map((item) =>
        typeof item === 'object' && item !== null
          ? maskObject(item as LogObject, depth + 1)
          : typeof item === 'string'
          ? maskString(item)
          : item
      )
    } else if (typeof value === 'object' && value !== null) {
      result[key] = maskObject(value as LogObject, depth + 1)
    } else {
      result[key] = value
    }
  }

  return result
}

Step 5: The masking pipeline entry point

Wrap everything in a single function that handles both structured objects and raw strings:

export function maskPII(input: unknown): unknown {
  if (typeof input === 'string') {
    return maskString(input)
  }

  if (typeof input === 'object' && input !== null && !Array.isArray(input)) {
    return maskObject(input as LogObject)
  }

  if (Array.isArray(input)) {
    return input.map(maskPII)
  }

  return input
}

Integrating With Your Logger

With Pino (recommended for Node.js)

Pino supports redact paths natively, but it only handles known field paths. For dynamic detection, use a serializers hook:

import pino from 'pino'
import { maskPII } from './masking'

const logger = pino({
  serializers: {
    // Mask the entire request object
    req: (req) => maskPII({
      method: req.method,
      url: req.url,
      headers: req.headers,
      body: req.body,
    }),
    // Mask arbitrary metadata
    meta: (meta) => maskPII(meta),
  },
})

// Usage
logger.info({ req, meta: { userId: user.email } }, 'Request received')

With Winston

import winston from 'winston'
import { maskPII } from './masking'

const maskingTransform = winston.format((info) => {
  return maskPII(info) as typeof info
})

const logger = winston.createLogger({
  format: winston.format.combine(
    maskingTransform(),
    winston.format.json()
  ),
  transports: [new winston.transports.Console()],
})

With a raw HTTP ingest endpoint

If you're building an ingest endpoint that receives logs from external sources (SDKs, collectors), apply masking server-side before writing to storage:

import Fastify from 'fastify'
import { maskPII } from './masking'

const app = Fastify()

app.post('/api/v1/ingest', async (request, reply) => {
  const { logs } = request.body as { logs: LogObject[] }

  const maskedLogs = logs.map((log) => ({
    ...maskObject(log),
    ingested_at: new Date().toISOString(),
  }))

  await db.insertInto('logs').values(maskedLogs).execute()

  return reply.send({ accepted: maskedLogs.length })
})

The Edge Cases Nobody Talks About

URL-encoded and Base64-encoded PII

Attackers (and frameworks) encode data. Your masking needs to handle it:

function maskStringWithDecoding(input: string): string {
  let result = input

  // Try URL decode and re-mask
  try {
    const decoded = decodeURIComponent(result)
    if (decoded !== result) {
      result = encodeURIComponent(maskString(decoded))
    }
  } catch {}

  // Try Base64 decode and re-mask
  const base64Pattern = /\b[A-Za-z0-9+/]{20,}={0,2}\b/g
  result = result.replace(base64Pattern, (match) => {
    try {
      const decoded = Buffer.from(match, 'base64').toString('utf8')
      // Only re-encode if it looks like it decoded to something meaningful
      if (/^[\x20-\x7E]+$/.test(decoded)) {
        const masked = maskString(decoded)
        if (masked !== decoded) {
          return Buffer.from(masked).toString('base64')
        }
      }
    } catch {}
    return match
  })

  return maskString(result)
}

Stack traces

Stack traces can contain PII in exception messages:

Error: User not found for email alice@example.com
    at UserService.findByEmail (user.service.ts:42)

function maskStackTrace(stack: string): string {
  return stack
    .split('\n')
    .map((line, index) => {
      // Mask the error message line (first line), leave stack frames alone
      if (index === 0) return maskString(line)
      return line
    })
    .join('\n')
}

Performance considerations

The masking pipeline runs on every log event. Profile it:

// Simple benchmark
const iterations = 10_000
const sampleLog = {
  message: 'User alice@example.com logged in from 192.168.1.1',
  email: 'alice@example.com',
  headers: { authorization: 'Bearer eyJhbGciOiJIUzI1NiJ9.test.test' },
}

const start = performance.now()
for (let i = 0; i < iterations; i++) {
  maskObject(sampleLog)
}
const elapsed = performance.now() - start
console.log(`${iterations} iterations in ${elapsed.toFixed(2)}ms (${(elapsed / iterations).toFixed(3)}ms each)`)

On a modern machine, a well-implemented masking pipeline takes 0.05-0.2ms per log event. At 1,000 logs/second, that's 50-200ms of CPU per second — acceptable for most applications, but worth measuring for high-throughput services.

If performance is a concern, compile your regex patterns once outside the function — the compilation cost is paid only once, not on every log event:

// Bad: regex compiled on every call
function maskEmail(str: string) {
  return str.replace(/\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b/g, '***')
}

// Good: compiled once, reused on every call
// Note: String.prototype.replace() manages lastIndex internally — no manual reset needed
const EMAIL_PATTERN = /\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b/g
function maskEmail(str: string) {
  return str.replace(EMAIL_PATTERN, '***')
}

Testing Your Masking Pipeline

A masking layer without tests is worse than no masking layer — it gives you false confidence.

import { describe, it, expect } from 'vitest'
import { maskPII, maskObject } from './masking'

describe('PII masking', () => {
  it('masks email addresses in strings', () => {
    const result = maskPII('User alice@example.com logged in') as string
    expect(result).not.toContain('alice@example.com')
    expect(result).toContain('@')  // partial masking, not full redaction
  })

  it('redacts password fields', () => {
    const result = maskObject({ password: 'hunter2', username: 'alice' })
    expect(result.password).toBe('[REDACTED]')
    expect(result.username).toBe('alice')  // non-sensitive fields unchanged
  })

  it('handles nested objects', () => {
    const result = maskObject({
      user: { email: 'alice@example.com', preferences: { theme: 'dark' } }
    })
    expect((result.user as any).email).not.toBe('alice@example.com')
    expect((result.user as any).preferences.theme).toBe('dark')
  })

  it('redacts bearer tokens', () => {
    const result = maskPII('Authorization: Bearer eyJhbGciOiJIUzI1NiJ9.test.sig') as string
    expect(result).toContain('[REDACTED]')
    expect(result).not.toContain('eyJhbGciOiJIUzI1NiJ9')
  })

  it('does not modify non-PII strings', () => {
    const input = 'Server started on port 3000'
    expect(maskPII(input)).toBe(input)
  })

  it('handles null and undefined gracefully', () => {
    expect(() => maskPII(null)).not.toThrow()
    expect(() => maskPII(undefined)).not.toThrow()
  })
})

The Masking Preview Problem

One practical challenge: developers need to test whether their masking rules are working without shipping to production. Build a simple preview endpoint (dev/staging only) that runs the masking pipeline and returns the diff:

if (process.env.NODE_ENV !== 'production') {
  app.post('/debug/mask-preview', async (request, reply) => {
    const input = request.body
    const masked = maskPII(input)
    return reply.send({
      original: input,
      masked,
      changed: JSON.stringify(input) !== JSON.stringify(masked),
    })
  })
}

Call it with a sample log payload and immediately see what gets masked. Faster than print-debugging your way through regex patterns.

Summary

PII masking in logs is not a nice-to-have. It's a compliance requirement, and more importantly, it's the right thing to do with your users' data.

The pattern is straightforward:

Mask at ingestion, not at display
Combine field-name detection (fast, reliable for structured data) with pattern matching (catches PII in strings)
Choose the right strategy per field type: mask for emails, redact for passwords/tokens, hash for correlation keys
Handle edge cases: URL encoding, Base64, stack traces
Test it like production code, because it is production code

The implementation above is about 150 lines of TypeScript. There's no reason every Node.js application logging to CloudWatch, Datadog, or anywhere else shouldn't have something equivalent running before the first log event leaves the process.

DEV Community