DEV Community

Cover image for Zero-Allocation PII Redaction in Go: Processing 780MB of Logs in Under 3 Minutes
gn000q
gn000q

Posted on • Originally published at gn000q.hashnode.dev

Zero-Allocation PII Redaction in Go: Processing 780MB of Logs in Under 3 Minutes

Zero-Allocation PII Redaction in Go: Processing 780MB of Logs in Under 3 Minutes

Every team feeding logs to LLMs has the same dirty secret: those logs are full of emails, IP addresses, credit card numbers, and government IDs. I know because I built a tool to find them.

After scanning 10GB of production logs at work, I found 47,000+ PII instances — emails, IPs, phone numbers — all sitting in plain text, waiting to be piped into ChatGPT or fine-tuning datasets.

So I built a local-first PII redaction engine in pure Go. No cloud. No API keys. No telemetry. This post breaks down the engineering decisions that made it fast.


The Problem: PII Leaks in AI Pipelines

The AI workflow looks like this:

Production Logs → Pre-processing → LLM API / Fine-tuning
Enter fullscreen mode Exit fullscreen mode

The gap is between step 1 and step 2. Most teams skip sanitization because:

  1. Cloud DLP services (Google, AWS Macie) require uploading your data — defeating the purpose
  2. Python-based tools (Presidio, scrubadub) are slow on large log files and need heavy dependencies
  3. Manual regex is fragile and doesn't handle context (is 1.2.3.4 an IP or a version number?)

I needed something that could:

  • Process 780MB in < 3 minutes on a single machine
  • Run 100% offline — no network calls, ever
  • Handle 11+ PII types across 7 jurisdictions (GDPR, HIPAA, CCPA, PIPL, APPI, PDPA)
  • Produce consistent tokenization for AI training (user@test.com[EMAIL_0001] everywhere)

Architecture: Why Go, and Why Zero-Allocation

Go was chosen for one reason: predictable memory behavior at high throughput. No GC pauses, no JIT warmup, no pip dependency hell.

CLI / GUI Entry
→ Fyne GUI (drag & drop) | CLI Mode (batch processing)
Compliance Profiles (PIPL / GDPR / CCPA / HIPAA / APPI / PDPA)
Core Engine — pure []byte pipeline:
PreFilter → Regex → Validate → Tokenize → Write
powered by sync.Pool · lock-free stats · streaming I/O

The engine never converts []byte to string in the hot path. Here's why that matters:

Trick 1: PreFilter Byte Probes

Before running regex (expensive), every line passes through a cheap byte probe:

type Pattern struct {
    ID        string
    Name      string
    Regex     *regexp.Regexp
    PreFilter func(line []byte) bool  // ← fast reject
    Validate  func(match []byte) bool // ← context-aware
}
Enter fullscreen mode Exit fullscreen mode

For example, the email pattern's PreFilter just checks if the line contains @:

PreFilter: func(line []byte) bool {
    return bytes.ContainsRune(line, '@')
}
Enter fullscreen mode Exit fullscreen mode

Result: ~80% of lines are skipped before regex runs. On a 780MB server log, this saves ~45 seconds.

Trick 2: sync.Pool Buffer Reuse

Every output line needs a buffer. Allocating and GC'ing millions of buffers kills throughput:

var bufPool = sync.Pool{
    New: func() interface{} {
        b := make([]byte, 0, 4096)
        return &b
    },
}

// In hot loop:
bp := bufPool.Get().(*[]byte)
buf := (*bp)[:0] // reset length, keep capacity
// ... write to buf ...
bufPool.Put(bp) // return to pool
Enter fullscreen mode Exit fullscreen mode

Result: heap allocations drop from millions to ~50. GC pressure essentially zero.

Trick 3: Context-Aware Validation

The regex for IPv4 (\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b) matches version numbers like 1.2.3.4 and file paths like data.2024.01.15. The Validate callback handles this:

Validate: func(match []byte) bool {
    // Reject if preceded by "version", "v", "=" etc.
    // Reject if all octets > 255
    // Reject if it looks like a date pattern
    return isLikelyIP(match)
}
Enter fullscreen mode Exit fullscreen mode

This eliminated 94% of false positives in our production logs without sacrificing recall.

Trick 4: RWMutex Tokenization

For AI training data, you need consistent tokens: the same email should always map to [EMAIL_0001]. The tokenizer uses a read-write split:

type Tokenizer struct {
    mu     sync.RWMutex
    tokens map[string]string
    counts map[string]int
}

func (t *Tokenizer) GetToken(typ, value string) string {
    t.mu.RLock()
    if tok, ok := t.tokens[key]; ok {
        t.mu.RUnlock()
        return tok  // fast path: read-only
    }
    t.mu.RUnlock()

    t.mu.Lock()
    // ... create new token ...
    t.mu.Unlock()
    return newToken  // slow path: only for first occurrence
}
Enter fullscreen mode Exit fullscreen mode

In real logs, PII values repeat heavily. The RLock fast path handles ~95% of lookups with zero contention.


Benchmark: 780MB Production Log

Metric Value
Input size 780 MB (4.2M lines)
PII instances found 47,283
Processing time 2 min 48 sec
Peak memory 12 MB
Throughput ~4.6 MB/s
False positive rate < 0.3% (validated on 1,000 random samples)

For comparison, a Python regex-based approach on the same file took 23 minutes with 1.8GB peak memory.


Multi-Jurisdiction Compliance

The tool ships with 7 compliance profiles, each enabling only the PII patterns required by that jurisdiction:

Profile Jurisdiction What It Catches
default Full scan All 11 pattern types
pipl China (PIPL) ID Card, CN Mobile, Email, IPv4
gdpr EU (GDPR) Email, IPv4/v6, Credit Card
ccpa California (CCPA) Email, IP, Phone, Credit Card, SSN
hipaa US Medical (HIPAA) Email, Phone, SSN, IPv4
appi Japan (APPI) Email, Phone, My Number, IPv4
pdpa Singapore/Thailand Email, Phone, IPv4, ID Card

Switch profiles with a single flag:

./pii_redactor --input server.log --profile gdpr --output clean.log
Enter fullscreen mode Exit fullscreen mode

Audit Report

Every run generates an audit report — essential for compliance documentation:

═══════════════════════════════════════════
  PII Redaction Audit Report
═══════════════════════════════════════════
  File: server_2024.log
  Encoding: UTF-8
  Lines: 4,218,903
  Duration: 2m48s
  ─────────────────────────────────────
  PII Type          Hits    Examples
  ─────────────────────────────────────
  Email             12,847  user@corp.com → [EMAIL_0001]
  IPv4              28,102  10.0.0.1 → [IP_0001]
  Credit Card          891  4111...1111 → [CC_0001]
  Phone (Intl)       2,443  +1-202-... → [PHONE_0001]
  JWT                3,000  eyJhbG... → [JWT_0001]
═══════════════════════════════════════════
Enter fullscreen mode Exit fullscreen mode

The tokenization map ([EMAIL_0001] ↔ original value) is kept in memory only during processing and never written to disk — zero data leakage by design.


Try It

The tool runs on Windows, macOS (Apple Silicon), and Linux. No dependencies, no Docker, no cloud account.

GitHub: github.com/gn000q/pii_redactor

Download pre-built binaries: PII Redactor V2 on Gumroad — includes cross-platform binaries, sample test data, config templates, and a quick-start guide.


What's Next

I'm considering adding:

  • YAML/JSON structured log parsing (currently handles flat text)
  • Custom pattern loading from external config files
  • Streaming mode for piped input (tail -f | pii_redactor)

What does your PII cleanup workflow look like? I'd love to hear if you're dealing with similar issues — especially if you're feeding logs to AI APIs.

Top comments (0)