Zero-Allocation PII Redaction in Go: Processing 780MB of Logs in Under 3 Minutes
Every team feeding logs to LLMs has the same dirty secret: those logs are full of emails, IP addresses, credit card numbers, and government IDs. I know because I built a tool to find them.
After scanning 10GB of production logs at work, I found 47,000+ PII instances — emails, IPs, phone numbers — all sitting in plain text, waiting to be piped into ChatGPT or fine-tuning datasets.
So I built a local-first PII redaction engine in pure Go. No cloud. No API keys. No telemetry. This post breaks down the engineering decisions that made it fast.
The Problem: PII Leaks in AI Pipelines
The AI workflow looks like this:
Production Logs → Pre-processing → LLM API / Fine-tuning
The gap is between step 1 and step 2. Most teams skip sanitization because:
- Cloud DLP services (Google, AWS Macie) require uploading your data — defeating the purpose
- Python-based tools (Presidio, scrubadub) are slow on large log files and need heavy dependencies
-
Manual regex is fragile and doesn't handle context (is
1.2.3.4an IP or a version number?)
I needed something that could:
- Process 780MB in < 3 minutes on a single machine
- Run 100% offline — no network calls, ever
- Handle 11+ PII types across 7 jurisdictions (GDPR, HIPAA, CCPA, PIPL, APPI, PDPA)
- Produce consistent tokenization for AI training (
user@test.com→[EMAIL_0001]everywhere)
Architecture: Why Go, and Why Zero-Allocation
Go was chosen for one reason: predictable memory behavior at high throughput. No GC pauses, no JIT warmup, no pip dependency hell.
CLI / GUI Entry
→ Fyne GUI (drag & drop) | CLI Mode (batch processing)
→ Compliance Profiles (PIPL / GDPR / CCPA / HIPAA / APPI / PDPA)
→ Core Engine — pure []byte pipeline:
PreFilter → Regex → Validate → Tokenize → Write
powered by sync.Pool · lock-free stats · streaming I/O
The engine never converts []byte to string in the hot path. Here's why that matters:
Trick 1: PreFilter Byte Probes
Before running regex (expensive), every line passes through a cheap byte probe:
type Pattern struct {
ID string
Name string
Regex *regexp.Regexp
PreFilter func(line []byte) bool // ← fast reject
Validate func(match []byte) bool // ← context-aware
}
For example, the email pattern's PreFilter just checks if the line contains @:
PreFilter: func(line []byte) bool {
return bytes.ContainsRune(line, '@')
}
Result: ~80% of lines are skipped before regex runs. On a 780MB server log, this saves ~45 seconds.
Trick 2: sync.Pool Buffer Reuse
Every output line needs a buffer. Allocating and GC'ing millions of buffers kills throughput:
var bufPool = sync.Pool{
New: func() interface{} {
b := make([]byte, 0, 4096)
return &b
},
}
// In hot loop:
bp := bufPool.Get().(*[]byte)
buf := (*bp)[:0] // reset length, keep capacity
// ... write to buf ...
bufPool.Put(bp) // return to pool
Result: heap allocations drop from millions to ~50. GC pressure essentially zero.
Trick 3: Context-Aware Validation
The regex for IPv4 (\b\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\b) matches version numbers like 1.2.3.4 and file paths like data.2024.01.15. The Validate callback handles this:
Validate: func(match []byte) bool {
// Reject if preceded by "version", "v", "=" etc.
// Reject if all octets > 255
// Reject if it looks like a date pattern
return isLikelyIP(match)
}
This eliminated 94% of false positives in our production logs without sacrificing recall.
Trick 4: RWMutex Tokenization
For AI training data, you need consistent tokens: the same email should always map to [EMAIL_0001]. The tokenizer uses a read-write split:
type Tokenizer struct {
mu sync.RWMutex
tokens map[string]string
counts map[string]int
}
func (t *Tokenizer) GetToken(typ, value string) string {
t.mu.RLock()
if tok, ok := t.tokens[key]; ok {
t.mu.RUnlock()
return tok // fast path: read-only
}
t.mu.RUnlock()
t.mu.Lock()
// ... create new token ...
t.mu.Unlock()
return newToken // slow path: only for first occurrence
}
In real logs, PII values repeat heavily. The RLock fast path handles ~95% of lookups with zero contention.
Benchmark: 780MB Production Log
| Metric | Value |
|---|---|
| Input size | 780 MB (4.2M lines) |
| PII instances found | 47,283 |
| Processing time | 2 min 48 sec |
| Peak memory | 12 MB |
| Throughput | ~4.6 MB/s |
| False positive rate | < 0.3% (validated on 1,000 random samples) |
For comparison, a Python regex-based approach on the same file took 23 minutes with 1.8GB peak memory.
Multi-Jurisdiction Compliance
The tool ships with 7 compliance profiles, each enabling only the PII patterns required by that jurisdiction:
| Profile | Jurisdiction | What It Catches |
|---|---|---|
default |
Full scan | All 11 pattern types |
pipl |
China (PIPL) | ID Card, CN Mobile, Email, IPv4 |
gdpr |
EU (GDPR) | Email, IPv4/v6, Credit Card |
ccpa |
California (CCPA) | Email, IP, Phone, Credit Card, SSN |
hipaa |
US Medical (HIPAA) | Email, Phone, SSN, IPv4 |
appi |
Japan (APPI) | Email, Phone, My Number, IPv4 |
pdpa |
Singapore/Thailand | Email, Phone, IPv4, ID Card |
Switch profiles with a single flag:
./pii_redactor --input server.log --profile gdpr --output clean.log
Audit Report
Every run generates an audit report — essential for compliance documentation:
═══════════════════════════════════════════
PII Redaction Audit Report
═══════════════════════════════════════════
File: server_2024.log
Encoding: UTF-8
Lines: 4,218,903
Duration: 2m48s
─────────────────────────────────────
PII Type Hits Examples
─────────────────────────────────────
Email 12,847 user@corp.com → [EMAIL_0001]
IPv4 28,102 10.0.0.1 → [IP_0001]
Credit Card 891 4111...1111 → [CC_0001]
Phone (Intl) 2,443 +1-202-... → [PHONE_0001]
JWT 3,000 eyJhbG... → [JWT_0001]
═══════════════════════════════════════════
The tokenization map ([EMAIL_0001] ↔ original value) is kept in memory only during processing and never written to disk — zero data leakage by design.
Try It
The tool runs on Windows, macOS (Apple Silicon), and Linux. No dependencies, no Docker, no cloud account.
GitHub: github.com/gn000q/pii_redactor
Download pre-built binaries: PII Redactor V2 on Gumroad — includes cross-platform binaries, sample test data, config templates, and a quick-start guide.
What's Next
I'm considering adding:
- YAML/JSON structured log parsing (currently handles flat text)
- Custom pattern loading from external config files
- Streaming mode for piped input (
tail -f | pii_redactor)
What does your PII cleanup workflow look like? I'd love to hear if you're dealing with similar issues — especially if you're feeding logs to AI APIs.
Top comments (0)