Originally published at theculprit.ai/blog/detecting-pii-in-event-payloads.
This is a working document, not a survey. The patterns below are the ones we actually run against inbound alert payloads in Culprit's tokenizer. They are tuned for one job: catch as much PII in unstructured event text as a regex layer can plausibly catch, while erring toward false positives over false negatives. Where they fail, we say so and describe the fallback.
If you're building a similar pipeline — observability tool, log sanitizer, ingestion middleware in front of an LLM — you can copy the set as-is, but the more useful thing is to read the failure modes and decide whether they apply to your traffic.
The set
Six patterns. All are stateless, all use the global flag so a single String.prototype.matchAll(regex) walks the entire payload, all are scoped to word boundaries to avoid eating the surrounding text. The full source is in packages/shared/src/pii-detect.ts; this is the load-bearing part:
export const PII_PATTERNS = [
{ type: 'email',
regex: /\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}\b/g },
{ type: 'ipv4',
regex: /\b(?:(?:25[0-5]|2[0-4]\d|[01]?\d\d?)\.){3}(?:25[0-5]|2[0-4]\d|[01]?\d\d?)\b/g },
{ type: 'ipv6',
regex: /\b(?:[0-9a-fA-F]{1,4}:){7}[0-9a-fA-F]{1,4}\b/g },
{ type: 'phone',
regex: /(?:\+?1[-.\s]?)?(?:\(?\d{3}\)?[-.\s]?)?\d{3}[-.\s]?\d{4}\b/g },
{ type: 'ssn',
regex: /\b\d{3}-\d{2}-\d{4}\b/g },
{ type: 'high_entropy',
regex: /\b[A-Za-z0-9+/=_-]{40,}\b/g },
];
What each one catches, what it misses, and what we do about it:
01 — email
\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}\b
Catches: the overwhelming majority of email addresses you'll see in practice — paula.holman@acme.com, user+tag@subdomain.example.co.uk, a@b.io.
False negatives: RFC 5322 is much wider than the regex. Quoted local parts ("weird name"@example.com), addresses with comments, IDN domains in their unicode form (user@münchen.de). In several years of looking at production alert payloads we have seen exactly zero of these. They are theoretical.
False positives: anything formatted like an email but used as something else — service-account names that happen to look like addresses, fixture data, JIRA mention syntax in some custom apps. These tokenize harmlessly. The downstream consumer sees an opaque token; the engineer can reveal it if they need to.
Tradeoff: the bracket-class [._%+-] does not include all RFC-permitted characters. We've never regretted that.
02 — ipv4
\b(?:(?:25[0-5]|2[0-4]\d|[01]?\d\d?)\.){3}(?:25[0-5]|2[0-4]\d|[01]?\d\d?)\b
Catches: every well-formed IPv4. The octet alternation 25[0-5]|2[0-4]\d|[01]?\d\d? rejects out-of-range numbers like 999.1.1.1, which keeps the false-positive rate low.
False negatives: zero. Any string that parses as an IPv4 matches this regex.
False positives: version strings (10.4.2.1), build numbers, dates rendered as 2026.05.05 — wait, that one fails the leading-zero rule, never mind. The real false-positive class is software version strings like 10.4.2.1, which the regex cannot distinguish from a private IP. We accept this. A version string tokenized as <TOKEN_…> in an alert is annoying; an exfiltrated customer IP is a breach.
Tradeoff: consider whether you actually want to tokenize private-range IPs (10.0.0.0/8, 192.168.0.0/16, 172.16.0.0/12). They are usually internal infrastructure, not customer data. We tokenize them anyway because the line between "internal" and "customer" gets blurry when you're hosting webhooks for customer-on-premise systems, and the asymmetry from §01 still applies.
03 — ipv6
\b(?:[0-9a-fA-F]{1,4}:){7}[0-9a-fA-F]{1,4}\b
Catches: fully-expanded IPv6 addresses. 2001:0db8:85a3:0000:0000:8a2e:0370:7334.
False negatives: every common form of IPv6 you'll actually see. ::1, 2001:db8::1, fe80::1%eth0, IPv4-mapped IPv6 (::ffff:192.168.1.1). The :: zero-compression syntax is not handled; neither is the scope identifier; neither is mixed IPv4/IPv6 notation.
False positives: rare. The : separator and the strict colon count make accidental matches uncommon.
Tradeoff: this is the worst pattern in the set, and we have not yet replaced it. The reason is that "ungrep'd IPv6" is itself a small class of leaks compared to email and bearer tokens. When we do replace it, the right move is two patterns — one for full form, one for compressed — both rooted in word boundaries with the high-entropy fallback as a backstop.
04 — phone
(?:\+?1[-.\s]?)?(?:\(?\d{3}\)?[-.\s]?)?\d{3}[-.\s]?\d{4}\b
Catches: North American number formats: 5551234567, 555-123-4567, (555) 123-4567, +1 555 123 4567, +1.555.123.4567.
False negatives: non-NANP numbers (UK, EU, Asia). International formatting beyond +1. Numbers written without separators that don't begin with the country code (5551234567 matches; 442012345678 does not, by design — that pattern catches too many order numbers).
False positives: seven-to-ten-digit numbers that are not phone numbers. Order IDs. Tracking codes. Long invoice numbers. The regex tokenizes all of them. This is the pattern with the highest false-positive rate in the set, and we accept it for the same reason as §02.
Tradeoff: if your traffic is global, swap this for libphonenumber or a per-country regex set. The performance cost is real (tens of milliseconds per payload) but tractable for a worker that runs after the ingest 200 has already returned.
05 — ssn
\b\d{3}-\d{2}-\d{4}\b
Catches: US Social Security Numbers in their canonical hyphenated form.
False negatives: SSNs without dashes (123456789). SSNs with spaces (123 45 6789). All non-US national-ID formats.
False positives: anything formatted like XXX-XX-XXXX. Some product SKUs, some legacy account numbers, some date-range strings if your team uses an unusual format. Low rate in practice.
Tradeoff: the regex is conservative on purpose. A payload containing 123456789 is more likely to be an order number, an internal ID, or a build artifact than an SSN, and tokenizing every nine-digit run produces a meaningful fraction of false positives in observability traffic. If you specifically need to catch un-hyphenated SSNs (healthcare, payments, government), add a structural rule instead — a ssn field in your event schema that you tokenize regardless of contents.
06 — high_entropy
\b[A-Za-z0-9+/=_-]{40,}\b
Catches: the long, opaque strings you don't have a more specific pattern for. Bearer tokens. JWTs. Most provider API keys (Stripe sk_live_…, AWS AKIA… keys when emitted with the secret, GCP service-account JSON values). Session IDs. Most cryptographic hashes used as identifiers.
False negatives: anything under 40 characters. Some short access keys (a few cloud providers issue 32-char keys; those slip through). UUIDs without dashes (32 chars) — usually not credentials, but worth knowing.
False positives: long base64-encoded blobs that are not credentials — encoded protobufs, serialized state strings, image data URIs, signed-but-not-secret payloads. These tokenize as opaque. Engineers occasionally complain that "the tokenizer redacted my decoded protobuf"; we ask them whether the protobuf contained a customer's name and they reread their own alert and stop complaining.
Tradeoff: the threshold is the entire pattern. We picked 40 by walking back from "what's the shortest credential we want to catch" (≈40-char JWT header) and "what's the longest non-credential we don't want to catch" (32-char dashless UUID, hex SHA-256). There is no universally correct number. Instrument the false-positive rate against your traffic and iterate. If you change the number, change it once and write down why; do not let it drift.
What the regex set does not catch
Four categories of PII do not yield to regex, and pretending they do is how detectors gain a reputation for being theater.
Personal names. "John Smith" is two common English nouns concatenated. There is no regex that distinguishes "John Smith reported the issue" from "John Smith Auto Parts." The right answer is a structural rule: any field whose schema name is name, customer_name, display_name, full_name, first_name, last_name, account_holder — tokenize unconditionally, regardless of contents.
Addresses. Free-text street addresses are unbounded. Same answer: structural rule on field name (address, street, mailing_address, billing_address).
Free-text disclosures. "The customer mentioned their phone is 5551234567 and their kid's birthday is on the 5th" — the regex set catches the phone number, but the surrounding context is itself revealing. There is no good regex defense. The defense is a structural rule that says any field named notes, comment, description, support_message, customer_message, summary is tokenized as a whole field rather than scanned for patterns.
Account numbers, license plates, and other domain-specific identifiers. These vary too much across industries. If you have them, you know their format; write a domain-specific regex and add it to the set. If you don't know your domain's identifier formats, you have a discovery problem before you have a detection problem.
The pattern: regex catches the universal cases; structural rules on field names catch the contextual cases. A serious tokenizer does both. A toy tokenizer does only the first and lets the second class slip through. If you only have time to build one layer, build the structural rules — most of the high-value leaks are in notes-shaped fields, not in stringified payloads.
A note on order and replacement
The detection regexes return (type, value, index) triples. To turn them into a sanitized payload, you have to replace each match with a placeholder without invalidating the indices of subsequent matches.
The naive approach replaces left-to-right and adjusts every later index by the length delta. This works but is fiddly. The cleaner shape is to sort matches by index and replace right-to-left, which leaves earlier indices untouched. The cleanest shape is to split the original string on the match boundaries and join with placeholders, which sidesteps index arithmetic entirely.
Whichever you pick, the relevant invariant is: deduplicate matches that overlap. The high-entropy pattern can match a substring that also matches the email pattern (a long enough local-part). Pick one — the more specific pattern wins, every time — and discard the other before replacement.
The shipping bar
If you're building this for production, the bar to clear before you trust the detector with real traffic:
- A hand-labeled sample of 200+ alerts from your actual pipeline, not synthetic data. Run the detector against it, count false negatives by category. If any category exceeds 5%, fix the regex or add a structural rule before shipping.
- A way to measure false-positive rate continuously in production. A weekly report of "tokens issued per category, normalized to traffic volume" — sudden spikes mean the regex started matching something it didn't before.
-
A reveal flow for the engineer who needs to see the original. Without a fast reveal flow, the false-positive rate stops being free — every false positive becomes a pager-duty for some on-call engineer who needs to know what
<TOKEN_a1b2c3>actually was.
The first regex you ship will not be the last one. Plan the iteration loop in.
The pattern set above is the production set as of the date on this post. If you want to read the full module — including the detector function, the sort-by-index ordering, and the rationale comments — it's at packages/shared/src/pii-detect.ts in the Culprit repo. The companion piece, on the rest of the pipeline (encrypt-at-ingest, per-tenant token dictionary, audited reveal), is How to keep PII out of your alert pipeline.
Top comments (0)