A VP at an SAP shop told me recently: "Every time we copy production to our lower environments, PII leaks. And no, we're not throwing an LLM at it. That's a thousand times the compute of what we already run."
He's right.
Most of the PII redaction problem in enterprise data isn't a neural network problem. It's a lookup table problem. And the incumbents already solve it. SAP TDMS, Delphix, Informatica, IBM InfoSphere Optim. All schema-aware. All row-level. All deterministic.
The 95% Where Deterministic Wins
In a SAP production database, the schema tells you almost everything. KNA1-NAME1 is a customer name. BSEG-IBAN is a bank account. USR02-BNAME is a user ID. A YAML rule says: "for this column type, replace with this pattern." Done.
The math is brutal. A regex plus a lookup table costs microseconds per row. A 1.5B-parameter model costs 10 to 50 milliseconds per row, even on a GPU. That's three to five orders of magnitude. A nightly batch copy that finishes by morning with TDMS would take weeks with an LLM in the loop.
Compute isn't even the main argument.
Referential integrity is. "Anna Müller" has to become "Person_47" consistently across 200 tables. KNA1, VBAK, VBKD, BSEG, wherever the customer ID travels. Deterministic pseudonymization with an HMAC and a scoped salt gives you that for free. Neural outputs drift.
Auditability is. A regulator asks: "show me the rule that masked this column." A YAML rule is defensible. A model output is not.
So for any SAP field with a known schema type, deterministic masking wins. Full stop. Don't let anyone sell you a neural-network-powered "modernization" of that layer.
Where a Fine-Tuned Model Earns Its Compute
Here's what TDMS, Delphix, and their peers silently miss.
Free-text columns. BSEG-SGTXT, the long-text field where someone typed "Ansprechpartner Anna Müller, Tel +49-170-...". Ticket descriptions from ServiceNow mirrored into dev. Email bodies stored as CLOBs. ADRC annotations. The column type is "text." The content is gold-mine PII.
Unstructured attachments. PDFs, scanned invoices, OCR'd contracts pulled into dev via ArchiveLink. Names and IBANs mid-prose, not in a column.
Schema drift. Consultants add Z-tables. The data steward hasn't classified them yet. Deterministic tools don't know the column holds PII. They pass the data through untouched.
On these, rule-based tools do one of two things. They wipe the whole column, destroying test fidelity, so the dev team can't debug against realistic data. Or they miss the PII entirely, and you get a compliance incident.
A German-specialized redactor earns its keep here because the alternative isn't "faster regex." It's "no coverage at all."
The Hybrid Architecture
This is the part that actually ships.
- A classifier pass on the SAP copy. Cheap heuristics (column-name keywords, column type, sample-value regex) flag each column as
structured_pii,free_text, orsafe. - Deterministic masker handles
structured_pii. TDMS or whatever you already run. - Fine-tuned LLM redactor runs only on
free_text, attachments, and unclassified Z-columns. - A consistency bridge. Both paths share a pseudonym table keyed by
HMAC(value, tenant_salt). "Anna Müller" becomes "Person_47" whether she was caught by regex or by the model.
Compute budget: the LLM runs on maybe 1 to 5 percent of the cells. Total cost is still dominated by the deterministic layer. You're not replacing TDMS. You're covering its blind spots.
What I Won't Claim
Three things I won't sell you:
- The LLM is cheaper than a regex. It isn't. Ever.
- It replaces your incumbent masking vendor. It doesn't.
- A benchmark against TDMS on structured columns is meaningful. You lose that benchmark. Benchmark on free-text and attachments, where deterministic tools score near zero.
The honest pitch to the VP was this. "You're right. For the 90% structured case, keep TDMS. The model is the long-tail layer. It runs only over the free-text fields and attachments your current tools silently leak. Small job. Different problem."
That's the conversation that lands. Not "replace your stack." Not "AI-powered everything."
Regex for the schema. LLM for the shadows.
I reserve my audits for teams ready to take action on the results.
Top comments (0)