DEV Community

Cover image for We Built a Privacy Layer for LLMs. Here’s What We Learned About Redaction vs Anonymization.
Rom C
Rom C

Posted on

We Built a Privacy Layer for LLMs. Here’s What We Learned About Redaction vs Anonymization.

Redaction removes PII values. Anonymization replaces them with realistic synthetic data. For LLM pipelines, redaction breaks context and doesn't scale. Anonymization preserves semantic meaning, keeps output quality high, and gives you a real compliance posture. Here's how we built the anonymization layer at Questa-AI and what it took.

Why I’m Writing This

Most engineering posts about LLM privacy stop at “don’t send PII to external models.” That’s fine advice. It’s also useless in practice, because real-world data pipelines don’t get to choose whether PII shows up.

Customer support tickets contain names. Financial reports contain account numbers. Medical summaries contain diagnoses. Legal documents contain literally everything. If you’re building LLM-powered workflows for any real business use case, you are going to encounter PII.
The question isn’t whether — it’s what you do with it.
At Questa-AI, we spent months building a production anonymization layer that sits between raw enterprise data and LLM calls. This post is the honest write-up of what we built, what we got wrong the first time, and what actually works.

First, the conceptual difference that actually matters

I want to get precise here because these two words get used interchangeably and the difference is not minor.

Redaction

Redaction removes or masks a sensitive value and leaves a gap or a placeholder token in its place.
Before: "Patient: John Smith, DOB: 12/03/1985, Diagnosis: Type 2 Diabetes" After: "Patient: [NAME], DOB: [DATE], Diagnosis: [CONDITION]"
The value is gone. The slot it occupied is still there. And that slot carries information — its position in the sentence, its relationship to surrounding fields, the structure of the record it came from. A sufficiently capable model, or a sufficiently motivated person reviewing logs, can still infer a lot from what’s left.

Anonymization

Anonymization replaces a sensitive value with a realistic synthetic substitute that preserves semantic meaning without preserving identity.
Before: "Patient: John Smith, DOB: 12/03/1985" After: "Patient: Alex T., DOB: 04/17/1976"
No gap. No placeholder token. The structure is intact, the context is intact, and the LLM gets everything it needs to process the input correctly. The person whose data you’re handling is no longer identifiable from the output.

This distinction matters for three separate reasons: compliance, model output quality, and operational scalability. We’ll go through each.

Why redaction specifically fails LLM pipelines

Problem 1: context leakage
LLMs are good at reasoning from context. That’s the whole point. But it means they can also infer missing values from context — which means a redacted record is not a neutral record from the model’s perspective. The surrounding text still carries signal.
More practically: if your prompts are logged anywhere (and they usually are, somewhere in the stack), a redacted log doesn’t give you the protection you think it does. The structure of the redacted record is often enough to re-identify the individual given additional data sources. This is the re-identification attack that regulators have been warning about since the Netflix prize dataset.
Problem 2: manual review at scale is a fantasy
I’ve talked to engineering teams processing 3,000–8,000 records per day through LLM pipelines who are relying on a human reviewer to catch PII before the data goes to the model.
That is not a system. That is a lottery. One fatigued reviewer on one busy afternoon is all it takes. And when it fails, there’s no audit trail, no incident log, no way to quantify the exposure.
Problem 3: blank tokens actively hurt model performance
This one doesn’t get enough attention. When you pass a prompt with [REDACTED] tokens to an LLM, the model treats those tokens as meaningful input. They create gaps in the reasoning chain. Depending on the task, this can meaningfully degrade output quality.
We measured this internally. Prompts with anonymized synthetic values consistently outperformed redacted prompts on summarization and classification tasks. The model performs better when it has coherent context, even if that context is synthetic.

How we built the anonymization pipeline

Here’s the actual architecture. I’ll go section by section.
The dual-model approach
Our first attempt used a single NER model. It was fine for catching person names and locations. It missed email addresses, financial identifiers, and anything domain-specific. A single model is not sufficient for production PII detection.
We ended up running two Hugging Face models in parallel:
DistilBERT NER — elastic/distilbert-base-uncased-finetuned-conll03-english. Optimized for standard entities: persons, organizations, locations.
Piiranha — iiiorg/piiranha-v1-detect-personal-information. Specialized for sensitive personal data: emails, IDs, phone numbers, financial data.
Running two models introduces an overlap problem. If Model A says characters 10–15 are a [PER] and Model B says characters 12–20 are an [EMAIL], naïve replacement corrupts the string.
The fix is a merge algorithm that works like this:

def merge_entities(entities_a, entities_b): combined = sorted(entities_a + entities_b, key=lambda e: e['start']) merged = [] for entity in combined: if merged and entity['start'] < merged[-1]['end']: # Conflict: keep the longest span if entity['end'] - entity['start'] > merged[-1]['end'] - merged[-1]['start']: merged[-1] = entity else: merged.append(entity) return merged

Sort by start position. Resolve conflicts by keeping the longest span. No dead space in the output. Clean, non-overlapping tokens.

Structured data: the CSV/Excel problem

You cannot pass a raw CSV to an NLP model. Seriously. The absence of sentence structure confuses context-aware models that expect prose. A row like:
"Jane Doe","jane@example.com","42 Oak Street","07/14/1988","$84,000"
...has no subject-verb-object structure. Standard NER models will misfire on it constantly.
Our solution combines three things:
1. Multithreaded row processing
NLP inference is CPU-bound. A 10,000-row CSV processed sequentially is too slow for production. We use ThreadPoolExecutor:

from concurrent.futures import ThreadPoolExecutor MAX_THREADS = 8 with ThreadPoolExecutor(max_workers=MAX_THREADS) as executor: futures = { executor.submit(redact_row, row): idx for idx, row in enumerate(df_sample) } for future in as_completed(futures): results[futures[future]] = future.result()

from concurrent.futures import ThreadPoolExecutor MAX_THREADS = 8 with ThreadPoolExecutor(max_workers=MAX_THREADS) as executor: futures = { executor.submit(redact_row, row): idx for idx, row in enumerate(df_sample) } for future in as_completed(futures): results[futures[future]] = future.result()
2. Smart heuristics per field type
• numeric fields — preserved automatically. Wiping numbers breaks financial context.
• email fields — detected via regex on @ symbol, always anonymized.
• Column headers — if the header is "Full Name" or "Email Address", we force anonymization regardless of model confidence.
3. Column-level thresholding
We sample a batch of rows per column and calculate a PII density ratio. If more than 80% of sampled values in a column are PII (a column of home addresses, for instance), we wipe the entire column and replace with the most common detected entity type. This is significantly faster than processing every cell individually and produces cleaner output.
def should_wipe_column(column_values, threshold=0.80): sample = column_values[:50] # sample first 50 rows pii_count = sum(1 for v in sample if contains_pii(v)) return (pii_count / len(sample)) >= threshold

File reconstruction

The pipeline supports .pdf, .docx, and .xlsx. The challenge isn’t detecting PII — it’s putting the file back together without breaking the format.
PDF: pdfplumber for extraction → anonymization pipeline → reportlab to regenerate the PDF stream. New file, clean output, intact structure.
Excel: openpyxl deconstructs to DataFrames → CSV anonymization logic → openpyxl reconstructs with sheet structure preserved.
DOCX: python-docx handles extraction and reconstruction while preserving styles, formatting, and document layout.
File reconstruction is genuinely the hardest part. The edge cases are endless — merged cells in Excel, mixed-language PDFs, Word documents with embedded tables. We’re still finding and fixing edge cases.

  • Language / Framework: Python, FastAPI
  • Models: DistilBERT (NER) + Piiranha (PII) via Hugging Face Transformers
  • Inference: PyTorch
  • Concurrency: ThreadPoolExecutor
  • Parsers: pdfplumber, openpyxl, python-docx
  • Output: reportlab (PDF reconstruction)

Full write-up with more implementation detail on the Questa-AI blog:
Under the Hood: Building a Privacy-First Anonymizer for LLMs.

The compliance angle: anonymization vs pseudonymization

Quick but important, because this bites teams.
Pseudonymization = reversible substitution. You replace “John Smith” with a token, keep a mapping table, can reverse it with the key. Under GDPR, this is still personal data. Still regulated. Still subject to all the same obligations.
Anonymization = irreversible transformation. No mapping table. Cannot be reversed. GDPR Recital 26 explicitly excludes properly anonymized data from the regulation’s scope. You gain real flexibility, not just the appearance of it.
If your team is storing a mapping table somewhere that can reverse your “anonymization,” you’re doing pseudonymization. The compliance implications are different. Know which one you’re doing.
HIPAA’s Safe Harbor method works similarly — properly de-identified data (18 specific identifiers removed, no residual re-identification risk) is no longer PHI and falls outside HIPAA’s scope.

Wrapping up

Redaction is a useful tool for certain use cases. For production LLM pipelines processing real enterprise data, it’s not sufficient — it leaks context, doesn’t scale, and degrades model performance.
Anonymization — done with a proper detection and replacement layer — solves all three problems simultaneously. It’s more engineering work upfront, but it’s the only approach that holds up in production at scale.
If you’re building something like this or running into specific edge cases, drop a comment. Happy to talk through implementation details.

Further reading & related discussions
If you want to go deeper on any of this:
📝 I’ve Been Thinking About How We’re Getting AI Privacy Wrong
📄 Stop Redacting. Start Anonymizing. — Medium
💬 Redaction vs Anonymization for AI Prompts — LinkedIn

Top comments (0)