DEV Community

Cover image for Benchmarking LLM Context Awareness Without Sending Raw PII
Privalyse
Privalyse

Posted on

Benchmarking LLM Context Awareness Without Sending Raw PII

TL;DR: I measured whether an LLM can still understand relationships and context when raw identifiers never enter the prompt. Turns out - simple redaction is not working well but with a little tweak it nearly matches full context!

I compared three approaches:

  1. Full Context (Baseline)
  2. Standard Redaction (everything becomes <PERSON>)
  3. Semantic Masking (Personal attempt to improve standard redaction based on an own simple Package built on top of spacy that generates context aware placeholders with IDs to keep relations like {Person_A})

The results were surprising: In a stress test for relationship reasoning, standard redaction collapsed to 27% accuracy. Semantic masking achieved 91% accuracy—matching the unmasked baseline almost perfectly while keeping direct identifiers local.

Scope note: This is not anonymization. The goal is narrower but practical: keep direct identifiers (names, emails, IDs) local, while giving the model enough structure to reason intelligently.

All source code is linked at the end.


Why this matters (beyond just RAG)

People love using AI interfaces, but we often forget that an LLM is a general-purpose engine, not a secure vault. Whether you are building a chatbot, an agent, or a RAG pipeline, passing raw data carries risks:

  • Prompt logging & tracing
  • Vector DB storage (embedding raw PII)
  • Debugging screenshots
  • "Fallback" calls to external providers

As a developer in the EU, I wanted to explore a mask-first approach: transform data locally, prompt on masked text, and (optionally) rehydrate the response locally.


The Problem: Context Collapse

The issue with standard redaction isn't that the tools are bad—it's that they destroy information the model needs to understand who is doing what.

The "Anna & Emma" Scenario:

Imagine a text: "Anna calls Emma."

  • Standard Redaction: Both names become generic tags.

    • Result: "<PERSON> calls <PERSON>."
    • The Issue: Who called whom? The model has literally zero way to distinguish them. The reasoning collapses.
  • Semantic Masking: We assign placeholders that are consistent within a document/session (and can be ephemeral across sessions for privacy).

    • Result: "{Person_A} calls {Person_B}."
    • The Win: The model knows A and B are different people. It understands the relationship. When the answer comes back ("{Person_A} initiated the call"), we can swap the real name back in locally.

So I wanted to measure: Exactly how much reasoning do we lose with redaction, and can we fix it by adding some semantics?


Benchmarks

I ran two experiments to test this hypothesis:

1) The "Who is Who" Stress Test (N=11)

A small, synthetic dataset designed to test context-awareness of LLMs using different PII-removal tools. It features:

  • Multiple people interacting in one story.
  • Relational reasoning ("Who is the manager?").

2) RAG QA Benchmark

A simulation of a retrieval pipeline:

  1. Take a private document.
  2. Mask it.
  3. Ask the LLM questions based only on the masked text.

Setup

  • Model: GPT-4o-mini (temperature=0)
  • Evaluator: GPT-4o-mini used as an LLM judge in a separate evaluation prompt (temperature=0)
  • Metric: Accuracy on relationship extraction questions.

Note on evaluation: Small-N benchmarks are meant to expose failure modes, not claim statistical perfection. They are a "vibe check" for logic.


Comparing the Approaches

1. Full Context (Baseline)

Sending raw text. (High privacy risk, perfect context).

2. Standard Redaction

Replacing entities with generic tags: <PERSON>, <DATE>, <LOCATION>.

3. Semantic Masking

The approach I'm testing. It does three things differently:

  • Consistency: "Anna" becomes {Person_hxg3}. If "Anna" appears again, she is still {Person_hxg3}.
  • Entity Linking: "Anna Smith" and "Anna" are detected as the same entity and get the same placeholder.
  • Semantic Hints: For example Dates aren't just <DATE>, but {Date_October_2000}, preserving the timeline without revealing the exact day to identify a real person from collecting a set of information.

The Results

Benchmark 1: Coreference Stress Test (N=11)

Strategy Accuracy Why?
Full Context 90.9% (10/11) Baseline. (One error due to model hallucination).
Standard Redaction 27.3% (3/11) Total collapse. The model guessed blindly because everyone was <PERSON>.
Semantic Masking 90.9% (10/11) Context restored. The model performed exactly as well as with raw data.

Benchmark 2: RAG QA

Strategy Context Retention
Original (Baseline) 100%
Standard Redaction ~10%
Semantic Masking 92–100%

The Takeaway: You don't need real names to reason. You just need structure.


What I Learned

  1. Structure > Content: For most AI tasks, the model doesn't care who someone is. It cares about the graph of relationships. Person A -> Boss of -> Person B.
  2. Entity Linking is Critical: Naive find-and-replace fails on "Anna" vs "Anna Smith". You need logic that links these to the same ID, or the model thinks they are two different people.
  3. Privacy Enablement: This opens up use cases (HR, detailed customer support, legal) where we previously thought "we can't use LLMs because we can't send the data."

Reproducibility vs. Privacy

A quick technical note:

  • In Production: You want ephemeral IDs (random per session). "Anna" is {Person_X} today and {Person_Y} tomorrow, so you can't build a profile across sessions.
  • For Benchmarking: I used a fixed seed to make the runs comparable.

Resources & Code

If you want to reproduce this or stress test my attempt on semantic masking yourself, check out the libraries:

python context_research/01_coreference_benchmark.py   # Coref
Enter fullscreen mode Exit fullscreen mode
python context_research/02_rag_qa_benchmark.py        # RAG QA
Enter fullscreen mode Exit fullscreen mode
pip install privalyse-mask
Enter fullscreen mode Exit fullscreen mode

Limitations / Threat Model

To be fully transparent:

  • Direct Identifiers are gone: Names, emails, phone numbers are masked locally.
  • Re-identification is possible: If the context (except PII) is unique enough (e.g., "The CEO of Apple in 2010"), the model might infer a real person.
  • No Differential Privacy: This is a utility-first approach, not a mathematical guarantee.

This approach is about minimizing data exposure while maximizing model-intelligence, not about achieving perfect anonymity.


Discussion

I’d love to hear from others working on privacy-preserving AI:

  • Are there other tools that handle entity linking during masking?
  • Do you know of standard datasets for "privacy-preserving reasoning"?
  • Are there some common Benchmarks for that kind of Context Awareness? (I only found some for long contexts)

Let's chat in the comments! 👇

Top comments (0)