<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Privalyse</title>
    <description>The latest articles on DEV Community by Privalyse (@privalyse).</description>
    <link>https://dev.to/privalyse</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3705407%2F2556c03b-8f99-469d-b231-6e980b159e64.png</url>
      <title>DEV Community: Privalyse</title>
      <link>https://dev.to/privalyse</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/privalyse"/>
    <language>en</language>
    <item>
      <title>Benchmarking LLM Context Awareness Without Sending Raw PII</title>
      <dc:creator>Privalyse</dc:creator>
      <pubDate>Wed, 14 Jan 2026 16:40:26 +0000</pubDate>
      <link>https://dev.to/privalyse/llm-context-awareness-without-sending-raw-pii-f5a</link>
      <guid>https://dev.to/privalyse/llm-context-awareness-without-sending-raw-pii-f5a</guid>
      <description>&lt;p&gt;&lt;strong&gt;TL;DR:&lt;/strong&gt; I measured whether an LLM can still understand relationships and context when &lt;strong&gt;raw identifiers never enter the prompt&lt;/strong&gt;. Turns out - simple redaction is not working well but with a little tweak it nearly matches full context!&lt;/p&gt;

&lt;p&gt;I compared three approaches:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Full Context&lt;/strong&gt; (Baseline)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Standard Redaction&lt;/strong&gt; (everything becomes &lt;code&gt;&amp;lt;PERSON&amp;gt;&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Semantic Masking&lt;/strong&gt; (Personal attempt to improve standard redaction based on an own simple Package built on top of spacy that generates context aware placeholders with IDs to keep relations like &lt;code&gt;{Person_A}&lt;/code&gt;)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The results were surprising: In a stress test for relationship reasoning, standard redaction collapsed to &lt;strong&gt;27% accuracy&lt;/strong&gt;. Semantic masking achieved &lt;strong&gt;91% accuracy&lt;/strong&gt;—matching the unmasked baseline almost perfectly while keeping direct identifiers local.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Scope note:&lt;/strong&gt; This is &lt;strong&gt;not anonymization&lt;/strong&gt;. The goal is narrower but practical: keep &lt;strong&gt;direct identifiers&lt;/strong&gt; (names, emails, IDs) local, while giving the model enough &lt;em&gt;structure&lt;/em&gt; to reason intelligently.&lt;/p&gt;

&lt;p&gt;All source code is linked at the end.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why this matters (beyond just RAG)
&lt;/h2&gt;

&lt;p&gt;People love using AI interfaces, but we often forget that an LLM is a general-purpose engine, not a secure vault. Whether you are building a &lt;strong&gt;chatbot, an agent, or a RAG pipeline&lt;/strong&gt;, passing raw data carries risks:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Prompt logging &amp;amp; tracing&lt;/li&gt;
&lt;li&gt;Vector DB storage (embedding raw PII)&lt;/li&gt;
&lt;li&gt;Debugging screenshots&lt;/li&gt;
&lt;li&gt;"Fallback" calls to external providers&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;As a developer in the EU, I wanted to explore a &lt;strong&gt;mask-first approach&lt;/strong&gt;: transform data locally, prompt on masked text, and (optionally) rehydrate the response locally.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Problem: Context Collapse
&lt;/h2&gt;

&lt;p&gt;The issue with standard redaction isn't that the tools are bad—it's that they destroy information the model needs to understand &lt;em&gt;who is doing what&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The "Anna &amp;amp; Emma" Scenario:&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Imagine a text: &lt;em&gt;"Anna calls Emma."&lt;/em&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Standard Redaction:&lt;/strong&gt; Both names become generic tags.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Result: &lt;code&gt;"&amp;lt;PERSON&amp;gt; calls &amp;lt;PERSON&amp;gt;."&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;The Issue:&lt;/strong&gt; Who called whom? The model has literally zero way to distinguish them. The reasoning collapses.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;p&gt;&lt;strong&gt;Semantic Masking:&lt;/strong&gt; We assign placeholders that are consistent within a document/session (and can be ephemeral across sessions for privacy).&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Result: &lt;code&gt;"{Person_A} calls {Person_B}."&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;The Win:&lt;/strong&gt; The model knows A and B are different people. It understands the relationship. When the answer comes back (&lt;code&gt;"{Person_A} initiated the call"&lt;/code&gt;), we can swap the real name back in locally.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;So I wanted to measure: &lt;em&gt;Exactly how much reasoning do we lose with redaction, and can we fix it by adding some semantics?&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Benchmarks
&lt;/h2&gt;

&lt;p&gt;I ran two experiments to test this hypothesis:&lt;/p&gt;

&lt;h3&gt;
  
  
  1) The "Who is Who" Stress Test (N=11)
&lt;/h3&gt;

&lt;p&gt;A small, synthetic dataset designed to test context-awareness of LLMs using different PII-removal tools. It features:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Multiple people&lt;/strong&gt; interacting in one story.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Relational reasoning&lt;/strong&gt; ("Who is the manager?").&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2) RAG QA Benchmark
&lt;/h3&gt;

&lt;p&gt;A simulation of a retrieval pipeline:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; Take a private document.&lt;/li&gt;
&lt;li&gt; Mask it.&lt;/li&gt;
&lt;li&gt; Ask the LLM questions based &lt;em&gt;only&lt;/em&gt; on the masked text.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Setup
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Model:&lt;/strong&gt; GPT-4o-mini (temperature=0)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Evaluator:&lt;/strong&gt; GPT-4o-mini used as an LLM judge in a separate evaluation prompt (temperature=0)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Metric:&lt;/strong&gt; Accuracy on relationship extraction questions.&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Note on evaluation:&lt;/strong&gt; Small-N benchmarks are meant to expose failure modes, not claim statistical perfection. They are a "vibe check" for logic.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  Comparing the Approaches
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Full Context (Baseline)
&lt;/h3&gt;

&lt;p&gt;Sending raw text. (High privacy risk, perfect context).&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Standard Redaction
&lt;/h3&gt;

&lt;p&gt;Replacing entities with generic tags: &lt;code&gt;&amp;lt;PERSON&amp;gt;&lt;/code&gt;, &lt;code&gt;&amp;lt;DATE&amp;gt;&lt;/code&gt;, &lt;code&gt;&amp;lt;LOCATION&amp;gt;&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Semantic Masking
&lt;/h3&gt;

&lt;p&gt;The approach I'm testing. It does three things differently:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Consistency:&lt;/strong&gt; "Anna" becomes &lt;code&gt;{Person_hxg3}&lt;/code&gt;. If "Anna" appears again, she is still &lt;code&gt;{Person_hxg3}&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Entity Linking:&lt;/strong&gt; "Anna Smith" and "Anna" are detected as the same entity and get the &lt;em&gt;same&lt;/em&gt; placeholder.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Semantic Hints:&lt;/strong&gt; For example Dates aren't just &lt;code&gt;&amp;lt;DATE&amp;gt;&lt;/code&gt;, but &lt;code&gt;{Date_October_2000}&lt;/code&gt;, preserving the timeline without revealing the exact day to identify a real person from collecting a set of information.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  The Results
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Benchmark 1: Coreference Stress Test (N=11)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Strategy&lt;/th&gt;
&lt;th&gt;Accuracy&lt;/th&gt;
&lt;th&gt;Why?&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Full Context&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;90.9%&lt;/strong&gt; (10/11)&lt;/td&gt;
&lt;td&gt;Baseline. (One error due to model hallucination).&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Standard Redaction&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;27.3%&lt;/strong&gt; (3/11)&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Total collapse.&lt;/strong&gt; The model guessed blindly because everyone was &lt;code&gt;&amp;lt;PERSON&amp;gt;&lt;/code&gt;.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Semantic Masking&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;90.9%&lt;/strong&gt; (10/11)&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Context restored.&lt;/strong&gt; The model performed exactly as well as with raw data.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Benchmark 2: RAG QA
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Strategy&lt;/th&gt;
&lt;th&gt;Context Retention&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Original (Baseline)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;100%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Standard Redaction&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~10%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Semantic Masking&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;92–100%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;The Takeaway:&lt;/strong&gt; You don't need real names to reason. You just need &lt;em&gt;structure&lt;/em&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  What I Learned
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt; &lt;strong&gt;Structure &amp;gt; Content:&lt;/strong&gt; For most AI tasks, the model doesn't care &lt;em&gt;who&lt;/em&gt; someone is. It cares about the &lt;em&gt;graph of relationships&lt;/em&gt;. Person A -&amp;gt; Boss of -&amp;gt; Person B.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Entity Linking is Critical:&lt;/strong&gt; Naive find-and-replace fails on "Anna" vs "Anna Smith". You need logic that links these to the same ID, or the model thinks they are two different people.&lt;/li&gt;
&lt;li&gt; &lt;strong&gt;Privacy Enablement:&lt;/strong&gt; This opens up use cases (HR, detailed customer support, legal) where we previously thought "we can't use LLMs because we can't send the data."&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Reproducibility vs. Privacy
&lt;/h2&gt;

&lt;p&gt;A quick technical note:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;In Production:&lt;/strong&gt; You want &lt;strong&gt;ephemeral IDs&lt;/strong&gt; (random per session). "Anna" is &lt;code&gt;{Person_X}&lt;/code&gt; today and &lt;code&gt;{Person_Y}&lt;/code&gt; tomorrow, so you can't build a profile across sessions.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;For Benchmarking:&lt;/strong&gt; I used a fixed seed to make the runs comparable.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Resources &amp;amp; Code
&lt;/h2&gt;

&lt;p&gt;If you want to reproduce this or stress test my attempt on semantic masking yourself, check out the libraries:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Benchmark Code:&lt;/strong&gt; &lt;a href="https://github.com/Privalyse/privalyse-research" rel="noopener noreferrer"&gt;Privalyse/privalyse-research&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python context_research/01_coreference_benchmark.py   &lt;span class="c"&gt;# Coref&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;python context_research/02_rag_qa_benchmark.py        &lt;span class="c"&gt;# RAG QA&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Semantic Masking Library:&lt;/strong&gt; &lt;a href="https://github.com/Privalyse/privalyse-mask" rel="noopener noreferrer"&gt;Privalyse/privalyse-mask&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;privalyse-mask
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Limitations / Threat Model
&lt;/h2&gt;

&lt;p&gt;To be fully transparent:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  ✅ &lt;strong&gt;Direct Identifiers are gone:&lt;/strong&gt; Names, emails, phone numbers are masked locally.&lt;/li&gt;
&lt;li&gt;  ❌ &lt;strong&gt;Re-identification is possible:&lt;/strong&gt; If the &lt;em&gt;context&lt;/em&gt; (except PII) is unique enough (e.g., "The CEO of Apple in 2010"), the model might infer a real person.&lt;/li&gt;
&lt;li&gt;  ❌ &lt;strong&gt;No Differential Privacy:&lt;/strong&gt; This is a utility-first approach, not a mathematical guarantee.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This approach is about &lt;strong&gt;minimizing data exposure&lt;/strong&gt; while &lt;strong&gt;maximizing model-intelligence&lt;/strong&gt;, not about achieving perfect anonymity.&lt;/p&gt;




&lt;h2&gt;
  
  
  Discussion
&lt;/h2&gt;

&lt;p&gt;I’d love to hear from others working on privacy-preserving AI:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Are there other tools that handle &lt;strong&gt;entity linking&lt;/strong&gt; during masking?&lt;/li&gt;
&lt;li&gt;  Do you know of standard datasets for "privacy-preserving reasoning"?&lt;/li&gt;
&lt;li&gt;  Are there some common Benchmarks for that kind of Context Awareness? (I only found some for long contexts)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Let's chat in the comments! 👇&lt;/p&gt;

</description>
      <category>privacy</category>
      <category>rag</category>
      <category>llm</category>
      <category>ai</category>
    </item>
  </channel>
</rss>
