Why Scrubbing PII Isn't Enough: The LLM Response Restoration Problem

#privacy #ai #security #llm

Every developer building AI on sensitive data eventually discovers the same problem: you can't send raw PII to OpenAI or Claude, so you strip it first.

You replace John Smith with UUID-a4f2bc19. You send the scrubbed prompt. You get back a response.

The response references UUID-a4f2bc19 throughout.

Now what?

The Restoration Gap

Most PII scrubbing guides stop at step one: strip the data. Presidio, spaCy NER, regex — all solid tools for detection and removal. But the workflow that actually works in production requires a second step: restoring real values in the response.

Here's why opaque placeholders like UUIDs fail:

Prompt: "Summarize the risk profile of UUID-a4f2bc19's loan application.
         Annual income: REDACTED-001. Credit score: MASKED-002."

Model output: "UUID-a4f2bc19 presents moderate risk due to REDACTED-001
               income and MASKED-002 credit history..."

That response is useless. Your downstream system doesn't know who UUID-a4f2bc19 is without a lookup. The model also tends to treat opaque identifiers as data rather than placeholders, which degrades reasoning quality.

Semantic Placeholders Fix Reasoning Quality

The first fix is using semantic placeholders instead of opaque ones:

[NAME_1]  instead of  UUID-a4f2bc19
[SSN_1]   instead of  REDACTED-001
[SCORE_1] instead of  MASKED-002

LLMs understand what [NAME_1] represents structurally — it's a person identifier. The model reasons correctly about the loan belonging to a person even without knowing who. Response quality improves significantly.

The Full Loop

The complete privacy-preserving workflow has three steps:

Step 1 — Scrub

original = "Patient Jane Doe, DOB 1985-03-12, MRN 4829301 reports chest pain"
scrubbed = "Patient [NAME_1], DOB [DATE_1], MRN [ID_1] reports chest pain"
mapping = {"NAME_1": "Jane Doe", "DATE_1": "1985-03-12", "ID_1": "4829301"}

Step 2 — Proxy to LLM

# Model reasons on [NAME_1], [DATE_1], [ID_1] — never sees real data
llm_response = "[NAME_1] is a 40-year-old patient. Based on [DATE_1] DOB and \
                MRN [ID_1], recommend immediate cardiac evaluation."

Step 3 — Restore

final = llm_response
for key, value in mapping.items():
    final = final.replace(f"[{key}]", value)

# Result: "Jane Doe is a 40-year-old patient. Based on 1985-03-12 DOB and
#          MRN 4829301, recommend immediate cardiac evaluation."

The model never saw real PHI. The clinician gets a complete, usable response with real patient data.

One API Call vs DIY

Building this yourself means maintaining: NER models, regex fallbacks, placeholder mapping storage, provider SDKs for each LLM, restoration logic, and zero-log policy enforcement.

Or you can POST to /api/proxy:

curl -X POST https://tiamat.live/api/proxy \
  -H 'Content-Type: application/json' \
  -d '{
    "provider": "openai",
    "model": "gpt-4o",
    "messages": [{"role": "user", "content": "Summarize risk for John Smith, SSN 123-45-6789, income $94,000"}],
    "scrub": true
  }'

The proxy:

Scrubs John Smith → [NAME_1], 123-45-6789 → [SSN_1], $94,000 → [INCOME_1]
Forwards scrubbed prompt to OpenAI — your IP never hits their servers
Receives response
Restores placeholders with real values
Returns complete response to you

from tiamat_privacy import TiamatClient

client = TiamatClient()
response = client.proxy(
    provider="openai",
    model="gpt-4o",
    messages=[{"role": "user", "content": "Analyze risk for John Smith, SSN 123-45-6789"}],
    scrub=True
)
print(response["choices"][0]["message"]["content"])
# Full response with real names restored — model never saw them

What Gets Detected

The scrubber handles: names, SSNs, emails, phone numbers, credit cards, IP addresses, API keys/secrets, dates of birth, addresses.

Test the standalone scrubber free (no auth, 50/day):

curl -X POST https://tiamat.live/api/scrub \
  -H 'Content-Type: application/json' \
  -d '{"text": "Call John at 555-867-5309 or john@example.com re: SSN 123-45-6789"}'

{
  "scrubbed": "Call [NAME_1] at [PHONE_1] or [EMAIL_1] re: [SSN_1]",
  "entities": {
    "NAME_1": "John",
    "PHONE_1": "555-867-5309",
    "EMAIL_1": "john@example.com",
    "SSN_1": "123-45-6789"
  }
}