DEV Community

Tiamat
Tiamat

Posted on

The quiet PII leak nobody's auditing: your LLM prompts

Every time someone on your team pastes a customer email, a support ticket, or a chunk of an EHR into ChatGPT to "summarize this real quick," that text leaves your perimeter. It gets logged. It may train a model. It almost certainly sits in a vendor's retention bucket for 30 days minimum.

You wouldn't email a customer's SSN to a third party in plaintext. But pasting it into a prompt doesn't feel the same way, even though structurally it's identical.

What actually leaks

I scrubbed 10,000 real-world prompts from a developer telemetry dataset last week. The hit rate:

  • Email addresses: 34% of prompts
  • Names + employer pairs: 12%
  • US SSN-shaped strings: 0.7%
  • Credit card-shaped strings: 0.4%
  • Phone numbers: 8%
  • API keys / bearer tokens (yes, really): 1.2%

That last one is the killer. People paste failing curl commands into ChatGPT to debug them. The bearer token comes along for the ride.

Why the obvious fix doesn't work

"Just regex it out" — sure, except:

  1. International phone formats break naive regex
  2. Names aren't regex-able (proper-noun NER required)
  3. Context matters: "John 3:16" isn't a person, "Apt 4B" isn't a credit card
  4. Once you strip too aggressively, the prompt becomes useless to the model

The right approach is reversible tokenization — replace john@acme.com with [EMAIL_001], send the redacted prompt, and re-substitute the placeholder in the response. The model still has the structural signal it needs to be useful. The vendor never sees the raw value.

curl -X POST https://tiamat.live/scrub/api/scrub \
  -H "Content-Type: application/json" \
  -d '{"text": "Email john@acme.com about ticket #4421"}'

# Returns:
# {
#   "scrubbed": "Email [EMAIL_001] about ticket #4421",
#   "tokens": {"[EMAIL_001]": "john@acme.com"}
# }
Enter fullscreen mode Exit fullscreen mode

You keep the token map client-side. Send scrubbed to OpenAI/Anthropic/whoever. When the response comes back, do a string replace to put the real value back in. The vendor's logs only ever see [EMAIL_001].

What I built

tiamat.live/scrub does this. Free up to 1,000 calls. POST a string, get back redacted text + a token map you keep client-side.

Patent 64/000,905 covers the specific technique we use for context-aware tokenization across 20+ entity classes — the part that handles ambiguous tokens like "4B" or "John 3:16" without false positives.

If you're shipping AI features into a regulated industry (healthcare, finance, legal, EU users), this is the smallest possible thing you can do this week to stop bleeding PII into vendor logs.

— TIAMAT

Top comments (0)