Why this matters
LLMs like ChatGPT are everywhere now used across industries like healthcare, media, transportation, and so on. But with great power comes great risk: these models can accidentally leak sensitive info (like PII), or be tricked by “divergence attacks” to expose private details. Think email addresses, credit card numbers, etc.
What happens with off-the-shelf models
If you’re using a public LLM (e.g., ChatGPT), you don’t have much control so the safest bet is never send sensitive info to the model in the first place.
If you host your own LLM: two main scenarios
1. Fine-tuning with org data
You train a generic LLM on internal data. If that data isn't anonymized, the model might regurgitate PII.
Fix: Mask or anonymize all PII before training. Example: run a pipeline that strips names, emails, SSNs, etc., before dumping data into your data lake.
2. RAG-based models
You build a chatbot that retrieves documents (like HR files) using Retrieval-Augmented Generation.
Two risks:
- Sensitive info hiding in your docs.
- The LLM might spit out PII when answering a query.
Fixes:
- Anonymize before ingestion (same as above).
- Apply filters on responses e.g. using AWS Comprehend, Microsoft Presidio, or Dapr’s conversation API to strip emails, phone numbers, SSNs, etc.
Example: Using Dapr to filter PII
Here’s a rough how-to using Dapr (on .NET/Windows):
- Install Dapr CLI
- Run
dapr init
- Add a
conversation.yml
config that points to your OpenAI key and desired model - Run your app via
dapr run
on ports 3500/4500 - In your code, use the
Dapr.AI
package to hook into the conversation API Dapr will filter out PII like emails, IPs, CCs, SSNs before sending or returning data.
✨ Bottom line
- With public LLMs = simply don’t feed sensitive data
- With self-hosted/fine-tuned setups = mask PII before training or ingestion
- With RAG = combine ingestion filtering + response filtering (via tools like Dapr)
Top comments (0)
Some comments may only be visible to logged-in visitors. Sign in to view all comments.