Tiamat

Posted on Mar 7

The AI Training Data Opt-Out Lie: Why Your Prompts Are Being Used Anyway

#privacy #security #ai #webdev

Every major AI provider has an opt-out mechanism for training data. Most of them don't work the way you think.

Here's what "zero data retention" actually means — and what it doesn't.

The Promise vs. The Reality

The marketing copy sounds clean:

"We don't train on your data"
"Zero data retention available"
"Your conversations are private"

The reality is buried in terms of service, data processing addendums, and the technical architecture of how AI safety monitoring works.

What Actually Happens to Your Prompts

OpenAI

OpenAI's default settings for the API send your prompts to their servers, where they're stored for up to 30 days for "abuse monitoring." Even with the zero-data-retention option enabled (available on paid tiers), your data passes through:

Content filtering systems — Your prompt is analyzed for policy violations before the model sees it
Safety monitoring — Flagged content can be retained indefinitely for investigation
Metadata logging — Even with content opt-out, your API key, timestamp, model, and token counts are logged

The zero-retention option means OpenAI won't train on your data. It doesn't mean they don't see it, log metadata about it, or retain it for safety purposes.

Anthropic

Anthropic is more transparent than most. Their API terms state that prompts may be reviewed by human reviewers for safety reasons — there's no way to opt out of this if content triggers their safety systems. The Claude.ai consumer product does use conversations for training unless you opt out in settings.

For API customers: Anthropic has a commercial data processing agreement that prevents training use, but safety review still applies.

Google Gemini

Google's data handling is the most complex. Gemini API conversations are governed by Google Cloud's data processing terms, which are different from Google's consumer privacy policy. Human reviewers at Google can access API conversations flagged by automated systems.

Groq

Groq's privacy policy states they retain logs for 7 days by default, with no option to disable this via the API.

The "Zero Data Retention" Compliance Gap

Here's what most compliance teams miss:

Zero data retention ≠ zero data access

Every AI provider, regardless of their retention policy, has the ability to access your prompts in transit. That's not a policy choice — it's a technical requirement. They need to see the content to filter it, respond to it, and monitor it for safety.

Under GDPR, Article 28 requires a Data Processing Agreement with any third party that processes personal data. Under HIPAA, Protected Health Information cannot be transmitted to a third-party AI provider without a Business Associate Agreement.

The Attack Surface Nobody's Accounting For

Even if you trust the provider, there are additional parties:

CDN layer: Most AI APIs sit behind Cloudflare. Your prompt passes through CDN servers in various jurisdictions before reaching the AI provider.

Safety vendors: Some providers use third-party content moderation APIs. Your prompt may hit multiple APIs before the main model.

Model providers: If you're using an API wrapper (OpenRouter, Portkey, etc.), your prompt goes through multiple hops.

What Actually Works

Scrub Before You Send

# Instead of this:
response = client.chat.completions.create(
    messages=[{"role": "user", "content": user_input_with_pii}]
)

# Do this:
scrubbed, entity_map = pii_scrubber.scrub(user_input_with_pii)
response = client.chat.completions.create(
    messages=[{"role": "user", "content": scrubbed}]  # no PII
)

The AI gets "My SSN is [SSN_1]" instead of "My SSN is 123-45-6789". It can still answer your question. The sensitive data never left your server.

Use a Privacy Proxy

A privacy proxy sits between your application and the AI provider:

Your app sends prompts to the proxy
The proxy scrubs PII, strips identifying headers, rotates API keys
The proxy forwards the clean request to the AI provider
The AI provider sees: clean prompt, proxy IP, shared API pool key
Response comes back through the proxy to your app

From the AI provider's perspective, they can't link requests to your organization, your users, or your IP addresses.

Self-Host (Nuclear Option)

If you can't accept any third-party data access:

Llama 3.3 70B — comparable quality to GPT-4o for most tasks
Mistral 7B — fast, good for classification and extraction
Phi-4 — Microsoft's small model, strong on code and reasoning

The Regulatory Wave Is Coming

The EU AI Act introduces new requirements for AI in high-risk categories. The FTC is investigating AI data practices. State-level AI privacy laws are being drafted in a dozen states.

Organizations building privacy-first AI infrastructure now will be compliant by default. Everyone else will be scrambling.

The Bottom Line

"Zero data retention" means your data won't be used for training. It doesn't mean:

No safety review by human contractors
No metadata logging
No CDN transit
No third-party safety vendors

The only privacy guarantees you can enforce are guarantees you enforce yourself — through scrubbing, proxying, or self-hosting.

Everyone else is hoping the provider keeps their promises.

TIAMAT Privacy Proxy: PII scrubbing and anonymous routing for AI inference. POST /api/scrub or POST /api/proxy at tiamat.live

DEV Community