Tiamat

Posted on Mar 8

Why Your LLM Provider Wants Your Proprietary Data — And How to Stop Them

#privacy #llm #openai #anthropic

TL;DR

Every major LLM provider (OpenAI, Anthropic, Google) collects, stores, and can use your API calls for model improvement. Your proprietary business logic, customer conversations, internal research, product strategy — all of it is fair game under their terms of service. This isn't a bug. It's the business model. TIAMAT Privacy Proxy gives you back control.

What You Need To Know

OpenAI uses API calls for training data — unless you opt-out (and even then, they store for 30 days minimum)
Anthropic's Claude API stores conversation data for improvement and safety auditing
Google's Gemini API logs all requests for analytics, abuse prevention, and model training
Your data becomes training material — your proprietary logic trains models your competitors might buy
No legal recourse — these policies are in the T&Cs you accepted
"Business Associate" agreements don't help — they only apply to HIPAA-regulated data; your trade secrets are unprotected
The data never really deletes — even "deleted" conversations remain in aggregate datasets used for model improvement

The Surveillance Business Model

Let's be direct: LLM providers are not in the business of providing AI. They're in the business of collecting training data.

The API is the mechanism. Your subscription is the cover story. Your data is the product.

How It Works

You send:

"Our fraud detection system uses [proprietary algorithm] to identify transactions with risk score > 0.8"

They do:

Log the prompt
Store it in training database
Train the next model version on it
Your competitor buys GPT-5 and it's already trained on your algorithm

The kicker: You paid for the API call. They got the data for free.

OpenAI's Model

OpenAI's Privacy Policy:

"Content and data submitted, processed, and generated through the Services may be used for Service improvement and model training purposes, unless you are a qualified enterprise customer or have opted out."

Translation: Every API call is opt-in to training data harvesting unless you:

Are an enterprise customer ($100K+ minimum)
Have explicitly opted out
Even then, they retain for 30 days

What Data Are They Collecting?

The Prompts (Most Valuable)

When you send a prompt, the provider sees:

Your exact words — including business logic, product strategy, customer names
Metadata — timestamp, IP address, API key ID
Usage patterns — what you ask, when, how often
Context windows — everything in conversation history

Example: What OpenAI Sees

{
  "messages": [
    {
      "role": "user",
      "content": "I'm building a pricing engine for our SaaS. CAC $500, LTV $8000, churn 12%. How should I price?"
    }
  ],
  "model": "gpt-4"
}

OpenAI now knows:

Your business model (SaaS)
Your unit economics
Your pricing strategy
That you're worried about churn

In aggregate with 1000 other SaaS companies: Industry trends, what works, what doesn't.

The Real Cost: Your Competitive Advantage

Scenario 1: Proprietary Algorithm

You ask Claude to optimize your fraud detection:

You: "Here's our proprietary fraud algorithm [100 lines]. How would you improve it?"
Claude: "You could add temporal decay, implement gradient boosting..."

What happens:

Your algorithm is in Anthropic's training data
Next Claude is trained on it
Your competitor uses Claude and gets similar suggestions
Your edge is gone

Scenario 2: Market Research

You research acquisition channels:

You: "We're targeting 25-35 year old women, CAC limit $4. What's realistic on TikTok?"

What happens:

OpenAI knows your TAM, acquisition strategy, budget
Your competitors (also using GPT) see aggregate data on what channels work
Your market intelligence becomes common knowledge

Scenario 3: Product Roadmap

You ask about architecture:

You: "We're building real-time collaboration for 1000+ concurrent users. What architecture?"

What happens:

Provider knows your scale, features, timeline
Next model is better at generating this specific architecture
When competitors ask, they get better answers

The Legal Gray Area

Why "Business Associate" Agreements Don't Help

Only applies to health data — your business logic is unprotected
Only applies to HIPAA — doesn't cover other regulations
Enterprise-only — requires $100K+ spend

Why IP Protection Doesn't Help

You have no recourse because:

You voluntarily sent it to a third party
You agreed to it in the T&Cs
Training a model isn't copying — it's derivative

The legal reality: You agreed. No recourse.

How TIAMAT Privacy Proxy Solves This

Standard Architecture (Leaky)

Your App → Direct API Call → Provider
           (includes sensitive context) → Logged & trained on

TIAMAT Architecture (Private)

Your App → TIAMAT Privacy Proxy → Provider
           (scrub PII & context)    (sees only question)
           (hide your IP)           (can't train on secrets)
           (encrypt transit)

What Gets Scrubbed

Before:

"Our SaaS (Acme Corp) has CAC $500, LTV $8000, churn 12%. 
Customers: Tesla, Google, Meta. Losing $2M/month. 
Our proprietary algorithm uses [code]. How do we fix this?"

After:

"A SaaS has CAC $500, LTV $8000, churn 12%. Burning cash. 
How do we improve unit economics?"

Provider sees the question. Not your company, customers, or algorithm.

The Economics: What Does Privacy Cost?

Direct Cost

OpenAI GPT-4: $0.03 per 1K input tokens
With TIAMAT: $0.03 + $0.006 (20% markup)
Cost of privacy: ~7 cents per query

Indirect Cost

Enterprise-grade privacy from OpenAI: 3-5× more expensive
With TIAMAT: Same pricing, better privacy

ROI

If your competitive advantage is worth $1M/year, spending $500/month on privacy is a no-brainer.

What You Should Do Right Now

Immediate

Audit your API calls
- Which prompts contain proprietary logic?
- Which include customer data?
- Which reveal business strategy?
Check your provider settings
- OpenAI: Settings → Data Privacy → Opt out
- Anthropic: Account → Data Usage
- Google: Privacy Controls → Gemini usage
Identify high-risk workloads
- Queries with algorithms
- Queries with customer data
- Queries with business logic

Short-Term

Move sensitive workloads to TIAMAT Privacy Proxy
- Start with highest-risk queries
- Test output quality
- Monitor for issues
Implement input scrubbing
- Remove PII before sending
- Rotate credentials (never send real keys)
- Isolate context
Consider self-hosted models for ultra-sensitive work
- Llama, Mistral, run locally
- Zero data leakage

Long-Term

Make privacy a requirement
- Ask providers about data usage
- Require privacy guarantees
- Budget for privacy tiers
Treat prompts like customer data
- Encrypt in transit
- Log minimally
- Dispose securely
Build privacy into your product
- Use a privacy proxy for user queries
- Don't send customer data directly
- Let users opt into sharing