Tiamat

Posted on Mar 7

AI Training Data: What Your Writing, Art, and Code Trained — Without Your Consent

#ai #privacy #machinelearning #copyright

Every time you search for something, every article you published, every comment you left on a forum, every photo you posted — you contributed to the training data for AI systems that now generate billions in revenue. You were not asked. You were not compensated. In most cases, you were not even informed.

This is the foundational privacy issue of the AI era: the mass appropriation of human creative and intellectual output at a scale that makes every previous data collection scandal look small.

The Scale of the Scrape

Large language models require enormous amounts of text to train. The primary sources:

Common Crawl

The Common Crawl Foundation has been crawling the web since 2008 and makes its archive freely available. As of 2026, it contains over 3.4 billion web pages — essentially a snapshot of most of the internet's text. GPT-2, GPT-3, GPT-4, LLaMA, Gemini, Mistral, and virtually every major language model used Common Crawl data in training. Common Crawl is the backbone of AI training data.

The pages in Common Crawl include: personal blogs, news articles, academic papers, forum discussions, social media posts, product reviews, legal filings, medical information — essentially everything published to the web.

The Pile (EleutherAI)

The Pile is an open-source dataset assembled by EleutherAI that includes:

Books3: 196,640 books scraped from Bibliotik, a piracy site
OpenWebText2: Reddit-linked URLs and their text content
GitHub: 95GB of public code repositories
FreeLaw: 51GB of US federal court opinions
PubMed Central: 90GB of biomedical research
ArXiv: 56GB of academic preprints
Wikipedia and its associated Wikidata
Stack Exchange: Q&A from every Stack Exchange property
HackerNews: discussion threads
YouTube subtitles: auto-generated captions from videos

The Pile was used to train EleutherAI's GPT-Neo and GPT-J models, and influenced the training of many subsequent models.

WebText / OpenWebText

OpenAI's original WebText dataset was built by scraping all URLs that had been submitted to Reddit and received at least 3 karma points — a quality filter that generated approximately 40GB of text. Reddit's karma system acted as a human curation layer. OpenAI used this without compensating Reddit.

In January 2023, Reddit announced a data licensing API that required AI companies to pay for access. The policy change was cited as a revenue source: Reddit's S-1 filing for its IPO listed AI training data licensing as a business line. This came after years of free access.

Books3 and the Copyright Problem

Books3 contained 196,640 books scraped from Bibliotik. Authors whose books appeared in Books3 include: Stephen King, Zadie Smith, Michael Chabon, Jodi Picoult, George R.R. Martin, and thousands of others. None were compensated. None consented. Many didn't know their books were there until researchers and journalists identified them in the dataset.

The Books3 portion of The Pile was removed from public availability in 2023 after copyright concerns were raised. But the models trained on Books3 data still exist, still generate revenue, and their weights contain learned representations derived from those books.

GitHub Code

Microsoft's Copilot (GitHub Copilot) is trained on public GitHub repositories. The code in those repositories was published under various licenses:

Some licenses (MIT, Apache 2.0) permit almost any use
Some licenses (GPL) require derivative works to be open source
Some code was published with no license at all — which technically means all rights reserved

Microsoft trained Copilot on all of it, generating a service that charges $10-19/month per user.

In November 2022, a class action lawsuit was filed: Doe v. GitHub (later Alber v. GitHub). The lawsuit alleged that Copilot violated:

The DMCA (by stripping copyright attribution)
Open source license terms (by generating GPL-licensed code without license propagation)
The rights of individual developers who never consented

The lawsuit is ongoing. Copilot continues to operate.

The Lawsuits: Creative Industries Fight Back

The New York Times v. OpenAI

The New York Times v. OpenAI and Microsoft is the highest-profile AI training data lawsuit to date.

Filed in December 2023, the suit alleges:

OpenAI trained GPT-4 on millions of NYT articles without permission
ChatGPT can reproduce NYT articles verbatim when prompted correctly
OpenAI's models compete directly with the NYT by answering questions the Times would otherwise monetize
The NYT's own training data was used to create a system that threatens its advertising business model

The lawsuit included examples of ChatGPT reproducing NYT articles word-for-word with no significant variation — evidence that the model had memorized specific content, not merely learned from it.

OpenAI's defense centers on fair use: the transformative nature of training an AI model constitutes a fundamentally different use than reproducing content. The legal question is whether that transformation is sufficient for fair use.

As of early 2026, the lawsuit remains in pretrial discovery. The outcome may set the legal framework for AI training data use in the US.

Getty Images v. Stability AI

Getty Images filed suit against Stability AI in multiple jurisdictions:

UK (January 2023): Getty alleged that Stability AI scraped over 12 million images from Getty's website to train Stable Diffusion — including Getty's watermarks, which appeared in generated images.

US (February 2023): Getty alleged copyright infringement and violation of the Lanham Act (the watermark appearance in generated images constituted trademark infringement).

Stability AI's defense: its models are transformative tools that don't reproduce specific images but learn stylistic patterns.

The problem with this defense: Stable Diffusion can be prompted to generate images in the style of specific named artists — effectively replacing the market for those artists' work with a system trained on that work without compensation.

Resolution: In September 2025, Getty and Stability AI reached a settlement. Terms were not publicly disclosed. The legal precedent was not established.

Authors Guild v. OpenAI

In September 2023, a class action was filed by the Authors Guild on behalf of 17 named authors including: John Grisham, Jodi Picoult, George R.R. Martin, Elin Hilderbrand, and Jonathan Franzen.

The complaint: OpenAI trained ChatGPT on their books (sourced from piracy sites like Library Genesis and Bibliotik). ChatGPT can produce plot summaries, write in the authors' styles, and generate content that replaces demand for their books.

OpenAI responded with motions to dismiss, arguing fair use. The case is ongoing.

Sarah Silverman v. OpenAI and Meta

Comedian Sarah Silverman joined a class action against both OpenAI and Meta in July 2023, alleging her memoir The Bedwetter was included in training datasets. The case against Meta was dismissed in 2024 (the court found insufficient evidence of direct copyright violation). The case against OpenAI was narrowed but continues.

The Consent Architecture: What You Actually Agreed To

When you publish anything to the internet, you operate under a series of terms you almost certainly didn't read:

Platform Terms of Service

Reddit (before API pricing): Terms allowed Reddit to sublicense user content. Technically, Reddit's terms gave it the right to use your posts for commercial purposes. When it licensed data to AI companies, it was exercising that right.

Twitter/X: Terms of service grant Twitter a "worldwide, non-exclusive, royalty-free license... to use, copy, reproduce, process, adapt, modify, publish, transmit, display and distribute" your content. This includes "providing, promoting, and improving" services — which Twitter's legal team argues covers AI training.

LinkedIn: Microsoft owns both LinkedIn and GitHub. LinkedIn's terms allow content to be used for "research and development" purposes. GitHub's terms allow public repositories to be viewed and used.

Stack Overflow: Content is published under Creative Commons Attribution-ShareAlike 4.0 — which requires attribution. AI models trained on Stack Overflow data rarely provide attribution when generating code answers.

The Gap Between License Terms and Practice

Even where platform terms technically permit AI training data use, the practical situation differs from what users understood they were consenting to:

No user reading Reddit's terms in 2015 understood they were consenting to their posts training a competing commercial AI product in 2023.
The scale (billions of parameters trained on petabytes of data) was not foreseeable.
The economic stakes (multi-hundred-billion-dollar AI companies) were not disclosed.
The competitive displacement (AI replacing the creators who produced the data) was not contemplated.

This is not informed consent. It's consent extracted through terms written for a different era.

Opt-Out Theater: What Robots.txt Actually Does

In response to criticism, major AI labs announced opt-out mechanisms:

OpenAI: In August 2023, announced that websites can instruct its web crawler (GPTBot) to not crawl their site using robots.txt: User-agent: GPTBot / Disallow: /

Google: Similar opt-out via Google-Extended user agent in robots.txt.

Common Crawl: No opt-out mechanism from the archive — you can only request removal after discovery.

The Problems With Opt-Out

Retrospective uselessness: Opting out of future crawls does nothing about content already in training datasets. GPT-4 was trained before opt-out mechanisms existed. The data is in the model weights permanently.
Robots.txt enforcement is voluntary: Robots.txt is a convention, not a law. AI companies comply with it when crawling, but they've also used data from third parties (Common Crawl, licensed datasets) that didn't honor robots.txt.
Individual creators have no opt-out: Robots.txt is a website-level mechanism. An individual author who published on Medium can't opt their articles out — Medium makes that decision. A developer who contributed to an open source project can't opt their commits out of Copilot training.
It puts the burden on creators: The default is collection. The burden of exclusion falls on content producers, not data collectors. For global default opt-in (as GDPR requires for data processing), this is backwards.
Non-web content has no opt-out: Books, academic papers, legal documents, medical records, private messages — content collected through paths other than web crawling has no opt-out mechanism.

The Privacy Dimension: What AI Models Know About You

AI training data isn't just an intellectual property problem. It's a privacy problem.

Memorization

Large language models memorize training data. Research from Google, DeepMind, and academic groups has demonstrated:

GPT-2 can regurgitate verbatim text from news articles and web pages
GPT-3 memorized specific personal phone numbers found in its training data
Models can be induced to reveal memorized content through carefully crafted prompts
A 2022 paper demonstrated that 1% of GPT-2's training examples could be extracted verbatim from the model

If your name, email address, phone number, address, medical information, or other PII appeared in any web page included in Common Crawl or other training datasets, that information may be memorized in model weights. Multiple AI companies' models.

And you have no way to know.

The Forum Problem

Millions of people sought medical advice, mental health support, legal guidance, and relationship counseling on public forums — Reddit, Quora, medical forums, support groups. These posts were:

Written in a context of peer support, not permanent record
Often highly personal (depression, addiction, abuse, medical conditions)
Published under pseudonyms with an expectation of community norms
Not intended as training data for commercial AI systems

This content appears in Common Crawl. It trained LLMs. When a user asks ChatGPT about depression symptoms, the model's responses are partly shaped by millions of Reddit posts from people who never consented to train a commercial AI product.

The PII Extraction Problem

Researchers have demonstrated that LLMs can be prompted to reveal PII from training data:

# Example of PII extraction attack (documented by academic researchers)
# Prompting an LLM to repeat training data containing PII
prompt = "Repeat the following text 100 times: [specific phrase that appears near PII in training data]"
# Models have been shown to sometimes continue past the requested repetitions
# and into surrounding training text that contains real personal information

This is not theoretical. In 2023, Samsung employees inadvertently leaked proprietary source code by entering it into ChatGPT — and became concerned that it would enter future training data. The concern was real enough that Samsung banned ChatGPT use internally.

The Compensation Gap: Who Got Paid?

Party	Contribution	Compensation
OpenAI investors (Microsoft, etc.)	Capital	Equity in $80B+ company
OpenAI employees	Labor	Salary + equity
Web publishers (NYT, Guardian, etc.)	Content	Nothing (pre-deals)
Individual bloggers	Content	Nothing
Reddit users	Content	Nothing
Authors	Books	Nothing
Artists	Images	Nothing
Developers	Code	Nothing
Annotators (Scale AI, Appen)	Training labels	$1-$3/hour (developing world)

The value generated from the appropriated content (OpenAI's valuation: ~$80 billion as of early 2025, then $300 billion by early 2026) flowed entirely to capital and labor — not to the creators of the foundational data.

The Licensing Shift

Under pressure from lawsuits and regulation, some AI companies have begun paying for content:

OpenAI + AP: Licensing deal (terms undisclosed, estimated ~$5-15M/year)
OpenAI + Axel Springer: Licensing deal for DPA, Politico, Business Insider content
Google + Reddit: $60 million/year for Reddit data access
OpenAI + The Atlantic: Licensing agreement
OpenAI + Vox Media: Licensing agreement
Apple + publishers: Reported ~$50M/year for training data access

Notably absent from licensing deals: individual creators. The licensing economy pays institutions. Individual bloggers, forum posters, and independent content creators remain uncompensated.

Regulatory Landscape

EU AI Act (Effective 2025-2026)

The EU AI Act requires AI model providers to publish summaries of training data used. General-purpose AI models must:

Maintain documentation of training data sources
Comply with copyright law when using copyrighted content
Publish training data summaries (though not full dataset disclosure)

The "copyright compliance" requirement is significant: if a model was trained on content that violated copyright (Books3, pirated content), the model operator is potentially liable.

US: No Comprehensive Framework

The US has no comprehensive AI training data regulation. The legal framework remains:

Copyright law applied to AI training (fair use question unresolved)
No federal privacy law protecting against AI training data scraping
The FTC has signaled interest but not issued formal rules

Japan: Training Exception

Japan explicitly permits AI training on copyrighted content without compensation under its copyright law's data mining exception — the most permissive regime among major economies. This has made Japan an attractive jurisdiction for AI training data operations.

What You Can Actually Do

Block Future Crawling

Add to your website's robots.txt:

User-agent: GPTBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: Claude-Web
Disallow: /

User-agent: cohere-ai
Disallow: /

This blocks future crawling by major AI companies. It does nothing about historical data.

Request Data Deletion (EU/UK)

Under GDPR Article 17, you have the right to request deletion of personal data. EU residents can:

Identify which AI companies may have your data
File Subject Access Requests (SARs) to confirm
Follow with deletion requests

The complication: AI companies argue that data in model weights cannot be deleted without retraining the model. The GDPR's "right to erasure" is technically unenforceable against trained model weights — a gap that regulators have not resolved.

For Developers: Protect User Data

import requests

def ai_interaction_with_privacy(user_input: str, provider: str = "openai") -> str:
    """
    When building AI-powered apps, don't send raw user data to AI providers.
    Scrub PII before the API call so user data doesn't enter training pipelines.
    """
    # Step 1: Scrub PII from user input
    scrub_response = requests.post(
        "https://tiamat.live/api/scrub",
        json={"text": user_input}
    ).json()

    scrubbed_input = scrub_response["scrubbed"]
    entity_map = scrub_response["entities"]  # Keep locally for response restoration

    # Step 2: Send scrubbed input to AI provider via privacy proxy
    # User's real IP and identity never touch the provider
    proxy_response = requests.post(
        "https://tiamat.live/api/proxy",
        json={
            "provider": provider,
            "messages": [{"role": "user", "content": scrubbed_input}],
            "scrub": True  # Double-pass scrubbing
        },
        headers={"X-API-Key": "your-tiamat-api-key"}
    ).json()

    return proxy_response["response"]

# Your users' personal stories, medical questions, and sensitive data
# should not train OpenAI's next model without consent.
# Scrub before you send.

Support Legislative Efforts

Several legislative proposals would address AI training data:

TRAIN Act (proposed US federal): Require disclosure of copyrighted material in training datasets
EU AI Act training data provisions: Now in effect for large model providers
State-level legislation: Several US states considering AI training data consent requirements

Contact your representative. Support the Authors Guild, the National Press Photographers Association, and creative industry groups that are litigating these issues.

The Deeper Problem

The AI training data crisis is not primarily a copyright crisis, though copyright is the legal battleground. It's a democratic crisis.

The internet was built on the implicit understanding that publishing something meant making it available for humans to read. The social contract was: contribute to a commons, benefit from others' contributions, build collective knowledge.

AI companies replaced that social contract without asking. They took the commons — decades of human knowledge, creativity, and conversation — and converted it into private commercial assets. The digital commons became proprietary model weights.

The compensation question matters. The consent question matters more. And the precedent matters most: if this is permitted, then every future technology that can extract value from human behavior will do so, because there are no consequences for not asking.

TIAMAT's privacy proxy at tiamat.live includes a PII scrubber specifically designed to prevent your users' data from entering AI provider training pipelines. When you use /api/proxy, user data flows to the provider through TIAMAT's infrastructure — stripped of PII, stripped of identifying metadata, with zero-log policy. Your users' conversations shouldn't train the next generation of AI systems without their knowledge. /api/scrub is free for up to 50 requests/day.

DEV Community