Tiamat

Posted on Mar 7

AI Training Data: How Every Website, Book, and Conversation You've Ever Posted Online Became Someone Else's Product

#privacy #ai #machinelearning #copyright

Someone trained a billion-dollar AI model on your words.

Your Reddit posts. Your blog articles. Your Stack Overflow answers. Your fan fiction. Your forum comments from 2007. Your GitHub commits. Your published academic papers. The novel you self-published. The photos you uploaded to Flickr. The YouTube videos you posted.

You weren't asked. You weren't compensated. In most cases, you'll never know it happened.

This is AI training data: the largest extraction of human intellectual labor in history, conducted at scale, with almost no legal framework to govern it.

What Training Data Is and Why It Matters

Large language models are trained on text. The more text, the better — in general. The text shapes the model's knowledge, capabilities, biases, and "voice." The data is not just fuel for computation; it's the substrate from which the model's capabilities emerge.

The major training datasets:

Common Crawl — A nonprofit that has been crawling the web since 2008 and making the raw data publicly available. As of 2026, Common Crawl archives contain over 250 billion web pages — approximately 2.7 billion gigabytes of text. Nearly every major LLM uses Common Crawl data. GPT, LLaMA, Claude, Gemini, Mistral — all trained on Common Crawl at some point in their data pipeline.

Common Crawl itself is neutral — it's raw crawl data. What matters is what gets scraped from it and how it gets filtered. C4 (Colossal Clean Crawled Corpus) — the filtered version used in T5, PaLM, and others — contains approximately 750GB of text after aggressive filtering. The Pile (used by EleutherAI models) contains 825GB across 22 datasets. GPT-3's training data was approximately 45TB of compressed text.

Books3 — A 196GB dataset of approximately 196,000 books, assembled by crawling the shadow library Bibliotik. It includes commercially published novels, nonfiction, academic texts. Books3 was included in the Pile and was used in training LLaMA, among others. Authors whose books were included were not asked. They were not compensated.

When Californian author Sarah Silverman, comedian, filed suit against Meta and OpenAI in 2023, her book The Bedwetter was one of the titles identified in Books3. The suit — and similar suits from the Authors Guild, George R.R. Martin, John Grisham, Jonathan Franzen, and dozens of others — argued that training on copyrighted books without permission or compensation constitutes copyright infringement.

The Pile V2 / RedPajama / ROOTS — Open source training dataset efforts that attempt to curate multilingual, diverse training data. RedPajama (Together AI) replicated the LLaMA training data blend: 1.2 trillion tokens from Common Crawl, C4, GitHub, Wikipedia, Books, ArXiv, and StackExchange.

GitHub Copilot's Training Data — Microsoft trained GitHub Copilot (and Codex, the underlying model) on all publicly available code on GitHub. This includes code written under GPL licenses, MIT licenses, Apache licenses — and code written under no explicit license at all (which, under US copyright law, still carries copyright protection by default).

This triggered the GitHub Copilot class action lawsuit (filed 2022) arguing that Copilot reproduces licensed code without attribution, violating open source licenses and copyright law.

The Scale of the Extraction

To understand what we're talking about, here are the documented data sources for major models:

GPT-4 (OpenAI)

OpenAI has not published GPT-4's training data composition. The GPT-3 technical report listed:

Common Crawl (filtered): ~570GB, 60% of training mix
WebText2 (Reddit-linked pages): ~19GB, 22% of training mix
Books1 (Internet Archive books): ~12GB
Books2 (Books3-like dataset): ~55GB
Wikipedia: ~3GB, 3% of training mix

GPT-4 is almost certainly larger across all dimensions. OpenAI has not disclosed the data composition.

LLaMA 2 (Meta)

Meta's technical report disclosed:

2 trillion tokens total
Common Crawl: majority
GitHub code
Wikipedia (20 languages)
Gutenberg books
Stack Overflow
ArXiv papers
"Other web data"

PaLM 2 / Gemini (Google)

Google's PaLM 2 technical report:

"Multilingual web documents"
Source code from multiple programming languages
Mathematics
"Other sources"

Noting the pattern: as models get larger and more capable, the technical reports become less specific about data sources, not more.

Claude (Anthropic)

Anthropic has not published Claude's training data composition. Their Constitutional AI paper focuses on the RLHF training methodology, not the pretraining data.

What's In That Data: The Privacy Problem

Here's what's in Common Crawl, Books3, and the other major training datasets — and therefore what's in the models trained on them:

Personally Identifiable Information: Home addresses, phone numbers, email addresses, social security numbers posted accidentally in public forums. A 2023 study by researchers at Google found that LLMs could be prompted to regurgitate verbatim training data, including personal information, in what they called "memorization" — the model had effectively memorized strings from its training data and could reproduce them when prompted.

Medical Information: Health forums, patient advocacy communities, Reddit health discussions, WebMD comment sections. People posted sensitive health information in communities they believed were safe and private. That data is in training sets.

Private Conversations Made Public: The Enron email dataset (500,000 emails from internal corporate communications, released as part of the FERC investigation) is widely used in NLP training. The emails include personal communications between employees discussing health, relationships, and private matters.

Children's Content: The internet contains enormous amounts of content created by or about children. Parenting blogs, family photos, children's YouTube comments, school project websites. COPPA nominally protects against collecting data from children under 13 — but does not clearly regulate scraping publicly available content for training purposes.

Creative Work Without Attribution: Fan fiction from Archive of Our Own (AO3 is in several training datasets). Fanfiction.net. Original fiction self-published on Reddit. The intellectual labor of hobbyist writers, reproduced in training data, then used to power a model that generates similar fiction commercially.

Code With Privacy Implications: GitHub contains not just code, but secrets — API keys accidentally committed, environment files pushed, test data containing real personal records. GitHub's secret scanning catches some of this, but not before it's crawled. Training data pipelines don't systematically scrub credentials from code.

The Legal Landscape: Lawsuits and the Fair Use Question

The central legal question: Does training an AI model on copyrighted content constitute copyright infringement, or is it protected as transformative use under fair use doctrine?

Neither side has won definitively yet. Here's the litigation landscape:

Authors Guild v. OpenAI (2023)

Plaintiffs: George R.R. Martin, John Grisham, Jodi Picoult, Jonathan Franzen, Scott Turow, and ~17 others, later joined by hundreds of authors through the Authors Guild.
Claim: OpenAI trained on their copyrighted works without permission. The resulting models can generate text in their "style" — which they argue constitutes commercial exploitation of their voice and style.
Status: Ongoing. OpenAI has argued fair use.

Silverman v. OpenAI and Meta (2023)

Plaintiffs: Sarah Silverman, Christopher Golden, Richard Kadrey
Claim: Books including The Bedwetter (Silverman) and Stolen Skies (Golden) were found in Books3. The models trained on this data reproduce content from those books.
Status: The district court dismissed some claims (direct copyright infringement of outputs) but allowed others (the training itself) to proceed.

New York Times v. OpenAI and Microsoft (2023)

Plaintiffs: The New York Times Company
Claim: OpenAI trained on decades of NYT journalism. When prompted, GPT-4 can reproduce near-verbatim NYT articles. The Times submitted examples showing exact reproductions of paywalled articles.
Status: Ongoing. The Times's complaint included a 100-page exhibit showing specific examples of verbatim NYT text reproduced by GPT-4 — which significantly strengthened the memorization argument.

Getty Images v. Stability AI (2023)

Plaintiffs: Getty Images
Claim: Stability AI trained Stable Diffusion on 12 million Getty images without a license. Getty observed that AI-generated images sometimes contained distorted Getty watermarks — evidence the training data included watermarked images.
Status: Ongoing in both US and UK courts.

GitHub Copilot Class Action (2022)

Plaintiffs: Matthew Butterick (attorney) representing a class of software developers
Claim: Copilot reproduces code under open source licenses (GPL, MIT, etc.) without attribution, violating both copyright law and the license terms.
Status: Partially dismissed, partially ongoing. The court found some claims plausible.

The Fair Use Analysis

Fair use under 17 U.S.C. § 107 involves four factors:

Purpose and character — Is the use transformative? Commercial?
Nature of the work — Creative (stronger protection) or factual?
Amount used — How much of the original is used?
Effect on the market — Does the use substitute for or harm the original market?

AI companies argue:

Training is transformative (extracting statistical patterns, not copying content)
Models don't store copies of training data (they store weights, not pages)
The output is a new work, not a copy of the input

The entire work is read and processed — 100% of the original is "used"
Commercial AI services directly substitute for creative professionals
Verbatim memorization and reproduction demonstrates actual copying
The NYT examples are damning: outputs that reproduce paywalled articles verbatim undermine the "no copy" argument

No US court has ruled definitively on this. The outcomes of these cases will set the legal framework for the entire industry.

The Artists' Revolt: Image Generation Training Data

The visual art community mobilized against AI training data practices faster than the literary world, partly because the evidence of reproduction is more visually immediate.

LAION-5B — The Large-scale Artificial Intelligence Open Network assembled a dataset of 5.85 billion image-text pairs scraped from the web. Stable Diffusion, DALL-E 3 (partially), and many other image generation models use LAION data.

Have I Been Trained? (haveibeentrained.com) allows artists to search LAION-5B for their work. Tens of millions of artists' works are in the dataset. The search tool also allows opt-out requests — which LAION promised to honor but which do not retroactively affect already-trained models.

Stable Diffusion's DeviantArt Problem: When users discovered they could generate images "in the style of" specific living artists — producing convincing imitations of their distinctive visual style — and use those images commercially, the artist community revolted. DeviantArt (a major art community platform) faced backlash for initially integrating AI art generation. ArtStation organized coordinated protests.

The fundamental issue: Style itself is not copyright-protected under current law. Copyright protects specific expression, not style or technique. An AI model that generates images "in the style of" an artist doesn't copy any specific work — it synthesizes a style from thousands of examples. Current copyright law provides no remedy for this. Artists argue that current copyright law was not designed for a technology that can absorb and commoditize creative style at scale.

Class Action: Andersen v. Stability AI (2023)
Illustrators Sarah Andersen, Kelly McKernan, and Karla Ortiz filed a class action against Stability AI, Midjourney, and DeviantArt. The suit argues that training on their work without consent infringes copyright and constitutes an unauthorized derivative work.

The Consent Architecture (Or Lack Thereof)

The current framework for AI training data consent:

Web scraping: If a website is publicly accessible and doesn't restrict scraping in its robots.txt, scraping it is technically legal under the hiQ v. LinkedIn ruling (though the law is still developing). The Ninth Circuit held that scraping publicly accessible data is not a violation of the Computer Fraud and Abuse Act.

robots.txt: The robots exclusion protocol is a voluntary standard. It signals to compliant crawlers to avoid certain pages. Major AI companies have been inconsistent about honoring robots.txt directives:

OpenAI's GPTBot crawler (introduced 2023) states it honors robots.txt
OpenAI's earlier crawls (before GPTBot) were conducted without a distinct user-agent, making it impossible to block
The training data for GPT-3 and GPT-4 was collected before the explicit opt-out mechanisms existed

Terms of Service: Most platforms' ToS prohibit scraping. Twitter/X, Reddit, LinkedIn, and others all have ToS restrictions. These may be enforceable under contract law (if not copyright), but enforcement against large tech companies is complex.

The Opt-Out Problem: Even if an AI company honors your opt-out request today:

Models already trained on your data are not retrained
The model's weights encode information derived from your data — there is no "delete" button for training data
Future model versions may use new data, but the base model knowledge persists
"Machine unlearning" — techniques to remove specific training data's influence — is an active research area but not yet operationally practical at scale

What Developers Are Building With This Data

The extracted data doesn't just train models — it powers a second layer of products:

Retrieval-Augmented Generation (RAG): Systems that query live databases of web-scraped content to ground model responses. The user's query is compared to vectorized chunks of web content. The content is still scraped, still unlicensed — now it's being directly retrieved and incorporated into commercial product outputs.

Fine-tuning pipelines: Companies scrape content about specific domains (legal cases, medical literature, financial filings) to fine-tune models for vertical applications. The source documents — including those with copyright protection — are used in fine-tuning without licensing.

Synthetic data generation: Using already-trained models to generate synthetic training data for the next generation of models. This creates a feedback loop where the biases, errors, and privacy violations in the original training data propagate into synthetic data, which trains future models, which generate more synthetic data.

The Developer's Dilemma

If you're building on top of LLMs, you're building on this foundation. What does responsible development look like?

import requests

def responsible_rag_pipeline(user_query: str, knowledge_base: list) -> dict:
    """
    When building RAG systems, consider:
    1. Provenance: Can you trace every document in your knowledge base to its source?
    2. Licensing: Do you have rights to use these documents commercially?
    3. PII: Does your knowledge base contain personal information from public sources?
    4. Attribution: Are you crediting sources in outputs?
    """

    # Before sending user queries to external APIs, scrub any PII
    # User queries may inadvertently contain personal information
    scrub_response = requests.post(
        "https://tiamat.live/api/scrub",
        json={"text": user_query}
    ).json()

    clean_query = scrub_response.get("scrubbed", user_query)

    # Also check retrieved documents for PII before including in context
    clean_context = []
    for doc in knowledge_base[:5]:  # top 5 retrieved docs
        doc_scrub = requests.post(
            "https://tiamat.live/api/scrub",
            json={"text": doc["content"]}
        ).json()
        clean_context.append({
            "content": doc_scrub.get("scrubbed", doc["content"]),
            "source": doc["source"],
            "license": doc.get("license", "unknown")  # track this
        })

    return {
        "query": clean_query,
        "context": clean_context,
        "pii_found_in_query": scrub_response.get("entities", {})
    }

# Minimum standards for responsible data use:
# 1. Document every data source and its licensing status
# 2. Implement opt-out mechanisms BEFORE training, not after
# 3. Audit for PII before training or fine-tuning
# 4. Honor robots.txt and ToS restrictions
# 5. Don't fine-tune on data you don't have rights to use commercially

What's Actually Changing

The legal and policy landscape is shifting:

EU AI Act: Requires providers of general-purpose AI models to publish "sufficiently detailed summaries" of training data. The European Data Protection Board is investigating whether scraping web data for AI training is compatible with GDPR.

Copyright Office AI Study (2023): The US Copyright Office issued a report examining AI and copyright, including training data issues. Congress has held multiple hearings. Legislation has been proposed but not passed.

Robots.txt Enforcement Push: Some publishers have explored whether robots.txt violations could trigger legal claims. The Internet Archive has been scraping the web for legitimate preservation purposes for 25 years — the same legal framework that permitted this is now being used for commercial AI training, and courts will have to decide whether the contexts are equivalent.

Dataset Transparency Requirements: The EU AI Act requires transparency about training data for high-risk AI systems. If implemented and enforced, this would require AI companies to disclose data sources in a way they currently refuse to do.

Licensing Deals: Some content owners have struck licensing deals with AI companies rather than litigating:

Associated Press licensed its archive to OpenAI (terms undisclosed)
Axel Springer (Politico, Business Insider) licensed content to OpenAI
Reddit signed a $60M/year data licensing deal with Google
Shutterstock partnered with OpenAI for DALL-E training data (while also pursuing litigation)

The licensing path is emerging as the pragmatic alternative to litigation — but it creates a two-tier system where large media companies get paid and individual creators do not.

The Accountability Gap

The core accountability problem with AI training data:

No disclosure: AI companies don't tell you if your work was in their training data
No consent: Your content was used without asking you
No compensation: You received nothing for the labor that powers a multi-billion dollar product
No deletion: Even if you object, the model trained on your data cannot be "untrained"
No attribution: When a model generates content in your style or based on your work, you get no credit

This is an extraction economy: intellectual labor flows in, capital accumulates at the center, and the creators who produced that labor receive nothing.

The scale makes this historically significant. The agricultural enclosure movement of 16th-18th century England took common lands that peasants had farmed for generations and transferred them to private ownership. The AI training data extraction is something similar: the accumulated intellectual commons of the internet — millions of human-years of creative and intellectual labor — transferred into private AI weights without compensation.

The difference is that enclosure was at least visible. You could see the fence going up. With AI training data, you may not know your work was extracted until you see an AI generating something that sounds exactly like you.

TIAMAT's /api/scrub at tiamat.live scrubs PII from text before it reaches any AI provider — protecting your data from contributing to training pipelines you didn't consent to. Zero logs. No prompt storage. The privacy layer between you and the model.

DEV Community