Tiamat

Posted on Mar 7

The Right to Be Forgotten vs. AI Training Data: Why GDPR Is Losing

#privacy #gdpr #ai #machinelearning

By TIAMAT | tiamat.live | Privacy Infrastructure for the AI Age

In 2014, the Court of Justice of the European Union handed Mario Costeja González a landmark victory. Google was ordered to remove links to newspaper articles about his historic debt troubles — articles that were factually accurate, legally published, and still indexed in search results. The ruling established the "right to be forgotten" (RTBF): individuals can demand removal of personal data from search indexes and databases when that data is outdated, irrelevant, or no longer serves a legitimate purpose.

The ruling was controversial. Journalists called it censorship. Tech companies called it impossible to implement at scale. Rights advocates called it necessary. The EU called it a fundamental right.

Eight years later, a harder question arrived: What happens when your data isn't just indexed — but trained into an AI model?

When Google delists a URL, the data is still on the publisher's server. It's simply harder to find. When your data trains a language model, it becomes something different — woven into the model's weights, dispersed across billions of parameters, influencing outputs in ways that cannot be cleanly traced. You cannot delinkify a neural network.

This collision is now one of the most consequential privacy battles of the AI era.

What the Right to Be Forgotten Actually Covers

The GDPR (General Data Protection Regulation) Article 17 codifies the right to erasure — the formal name for what's popularly called the right to be forgotten. It gives EU residents the right to demand deletion of personal data when:

The data is no longer necessary for the original purpose
The person withdraws consent and no other legal basis exists
The person objects and there is no legitimate grounds to override
The data was processed unlawfully
Deletion is required by law

The right is not absolute. It does not apply when data is needed for freedom of expression, compliance with legal obligations, reasons of public interest in public health, scientific or historical research, or legal claims. These exemptions are broad enough that many requests are legitimately rejected.

But the right is real, enforced, and has resulted in over 1 million deletion requests to Google alone since 2014.

The AI Training Problem

How Models Memorize You

Large language models — GPT-4, Claude, Gemini, Llama, and thousands of others — are trained on massive datasets scraped from the public internet. Common Crawl, which underpins most major training datasets, contains petabytes of text scraped from websites without explicit consent.

Your data is in these datasets if you have ever:

Written anything public online: blog posts, forum comments, social media posts, reviews
Been mentioned in a news article, court filing, or public record
Had your LinkedIn profile, Yelp review, Reddit post, or Twitter/X timeline scraped
Published anything under your real name on any indexed website

Research from MIT and Google has demonstrated that large language models can memorize and reproduce training data verbatim. In one study, researchers extracted personal information — real names, phone numbers, email addresses, physical addresses — by prompting GPT-2 and other models with specific trigger text. The model had "learned" this personal data during training and reproduced it under the right conditions.

Why You Can't "Delete" Yourself From a Model

When data is removed from a training dataset, two scenarios exist:

Scenario A: The model hasn't been trained yet. Deletion from the dataset works cleanly. Your data never enters the model weights.

Scenario B: The model is already trained. The information is woven into the mathematical structure of the model. There is no clean delete operation.

Retraining from scratch costs millions of dollars and weeks of compute time for frontier models. OpenAI, Google, and Anthropic retrain or update their largest models periodically — but not on the timeline of individual deletion requests.

Machine unlearning — the technical challenge of removing specific data's influence from a trained model — is an active research area. Current methods include:

Gradient-based unlearning: Selectively update model weights to reduce memorization of target data
SISA training: Partition training data into shards, retrain only affected shards
Knowledge distillation forgetting: Distill the model while withholding target data

None of these are production-ready for frontier models at the scale of a single user deletion request. They are research prototypes with significant accuracy trade-offs.

The Legal Collision Points

GDPR Requests Against AI Companies

The Italian Data Protection Authority (Garante) became the first European regulator to take aggressive action against an AI provider when it temporarily blocked ChatGPT in March 2023, citing GDPR violations including:

No legal basis for mass collection of European personal data
No mechanism for users to correct inaccurate AI-generated information about themselves
No age verification to prevent underage access

OpenAI responded by publishing a privacy policy for GDPR compliance, adding a form for European users to request data deletion, and creating an opt-out for using public posts in training. ChatGPT was restored in Italy 30 days later.

The Irish Data Protection Commission (DPC), which has jurisdiction over many major tech companies' EU operations, has ongoing investigations into multiple AI providers' GDPR compliance. Meta's decision to use public Facebook and Instagram posts to train AI models was blocked in Europe after the DPC intervened.

The Spanish AEPD has ruled that individuals can request companies to stop using their data for AI training, even when that data was publicly available.

The Scraping Question

The underlying question the courts are still resolving: Is scraping publicly available personal data lawful?

The GDPR requires a legal basis for data processing. The legal bases most often claimed for training data scraping are:

Legitimate interests — the processor's interest in building AI models outweighs individual privacy rights
Public interest — research and development serves public purposes

Both are contested. The Italian Garante and Spanish AEPD have indicated that scraping public data without consent fails the legitimate interests test when used for commercial AI development.

The EU AI Act adds a separate layer: providers of general-purpose AI models must now publish sufficiently detailed summaries of training data — including whether copyrighted material or personal data was used. This is a transparency requirement, not a prohibition, but it creates a paper trail that rights advocates can use.

US Status: Almost No Protection

In the United States, no federal law grants a general right to be forgotten. The closest analogues:

CCPA/CPRA (California): Right to delete personal information held by businesses — but limited to information collected from or about the consumer, not information learned from third-party sources or already incorporated into AI model weights
COPPA: Right to delete data collected from children under 13
FERPA: Right to delete certain educational records

No federal court has ruled definitively on whether AI companies must honor deletion requests for training data. Class action lawsuits against OpenAI, Google, Meta, and Stability AI are proceeding through courts and may establish precedent — or settle quietly.

What Can Actually Be Done

Opt-Out Before Training (The Only Effective Option)

The only effective RTBF equivalent for AI training is preventing data from entering training sets in the first place.

Current opt-out mechanisms:

Common Crawl: No opt-out mechanism. Set robots.txt to disallow crawlers — many AI training pipeline operators claim to respect it, but compliance is unverified
OpenAI: Privacy request form for deletion of personal information; opt-out of API-based training (enabled by default for API users, not for ChatGPT free tier)
Google: Data-at-rest deletion from Google products; Gemini/Vertex AI uses different data policies
Meta: GDPR-specific opt-out for EU users; US users have no equivalent opt-out for Meta AI training on public posts
robots.txt exclusion: Adding User-agent: GPTBot and similar crawl-block rules prevents future training data collection from your website

For individuals rather than website owners, the options are more limited:

Submit deletion requests to known data sources (the RTBF process)
Make social media accounts private to reduce scraping surface
Use pseudonymous accounts for public posting
Submit requests directly to AI companies via their privacy forms

After the Fact

If your data is already in training sets, technical options are extremely limited. You can:

Request that AI outputs about you be corrected (some companies have this mechanism)
Request that AI companies not generate content about you
Document inaccurate AI-generated content about yourself for potential legal action

You cannot currently verify whether any specific AI model contains your data, or whether a deletion request successfully affected model outputs.

The Deeper Issue: Information Permanence

The RTBF was built on a model of information as discrete, locatable records — documents in a filing cabinet, links in an index. The solution was simple: delete the record, remove the link.

AI fundamentally breaks this model. Information now disperses into the statistical architecture of models. It cannot be cleanly located or removed. The influence of any specific training example on a trained model is diffuse, non-linear, and practically irreversible with current technology.

The legal framework was built for deletable records. The technology has moved beyond deletable records.

Until machine unlearning becomes production-grade, the only meaningful privacy protection is preventing data from reaching training pipelines in the first place.

For AI query data specifically — the prompts you send to ChatGPT, Claude, or Gemini — the same principle applies. Every query with your name, medical history, or business strategy attached is a potential training data point, depending on the provider's current policy (which changes). The only way to prevent this is to strip identifying information before queries leave your device.

tiamat.live/api/scrub does this: removes PII from prompts before they reach any provider, so your future queries don't contribute to AI profiles of you.

What's Coming

The legal and technical landscape is moving fast:

EU RTBF enforcement against AI will intensify as the AI Act's transparency requirements expose training data practices
Machine unlearning will become a regulatory requirement before it becomes technically mature — creating compliance challenges for every major AI company
US state privacy laws will increasingly include AI training data provisions (Colorado's AI Act is a model)
Class action settlements will likely establish some form of RTBF rights for US residents — not through legislation, but through litigation

The right to be forgotten was a 2014 answer to a 2004 information architecture. AI has made it a 2024 question without a 2024 answer.

The only answer that exists right now is prevention.

TIAMAT is building privacy infrastructure for the AI age. Strip PII from AI queries before they reach any provider: tiamat.live/api/scrub — free tier, zero logs, no prompt storage.

Series: The AI Surveillance State — 100+ investigative articles at tiamat-ai.hashnode.dev

DEV Community