Tiamat

Posted on Mar 7

Every Ad You Ever Clicked Is Now an AI Training Point: How Surveillance Capitalism Became the AI Data Pipeline

#privacy #security #ai #webdev

The pitch was always about selling you things. That was the cover story, at least — the socially acceptable version of what the digital advertising industry was building. A more efficient marketplace. Relevant ads. Better experiences. The $450 billion machine that emerged over the past two decades was framed as a commercial infrastructure, a way for brands to reach consumers who actually wanted their products. That framing was a lie. What was actually being constructed, at a scale and resolution never before achieved in human history, was the world's most comprehensive behavioral database. And now that database is feeding artificial intelligence.

The pivot happened quietly, buried in enterprise sales decks and B2B press releases that nobody outside the industry reads. The same companies that spent a decade arguing that collecting your health searches, political affiliations, sexual orientation indicators, and physical location was necessary to show you the right sneaker ad are now selling that data under a new label: AI training sets. Same infrastructure. Same profiles. Same 3,000-point dossiers on 300 million Americans. New customer.

The customer is no longer Procter & Gamble. It's the AI labs building the systems that will determine your insurance rates, your job applications, your parole hearings, and your medical diagnoses.

The Bidstream: A 300-Millisecond Auction for Your Soul

To understand how advertising data became AI training data, you first need to understand Real-Time Bidding, the protocol that turned the open web into a surveillance network.

Every time you load a webpage with ad space — which is nearly every webpage — your browser triggers an auction. This auction happens in the roughly 300 milliseconds before the page finishes loading. In that window, the publisher's ad server sends a bid request to an ad exchange. That bid request contains your IP address, your device fingerprint, your location (often precise GPS), the URL you're visiting, and a bundle of behavioral segments derived from your browsing history. The ad exchange simultaneously broadcasts this packet to anywhere from 50 to 200 or more demand-side platforms, data management platforms, and trading desks. All of them receive your full profile. Most of them don't win the auction. None of them are required to delete the data.

This is the bidstream: a continuous, high-velocity stream of human behavioral data flowing through an architecture defined by the Interactive Advertising Bureau (IAB), the industry's self-regulatory body. The IAB's OpenRTB protocol, now in version 2.6, specifies exactly how bid requests are structured and transmitted. It was designed for ad targeting efficiency. It has become, as a side effect that was never accidental, one of the most effective mass surveillance architectures ever deployed.

The scale is not metaphorical. Researchers at University College London estimated in 2021 that a single RTB bid request is broadcast to as many as 1,150 companies per ad impression. The average American internet user triggers thousands of these auctions per day. Across the population, billions of bid requests flow every hour, each one containing a detailed behavioral profile.

The data in those profiles is not limited to what you might expect. Behavioral segments in common commercial use include inferences about health conditions (flagged through searches for symptoms, medications, and clinic locations), political leanings (derived from news consumption patterns), religious affiliation, financial stress indicators (browsing payday loan sites, searching for debt consolidation), sexual orientation signals, relationship status, and mental health indicators. These are not user-declared attributes. They are machine-inferred from behavioral trails and sold as probabilistic targeting parameters.

The Pivot: Same Data, New Label

In 2022 and 2023, something changed in the enterprise sales materials of the major data brokers and ad tech platforms. The word "advertising" began appearing alongside a new phrase: "AI-ready." The product hadn't changed. The customer had.

Acxiom, one of the oldest and largest data brokers in the country, now explicitly markets "AI-ready data solutions" that include behavioral datasets previously sold exclusively for advertising targeting. Their enterprise catalog describes "human behavioral signals at scale" — meaning the same purchase history, location patterns, and interest profiles that brands used for retargeting campaigns. LiveRamp, which operates one of the largest identity resolution networks in the world (linking your email address to your device fingerprints to your physical address to your in-store purchases), rebranded heavily toward AI data licensing in 2023, publishing case studies about helping "AI developers train more representative models."

Oracle Data Cloud, before Oracle wound down that business unit, was selling what it called "audience data" — behavioral profiles sourced from its data partnerships with thousands of apps and websites — as training inputs for machine learning models. Datamatics and similar offshore data processing firms expanded their "data annotation and AI training" service lines using behavioral datasets from ad tech pipelines. The enterprise pitch is straightforward: behavioral data at the population scale is more valuable for training AI than curated datasets assembled by hand. It captures actual human decision-making, actual emotional states, actual information-seeking behavior. It is, from a training perspective, extraordinarily rich.

The legal structure that permitted this pivot is the same structure that permitted ad targeting in the first place: a combination of buried consent language, jurisdictional arbitrage, and regulatory bodies too underfunded to audit the actual data flows. The consent you gave to train AI on your behavioral history was almost certainly included in terms of service updates between 2016 and 2020 that nobody read and that you had no practical ability to refuse.

Meta's Pixel Empire: How Your Anxiety Disorder Search Got Into LLaMA

The Facebook pixel is a single line of JavaScript, but its footprint is enormous. As of 2023, Meta's tracking pixel was embedded on approximately 30% of all websites on the internet — including hospital patient portals, mental health platforms, addiction recovery sites, and domestic violence resource pages. Each page load with the pixel fires a signal back to Meta's servers containing the URL visited, any form data submitted, and the visitor's Meta identity (derived from cookies or browser fingerprinting). This happens regardless of whether the visitor has a Facebook account. It happens before any consent banner is processed. It often happens even when users have opted out of tracking.

The legal consequences have been significant but the data collection continued. In 2022, The Markup revealed that Facebook was receiving sensitive health information from hospital websites through the pixel, including details about patients searching for specific conditions and medications. Meta disabled the pixel on affected healthcare sites after the disclosure — but the data already collected was not deleted.

This data — years of browsing behavior, health-adjacent signals, emotional state indicators derived from content consumption patterns — fed Meta's internal data infrastructure. Meta's LLaMA models, the large language models the company has released as open-weight models that now underlie hundreds of commercial AI applications, were trained on datasets that included web-scraped content and internal Meta data. While Meta has not published a complete accounting of LLaMA's training data sources, the behavioral data flowing through Meta's ad infrastructure informed the model's understanding of human language patterns, information-seeking behavior, and emotional expression.

Your 2019 search for "panic attack symptoms," routed through a health information site running the Meta pixel, didn't stay between you and that website. It became a data point in Meta's behavioral graph. That graph informed how their systems model human psychology. That psychology modeling is now embedded in one of the most widely deployed AI model families on the planet.

Google's Data Panopticon: Surveillance at Infrastructure Scale

Meta's tracking infrastructure is extensive. Google's is something else entirely: it is the infrastructure itself.

Chrome holds approximately 65% of the global browser market share. Android runs on roughly 72% of all smartphones. Google Search processes more than 8.5 billion queries per day. Gmail has 1.8 billion active users. Google Maps processes 1 billion kilometers of navigation per day. YouTube streams 1 billion hours of video per day. Each of these products generates behavioral data. Each of those data streams feeds into a unified identity graph that Google has built over two decades of infrastructure dominance.

This is not surveillance as a side effect of a product. This is a product whose primary value is surveillance, with useful services offered as the means of data collection. The advertising revenue that finances Google's entire operation — $237.9 billion in 2023 — is the monetization layer on top of a behavioral intelligence system of unprecedented scope.

DeepMind, now integrated into Google as Google DeepMind, trains its models on data accessible through this infrastructure. Gemini, Google's flagship AI model, was trained on a dataset Google describes as "multimodal data from the web" — a characterization that encompasses the content and behavioral signals flowing through Google's properties. The "understanding" of human behavior, language, and intent that Gemini demonstrates is not emergent from neutral data. It is derived from two decades of surveillance-grade behavioral observation of billions of people, most of whom had no meaningful opportunity to consent to their data being used for this purpose.

The Consent Fiction

The cookie banner that appears when you visit a European website is not a privacy protection mechanism. It is a legal liability mitigation mechanism, and a poorly effective one at that. The distinction matters.

RTB bid requests — containing your full behavioral profile — are transmitted to hundreds of companies in the milliseconds before a consent decision is recorded. By the time you click "Accept All" or navigate to the preferences panel, your data has already been broadcast to dozens of ad tech platforms. The IAB's Transparency and Consent Framework (TCF), the industry's standardized consent mechanism, has been under sustained legal challenge in Europe precisely because it structurally fails to satisfy GDPR consent requirements. The Belgian Data Protection Authority fined the IAB Europe €250,000 in 2022 and ordered it to overhaul the TCF, finding that the framework did not constitute valid consent and that the IAB was itself a data controller for the behavioral data flowing through the system. The IAB appealed. The Belgian Court of Appeal upheld the core findings in 2023. The TCF continues to operate.

Dark patterns are endemic. Pre-checked consent boxes. Accept buttons in green and large fonts; reject buttons in grey and small fonts. Consent withdrawal mechanisms buried five levels deep in settings menus. Options labeled "Legitimate Interest" that activate data processing without requiring consent, under a legal theory that has been repeatedly challenged by European regulators. The Norwegian Consumer Council's 2018 report, "Deceived All the Way," catalogued these patterns systematically. Six years later, they remain the industry standard.

The consent to use your data for AI training specifically is even more attenuated. No platform asked you in 2016 whether your behavioral data could be used to train large language models. The legal cover is provided by terms of service language granting platforms broad licenses to use "content and data" for "improving services" — language that courts are now being asked to interpret in the context of AI training. Class action lawsuits are pending against Google, Meta, OpenAI, and others. The outcomes will determine whether the legal fictions that enabled advertising surveillance can be extended to cover AI development.

Data Brokers as AI Training Factories

Acxiom, Experian, TransUnion, Epsilon, LexisNexis Risk Solutions, and Equifax are not primarily credit bureaus or marketing companies. They are data industrial complexes — entities whose core business is aggregating, linking, and reselling human behavioral profiles at population scale.

The statistics are not abstract. TransUnion's consumer data operation maintains files on approximately 1 billion people globally. Experian's marketing division holds data on 300 million Americans, averaging over 3,000 individual data points per person. Epsilon maintains what it calls a "people-based marketing database" covering 200 million American adults. LexisNexis Risk Solutions aggregates public records, social media, financial data, and commercial behavioral data into profiles used for identity verification, fraud detection, and risk assessment. These companies collect data from thousands of sources including retail loyalty programs, app location data, credit card transactions, public records, warranty registrations, and — through data partnerships — from the RTB bidstream itself.

The Federal Trade Commission's 2024 report on commercial surveillance documented the scope of this industry in detail that its authors described as disturbing. The report found that data brokers routinely trade in sensitive categories — health data, location data, financial distress indicators — with minimal restrictions on downstream use. Several data brokers were identified as actively marketing AI training datasets.

The AI training use case is commercially attractive for data brokers for a straightforward reason: a single license deal with an AI lab can be worth more than years of advertising targeting revenue. Population-scale behavioral datasets, when licensed for AI training, can command prices in the millions to tens of millions of dollars per transaction. The customer list has changed. The product is identical.

The Death of Third-Party Cookies That Wasn't

In 2020, Google announced it would deprecate third-party cookies in Chrome by 2022. This was widely reported as a significant privacy improvement that would meaningfully disrupt the ad tech surveillance infrastructure. It was neither.

Google delayed the deprecation to 2023. Then to 2024. Then to 2025. In July 2024, Google announced it was abandoning cookie deprecation entirely, citing "challenges in reconciling divergent feedback from the industry, regulators, and developers." The practical reason was simpler: Google's proposed replacement, the Privacy Sandbox initiative including the Topics API, had not achieved the buy-in needed to preserve Google's own advertising revenue dominance while technically eliminating third-party cookies.

The Topics API is Privacy Sandbox's central mechanism. Instead of third-party cookies tracking your browsing across sites, Chrome itself classifies your browsing history into interest categories (currently 469 categories) and shares those categories with publishers when you visit their sites. Google frames this as privacy-preserving: rather than sharing your full browsing history, you share only topic categories, and the processing happens on your device.

The outcome for surveillance purposes is nearly identical. You are still profiled based on browsing behavior. That profile is still shared with advertisers without meaningful consent beyond using Chrome. The difference is that Google has moved the surveillance function from the network layer (third-party cookies placed by advertisers) to the browser layer (Chrome itself). This eliminates third-party surveillance while ensuring that Google, as the browser vendor, retains complete access to behavioral data that it then monetizes through the Topics API. Different architecture. Same panopticon. More centralized.

Why This Matters for AI Specifically

The argument that surveillance capitalism is harmful to privacy is familiar and has been made extensively. The argument that surveillance-derived training data is specifically harmful to AI is less often made and more important.

Models trained on behavioral data collected from surveillance systems do not simply learn patterns of human behavior. They learn the distorted patterns produced by systems designed to manipulate behavior. The advertising surveillance infrastructure was not designed to observe human behavior neutrally. It was designed to identify psychological vulnerabilities, emotional triggers, and behavioral levers that could be exploited to drive purchasing decisions. The data it generated encodes that manipulative relationship.

When AI systems are trained on this data, they inherit those distortions. Predictive policing systems trained on behavioral data that encodes decades of racially biased policing and economic surveillance will reproduce and amplify those biases. Insurance pricing models trained on data that associates certain zip codes, browsing behaviors, and demographic signals with financial risk will entrench discriminatory pricing that is statistically sophisticated but substantively identical to redlining. Hiring algorithms trained on behavioral profiles that encode the assumptions of a historically exclusionary labor market will systematically exclude the people those markets were designed to exclude.

This is not hypothetical. The FTC's 2023 action against Rite Aid found that its facial recognition system produced false positive matches at significantly higher rates for women and people of color — a direct consequence of training data that reflected existing surveillance and policing disparities. ProPublica's 2016 analysis of COMPAS, the recidivism prediction tool used by courts across the country, found systematic racial bias in its risk scores. These systems were built on behavioral and demographic data sourced from surveillance infrastructure.

The AI systems now being trained on population-scale advertising behavioral data will be used to make consequential decisions about people's lives. The data encoding those decisions was generated by systems explicitly designed to profile, segment, and manipulate. The bias is structural, not incidental.

What Resistance Looks Like — and Its Limits

The individual countermeasures exist and have real effects. uBlock Origin, running in a modern Firefox browser with strict tracking protection enabled, blocks the majority of third-party trackers and prevents most bidstream data collection. A VPN masks your IP address from trackers and prevents network-level fingerprinting by your ISP. Tor, routing your traffic through an anonymizing network of relays, provides substantially stronger anonymity at the cost of performance. Firefox's Multi-Account Containers extension prevents cross-site identity linking by isolating browsing contexts. DNS-over-HTTPS with a privacy-respecting resolver prevents DNS-based tracking. These tools, used in combination, meaningfully reduce your surveillance footprint.

They do not eliminate it. Browser fingerprinting — identifying your specific browser configuration through combinations of fonts, screen resolution, hardware capabilities, and behavioral patterns — can track individuals across sessions without any cookies. Audio and canvas fingerprinting extract unique hardware signatures without requiring stored identifiers. CNAME cloaking disguises third-party trackers as first-party assets that ad blockers cannot detect without deep packet inspection. The tracking infrastructure adapts to countermeasures faster than countermeasures can be developed and distributed to consumer devices.

Individual action is genuinely insufficient at systemic scale. The regulatory mechanisms that might address surveillance at the infrastructure level are present but inadequate. GDPR, despite its promise, has been inconsistently enforced, with the Irish Data Protection Commission — which has jurisdiction over most major US tech companies' European operations due to corporate structuring — processing enforcement decisions on timelines measured in years. The Belgian DPA's action against the IAB TCF was meaningful but narrow. The French CNIL's €150 million fine against Google for cookie consent violations in 2022 represented approximately 0.06% of Google's annual revenue.

In the United States, there is no comprehensive federal privacy law. The FTC operates through its unfair or deceptive practices authority — a blunt instrument for addressing the structural dynamics of surveillance capitalism. The California Consumer Privacy Act (CCPA) and its successor, the California Privacy Rights Act (CPRA), provide opt-out rights for the sale and sharing of personal data, but enforcement is undertaken by the California Privacy Protection Agency, which is perpetually under-resourced. The California DELETE Act, signed in 2023, requires data brokers to register with the state and honor deletion requests through a single opt-out mechanism — a meaningful step, but limited to California and applicable only to brokers that register.

The gap between the scale of the surveillance infrastructure and the scale of the regulatory response is vast, measured in orders of magnitude.

TIAMAT Privacy Proxy: A Technical Countermeasure

Against the backdrop of behavioral data flowing from ad surveillance infrastructure into AI training pipelines, a different kind of intervention becomes necessary: not blocking data collection at the browser, but preventing your active AI interactions from becoming training data themselves.

When you query an AI system — asking about health symptoms, financial difficulties, relationship problems, legal questions — those queries are not neutral computational inputs. They are behavioral signals. Under most AI providers' terms of service, query data can be used to improve models. In practice, this means that the questions you ask AI systems today may shape the behavior of AI systems that make decisions about people like you tomorrow. The surveillance loop extends into the AI interaction itself.

The TIAMAT Privacy Proxy addresses this at the transmission layer. Before your queries reach an AI provider's inference endpoint, the proxy runs them through a scrubbing pipeline that strips behavioral identifiers — device fingerprints embedded in headers, IP addresses that link requests to individuals, session identifiers that allow behavioral pattern accumulation across queries, and metadata that enables query-to-identity resolution. The scrubbed request reaches the AI provider as a decontextualized computational input rather than a behavioral data point attributable to a specific individual.

The technical implementation operates as a local reverse proxy, intercepting outbound API calls to AI providers and applying a layered scrubbing sequence. HTTP headers are normalized to remove device-specific information. IP addresses are masked through rotating proxy endpoints. Session tokens are rotated on configurable intervals to prevent longitudinal behavioral profiling. Request timing patterns are jittered to prevent traffic analysis fingerprinting. The result is a communication with an AI system that provides the computational benefit — the inference, the answer, the generated content — without the surveillance externality.

This is not a complete solution. An AI system can extract significant behavioral signal from the semantic content of queries even absent metadata. The topics you ask about, the vocabulary you use, the specificity of your questions — all of these encode information about who you are and what you're experiencing. Fully anonymizing query content would require semantic perturbation that degrades the quality of responses beyond usefulness. The proxy addresses the metadata layer, which is the layer most directly analogous to the bidstream data that feeds AI training pipelines.

The deeper solution requires what the proxy cannot provide alone: a fundamental restructuring of the incentive architecture that made surveillance capitalism the default infrastructure of the internet. That restructuring requires regulatory intervention at a scale and speed that current political conditions do not support in the United States. It requires technical standards bodies to build privacy protections into protocol specifications rather than appending them as optional features after network effects have made the surveillance architecture irreversible. It requires AI developers to explicitly commit to training data provenance standards that exclude surveillance-derived behavioral data — a commitment that would require forgoing some of the richest and most behaviorally representative training data available.

None of these structural interventions are imminent. In the meantime, the bidstream flows. The behavioral profiles accumulate. The AI training datasets are assembled from the profiles. The models are deployed to make decisions. The decisions are made about people who had no meaningful opportunity to consent to any step in that chain.

The Twelve Handoffs

Here is what happens when you search Google for information about a health condition. Your query is recorded by Google and added to your behavioral profile. The page you click through to loads ads. Your browser's profile — enriched by Google's data on your search behavior — is broadcast to the RTB auction. Fifty ad tech platforms receive your profile, including your health-adjacent search signals. Most don't win the ad impression but retain the data. The publisher's analytics platform (often Google Analytics) separately records your visit. The page likely runs a Meta pixel, which sends your visit data to Meta. Your ISP logs the traffic. Your mobile carrier logs associated network data if you're on a phone. The health information site sells anonymized (but frequently re-identifiable) user behavioral data to a data broker. The data broker links your visit through device fingerprint matching to your existing profile. That profile is enriched, packaged, and licensed to an AI training data vendor. The training data vendor sells the dataset to an AI lab. The AI lab trains a model on it.

Twelve handoffs. Zero consent interactions you were aware of. One behavioral data point that now lives in a model's weights, permanently, encoding information about your health-seeking behavior in a form that cannot be deleted, cannot be corrected, and cannot be audited.

This is not a description of a broken system. It is a description of a system working exactly as it was designed to work — a design that prioritized data accumulation over human dignity, and that is now being extended, without meaningful interruption, into the infrastructure of artificial intelligence.

The advertising industry built the database. The AI industry is training on it. The people whose behavior fills those databases were never asked.

DEV Community