Mwai Victor Brian

Posted on Jun 9

Anonymized Data Isn't. Or It Isn't Data: A Technical Overview

#data #privacyengineering #datascience #differentialprivacy

Why Privacy Is the Most Misunderstood Concept in Data Science

A technical follow-up to “Kenya Accidentally Discovered a Gold Mine and Immediately Started Asking Who Wants to Buy the Dirt.” If you haven’t read the original piece yet, start here: https://dev.to/code_with_mwai/kenya-accidentally-discovered-a-gold-mine-and-immediately-started-asking-who-wants-to-buy-the-dirt-594l, this article builds on one of its core arguments: anonymity.

Executive Summary

In the first article, we argued that Kenya is sitting on one of the most valuable data assets on the continent the exhaust of eCitizen and the government registries behind it and that the instinct to sell it is the weakest possible use of it. That argument leaned on a single load-bearing assumption made by everyone defending the plan: "don't worry, it's only anonymized data."

This article takes that assumption apart.

The claim rests on a folk theory of privacy that goes roughly: identity lives in your name and ID number; strip those out, and the data is safe. This is wrong, and it has been demonstrably wrong for over twenty-five years. The uncomfortable truth, known to every working privacy engineer, is captured in Cynthia Dwork's aphorism: anonymized data isn't; or it isn't data. Either a dataset is detailed enough to be useful in which case it is almost certainly re-identifiable or it has been crushed flat enough to be safe, in which case much of the value people wanted from it is gone.

This piece makes five claims and defends each with code, math, and case law-adjacent disasters:

Removing names does not produce anonymity. Identity is distributed across quasi-identifiers age, location, sex, dates, occupation whose combinations fingerprint people.
Humans are astonishingly unique. Four time-location points identify ~95% of us. The identifier is often the behavior itself.
Useful datasets stay re-identifiable. Sparsity and high dimensionality exactly what makes data valuable for AI and research are exactly what make it linkable.
Perfect anonymity destroys utility. Privacy and usefulness sit on opposite ends of a measurable tradeoff curve.
Privacy is not a binary state. It is a budget. Modern privacy engineering (k-anonymity, l-diversity, differential privacy, federated learning, synthetic data, data minimization) is the science of spending that budget wisely not the magic of making risk vanish.

We finish back where the first article ended: with Kenya. If a government is going to monetize "anonymized" data, the single most important question is not the price. It is: anonymized how, against which adversary, with what budget, and who is liable when it fails?

Anonymization is not a state you reach. It is a war you fight against an adversary you cannot see, with auxiliary data you do not control.

Introduction: The Sentence That Ends Every Privacy Debate

There is a sentence that appears, like clockwork, the moment any government or company is challenged about a dataset:

"We'll only sell anonymized data."

It is a remarkable sentence. It ends arguments. It calms boards. It satisfies journalists. It is the data-governance equivalent of "the cheque is in the mail" technically a statement, emotionally a sedative.

And in Kenya's case, it is doing enormous work. The Draft Final National Data Governance Policy proposes a marketplace of "anonymized and aggregated" datasets traffic flows, land transactions, business registrations, immigration volumes and the entire legal and ethical justification rests on that one word. Personal data is excluded. Anonymized data is fair game. End of debate.

Except it isn't the end of the debate. It's barely the beginning. Because before we can argue about whether anonymized data should be sold, we have to confront a more awkward question that almost nobody asks:

Does anonymized data, in the form most people imagine, actually exist?

The working consensus among people who do this for a living is no not for any dataset rich enough to be worth selling. This is not cynicism. It is the accumulated result of three decades of researchers being handed "anonymous" datasets and re-identifying the people in them, often within days, often for fun, occasionally to mail a governor his own medical records.

So let's do the thing the policy debate skipped. Let's define anonymization precisely, attack it the way a real adversary would, and see what survives. Bring a terminal.

Section 1: What People Think Anonymization Means

Here is the mental model almost everyone carries. Start with a dataset that obviously identifies people:

name,phone,email,age,county
John Doe,0712345678,john@email.com,32,Nairobi
Jane Doe,0723456789,jane@email.com,29,Kiambu
Peter Otieno,0734567890,peter@email.com,41,Kisumu

Now "anonymize" it by deleting the columns that obviously point at a person name, phone, email:

age,county
32,Nairobi
29,Kiambu
41,Kisumu

Problem solved?

It feels solved. There is no name. There is no number to call. You could publish this on the front page of a newspaper and nobody could be harmed. Right?

The trouble is that this intuition confuses two completely different things:

Direct identifiers fields that point at exactly one person on their own: name, national ID, phone, email, account number, biometric template.
Quasi-identifiers fields that are individually harmless but, in combination, narrow the world down to one person: age, sex, county, date of birth, occupation, employer, the date you visited a clinic.

Deleting direct identifiers is necessary. It is nowhere near sufficient. Because identity does not live in the name column. Identity is distributed across the quasi-identifiers, and it reassembles itself the moment you combine the dataset with something else.

The toy example above looks safe only because it has three rows and two columns. Real eCitizen-scale data has millions of rows and dozens of columns, and that changes everything. The more attributes you keep and you keep them precisely because they're useful the more each person's row becomes a fingerprint.

Latanya Sweeney proved this in the 1990s with three fields you'd swear were harmless. We'll get there. First, vocabulary, because half of all privacy disasters are really vocabulary disasters.

Identity does not live in the name column. It is smeared across every "harmless" attribute you decided to keep because it was useful.

Section 2: Privacy vs. Security vs. Confidentiality vs. Anonymization vs. Pseudonymization vs. De-identification

These words get used interchangeably by people who should know better, including in policy documents that will become law. They are not synonyms. They live at different layers of the stack.

Term	What it actually means (engineering)	Failure mode	Reversible?
Security	Keeping unauthorized parties out of the data (encryption, access control, network controls).	Breach, leaked credentials, misconfigured bucket.	N/A — it's a perimeter, not a transformation
Confidentiality	A promise/obligation not to disclose data you legitimately hold.	Insider misuse, careless sharing.	N/A — it's a policy, not a technique
Privacy	The individual's control over information about themselves and the inferences drawn from it.	Data used in ways the person never agreed to.	N/A — it's a right/property
Pseudonymization	Replacing direct identifiers with tokens (hash, random ID), keeping a mapping somewhere.	Linkage; the mapping leaks; the token is guessable.	Yes — with the key, or by attack
De-identification	Removing/obscuring identifiers to reduce identifiability to some standard.	Re-identification via quasi-identifiers + auxiliary data.	Sometimes
Anonymization	Transforming data so individuals can no longer be identified by any means reasonably likely to be used.	The "reasonably likely" clause quietly expands every year.	No — if it's truly achieved

Three engineering points that the table can't shout loudly enough:

1. Security is orthogonal to anonymization. You can have a perfectly secured database encrypted at rest, locked behind IAM, audited to death full of perfectly identifiable records. Security protects data from outsiders. Anonymization protects people from the data itself, including from the insiders and buyers you handed it to on purpose. Kenya's marketplace is, by design, a plan to give data to outsiders. Security buys you nothing there.

2. Pseudonymization is constantly mistaken for anonymization, and the mistake is expensive. Hashing a national ID number with SHA-256 feels irreversible. It is not, in the way that matters. The space of Kenyan national ID numbers is small and structured; you can hash every possible ID in an afternoon and build a reverse lookup table. This is exactly how the 2014 NYC taxi dataset fell medallion numbers were "anonymized" with MD5, but the medallion space is tiny, so researchers rebuilt the mapping and re-identified individual drivers (and, using paparazzi photos with visible medallions, specific celebrities' trips and tips).

Hashing an identifier from a small, structured space isn't anonymization. It's a padlock whose key you also published.

3. Under data-protection law, pseudonymized data is still personal data. This is the legal landmine in Kenya's plan. If a "non-personal" dataset turns out to be merely pseudonymized or re-identifiable via quasi-identifiers then it was personal data all along, the Data Protection Act applied the whole time, and selling it was unlawful. The label on the box does not change what's inside it.

Section 3: The Re-Identification Problem: Linkage Attacks

Here is the mechanism behind almost every famous privacy failure. It is embarrassingly simple. It is a JOIN.

A linkage attack works when two datasets share quasi-identifiers. One dataset has the sensitive thing you want to hide (a diagnosis, a salary, a search history). The other dataset, often public, connects those same quasi-identifiers back to a name.

Consider an "anonymized" hospital extract:

-- Dataset A: "anonymized" hospital records (names removed!)
age,gender,county,diagnosis
42,Female,Nairobi,HIV+
29,Male,Kiambu,Diabetes
55,Female,Kisumu,Depression

And a perfectly ordinary public or semi-public registry a professional directory, a voter roll, a leaked dataset, a LinkedIn scrape:

-- Dataset B: a public registry that happens to have names
full_name,age,gender,county
Mary Atieno,42,Female,Nairobi
James Mwangi,29,Male,Kiambu
Grace Wanjiru,55,Female,Kisumu

Neither dataset has both the diagnosis and the name. So neither is "identifying," right? Watch:

SELECT  b.full_name,
        a.age,
        a.gender,
        a.county,
        a.diagnosis          -- the sensitive attribute, now wearing a name
FROM    hospital_records a
JOIN    public_registry  b
  ON    a.age    = b.age
 AND    a.gender = b.gender
 AND    a.county = b.county;

full_name      age  gender  county    diagnosis
Mary Atieno    42   Female  Nairobi   HIV+
James Mwangi   29   Male    Kiambu    Diabetes
Grace Wanjiru  55   Female  Kisumu    Depression

The diagnosis just acquired a name. No hack. No breach. No password cracked. Just a join on three columns nobody thought were identifying.

Why does this work? Because (age, gender, county) is a quasi-identifier with enough resolution to be nearly unique once you go fine-grained. In a small county, "42-year-old woman" might be one of a handful of people. Add one more attribute occupation, sub-county, a clinic visit date and the equivalence class collapses to one.

This is the entire game. Anonymization fails not because of what's in your dataset, but because of what your dataset can be joined to. And you do not control what it can be joined to. Every new public dataset, every breach, every social-media scrape is a new potential Dataset B. An anonymization that is safe today can be broken tomorrow by a dataset that doesn't exist yet. Privacy engineers call this the auxiliary information problem, and it is unwinnable in the general case, because you are defending against the union of all data that will ever be published.

You are not anonymizing against today's internet. You are anonymizing against every dataset that will ever exist. You will lose that race.

Section 4: Humans Are Surprisingly Unique

The reason linkage attacks work so reliably is a fact that surprises almost everyone the first time they meet it: people are far more statistically unique than their intuition allows. You feel like one of millions. In the data, you are one of one.

Location. In a landmark 2013 study, Unique in the Crowd, de Montjoye and colleagues analyzed fifteen months of mobility data for 1.5 million people just the antenna and timestamp for each call. They found that four approximate time-and-location points were enough to uniquely identify 95% of individuals. Not four hundred. Four. Coarsening the data (bigger time windows, bigger areas) barely helped: uniqueness decays slowly, so you have to destroy almost all the utility to get safety.

Transactions. The same group's 2015 follow-up, Unique in the Shopping Mall, did it with credit-card metadata: just the shop and the day for four purchases re-identified 90% of people in a dataset of 1.1 million. Knowing the rough price of a couple of those purchases pushed it higher.

Search. Your search history is a confession. The sequence of things a person asks their town, their employer, their illnesses, their children's names, the embarrassing thing at 2 a.m. is a fingerprint made of curiosity. (AOL learned this in public; Section 6.)

Demographics. Sweeney's famous estimate: roughly 87% of the U.S. population is uniquely identifiable by just {ZIP code, date of birth, sex}. Three fields. In Kenya, swap ZIP for sub-county or ward and the logic is identical, sometimes worse, because rural wards are small.

The deep lesson is this: as you add dimensions, the space of possible people explodes far faster than the population fills it. With 47 counties, 2 sexes, and 100 age values you already have 9,400 cells for ~50 million people fine. But add occupation (say 500 categories), marital status (5), and education level (8), and you have 188 million cells for 50 million people. Most cells now contain zero or one person. The dataset has become a list of individuals wearing a thin disguise.

This is why the identifier is so often the behavior itself. Your commute, your spending rhythm, your search pattern, your pattern of government-service usage on e Citizen these are not attributes attached to your identity. At sufficient resolution, they are your identity. There is no separate "name" to remove.

You think you're one in a million. In a rich dataset, you're one of one. The behavior is the identifier.

Section 5: Rebuilding Identity From Fragments (with Python)

Talk is cheap. Let's measure uniqueness on a synthetic eCitizen-style dataset so you can run the logic against your own data tomorrow.

import numpy as np
import pandas as pd

rng = np.random.default_rng(42)
N = 1_000_000

counties   = [f"County_{i}" for i in range(47)]
occupations = [f"Occ_{i}" for i in range(300)]

df = pd.DataFrame({
    "age":        rng.integers(18, 80, N),
    "gender":     rng.choice(["M", "F"], N),
    "county":     rng.choice(counties, N),
    "occupation": rng.choice(occupations, N),
})

def uniqueness_report(df, quasi_identifiers):
    """For a set of quasi-identifiers, how identifying is the combination?"""
    sizes = df.groupby(quasi_identifiers).transform("size").iloc[:, 0]
    pct_unique = (sizes == 1).mean() * 100          # rows that are 1-of-1
    pct_le_5   = (sizes <= 5).mean() * 100          # rows in a class of <= 5
    k_min      = sizes.min()                        # the dataset's k-anonymity
    print(f"{quasi_identifiers}")
    print(f"  records that are UNIQUE:        {pct_unique:5.1f}%")
    print(f"  records in a group of <= 5:     {pct_le_5:5.1f}%")
    print(f"  dataset k-anonymity (min group): {k_min}\n")

uniqueness_report(df, ["gender", "county"])
uniqueness_report(df, ["age", "gender", "county"])
uniqueness_report(df, ["age", "gender", "county", "occupation"])

Indicative output:

['gender', 'county']
  records that are UNIQUE:          0.0%
  records in a group of <= 5:       0.0%
  dataset k-anonymity (min group): 10408

['age', 'gender', 'county']
  records that are UNIQUE:          0.0%
  records in a group of <= 5:       0.1%
  dataset k-anonymity (min group): 121

['age', 'gender', 'county', 'occupation']
  records that are UNIQUE:         24.7%
  records in a group of <= 5:      71.0%
  dataset k-anonymity (min group): 1

Read that table slowly, because it is the entire argument in three rows.

With two coarse attributes, every person hides in a crowd of thousands. Safe. Also nearly useless you can't tell anyone apart, which is the point of safety and the death of utility.
Add age, and a few people start standing out, but the dataset's worst-case group is still 121 people. Mostly safe.
Add occupation one more "harmless" column, the kind a researcher insists they need and a quarter of the population is now unique and 71% sit in a group of five or fewer. The dataset's k-anonymity just fell to 1: at least one person is alone in their cell, fully exposed.

Note that this is uniformly random synthetic data, which is the best case for privacy. Real data is correlated and skewed surgeons cluster in cities, certain age-occupation combos are rare so real uniqueness is worse than this simulation. The toy above is the optimistic version.

This is the mechanism behind the whole field: each additional attribute multiplies the number of cells, and uniqueness rises non-linearly. Anonymity isn't lost gradually as you add columns. It collapses.

Anonymity doesn't erode column by column. It holds, holds, holds then collapses the moment you add the attribute your researcher swore they couldn't live without.

Section 6: Famous Privacy Failures (Technical Post-Mortems)

History is the best teacher here, because the failures rhyme. Same mechanism, different decade.

6.1 The Netflix Prize (2006–2010)

What happened. Netflix released ~100 million movie ratings from ~480,000 subscribers to crowdsource a better recommender, offering $1M. They replaced names with random IDs and perturbed some data, and declared it anonymous.

The technical failure. In Robust De-anonymization of Large Sparse Datasets (2008), Narayanan and Shmatikov showed that ratings data is sparse and high-dimensional almost everyone's set of rated movies-with-dates is nearly unique. They cross-referenced the "anonymous" data with public IMDb reviews (the auxiliary dataset) and matched real people. Knowing as few as 8 ratings (2 possibly wrong) and rough dates re-identified 99% of records they tested.

Why anonymization failed. Sparsity. When each person's vector is almost unique, you don't need their name you need any second source that shares a few data points. The release defended against the wrong threat model (someone with no outside information) instead of the real one (someone with a little).

Lesson. High-dimensional behavioral data the most valuable kind for AI is the hardest to anonymize and the easiest to link. Netflix cancelled the planned sequel competition after an FTC complaint and a lawsuit.

6.2 AOL Search Logs (2006)

What happened. AOL Research published ~20 million queries from ~650,000 users "for research," replacing usernames with numbers.

The technical failure. They anonymized the user ID but published the queries verbatim. The content was the identifier. A user's stream of searches their town, neighbors' names, ailments, the businesses near them read like a diary. Reporters identified user #4417749 as a specific 62 year old woman in Georgia within days, just by reading her searches and knocking on a door.

Why anonymization failed. They removed the label and kept the confession. Pseudonymizing the key while releasing rich free-text content is theater.

Lesson. If the payload is identifying, scrubbing the key does nothing. The data was withdrawn; researchers resigned; the dataset still circulates today, which is the other lesson you cannot un-publish data.

6.3 The Strava Heatmap (2017–2018)

What happened. Strava published a global "heatmap" of aggregated, anonymized fitness activity a billion activities, no individual tracks, just glowing lines of where people exercise.

The technical failure. Aggregation hides the individual but reveals the pattern. In empty deserts, the only glowing lines were soldiers jogging the perimeter of forward operating bases in Afghanistan and Syria, tracing patrol routes and base layouts. An analyst spotted it on a map. Aggregate data leaked operational secrets.

Why anonymization failed. Anonymizing who doesn't anonymize where and when. In sparse regions, the aggregate is sensitive. This is the precise risk in Kenya's proposed traffic-flow and mobility datasets: aggregate mobility can still reveal a specific person's commute in a thinly populated ward, or a sensitive facility's access pattern.

Lesson. "It's only aggregated" is the cousin of "it's only anonymized." Both are conditional, and the condition is density.

6.4 Cambridge Analytica (2018)

What happened. A personality-quiz app harvested data from ~87 million Facebook profiles mostly friends of the few hundred thousand who took the quiz and fed psychographic targeting.

The technical failure (and the nuance). This wasn't classic re-identification; it was inference plus over-broad collection. Academic work (Kosinski & Stillwell) had already shown that mundane "likes" predict sensitive traits sexuality, politics, personality — with startling accuracy. CA's lesson for our topic is the inference attack: even data you'd never call sensitive becomes sensitive once a model maps it to the things you actually wanted to hide.

Lesson. Anonymization assumes the sensitive attribute is a column you can remove. Inference makes the sensitive attribute derivable from the columns you kept. You cannot delete a prediction.

6.5 Location Data Brokers (ongoing)

What happened. A shadow industry buys "anonymous" location pings from apps and SDKs and resells them. The New York Times' One Nation, Tracked (2019) took one such "anonymized" file and trivially re-identified people because a phone that sleeps at one address every night and commutes to one office every day has announced its owner. In 2021, a U.S. priest was outed and forced to resign after a group bought "anonymized" app location data and traced his device to his home and to Grindr usage.

Why anonymization failed. Two points home and work usually identify a person. (Recall de Montjoye: four points → 95%.) Location data is intrinsically identifying because human movement is routine and routines are unique.

Lesson. There is no such thing as anonymous location data at useful resolution. There is only location data whose re-identification you haven't bothered to do yet.

Case	Data type	Auxiliary source	Root cause	One-line lesson
Netflix	Movie ratings	Public IMDb reviews	Sparsity / high dimensionality	Behavioral vectors are near-unique
AOL	Search queries	Common sense + a phone book	Identifying payload	Don't scrub the key, keep the confession
Strava	Aggregated GPS	A world map	Density-dependent aggregation	Aggregates leak in sparse regions
Cambridge Analytica	Profiles + likes	Predictive models	Inference, over-collection	You can't delete a prediction
Location brokers	GPS pings	Address/identity records	Routine = identity	"Anonymous location" is an oxymoron

Every one of these teams believed they had shipped anonymous data. Every one was wrong within days. The pattern isn't carelessness. It's the nature of the thing.

Section 7: The Privacy–Utility Tradeoff

By now the shape of the problem should be visible. Safety and usefulness are not independent dials. They are the two ends of one curve.

  PRIVACY
   ^
   |  * (suppress everything: perfect privacy, zero utility — a blank file)
   |   \
   |     \
   |       \
   |         \        <-- the frontier: every point is a real tradeoff
   |           \
   |             \
   |               \
   |                 *  (raw microdata: perfect utility, zero privacy)
   +---------------------------------------------> UTILITY

Everything in privacy engineering is a fight over where on this curve you sit, and how to push the curve outward (more privacy and more utility) with cleverer math.

Suppress and generalize aggressively → you slide up-left. Safe, useless. A table reporting "some adults live in Kenya" leaks nothing and teaches nothing.
Release rich microdata → you slide down-right. A goldmine for researchers, a goldmine for attackers, identical file.
Differential privacy, synthetic data, query interfaces → these bend the frontier, buying more utility per unit of privacy risk. They don't abolish the tradeoff. Nothing does.

Why does value cling so stubbornly to the dangerous end? Because the questions people pay for are specific:

AI training wants the long tail the rare, the unusual, the individual. That's where models learn the hard cases. The rare row is the valuable row and the identifiable row.
Fraud detection is literally the search for the anomalous individual. Aggregate it away and you've deleted the fraud.
Recommendation systems model you, not the average user.
Government planning done well needs sub-county, age-banded, sector-specific detail exactly the granularity that re-identifies.

This is why "we'll only sell useful, anonymized data" is close to a contradiction in terms. The adjective and the participle are pulling in opposite directions.

Privacy and utility aren't in tension by accident. They're in tension by construction. The valuable row and the identifiable row are the same row.

Section 8: Why AI Makes Everything Worse

If linkage attacks are the classical threat, machine learning is the modern accelerant. AI changes the anonymization problem in four ways, all bad for the "it's only anonymized data" defense.

1. Inference replaces extraction. You no longer need the sensitive column in the data; a model infers it from the columns you kept. Gender, ethnicity, health status, pregnancy, sexual orientation, and political leaning have all been predicted from "neutral" features. Anonymization removes attributes. AI reconstructs them. Removing a field is now a speed bump, not a wall.

2. Foundation models memorize their training data. Large models trained on a corpus can be prompted to regurgitate verbatim training examples names, phone numbers, snippets of private text a failure mode documented in Extracting Training Data from Large Language Models (Carlini et al., 2021) and its successors. If a Kenyan dataset, however "anonymized," ends up in a training corpus and contains any re-identifiable structure, the model can become a leaky cache of it. You can't delete a record from a model the way you delete a row from a table.

3. Embeddings are reversible enough to worry. We comfort ourselves that turning text or images into vectors "anonymizes" them. But embedding-inversion research reconstructs substantial portions of the original input from its embedding, and membership-inference attacks determine whether a specific person's record was in the training set — itself a privacy breach when the dataset is, say, "patients with condition X."

4. Linkage at machine scale. The auxiliary-data problem from Section 3 was bad when a human did the join. ML does fuzzy, probabilistic linkage across messy datasets at population scale, tolerating typos and missing fields that would defeat a SQL JOIN. The adversary got a force multiplier.

The net effect: every assumption behind classical de-identification the sensitive attribute is a removable column; vectors are safe; you need an exact match to link is weakened by modern AI. Which is darkly ironic, because building African AI is one of the main reasons Kenya wants this data in the first place. The very capability that makes the data valuable makes the anonymization fragile.

Classical anonymization removes attributes. AI reconstructs them, memorizes them, and links them at scale. We are defending a sandcastle against a rising tide we built ourselves.

Section 9: The Kenya Question

Now bring it home, concretely, to the systems from the first article: eCitizen, the civil and business registries behind it, the land and vehicle databases, KNBS microdata, and the Maisha Namba identity layer.

If Kenya is going to monetize "anonymized" datasets, four questions must be answered before any pricing tier is published.

1. Anonymized to what standard, certified by whom? "Anonymized" is not a technical specification. k-anonymity at k=5? Differential privacy at ε=1? Today the draft policy proposes ethics and quality standards but no binding, published de-identification threshold, and leaves unresolved whether the new Data Governance Council or the Office of the Data Protection Commissioner has the final say on what counts as adequately anonymized. Without a number, "anonymized" is a vibe, not a control.

2. Against which adversary, and which auxiliary datasets? Kenya has leaked datasets, voter rolls, scraped social media, telco data, and a fast-growing data-broker market. The relevant question is never "is this dataset safe in a vacuum?" It is "is this dataset safe against everything else that exists about Kenyans?" The traffic/mobility datasets in particular (Section 6.3, plus de Montjoye) should be treated as near-unanonymizable at useful resolution and handled, if at all, only through query interfaces, never bulk release.

3. What is the residual risk, and who is liable when it materializes? Re-identification risk is never zero; it is a probability you choose. So someone must own three numbers: the acceptable re-identification probability, the assessed actual probability per dataset, and the liability when a buyer (or a buyer's buyer) breaks it. The legal twist from the first article bites here a successful re-identification retroactively converts "non-personal data" into a personal-data breach under the Data Protection Act and Article 31. The marketplace would be selling latent liability priced as if it were inert.

4. Why release microdata at all when safer architectures exist? This is the architecture question, and it's where Kenya can actually win.

Model	What buyers get	Re-ID risk	Utility	Fit for Kenya
Bulk "anonymized" download	The raw-ish file	High (this whole article)	High	Avoid for anything granular
Aggregate open data (DP-protected)	Free statistics with a noise budget	Low	Medium	Yes — low-risk public-good tier
Query API / data clean room	Answers to vetted queries; data never copied	Low–Med (controllable)	High	Best for sensitive, high-value data
Synthetic data	Artificial records preserving structure	Low–Med (if generator is DP)	Med–High	Good for prototyping/ML, with care
Federated analytics	Models/answers, not data	Low	Med–High	Strong for cross-agency analytics

The recurring finding from the first article reappears in technical form: the safest and the most valuable strategies both point away from selling bulk microdata. Let approved Kenyan researchers, universities, and startups compute on the data inside controlled environments query interfaces, clean rooms, federated analytics capturing the insight while the raw asset (and its re-identification risk) never leaves national control. That is not just better privacy. It is better economics, because it keeps the value-add and the IP in Kenya instead of exporting a one-time file.

"Anonymized" without a threshold, an adversary model, and an owner of residual risk isn't a safeguard. It's a disclaimer the citizen never got to read.

Section 10: Modern Privacy Engineering (the actual toolbox)

So what do the techniques do, and what are their limits? This is the part to send your policy team.

10.1 k-anonymity

Idea. A release is k-anonymous if every record is indistinguishable from at least k−1 others on the quasi-identifiers. You get there by generalization (exact age → age band; ward → county) and suppression (dropping outlier rows).

RAW                              3-ANONYMOUS (k=3)
age  gender  county   dx         age    gender  county   dx
42   F       Nairobi  HIV+       40-49  F       Nairobi  HIV+
44   F       Nairobi  Flu        40-49  F       Nairobi  Flu
47   F       Nairobi  Diabetes   40-49  F       Nairobi  Diabetes

Now "a 40-something woman in Nairobi" maps to ≥3 records; you can't single one out on quasi-identifiers.

Limits. k-anonymity protects identity but not attributes. If all k records in a group share the same sensitive value, you've learned it without knowing which row is whom the homogeneity attack. And background knowledge ("I know my neighbour isn't diabetic") shrinks the group.

10.2 l-diversity (and t-closeness)

Idea. Patch the homogeneity hole: require each equivalence class to contain at least l well-represented values of the sensitive attribute. t-closeness goes further the distribution of the sensitive attribute within each group must stay within t of the global distribution.

BAD (k=3 but l=1: homogeneity leak)    GOOD (l=3: diverse sensitive values)
age    county   dx                     age    county   dx
40-49  Nairobi  HIV+                    40-49  Nairobi  HIV+
40-49  Nairobi  HIV+                    40-49  Nairobi  Diabetes
40-49  Nairobi  HIV+   <-- leaked       40-49  Nairobi  Flu

Limits. Hard to achieve without heavy distortion; still vulnerable to skew and similarity attacks; still a syntactic guarantee about a specific table, not a mathematical guarantee about an adversary.

10.3 Differential privacy (DP), the only guarantee with a number

Idea. Instead of de-identifying rows, DP constrains outputs. An algorithm M is ε-differentially private if, for any two datasets differing by one person, and any possible output set S:

Pr[M(D)   ∈ S]  ≤  e^ε · Pr[M(D') ∈ S]

In words: adding or removing any single person barely changes the probability of any output. So no released statistic can reveal much about any individual, regardless of what the attacker already knows. That last clause is the magic — DP is robust to all present and future auxiliary data. It defeats the auxiliary-information problem that kills every other method.

You achieve it by adding calibrated noise. For a counting query (sensitivity Δf = 1), the Laplace mechanism adds noise scaled to Δf/ε:

import numpy as np

def dp_count(true_count: int, epsilon: float) -> float:
    """ε-DP answer to a counting query (sensitivity = 1)."""
    noise = np.random.laplace(loc=0.0, scale=1.0 / epsilon)
    return true_count + noise

# "How many people in Ward X have condition Y?"
print(dp_count(213, epsilon=0.5))   # ~213 ± a few; the individual is hidden in the noise

The catches, stated honestly:

ε is a privacy budget, and it composes. Answer many queries and the ε's add up; spend the whole budget and privacy is gone. You must ration questions.
Smaller ε = more privacy = more noise = less utility. It is the Section 7 tradeoff, finally given a dial you can audit.
It's a guarantee about the mechanism, not a promise that any single output is "safe." And choosing ε is a policy decision masquerading as a technical one. The U.S. Census Bureau adopted DP for the 2020 census and the fight over ε was ferocious precisely because it is, in the end, a values question.

Pull quote: Differential privacy is the first privacy technology honest enough to print its own price tag. The price is called epsilon, and someone has to decide how much to spend.

10.4 Federated learning

Idea. Don't move the data to the model; move the model to the data. Each device/agency computes updates on its local data; only the updates (not raw records) are aggregated.

                +------------------------+
                |   Global model (w)     |
                +-----------+------------+
                            |  send w
        +-------------------+-------------------+
        v                   v                   v
  +-----------+       +-----------+       +-----------+
  | Hospital A|       | Hospital B|       |  County C |
  | local data|       | local data|       | local data|
  | train ->  |       | train ->  |       | train ->  |
  |  Δw_A     |       |  Δw_B     |       |  Δw_C     |
  +-----+-----+       +-----+-----+       +-----+-----+
        |  send Δw (gradients), NOT data    |
        +-------------------+-------------------+
                            v
                +------------------------+
                |  Secure aggregation +  |
                |  DP noise -> new w     |
                +------------------------+

Limits. Raw data stays put — but gradients leak. Gradient-inversion attacks reconstruct training inputs from updates, so federated learning is only safe when combined with secure aggregation and DP noise on the updates. It's a powerful architecture, not a standalone shield.

10.5 Synthetic data

Idea. Train a generative model on the real data and release fake records that preserve the statistical structure (correlations, distributions) without being any real person.

Limits. If the generator overfits, it memorizes and reproduces real individuals re-identification with extra steps. Quality and privacy trade off (Section 7 again). The only synthetic data with a guarantee is DP-synthetic data, where the generator itself is trained under differential privacy. Synthetic ≠ safe by default.

10.6 Data minimization, the most underrated technique in the toolbox

Every method above is damage control applied after you've collected the data. Minimization is the only one that reduces risk at the source: don't collect what you don't need; don't keep it longer than you must; don't link what doesn't need linking.

It is unglamorous and it is the most effective privacy technology in existence, for a simple reason: the safest record is the one that was never created. There is no breach of a field you didn't store, no re-identification of a row that doesn't exist, no subpoena for data you discarded on schedule.

And here is the structural tension this whole series keeps returning to: monetization is the natural enemy of minimization. The moment data is an asset on a balance sheet, every incentive flips toward collecting more, keeping it longer, and linking it wider — because inventory is revenue. India's reviewers named this before they killed their version of Kenya's policy. A government cannot be both the steward who minimizes and the vendor who maximizes inventory. Those are different organisms.

The safest record is the one that was never created. Every other privacy technique is just managing the risk you chose to take on.

Section 11: A Thought Experiment

Let's make the whole article concrete with the kind of "obviously harmless" release a marketplace might actually publish. Kenya releases a dataset with no names and only four fields:

age_range,county,occupation,travel_frequency
30-39,Nairobi,Cardiologist,Daily
50-59,Turkana,Member of County Assembly,Weekly
40-49,Kisumu,University Professor,Monthly

No name. No ID. No phone. "Anonymized." Could you still identify individuals? Walk through it as an attacker would.

Step 1. Count the population in the cell. How many cardiologists aged 30–39 work in Nairobi? Possibly dozens but possibly not. The rarer the occupation, the smaller the cell. For a Member of the County Assembly in Turkana aged 50–59, the cell might contain one person. The occupation field is doing the work a name used to do. This is a uniqueness collapse the Section 5 effect, live.

Step 2. Bring auxiliary data. Professional registries (medical board, bar association, IEBC records of elected officials), LinkedIn, university staff pages, news articles. Join on (occupation, county) the way we joined in Section 3. For public roles like elected officials, the auxiliary data is literally published by the state itself.

Step 3. Use the sensitive field as a discriminator. travel_frequency now reads as a behavioral attribute attached to a named individual: this specific professor travels monthly; this specific MCA travels weekly. If a later release adds destination or dates, you're in de Montjoye territory four points, 95%.

Step 4. Iterate across releases. The marketplace won't sell one file; it'll sell many, over five years. Each is "anonymized" alone. But an attacker intersects them: the same rare cells recur, and overlapping releases let you triangulate the differencing attack. Anonymization that holds per-release fails across the catalogue. (This is exactly why differential privacy budgets are tracked across all queries, not per query.)

The punchline: a four-column, name-free dataset that any reasonable official would wave through as "obviously anonymous" can re-identify the rarest, often most powerful or vulnerable people in it the specialist doctor, the elected official, the only professor of her kind in a county. Anonymization fails first for exactly the people most worth protecting.

Strip every name from the file and the rarest people in it are still wearing their occupation like a badge. Anonymization fails first for the people most worth protecting.

Conclusion: Anonymized Data Isn't. Or It Isn't Data.

We can now say precisely what Dwork's aphorism means, and why it is the truest sentence in privacy engineering.

"Anonymized data isn't" because any dataset rich enough to answer the questions people pay for retains the quasi-identifiers, the sparsity, and the behavioral fingerprints that make re-identification a JOIN away. Names are not where identity lives. Identity is the pattern, and you cannot sell the pattern while deleting the person they are the same thing.

"Or it isn't data" because the only way to truly sever identity is to destroy so much structure (suppress, generalize, add noise until ε → 0) that the file no longer tells you anything worth knowing. Perfect anonymity is a blank page. It is perfectly safe and perfectly useless.

Between those poles is not a safe harbour but a frontier of tradeoffs, and every real release is a choice of where to stand on it — a choice about acceptable risk, made on behalf of people who never voted on their ε. That reframes the entire Kenyan debate. The question was never "personal or anonymized?" as if those were two boxes. The real questions are engineering and governance questions:

Anonymized to what measurable standard (k? ε?), certified by whom?
Safe against which adversary and which auxiliary datasets?
At what residual re-identification probability, owned by whom when it fails?
And — the question this series keeps arriving at — why release the microdata at all, when query interfaces, clean rooms, federated analytics, and DP-protected aggregates let Kenyans extract the value while the raw asset, and its risk, stay home?

Privacy, in the end, is not a property of a dataset. It is a property of the system — the techniques, the budget, the threat model, the institutions, and the trust — that surrounds it. You cannot buy it in a single transformation called "anonymize," and you cannot restore it after a breach with an apology.

Which is why the deepest lesson of this entire series is not technical at all.

The future of data governance will not be decided by how much data governments can collect. It will be decided by how much trust institutions can maintain while using it.

A government that understands this will stop asking "how do we anonymize it so we can sell it?" and start asking "how do we use it, under guarantees citizens can verify, so they never have to take our word for it?"

Because in the end, "we'll only sell anonymized data" was never a technical claim.

It was a request to be trusted.

And trust, unlike data, cannot be re-identified once it's gone.

Visual Suggestions

The uniqueness collapse curve line chart: % of records that are unique (y) vs. number of quasi-identifiers included (x), from Section 5. The line stays near zero, then shoots up. The single most persuasive image in the piece.
The linkage-attack diagram two tables (anonymized hospital data; public registry) with arrows joining on age, gender, county, meeting at a third table where the diagnosis now has a name.
The privacy–utility frontier the Section 7 curve, with real techniques plotted as points (raw microdata bottom-right; suppressed table top-left; DP-aggregate and clean-room bending the frontier outward).
"Four points, 95%" mobility graphic a city map with four pins (home, office, mall, church) resolving to one highlighted person.
The epsilon dial a single slider from "ε→0: useless & private" to "ε→∞: useful & exposed," annotating where Census-style (ε≈1–10) choices sit. Makes the budget tangible.
Architecture comparison four side-by-side mini-diagrams: bulk download vs. query API vs. federated analytics vs. clean room, color-coded by re-ID risk.

References

Sweeney, L. (2002). k-Anonymity: A Model for Protecting Privacy. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems. (Also: Sweeney's ZIP/DOB/sex ~87% uniqueness result.)
Narayanan, A., & Shmatikov, V. (2008). Robust De-anonymization of Large Sparse Datasets (How to Break Anonymity of the Netflix Prize Dataset). IEEE S&P.
de Montjoye, Y.-A., Hidalgo, C. A., Verleysen, M., & Blondel, V. D. (2013). Unique in the Crowd: The Privacy Bounds of Human Mobility. Scientific Reports.
de Montjoye, Y.-A., et al. (2015). Unique in the Shopping Mall: On the Reidentifiability of Credit Card Metadata. Science.
Machanavajjhala, A., et al. (2007). l-Diversity: Privacy Beyond k-Anonymity. ACM TKDD.
Li, N., Li, T., & Venkatasubramanian, S. (2007). t-Closeness: Privacy Beyond k-Anonymity and l-Diversity. IEEE ICDE.
Dwork, C. (2006). Differential Privacy. ICALP. And Dwork & Roth (2014), The Algorithmic Foundations of Differential Privacy.
Ohm, P. (2010). Broken Promises of Privacy: Responding to the Surprising Failure of Anonymization. UCLA Law Review. (Source of the "database of ruin" framing.)
Barbaro, M., & Zeller, T. (2006). A Face Is Exposed for AOL Searcher No. 4417749. The New York Times.
Hern, A. (2018). Fitness tracking app Strava gives away location of secret US army bases. The Guardian.
Carlini, N., et al. (2021). Extracting Training Data from Large Language Models. USENIX Security.
Shokri, R., et al. (2017). Membership Inference Attacks Against Machine Learning Models. IEEE S&P.
Thompson, S. A., & Warzel, C. (2019). One Nation, Tracked. The New York Times (Privacy Project).
Kosinski, M., Stillwell, D., & Graepel, T. (2013). Private traits and attributes are predictable from digital records of human behavior. PNAS.
U.S. Census Bureau. Disclosure Avoidance for the 2020 Census: differential privacy.
Ministry of Information, Communications and the Digital Economy (Kenya). Draft Final National Data Governance Policy (May 2026); Data Protection Act, 2019; Constitution of Kenya, Article 31.

(Citations are provided for verification and further reading; figures from the Kenyan policy reflect a draft under public consultation and should be checked against the final gazetted document.)

DEV Community