Mwai Victor Brian

Posted on Jun 9

Anonymized Data Isn't. Or It Isn't Data

#data #dataprivacy #datagovernance

Why "don't worry, it's anonymized" might be the most comforting lie in tech

A technical follow-up to “Kenya Accidentally Discovered a Gold Mine and Immediately Started Asking Who Wants to Buy the Dirt.” If you haven’t read the original piece yet, start here: https://dev.to/code_with_mwai/kenya-accidentally-discovered-a-gold-mine-and-immediately-started-asking-who-wants-to-buy-the-dirt-594l, this article builds on one of its core arguments: anonymity.

Introduction

In the last article, I argued that Kenya is sitting on a gold mine of data and is about to sell the dirt.

The whole plan rests on five magic words.

"We'll only sell anonymized data."

It's a wonderful sentence.

It ends arguments.

It calms boards.

It reassures the public.

There's just one problem.

It's mostly not true.

Not because anyone is lying on purpose.

But because "anonymized" doesn't mean what almost everyone thinks it means.

There's an old saying among privacy researchers, usually credited to the cryptographer Cynthia Dwork:

> Anonymized data isn't. Or it isn't data.

Translation: a dataset is either useful in which case it can probably be traced back to real people or it's been scrubbed so hard that it's safe and useless.

You rarely get both.

This article is about why.

No heavy math. No code. Just the idea, the evidence, and what it means for Kenya.

If you are a data professional you can get the more technical article on data privacy here.

What People Think Anonymizing Means

Picture a simple list.

Name        Phone        Age   County
John Doe    0712345678   32    Nairobi
Jane Doe    0723456789   29    Kiambu

To "anonymize" it, you cross out the obvious stuff. Name. Phone.

Age   County
32    Nairobi
29    Kiambu

Done?

It feels done. No name. No number. Nobody can be hurt by "32, Nairobi."

But here's the trap.

Your identity was never only in your name.

Your identity is scattered across all the boring little details your age, your sex, your county, your job, the day you visited a clinic.

On their own, each detail is harmless.

Together?

They point at exactly one person.

Crossing out the name is like hiding someone's face but leaving their fingerprints their address, their job title, and their daily routine on the table.

You didn't hide them.

You just made it slightly more work to find them.

I know you’ve heard the term digital footprint thrown around. And yes it is exactly what it sounds like: your digital DNA.

Every click, search, location ping, and interaction becomes a data point. And in the world of data, no point is ever truly “small” each one is a nucleotide in the larger strand that reconstructs who you are.

Anonymizing by deleting names is like hiding a face while leaving the fingerprints.

The Magic Trick Behind Every Privacy Disaster

Here's how people actually get re-identified. It's almost insultingly simple.

You take the "anonymous" dataset.

You find a second dataset that happens to share a few of the same details.

You match them up.

That's it. That's the whole trick.

Imagine an "anonymous" hospital list:

Age   Gender   County    Condition
42    Female   Nairobi   (something private)

No name. Safe, right?

Now imagine any ordinary public list with names on it a staff directory, a professional registry, a voter roll, a LinkedIn page:

Name           Age   Gender   County
Mary Atieno    42    Female   Nairobi

Neither list has both the name and the private condition.

But line them up by age, gender and county…

…and suddenly Mary Atieno's private medical condition has her name on it.

No hacking. No password stolen. No breach.

Just two harmless lists and a bit of matching.

And here's the scary part: you don't control the second list.

Every new public dataset, every leaked database, every social-media scrape becomes a new tool for unmasking your "anonymous" data.

So a dataset that's safe today can be cracked open tomorrow by a dataset that doesn't even exist yet.

You're not hiding people from today's world.

You're trying to hide them from every list that will ever be published.

That's a race you lose.

You can't un-publish data. Once it's out, it's out —and the tools to crack it only get better.

You Are Not as Average as You Think

The reason this keeps working is a fact that shocks almost everyone.

People feel like one of millions.

In data, you are usually one of one.

A famous study looked at people's movement just the rough place and time of their phone activity.

How many of those little dots do you need to pick one specific person out of one and a half million?

Four.

Not four hundred.

Not forty.

Four.

Think about your own day:

Home in the morning.
Work by nine.
That one café you always go to.
Church on Sunday.

Congratulations. There is almost certainly nobody else on Earth with your exact pattern.

The same thing is true of:

The way you spend money.
The things you search for.
The mix of government services you use.

This is the deepest idea in the whole article, so let me say it plainly:

Your behaviour is your name.

You don't need an ID number when your daily routine already belongs to you and you alone.

And that's the cruel twist for Kenya's plan, because one of the datasets reportedly up for sale is traffic and mobility data.

In the privacy world, that's not the easy stuff.

That's the most dangerous data there is.

How One Extra Column Blows It All Up

Here's the part policymakers should tape to their wall.

Anonymity doesn't fade away slowly as you add details.

It holds, and holds, and holds and then collapses all at once.

Picture a dataset of a million Kenyans.

With just gender and county, everyone hides in a crowd of thousands. Totally safe. Also totally useless — you can't tell anyone apart.
Add age, and a few unusual people start to stand out, but most are still safe.
Add one more detail occupation and suddenly a quarter of everyone is unique, and most of the rest sit in tiny groups of five or fewer.

One extra column. The exact kind of "but I really need this field" column a researcher always asks for.

And the whole thing falls over.

The lesson: every useful detail you keep is also a detail that helps unmask someone.

Usefulness and safety are pulling on the same rope, in opposite directions.

Anonymity doesn't erode. It holds then collapses the instant you add the one column someone insisted they needed.

The Times the World Found Out the Hard Way

This isn't theory. It keeps happening. Same mistake, new decade.

Netflix. Years ago, Netflix released "anonymous" movie ratings for a competition. Researchers matched them against public film reviews online and unmasked real people — revealing things as private as their politics and sexuality. From a list of movie ratings.

AOL. A search company once published millions of "anonymous" searches, swapping names for numbers. But they left the searches themselves intact. Reporters read one person's stream of searches her town, her ailments, her neighbours' names and knocked on her door within days. The searches were the identity.

Strava. A fitness app published a glowing global map of where people exercise fully aggregated, no individuals. Except in empty deserts, the only glowing lines were soldiers jogging around secret military bases. The map revealed the bases. "Aggregated" leaked national secrets.

Location brokers. A whole industry sells "anonymous" phone-location data. But a phone that sleeps at one house every night and goes to one office every day has basically announced its owner. Journalists and snoops have re-identified people including a priest forced to resign from supposedly anonymous location trails.

Notice the pattern.

Every one of these teams genuinely believed they had shipped anonymous data.

Every one was wrong.

Not because they were careless.

Because that's the nature of the thing.

Every team thought their data was anonymous. Every team was wrong within days.

And Then AI Showed Up and Made It Worse

Just as we were losing this fight, artificial intelligence arrived to make it harder.

Old-school anonymizing assumed the private fact was a column you could delete.

AI doesn't need the column.

It can guess the private fact from the boring ones predicting health, ethnicity, sexuality, or politics from data that looks completely innocent.

You can delete a field.

You can't delete a prediction.

And big AI models have a nasty habit: feed them data, and they sometimes memorize it coughing real names and numbers back out later when prompted.

So the very thing Kenya wants this data for building African AI is also the thing that makes "anonymized" hardest to guarantee.

We're building the tide that's washing away our own sandcastle.

So What Should Kenya Actually Do?

Here's the good news. There's a smarter path, and it's not complicated.

Stop asking: "How do we anonymize it enough to sell it?"

Start asking: "How do we let people use it without handing over the raw data at all?"

Three ideas do most of the work.

1. Don't sell the file. Sell the answer.
Instead of shipping a dataset out the door, let approved researchers ask questions and get answers back while the actual data never leaves the government's vault. Capture the insight, keep the risk at home. (Engineers call these "data clean rooms" and "query interfaces." You don't need to remember the names. Just the idea: visitors compute on the data; they don't take it.)

2. Add a little honest noise.
There's a technique used by the US Census and by Apple and Google — that adds tiny, carefully measured "static" to published statistics. Enough to hide any single person, not enough to ruin the big picture. It's the first privacy tool honest enough to come with a dial you can actually set and audit.

3. Collect less in the first place.
The single best privacy technology ever invented is not collecting the data you don't need.

You can't leak a record that doesn't exist.

You can't unmask a person you never logged.

Boring? Yes. Unglamorous? Completely.

Also the most effective thing on the list.

And this is exactly where selling data becomes dangerous. The moment data is money, every office has a reason to collect more of it, keep it longer, and link it wider because more data means more to sell.

A government can't be both the careful guardian who collects less and the eager vendor who hoards more.

Those are two different animals.

The safest record is the one you never collected. Everything else is just managing a risk you chose to take.

A Quick Thought Experiment

Say Kenya releases a "safe" dataset with no names — just four columns:

Age range   County    Job                      Travels
30-39       Nairobi   Cardiologist             Daily
50-59       Turkana   Member of County Assembly Weekly
40-49       Kisumu    University Professor      Monthly

No name. No ID. Surely anonymous?

Ask one question: how many 50-something Members of the County Assembly are there in Turkana?

Probably… one.

That person is now fully exposed — their travel habits, attached to their name, by anyone with a newspaper and an internet connection.

The job title did the work the name used to do.

And notice who gets exposed first: the rarest people. The specialist doctor. The elected official. The only professor of her kind in the county.

Anonymization fails first for exactly the people who are most powerful or most vulnerable.

The Bottom Line

So why is "anonymized data isn't, or it isn't data" the truest line in this whole debate?

Because if the data is useful, it can usually be traced back to real people.

And if you scrub it until it truly can't, it stops telling you anything worth knowing.

There's no magic word called "anonymize" that gives you both safety and value at once. There's only a choice about how much risk to accept a choice usually made for citizens, by people they'll never meet, about data the citizens themselves created.

Which means the real Kenyan question was never "personal data or anonymized data?"

It was always:

Anonymized how, and proven by whom?
Safe against which snoop, with which other datasets?
And who takes the blame when someone gets unmasked?

Privacy isn't a setting you switch on once.

It's something a country either earns and protects or loses and can't get back.

And that's the thought I want to leave you with.

The future of data in Kenya won't be decided by how much data the government can collect.

It'll be decided by how much trust our institutions can keep while using it.

Because "we'll only sell anonymized data" was never really a technical promise.

It was a request to be trusted.

And trust, unlike data, can't be re-identified once it's gone.

Author: Mwai Victor

This is Part Two of a series. Part One “Kenya Accidentally Discovered a Gold Mine and Immediately Started Asking Who Wants to Buy the Dirt” focused on the economics and policy implications.

For readers who want to go deeper, there is also a separate technical edition of this discussion, covering the code, mathematics, and engineering behind the arguments made here.

If you’ve made it this far whether you’re a data professional or just curious I recommend continuing to the technical overview:
Technical Overview of Data Privacy

Top comments (2)

leslie angu • Jun 10

I hadn't realized that anonymous data that was collected such as cookies meant: either they could track my search patterns or they couldn't track it. This article has dismistified how we understand anonymity. It has broken down the constructs and jargon used by tech leads to mask their true intent with the data they collect. This truly shows how data is the new gold and everyone is racing towards tapping it. @code_with_mwai - you are the true definition of a tech journalist who takes time to review and learn how hidden information in the tech space. This information can be used to improve or take advantage of the people who don't understand all the buzz words and jargon. I hope to see more articles like this.

Mwai Victor Brian • Jun 10

Thank you for the review, @leslie_angu.

I'm a firm believer in looking beyond the headlines and digging into the substance of an issue. Whenever I explore a topic, I try not only to ask the relevant questions but also to investigate and answer them. That's where the most interesting insights are often found.