J.S_Falcon

Posted on May 10

How I Built a Masking Tool Without Showing AI Any Real Data: Column-wise Shuffling as the Scaffold

#ai #python #privacy #productivity

TL;DR

I never write code or send real data to LLMs — but I built a complete data-masking tool through AI collaboration.
The technique: column-wise independent shuffling (Japan PPC's official anonymization method) plus Faker replacement.
Four phases: send column names → run shuffling batch → manually craft sample CSV → send sample for Faker batch + structural review.
Key discipline: survey naive ideas in industry terminology before having AI implement — that alone compresses code 10x.
The output is a tool I trigger by double-click. I never read the Python.

1. The "Can't Send to LLM" Wall

Across my field notes, I've kept saying the same things:
"Don't send business data to LLMs."
"Only sanitized samples go to AI."

But how exactly do I sanitize the data?
That methodology has never been spelled out. So here it is —
a self-asked, self-answered post.

I wanted to build a new masking tool. I wanted to discuss it
with Claude or Gemini, showing real data and asking
"how would you mask this column?"

But the rule is firm: no business data goes to LLMs.

Just describing the logic verbally doesn't land —
LLMs need to see the data shape.
Hand-crafting fake data is torture (you have to reproduce
empty-cell patterns, spelling variants, full-width/half-width
character mixes, and so on).

What I needed: data that looks real but can't identify anyone.

2. The Naive Idea: Column-by-Column Shuffle

My first idea was simple:

"What if I shuffle each column independently?"

If you shuffle each column on its own:

Each value remains real (format perfectly preserved)
Row-level combinations are destroyed (records can't be reconstructed)
Per-column statistical properties are preserved (distributions intact)

For 100 customer rows, shuffle the name column, address column,
and amount column separately. The combination
"John Smith / 123 Main St / $12,345" disappears,
but each value still exists somewhere.

That should make individual identification impossible.

But before implementing, I surveyed first.

3. The Survey Reveals: Industry Standard

"Naive idea → immediate implementation" is forbidden discipline
(see my earlier field guide on
ops discipline in AI-assisted coding).
Translate the naive idea into industry terminology, then search.

Searching "column-wise shuffle + anonymization + technical term":

Column-wise Independent Shuffling
A de-identification technique offered by Oracle Data Safe,
Talend, Tonic.ai, and others.

And surprisingly, Japan codifies it too:

Japan's Personal Information Protection Commission (PPC) lists
"shuffling" explicitly in its anonymization guidelines:
"Probabilistically swapping records constituting the personal
information database among themselves."

So my naive idea was literally PPC's official method.
Survey complete. Time to implement.

4. AI Collaboration in Four Phases

A premise I should make explicit — I don't write a single line of code.
As a vibe coder, I have AI write it for me.

But the rule "no business data to LLMs" applies, so I can't just send
the real data and say "shuffle this please."
So I do it in four phases.

Phase 1: Send Only Column Names → Get a Tool Built

I can't send the real data, but I can send the column names
(structure, not PII).

Prompt to LLM:

Schema: customerID / name / address / building name / company / amount
Requirement: Shuffle each column independently, destroy row combinations
Build it as a batch file (.bat) that runs on double-click

The LLM produced a batch file + internal script + input/output folders
as a complete bundle. What lands on my desk: a tool that runs on double-click.
I don't read the Python inside.

Phase 2: Verify Operation → One Bug Surfaces

I drop real data into the input folder, double-click the batch file,
open the output CSV in Excel.

Something's off. The shuffle is supposedly happening, but row-level
combinations look intact — each row resembles its original ordering.

I report to the LLM:

The double-click ran fine, but the output CSV doesn't look shuffled.
Each row resembles the original order.

The LLM's instant reply:

The internal seed is shared across all columns. We need a different
seed per column. Fixing.

I receive the fixed batch file, double-click → combinations are now
destroyed. OK.

What looked correct on paper failed in practice.
The AI confidently said "looks right on paper" too,
so practical verification is the human's role.

Phase 3: Build the Sample CSV

From the shuffled output, I pull just 10 rows and manually replace the
surnames and building names with arbitrary characters in Excel.
This erases the last traces of real data.

The sample CSV now has only the column structure and shape of data —
no real-data trace remains. Only at this point does it become material
I can send to the LLM.

Phase 4: Send the Sample CSV → Get the Faker Batch Built

I send the sample CSV to the LLM with a follow-up request:

Based on this sample, add a Faker-based replacement step for
name / address / building / company. Same batch file should handle it.

The LLM integrated Faker (ja_JP locale, but the same applies in any
locale) and, for fields Faker doesn't support (e.g., apartment building
names like "Alpha Omega Place"), wrote a custom generator using
katakana + suffixes (producing names like "Nikikenawatower").

While reading the sample, the LLM also notices:

Your "product" keyword rule for Faker-replacement is over-matching:
"productID", "productStock", "productCategory" are getting hit too.
Switch to a two-stage detection (include + exclude keywords).

This wasn't a perspective I would have spotted alone.
I use AI twice — once for the shuffling batch (built from column
names alone), and once for the Faker batch + structural review (built
from the sample CSV).

The LLM rewrote the matching logic from "keyword-match → apply" into
"keyword-match → exclusion-check → apply" before producing the Faker
batch. Double-click the new batch file → Faker processing completes
without any over-match. Done.

The Four-Phase Role Split

Phase	What I do	What AI does
Phase 1: Build shuffling batch	Send column names as prompt	Build the complete batch tool
Phase 2: Verify operation → Fix bug	Click / verify in Excel / report	Identify bug cause and fix
Phase 3: Build sample CSV	Pull 10 rows / manually edit surnames and building names	(not involved)
Phase 4: Build Faker batch	Send sample CSV to LLM / click / verify	Build Faker batch + structural review (resolve over-match)

I never read the Python. I never send real data to LLMs.
Double-click → open in Excel → report to AI. Four phases through
this loop and the tool is finished.

This is what AI collaboration looks like.

5. Legal Positioning (Internal Use OK, Outsourcing Gets Tricky)

A brief touch on the legal positioning.

Internal use (LLM discussion / internal analysis) is generally fine.
The scrambled output is unidentifiable enough to substantially reduce
privacy risk in most jurisdictions.

Handling client data as a contractor is where it gets tricky.
The framing differs by jurisdiction:

Japan classifies this as "entrusted processing" under Article 27(5)(i) of the Personal Information Protection Act, an exception to third-party transfer rules.
EU/UK treats it as a Data Processor / Data Controller relationship under GDPR, with a Data Processing Agreement (DPA) under Article 28 specifying the processing scope.
US uses HIPAA's Business Associate Agreement (BAA) for healthcare data, or contractual data-handling clauses for general PII.

The common pattern: contract language determines compliance.
Whichever jurisdiction you operate in, have your legal team review
the "sanitization purpose and scope of use" clauses explicitly.

In short: contracts get complicated, so legal review is recommended
for contract work. I won't go deeper than that here.

6. Closing

What the AI collaboration era needs is a scaffold tool that converts
"data you can't share" into "samples you can share."
That tool fits in one line of pandas plus Faker.

And the post's thesis:

Don't have AI implement your naive idea immediately —
survey it in industry terminology first, then implement.
That discipline compresses code volume by 10x.

If I hadn't known that "column-wise shuffle" is PPC's official
"shuffling" method, I would have asked the LLM to "generate random
names with Faker, build a consistency dictionary, maintain referential
integrity, ..." — full from-scratch implementation.
In reality, one line of pandas was enough.

Survey-driven discipline. In the AI collaboration era, what matters
on the human side is the ability to cross-reference industry knowledge.

DEV Community