I dropped this for three months. Here's what I added when I came back.
I started MaskOps in March. It masks PII in Polars DataFrames using Rust — no Python per row, no NLP models, just regex running on Arrow buffers.
Then I got hired. Cencosud S.A. The project sat untouched until last week.
Coming back to it, I had a backlog. I shipped the one I kept thinking about at work: mask_pii_audit.
The problem with masking alone
Masking answers "is this field safe to store?" It doesn't answer "what kind of PII just came through, and how much of it?"
Compliance teams need both. They need the masked value and a count of what was found — by family — without running the column twice.
What mask_pii_audit does
It returns a nested Struct: the masked text, plus a count for each of the 33 PII families.
import polars as pl
import maskops
df = pl.DataFrame({"notes": [
"Call me at 555-123-4567. SSN: 123-45-6789.",
"IBAN: DE89370400440532013000",
"Nothing here.",
]})
result = (
df
.with_columns(maskops.mask_pii_audit("notes").alias("audit"))
.unnest("audit")
)
print(result.select("masked", "counts"))
┌───────────────────────────────┬────────────────────────────┐
│ masked ┆ counts │
╞═══════════════════════════════╪════════════════════════════╡
│ Call me at ***-***-****. SSN… ┆ {"phone": 1, "ssn": 1, …} │
│ IBAN: DE89****************** ┆ {"iban": 1, …} │
│ Nothing here. ┆ {"phone": 0, "ssn": 0, …} │
└───────────────────────────────┴────────────────────────────┘
Same masked output as mask_pii. Zero fields mean no match.
One pass
The counting happens inside the existing replace_all call. A Cell<u32> in the closure increments on each validated match. No second scan, no cloned strings.
pub fn replace_counted<F>(re: &Regex, s: &str, render: F) -> (String, u32)
where F: Fn(&Captures) -> Option<String> {
let count = Cell::new(0u32);
let out = re.replace_all(s, |caps: &Captures| match render(caps) {
Some(masked) => { count.set(count.get() + 1); masked }
None => caps[0].to_string(),
}).into_owned();
(out, count.get())
}
A daily audit pattern
(
df
.with_columns(maskops.mask_pii_audit("free_text").alias("audit"))
.unnest("audit")
.select(
pl.col("counts").struct.field("ssn").sum().alias("ssn_total"),
pl.col("counts").struct.field("credit_card").sum().alias("cc_total"),
pl.col("counts").struct.field("iban").sum().alias("iban_total"),
)
)
Run this at ingest. Log the totals. Alert if a family appears that shouldn't.
Where it stands
v1.6.0. 33 PII families: EU IDs, US healthcare, LATAM nationals, APAC. Asterisk masking and FF3-1 format-preserving encryption. Polars lazy and streaming supported.
pip install maskops
Source: github.com/fcarvajalbrown/MaskOps
Happy to answer questions.
Top comments (0)