DEV Community

Felipe Carvajal Brown
Felipe Carvajal Brown

Posted on

I dropped this for three months. Here's what I added when I came back.

I dropped this for three months. Here's what I added when I came back.

I started MaskOps in March. It masks PII in Polars DataFrames using Rust — no Python per row, no NLP models, just regex running on Arrow buffers.

Then I got hired. Cencosud S.A. The project sat untouched until last week.

Coming back to it, I had a backlog. I shipped the one I kept thinking about at work: mask_pii_audit.


The problem with masking alone

Masking answers "is this field safe to store?" It doesn't answer "what kind of PII just came through, and how much of it?"

Compliance teams need both. They need the masked value and a count of what was found — by family — without running the column twice.

What mask_pii_audit does

It returns a nested Struct: the masked text, plus a count for each of the 33 PII families.

import polars as pl
import maskops

df = pl.DataFrame({"notes": [
    "Call me at 555-123-4567. SSN: 123-45-6789.",
    "IBAN: DE89370400440532013000",
    "Nothing here.",
]})

result = (
    df
    .with_columns(maskops.mask_pii_audit("notes").alias("audit"))
    .unnest("audit")
)

print(result.select("masked", "counts"))
Enter fullscreen mode Exit fullscreen mode
┌───────────────────────────────┬────────────────────────────┐
│ masked                        ┆ counts                     │
╞═══════════════════════════════╪════════════════════════════╡
│ Call me at ***-***-****. SSN… ┆ {"phone": 1, "ssn": 1, …} │
│ IBAN: DE89******************  ┆ {"iban": 1, …}             │
│ Nothing here.                 ┆ {"phone": 0, "ssn": 0, …} │
└───────────────────────────────┴────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

Same masked output as mask_pii. Zero fields mean no match.

One pass

The counting happens inside the existing replace_all call. A Cell<u32> in the closure increments on each validated match. No second scan, no cloned strings.

pub fn replace_counted<F>(re: &Regex, s: &str, render: F) -> (String, u32)
where F: Fn(&Captures) -> Option<String> {
    let count = Cell::new(0u32);
    let out = re.replace_all(s, |caps: &Captures| match render(caps) {
        Some(masked) => { count.set(count.get() + 1); masked }
        None => caps[0].to_string(),
    }).into_owned();
    (out, count.get())
}
Enter fullscreen mode Exit fullscreen mode

A daily audit pattern

(
    df
    .with_columns(maskops.mask_pii_audit("free_text").alias("audit"))
    .unnest("audit")
    .select(
        pl.col("counts").struct.field("ssn").sum().alias("ssn_total"),
        pl.col("counts").struct.field("credit_card").sum().alias("cc_total"),
        pl.col("counts").struct.field("iban").sum().alias("iban_total"),
    )
)
Enter fullscreen mode Exit fullscreen mode

Run this at ingest. Log the totals. Alert if a family appears that shouldn't.


Where it stands

v1.6.0. 33 PII families: EU IDs, US healthcare, LATAM nationals, APAC. Asterisk masking and FF3-1 format-preserving encryption. Polars lazy and streaming supported.

pip install maskops
Enter fullscreen mode Exit fullscreen mode

Source: github.com/fcarvajalbrown/MaskOps

Happy to answer questions.

Top comments (0)