DEV Community

Felipe Carvajal Brown
Felipe Carvajal Brown

Posted on

MaskOps 0.1.0: A Native Polars Plugin for High-Speed PII Masking in Python

TL;DR: I built a Rust-powered Polars plugin that masks GDPR-sensitive data (IBAN, EU VAT) at up to 16 million rows per second — no NLP models, no spaCy, no Presidio overhead. pip install maskops.


The Problem

If you work with financial data, healthcare records, or any GDPR-regulated dataset in Python, you've likely hit the same wall: de-identifying structured data at scale is painfully slow.

The go-to solution is Microsoft Presidio. It's powerful, but it's built for unstructured text — it spins up a full spaCy NLP pipeline to find a phone number in a CSV column. For structured DataFrames where you already know which columns contain PII, that's enormous overhead:

  • Presidio with spaCy NER: ~1,000–5,000 rows/s
  • Presidio with regex-only recognizers: ~10,000–50,000 rows/s
  • Pure Python re module: ~1,100,000 rows/s

None of these integrate natively with Polars, the fastest DataFrame library in Python.


The Solution: maskops

maskops is a native Polars expression plugin written in Rust. It extends Polars with two new expressions — mask_pii() and contains_pii() — that run directly on Arrow memory buffers with zero Python overhead per row.

import polars as pl
import maskops

df = pl.read_csv("payments.csv")

# Mask all PII in a column
df.with_columns(maskops.mask_pii("notes"))
# "Transfer to DE89370400440532013000" → "Transfer to DE89******************"

# Boolean detection — filter rows containing PII
df.filter(maskops.contains_pii("free_text"))
Enter fullscreen mode Exit fullscreen mode

That's it. No model downloads, no engine initialization, no spaCy.


Benchmarks

Tested on 1,000,000 rows, Intel i-series CPU, Python 3.14, Windows.

maskops throughput

Profile Expression Time Rows/s MB/s
clean (no PII) mask_pii 0.404s 2,477,599 54.5
clean (no PII) contains_pii 0.169s 5,915,970 130.2
dense (all PII) mask_pii 1.385s 722,104 15.9
dense (all PII) contains_pii 0.059s 16,987,879 373.7
mixed (50/50) mask_pii 0.760s 1,315,407 28.9
mixed (50/50) contains_pii 0.133s 7,498,315 165.0

vs pure Python regex (same machine)

Profile maskops mask_pii Python re Speedup
clean 0.404s 0.925s 2.3×
dense 1.385s 1.653s 1.2×
mixed 0.760s 1.337s 1.8×

On clean and mixed data maskops is consistently faster. On dense data (every row is a full IBAN) both are regex-bound — the bottleneck is the pattern itself, not Python overhead.

vs Microsoft Presidio (estimated)

Presidio processes structured DataFrames via presidio-structured, which runs a spaCy NLP pipeline per row. Based on community reports and the architecture:

Tool Throughput (structured data) Requires NLP model
maskops ~700K–17M rows/s No
Presidio (regex-only recognizers) ~10–50K rows/s* No
Presidio (spaCy NER) ~1–5K rows/s* Yes (250MB+)

* Estimated from community benchmarks and Presidio's own documentation noting it is "not optimized for bulk structured data." Microsoft confirmed no official throughput benchmarks exist.

maskops is purpose-built for structured data pipelines where Presidio's NLP overhead is unnecessary.


How It Works

The key is the Polars expression plugin system, introduced in Polars 0.20. It allows you to register custom Rust functions that Polars calls directly on Arrow ChunkedArray buffers — bypassing Python entirely for the hot loop.

The architecture is three layers:

Python (user code)
    ↓  register_plugin_function()
Polars expression engine
    ↓  Arrow ChunkedArray
Rust (maskops core)
    ↓  regex::Regex on &str slices
Enter fullscreen mode Exit fullscreen mode

Each PII type lives in its own Rust module (iban.rs, vat.rs) with a compiled once_cell::Lazy<Regex> — the regex is compiled once at startup, not per row.

// Rust side — called directly by Polars on each string slice
#[polars_expr(output_type=String)]
fn mask_pii(inputs: &[Series]) -> PolarsResult<Series> {
    let ca = inputs[0].str()?;
    let out: StringChunked = ca.apply(|opt_val: Option<&str>| {
        opt_val.map(|s| std::borrow::Cow::Owned(mask_all(s)))
    });
    Ok(out.into_series())
}
Enter fullscreen mode Exit fullscreen mode

Supported PII Patterns (v0.1.0)

Pattern Coverage Example
IBAN All 36 SEPA countries DE89370400440532013000DE89******************
EU VAT All 27 EU member states DE123456789DE*********

Tested against Faker-generated data in 8 EU locales: DE, FR, ES, IT, NL, PL, PT, SE.


Why Not Just Use Polars .str.replace()?

You could write pl.col("x").str.replace_all(pattern, "****") directly in Polars. The problem:

  1. You need one expression per PII type — maskops applies all patterns in a single pass.
  2. No detection — Polars has no contains_pii() equivalent without writing the regex yourself.
  3. No masking logicmask_pii preserves the IBAN country code and check digits, which is standard practice for audit trails. A raw str.replace_all would wipe everything.

Roadmap

  • v0.1.1: Email, phone number, IP address patterns
  • v0.1.2: Format-Preserving Encryption (FPE/FF3-1) for reversible masking + PyPI publish
  • v0.2.0: Latin American IDs (Chilean RUT, Brazilian CPF, Mexican CURP)

Install & Getting Started

pip install maskops
Enter fullscreen mode Exit fullscreen mode
import polars as pl
import maskops

df = pl.DataFrame({
    "transaction": [
        "Payment from DE89370400440532013000",
        "Invoice VAT: DE123456789",
        "No PII here"
    ]
})

result = df.with_columns([
    maskops.mask_pii("transaction").alias("masked"),
    maskops.contains_pii("transaction").alias("has_pii")
])

print(result)
Enter fullscreen mode Exit fullscreen mode

Output:

┌─────────────────────────────────────┬──────────────────────────────────┬─────────┐
│ transaction                         ┆ masked                           ┆ has_pii │
╞═════════════════════════════════════╪══════════════════════════════════╪═════════╡
│ Payment from DE89370400440532013000 ┆ Payment from DE89*************** ┆ true    │
│ Invoice VAT: DE123456789            ┆ Invoice VAT: DE*********         ┆ true    │
│ No PII here                         ┆ No PII here                      ┆ false   │
└─────────────────────────────────────┴──────────────────────────────────┴─────────┘
Enter fullscreen mode Exit fullscreen mode

Source code: github.com/fcarvajalbrown/MaskOps


Built with Rust, pyo3-polars, and maturin. Contributions welcome.

Tags: #rust #python #polars #gdpr #dataengineering #privacy #pii #opensource

Top comments (0)