TL;DR: I built a Rust-powered Polars plugin that masks GDPR-sensitive data (IBAN, EU VAT) at up to 16 million rows per second — no NLP models, no spaCy, no Presidio overhead. pip install maskops.
The Problem
If you work with financial data, healthcare records, or any GDPR-regulated dataset in Python, you've likely hit the same wall: de-identifying structured data at scale is painfully slow.
The go-to solution is Microsoft Presidio. It's powerful, but it's built for unstructured text — it spins up a full spaCy NLP pipeline to find a phone number in a CSV column. For structured DataFrames where you already know which columns contain PII, that's enormous overhead:
- Presidio with spaCy NER: ~1,000–5,000 rows/s
- Presidio with regex-only recognizers: ~10,000–50,000 rows/s
- Pure Python
remodule: ~1,100,000 rows/s
None of these integrate natively with Polars, the fastest DataFrame library in Python.
The Solution: maskops
maskops is a native Polars expression plugin written in Rust. It extends Polars with two new expressions — mask_pii() and contains_pii() — that run directly on Arrow memory buffers with zero Python overhead per row.
import polars as pl
import maskops
df = pl.read_csv("payments.csv")
# Mask all PII in a column
df.with_columns(maskops.mask_pii("notes"))
# "Transfer to DE89370400440532013000" → "Transfer to DE89******************"
# Boolean detection — filter rows containing PII
df.filter(maskops.contains_pii("free_text"))
That's it. No model downloads, no engine initialization, no spaCy.
Benchmarks
Tested on 1,000,000 rows, Intel i-series CPU, Python 3.14, Windows.
maskops throughput
| Profile | Expression | Time | Rows/s | MB/s |
|---|---|---|---|---|
| clean (no PII) | mask_pii |
0.404s | 2,477,599 | 54.5 |
| clean (no PII) | contains_pii |
0.169s | 5,915,970 | 130.2 |
| dense (all PII) | mask_pii |
1.385s | 722,104 | 15.9 |
| dense (all PII) | contains_pii |
0.059s | 16,987,879 | 373.7 |
| mixed (50/50) | mask_pii |
0.760s | 1,315,407 | 28.9 |
| mixed (50/50) | contains_pii |
0.133s | 7,498,315 | 165.0 |
vs pure Python regex (same machine)
| Profile | maskops mask_pii
|
Python re
|
Speedup |
|---|---|---|---|
| clean | 0.404s | 0.925s | 2.3× |
| dense | 1.385s | 1.653s | 1.2× |
| mixed | 0.760s | 1.337s | 1.8× |
On clean and mixed data maskops is consistently faster. On dense data (every row is a full IBAN) both are regex-bound — the bottleneck is the pattern itself, not Python overhead.
vs Microsoft Presidio (estimated)
Presidio processes structured DataFrames via presidio-structured, which runs a spaCy NLP pipeline per row. Based on community reports and the architecture:
| Tool | Throughput (structured data) | Requires NLP model |
|---|---|---|
| maskops | ~700K–17M rows/s | No |
| Presidio (regex-only recognizers) | ~10–50K rows/s* | No |
| Presidio (spaCy NER) | ~1–5K rows/s* | Yes (250MB+) |
* Estimated from community benchmarks and Presidio's own documentation noting it is "not optimized for bulk structured data." Microsoft confirmed no official throughput benchmarks exist.
maskops is purpose-built for structured data pipelines where Presidio's NLP overhead is unnecessary.
How It Works
The key is the Polars expression plugin system, introduced in Polars 0.20. It allows you to register custom Rust functions that Polars calls directly on Arrow ChunkedArray buffers — bypassing Python entirely for the hot loop.
The architecture is three layers:
Python (user code)
↓ register_plugin_function()
Polars expression engine
↓ Arrow ChunkedArray
Rust (maskops core)
↓ regex::Regex on &str slices
Each PII type lives in its own Rust module (iban.rs, vat.rs) with a compiled once_cell::Lazy<Regex> — the regex is compiled once at startup, not per row.
// Rust side — called directly by Polars on each string slice
#[polars_expr(output_type=String)]
fn mask_pii(inputs: &[Series]) -> PolarsResult<Series> {
let ca = inputs[0].str()?;
let out: StringChunked = ca.apply(|opt_val: Option<&str>| {
opt_val.map(|s| std::borrow::Cow::Owned(mask_all(s)))
});
Ok(out.into_series())
}
Supported PII Patterns (v0.1.0)
| Pattern | Coverage | Example |
|---|---|---|
| IBAN | All 36 SEPA countries |
DE89370400440532013000 → DE89******************
|
| EU VAT | All 27 EU member states |
DE123456789 → DE*********
|
Tested against Faker-generated data in 8 EU locales: DE, FR, ES, IT, NL, PL, PT, SE.
Why Not Just Use Polars .str.replace()?
You could write pl.col("x").str.replace_all(pattern, "****") directly in Polars. The problem:
- You need one expression per PII type — maskops applies all patterns in a single pass.
-
No detection — Polars has no
contains_pii()equivalent without writing the regex yourself. -
No masking logic —
mask_piipreserves the IBAN country code and check digits, which is standard practice for audit trails. A rawstr.replace_allwould wipe everything.
Roadmap
- v0.1.1: Email, phone number, IP address patterns
- v0.1.2: Format-Preserving Encryption (FPE/FF3-1) for reversible masking + PyPI publish
- v0.2.0: Latin American IDs (Chilean RUT, Brazilian CPF, Mexican CURP)
Install & Getting Started
pip install maskops
import polars as pl
import maskops
df = pl.DataFrame({
"transaction": [
"Payment from DE89370400440532013000",
"Invoice VAT: DE123456789",
"No PII here"
]
})
result = df.with_columns([
maskops.mask_pii("transaction").alias("masked"),
maskops.contains_pii("transaction").alias("has_pii")
])
print(result)
Output:
┌─────────────────────────────────────┬──────────────────────────────────┬─────────┐
│ transaction ┆ masked ┆ has_pii │
╞═════════════════════════════════════╪══════════════════════════════════╪═════════╡
│ Payment from DE89370400440532013000 ┆ Payment from DE89*************** ┆ true │
│ Invoice VAT: DE123456789 ┆ Invoice VAT: DE********* ┆ true │
│ No PII here ┆ No PII here ┆ false │
└─────────────────────────────────────┴──────────────────────────────────┴─────────┘
Source code: github.com/fcarvajalbrown/MaskOps
Built with Rust, pyo3-polars, and maturin. Contributions welcome.
Tags: #rust #python #polars #gdpr #dataengineering #privacy #pii #opensource
Top comments (0)