Felipe Carvajal Brown

Posted on Mar 9

MaskOps 0.1.0: A Native Polars Plugin for High-Speed PII Masking in Python

#performance #privacy #python #rust

TL;DR: I built a Rust-powered Polars plugin that masks GDPR-sensitive data (IBAN, EU VAT) at up to 16 million rows per second — no NLP models, no spaCy, no Presidio overhead. pip install maskops.

The Problem

If you work with financial data, healthcare records, or any GDPR-regulated dataset in Python, you've likely hit the same wall: de-identifying structured data at scale is painfully slow.

The go-to solution is Microsoft Presidio. It's powerful, but it's built for unstructured text — it spins up a full spaCy NLP pipeline to find a phone number in a CSV column. For structured DataFrames where you already know which columns contain PII, that's enormous overhead:

Presidio with spaCy NER: ~1,000–5,000 rows/s
Presidio with regex-only recognizers: ~10,000–50,000 rows/s
Pure Python re module: ~1,100,000 rows/s

None of these integrate natively with Polars, the fastest DataFrame library in Python.

The Solution: maskops

maskops is a native Polars expression plugin written in Rust. It extends Polars with two new expressions — mask_pii() and contains_pii() — that run directly on Arrow memory buffers with zero Python overhead per row.

import polars as pl
import maskops

df = pl.read_csv("payments.csv")

# Mask all PII in a column
df.with_columns(maskops.mask_pii("notes"))
# "Transfer to DE89370400440532013000" → "Transfer to DE89******************"

# Boolean detection — filter rows containing PII
df.filter(maskops.contains_pii("free_text"))

That's it. No model downloads, no engine initialization, no spaCy.

Benchmarks

Tested on 1,000,000 rows, Intel i-series CPU, Python 3.14, Windows.

maskops throughput

Profile	Expression	Time	Rows/s	MB/s
clean (no PII)	`mask_pii`	0.404s	2,477,599	54.5
clean (no PII)	`contains_pii`	0.169s	5,915,970	130.2
dense (all PII)	`mask_pii`	1.385s	722,104	15.9
dense (all PII)	`contains_pii`	0.059s	16,987,879	373.7
mixed (50/50)	`mask_pii`	0.760s	1,315,407	28.9
mixed (50/50)	`contains_pii`	0.133s	7,498,315	165.0

vs pure Python regex (same machine)

Profile	maskops `mask_pii`	Python `re`	Speedup
clean	0.404s	0.925s	2.3×
dense	1.385s	1.653s	1.2×
mixed	0.760s	1.337s	1.8×

On clean and mixed data maskops is consistently faster. On dense data (every row is a full IBAN) both are regex-bound — the bottleneck is the pattern itself, not Python overhead.

vs Microsoft Presidio (estimated)

Presidio processes structured DataFrames via presidio-structured, which runs a spaCy NLP pipeline per row. Based on community reports and the architecture:

Tool	Throughput (structured data)	Requires NLP model
maskops	~700K–17M rows/s	No
Presidio (regex-only recognizers)	~10–50K rows/s*	No
Presidio (spaCy NER)	~1–5K rows/s*	Yes (250MB+)

* Estimated from community benchmarks and Presidio's own documentation noting it is "not optimized for bulk structured data." Microsoft confirmed no official throughput benchmarks exist.

maskops is purpose-built for structured data pipelines where Presidio's NLP overhead is unnecessary.

How It Works

The key is the Polars expression plugin system, introduced in Polars 0.20. It allows you to register custom Rust functions that Polars calls directly on Arrow ChunkedArray buffers — bypassing Python entirely for the hot loop.

The architecture is three layers:

Python (user code)
    ↓  register_plugin_function()
Polars expression engine
    ↓  Arrow ChunkedArray
Rust (maskops core)
    ↓  regex::Regex on &str slices

Each PII type lives in its own Rust module (iban.rs, vat.rs) with a compiled once_cell::Lazy<Regex> — the regex is compiled once at startup, not per row.

// Rust side — called directly by Polars on each string slice
#[polars_expr(output_type=String)]
fn mask_pii(inputs: &[Series]) -> PolarsResult<Series> {
    let ca = inputs[0].str()?;
    let out: StringChunked = ca.apply(|opt_val: Option<&str>| {
        opt_val.map(|s| std::borrow::Cow::Owned(mask_all(s)))
    });
    Ok(out.into_series())
}

Supported PII Patterns (v0.1.0)

Pattern	Coverage	Example
IBAN	All 36 SEPA countries	`DE89370400440532013000` → `DE89******************`
EU VAT	All 27 EU member states	`DE123456789` → `DE*********`

Tested against Faker-generated data in 8 EU locales: DE, FR, ES, IT, NL, PL, PT, SE.

Why Not Just Use Polars `.str.replace()`?

You could write pl.col("x").str.replace_all(pattern, "****") directly in Polars. The problem:

You need one expression per PII type — maskops applies all patterns in a single pass.
No detection — Polars has no contains_pii() equivalent without writing the regex yourself.
No masking logic — mask_pii preserves the IBAN country code and check digits, which is standard practice for audit trails. A raw str.replace_all would wipe everything.

Roadmap

v0.1.1: Email, phone number, IP address patterns
v0.1.2: Format-Preserving Encryption (FPE/FF3-1) for reversible masking + PyPI publish
v0.2.0: Latin American IDs (Chilean RUT, Brazilian CPF, Mexican CURP)

Install & Getting Started

pip install maskops

import polars as pl
import maskops

df = pl.DataFrame({
    "transaction": [
        "Payment from DE89370400440532013000",
        "Invoice VAT: DE123456789",
        "No PII here"
    ]
})

result = df.with_columns([
    maskops.mask_pii("transaction").alias("masked"),
    maskops.contains_pii("transaction").alias("has_pii")
])

print(result)

Output:

┌─────────────────────────────────────┬──────────────────────────────────┬─────────┐
│ transaction                         ┆ masked                           ┆ has_pii │
╞═════════════════════════════════════╪══════════════════════════════════╪═════════╡
│ Payment from DE89370400440532013000 ┆ Payment from DE89*************** ┆ true    │
│ Invoice VAT: DE123456789            ┆ Invoice VAT: DE*********         ┆ true    │
│ No PII here                         ┆ No PII here                      ┆ false   │
└─────────────────────────────────────┴──────────────────────────────────┴─────────┘

Source code: github.com/fcarvajalbrown/MaskOps

Built with Rust, pyo3-polars, and maturin. Contributions welcome.

Tags: #rust #python #polars #gdpr #dataengineering #privacy #pii #opensource

DEV Community

MaskOps 0.1.0: A Native Polars Plugin for High-Speed PII Masking in Python

The Problem

The Solution: maskops

Benchmarks

maskops throughput

vs pure Python regex (same machine)

vs Microsoft Presidio (estimated)

How It Works

Supported PII Patterns (v0.1.0)

Why Not Just Use Polars `.str.replace()`?

Roadmap

Install & Getting Started

Top comments (0)

The Problem

The Solution: maskops

Benchmarks

maskops throughput

vs pure Python regex (same machine)

vs Microsoft Presidio (estimated)

How It Works

Supported PII Patterns (v0.1.0)

Why Not Just Use Polars .str.replace()?

Roadmap

Install & Getting Started

Why Not Just Use Polars `.str.replace()`?