Most PDF Redaction Is Broken. Here's What "Real" Redaction Actually Requires.

#tauri #rust #security #programming

All tests run on an 8-year-old MacBook Air.

Drawing a black rectangle over text is not redaction.

The text is still there. Select all, copy, paste into Notepad — it appears. This has leaked classified documents from actual government agencies. Multiple times.

Real redaction destroys the underlying data. Here's how I implemented it.

What fake redaction looks like

PDF structure (fake redaction):
  Page content stream: "Salary: $120,000"  ← still here
  Annotation layer:    [black rectangle]   ← just covering it

The content stream is untouched. Any PDF parser can read it.

What real redaction requires

Find the target text in the content stream
Remove it from the stream entirely
Replace with a filled black rectangle drawn directly into the content
Re-serialize the page — no original data survives

pub fn redact_text(
    doc: &mut Document,
    page_id: ObjectId,
    target: &str,
) -> Result<(), lopdf::Error> {
    let page = doc.get_object_mut(page_id)?;

    if let Ok(stream) = page.as_stream_mut() {
        let content = stream.decode_content()?;

        // Remove text operators containing target
        let cleaned = remove_text_from_content(content, target);

        // Replace with black filled rectangle at same position
        let redact_op = format!(
            "q 0 0 0 rg {} {} {} {} re f Q\n",
            x, y, width, height
        );

        stream.set_content(cleaned + redact_op.as_bytes());
    }

    Ok(())
}

The hard part: finding text position

PDF content streams don't store text with coordinates in a simple format. Text position is determined by the current transformation matrix, text matrix, and font metrics — all stateful.

Parsing this correctly requires a proper content stream interpreter, not a regex over the raw bytes.

lopdf gives you the raw stream. Interpreting it is your job.

AI-assisted detection

For auto-detection of PII (names, phone numbers, ID numbers), I run a pattern-matching pass before redaction:

pub fn detect_pii(text: &str) -> Vec<(usize, usize, PiiType)> {
    let mut findings = Vec::new();

    // Phone numbers
    let phone_re = Regex::new(r"\d{2,4}-\d{2,4}-\d{4}").unwrap();
    for m in phone_re.find_iter(text) {
        findings.push((m.start(), m.end(), PiiType::Phone));
    }

    // Japanese My Number (12 digits)
    let mynumber_re = Regex::new(r"\b\d{12}\b").unwrap();
    for m in mynumber_re.find_iter(text) {
        findings.push((m.start(), m.end(), PiiType::MyNumber));
    }

    findings
}