DEV Community

Cover image for Most PDF Redaction Is Broken. Here's What "Real" Redaction Actually Requires.
hiyoyo
hiyoyo

Posted on

Most PDF Redaction Is Broken. Here's What "Real" Redaction Actually Requires.

All tests run on an 8-year-old MacBook Air.

Drawing a black rectangle over text is not redaction.

The text is still there. Select all, copy, paste into Notepad — it appears. This has leaked classified documents from actual government agencies. Multiple times.

Real redaction destroys the underlying data. Here's how I implemented it.


What fake redaction looks like

PDF structure (fake redaction):
  Page content stream: "Salary: $120,000"  ← still here
  Annotation layer:    [black rectangle]   ← just covering it
Enter fullscreen mode Exit fullscreen mode

The content stream is untouched. Any PDF parser can read it.


What real redaction requires

  1. Find the target text in the content stream
  2. Remove it from the stream entirely
  3. Replace with a filled black rectangle drawn directly into the content
  4. Re-serialize the page — no original data survives
pub fn redact_text(
    doc: &mut Document,
    page_id: ObjectId,
    target: &str,
) -> Result<(), lopdf::Error> {
    let page = doc.get_object_mut(page_id)?;

    if let Ok(stream) = page.as_stream_mut() {
        let content = stream.decode_content()?;

        // Remove text operators containing target
        let cleaned = remove_text_from_content(content, target);

        // Replace with black filled rectangle at same position
        let redact_op = format!(
            "q 0 0 0 rg {} {} {} {} re f Q\n",
            x, y, width, height
        );

        stream.set_content(cleaned + redact_op.as_bytes());
    }

    Ok(())
}
Enter fullscreen mode Exit fullscreen mode

The hard part: finding text position

PDF content streams don't store text with coordinates in a simple format. Text position is determined by the current transformation matrix, text matrix, and font metrics — all stateful.

Parsing this correctly requires a proper content stream interpreter, not a regex over the raw bytes.

lopdf gives you the raw stream. Interpreting it is your job.


AI-assisted detection

For auto-detection of PII (names, phone numbers, ID numbers), I run a pattern-matching pass before redaction:

pub fn detect_pii(text: &str) -> Vec<(usize, usize, PiiType)> {
    let mut findings = Vec::new();

    // Phone numbers
    let phone_re = Regex::new(r"\d{2,4}-\d{2,4}-\d{4}").unwrap();
    for m in phone_re.find_iter(text) {
        findings.push((m.start(), m.end(), PiiType::Phone));
    }

    // Japanese My Number (12 digits)
    let mynumber_re = Regex::new(r"\b\d{12}\b").unwrap();
    for m in mynumber_re.find_iter(text) {
        findings.push((m.start(), m.end(), PiiType::MyNumber));
    }

    findings
}
Enter fullscreen mode Exit fullscreen mode

The user reviews detections before committing — auto-redaction without review is its own kind of risk.


Hiyoko PDF Vault → https://hiyokoko.gumroad.com/l/HiyokoPDFVault
X → @hiyoyok

Top comments (0)