All tests run on an 8-year-old MacBook Air.
Drawing a black rectangle over text is not redaction.
The text is still there. Select all, copy, paste into Notepad — it appears. This has leaked classified documents from actual government agencies. Multiple times.
Real redaction destroys the underlying data. Here's how I implemented it.
What fake redaction looks like
PDF structure (fake redaction):
Page content stream: "Salary: $120,000" ← still here
Annotation layer: [black rectangle] ← just covering it
The content stream is untouched. Any PDF parser can read it.
What real redaction requires
- Find the target text in the content stream
- Remove it from the stream entirely
- Replace with a filled black rectangle drawn directly into the content
- Re-serialize the page — no original data survives
pub fn redact_text(
doc: &mut Document,
page_id: ObjectId,
target: &str,
) -> Result<(), lopdf::Error> {
let page = doc.get_object_mut(page_id)?;
if let Ok(stream) = page.as_stream_mut() {
let content = stream.decode_content()?;
// Remove text operators containing target
let cleaned = remove_text_from_content(content, target);
// Replace with black filled rectangle at same position
let redact_op = format!(
"q 0 0 0 rg {} {} {} {} re f Q\n",
x, y, width, height
);
stream.set_content(cleaned + redact_op.as_bytes());
}
Ok(())
}
The hard part: finding text position
PDF content streams don't store text with coordinates in a simple format. Text position is determined by the current transformation matrix, text matrix, and font metrics — all stateful.
Parsing this correctly requires a proper content stream interpreter, not a regex over the raw bytes.
lopdf gives you the raw stream. Interpreting it is your job.
AI-assisted detection
For auto-detection of PII (names, phone numbers, ID numbers), I run a pattern-matching pass before redaction:
pub fn detect_pii(text: &str) -> Vec<(usize, usize, PiiType)> {
let mut findings = Vec::new();
// Phone numbers
let phone_re = Regex::new(r"\d{2,4}-\d{2,4}-\d{4}").unwrap();
for m in phone_re.find_iter(text) {
findings.push((m.start(), m.end(), PiiType::Phone));
}
// Japanese My Number (12 digits)
let mynumber_re = Regex::new(r"\b\d{12}\b").unwrap();
for m in mynumber_re.find_iter(text) {
findings.push((m.start(), m.end(), PiiType::MyNumber));
}
findings
}
The user reviews detections before committing — auto-redaction without review is its own kind of risk.
Hiyoko PDF Vault → https://hiyokoko.gumroad.com/l/HiyokoPDFVault
X → @hiyoyok
Top comments (0)