All tests run on an 8-year-old MacBook Air.
Some PDFs won't open. Not because the content is gone — because the index that tells readers where to find the content is corrupt.
That index is the XREF table. And it can be rebuilt.
What the XREF table is
Every PDF has a cross-reference table near the end of the file. It's a lookup map: object ID → byte offset in the file.
xref
0 6
0000000000 65535 f
0000000009 00000 n
0000000058 00000 n
0000000115 00000 n
0000000266 00000 n
0000000496 00000 n
When a reader opens the PDF, it reads this table first. If it's missing or corrupt — the PDF "won't open."
Rebuilding it
The content objects are still in the file. We just need to find them and rebuild the index.
pub fn rebuild_xref(data: &[u8]) -> Result {
// lopdf can attempt recovery on malformed files
let doc = Document::load_mem(data)
.or_else(|_| recover_document(data))?;
Ok(doc)
}
pub fn recover_document(data: &[u8]) -> Result {
// Scan the raw bytes for object markers
// Pattern: "N 0 obj" where N is the object number
let mut offsets: Vec<(u32, u32, usize)> = Vec::new();
let obj_pattern = b" 0 obj";
for (i, window) in data.windows(obj_pattern.len()).enumerate() {
if window == obj_pattern {
// Walk back to find the object number
if let Some(num) = extract_obj_num(data, i) {
offsets.push((num, 0, i - num.to_string().len()));
}
}
}
// Reconstruct document from found objects
rebuild_from_offsets(data, offsets)
}
What this fixes
- PDFs truncated mid-write (power loss during save)
- PDFs with incremental updates that broke the XREF chain
- Old files where the XREF was hand-edited incorrectly
- Scanner output with malformed structure
What it can't fix
If the content streams themselves are corrupt — the actual page data is gone — no amount of XREF rebuilding helps. Structural resurrection only works when the objects are present but the index is broken.
In practice
About 80% of "won't open" PDFs I've tested are XREF problems. The content is fine. They just need a new index.
Hiyoko PDF Vault → https://hiyokoko.gumroad.com/l/HiyokoPDFVault
X → @hiyoyok
Top comments (0)