DEV Community

Cover image for Repairing a Broken PDF in Rust — Rebuilding the XREF Table From Scratch
hiyoyo
hiyoyo

Posted on

Repairing a Broken PDF in Rust — Rebuilding the XREF Table From Scratch

All tests run on an 8-year-old MacBook Air.

Some PDFs won't open. Not because the content is gone — because the index that tells readers where to find the content is corrupt.

That index is the XREF table. And it can be rebuilt.


What the XREF table is

Every PDF has a cross-reference table near the end of the file. It's a lookup map: object ID → byte offset in the file.

xref
0 6
0000000000 65535 f
0000000009 00000 n
0000000058 00000 n
0000000115 00000 n
0000000266 00000 n
0000000496 00000 n
Enter fullscreen mode Exit fullscreen mode

When a reader opens the PDF, it reads this table first. If it's missing or corrupt — the PDF "won't open."


Rebuilding it

The content objects are still in the file. We just need to find them and rebuild the index.

pub fn rebuild_xref(data: &[u8]) -> Result {
    // lopdf can attempt recovery on malformed files
    let doc = Document::load_mem(data)
        .or_else(|_| recover_document(data))?;
    Ok(doc)
}

pub fn recover_document(data: &[u8]) -> Result {
    // Scan the raw bytes for object markers
    // Pattern: "N 0 obj" where N is the object number
    let mut offsets: Vec<(u32, u32, usize)> = Vec::new();
    let obj_pattern = b" 0 obj";

    for (i, window) in data.windows(obj_pattern.len()).enumerate() {
        if window == obj_pattern {
            // Walk back to find the object number
            if let Some(num) = extract_obj_num(data, i) {
                offsets.push((num, 0, i - num.to_string().len()));
            }
        }
    }

    // Reconstruct document from found objects
    rebuild_from_offsets(data, offsets)
}
Enter fullscreen mode Exit fullscreen mode

What this fixes

  • PDFs truncated mid-write (power loss during save)
  • PDFs with incremental updates that broke the XREF chain
  • Old files where the XREF was hand-edited incorrectly
  • Scanner output with malformed structure

What it can't fix

If the content streams themselves are corrupt — the actual page data is gone — no amount of XREF rebuilding helps. Structural resurrection only works when the objects are present but the index is broken.


In practice

About 80% of "won't open" PDFs I've tested are XREF problems. The content is fine. They just need a new index.


Hiyoko PDF Vault → https://hiyokoko.gumroad.com/l/HiyokoPDFVault
X → @hiyoyok

Top comments (0)