DEV Community

Cover image for Renaming 200 PDFs by Their Content — Without an LLM
hiyoyo
hiyoyo

Posted on

Renaming 200 PDFs by Their Content — Without an LLM

All tests run on an 8-year-old MacBook Air.

"scan_20260101_0042.pdf" tells you nothing.

Smart Rename reads the first page of each PDF and generates a meaningful filename. No API call, no LLM, no internet. Just heuristics and pattern matching in Rust.


What the first page usually contains

Most documents put identifying information near the top: document type, date, recipient, reference number. The challenge is extracting it reliably across wildly different layouts.

pub fn extract_rename_candidates(doc: &Document) -> RenameCandidates {
    let first_page_text = doc.extract_text(&[1]).unwrap_or_default();
    let lines: Vec<&str> = first_page_text.lines()
        .map(str::trim)
        .filter(|l| !l.is_empty())
        .collect();

    RenameCandidates {
        title: "find_title(&lines),"
        date: find_date(&lines),
        doc_type: classify_document(&lines),
        reference: find_reference_number(&lines),
    }
}
Enter fullscreen mode Exit fullscreen mode

Date extraction

Dates appear in many formats. Regex covers the common ones:

pub fn find_date(lines: &[&str]) -> Option {
    let patterns = [
        r"(\d{4})[年/\-\.](\d{1,2})[月/\-\.](\d{1,2})[日]?",
        r"(\d{1,2})[/\-\.](\d{1,2})[/\-\.](\d{4})",
        r"(January|February|March|April|May|June|July|August|September|October|November|December)\s+\d{1,2},?\s+\d{4}",
    ];

    for line in lines {
        for pattern in &patterns {
            if let Some(cap) = Regex::new(pattern).unwrap().captures(line) {
                return Some(normalize_date(&cap));
            }
        }
    }
    None
}
Enter fullscreen mode Exit fullscreen mode

Document type classification

A small keyword list covers most business documents:

pub fn classify_document(lines: &[&str]) -> &'static str {
    let text = lines.join(" ").to_lowercase();

    if text.contains("invoice") || text.contains("請求書") { return "invoice"; }
    if text.contains("contract") || text.contains("契約書") { return "contract"; }
    if text.contains("receipt") || text.contains("領収書") { return "receipt"; }
    if text.contains("report") || text.contains("報告書") { return "report"; }
    if text.contains("minutes") || text.contains("議事録") { return "minutes"; }

    "document"
}
Enter fullscreen mode Exit fullscreen mode

Output

scan_20260101_0042.pdf  →  20260115_invoice_ABC-Corp.pdf
doc_final_v3.pdf        →  20260203_contract_NDA.pdf
untitled.pdf            →  20260310_report.pdf
Enter fullscreen mode Exit fullscreen mode

Not perfect. Works well enough that users stop caring about the originals.


Hiyoko PDF Vault → https://hiyokoko.gumroad.com/l/HiyokoPDFVault
X → @hiyoyok

Top comments (0)