Tighnari

Posted on Feb 23

From IMG_4382.jpg to Invoice_Acme_2024-03.pdf: Building a Content-Aware Renaming Pipeline

#ai #machinelearning #devops #python

Plug in a flatbed scanner and watch what happens to your filenames. Every document gets named Scan0047.pdf. Photos leave the camera as IMG_4382.jpg. Screenshots pile up as Screenshot 2024-03-14 at 09.42.17.png. Within a week, a Downloads folder turns into a graveyard of meaningless names attached to files that might be anything.

The naive fix is a renaming rule. "Anything prefixed with Scan goes into /documents/scans/." That works until your scanner firmware updates and starts outputting IMG prefixes. Or until you add a second scanner. Rule-based approaches collapse because they operate on filenames, and filenames carry exactly zero semantic information about what's inside the file.

This post walks through the engineering approach we use to solve this: a content-aware renaming pipeline that reads the document, understands what it is, and generates a meaningful name from the content itself.

Why filename metadata is a dead end

Before getting into the solution, it helps to be precise about why this problem is stubborn.

Modern file systems give you: filename, creation date, modification date, file size, MIME type. None of those fields tells you whether a PDF is a tax return, an NDA, or a pizza receipt. MIME type gets you application/pdf. That's the same for all three.

For images, EXIF data adds GPS coordinates, camera model, and focal length. Useful for photographers. Not useful when you're trying to identify a photo of a whiteboard with meeting notes on it.

The only reliable signal about what a file contains is the file contents. Which means you have to read it.

The OCR pipeline

If your input is a native PDF with embedded text (something exported from Word or Google Docs), text extraction is straightforward. pdfplumber, PyMuPDF, and pdfminer all get you there in a few lines. The hard cases are scanned documents and images.

For those, you need OCR. And before you run OCR, you need to preprocess the image. Raw scanner output is noisy in ways that destroy recognition accuracy.

Preprocessing steps that actually matter: convert to grayscale, deskew to correct page rotation (scanners are never perfectly aligned), binarize to black and white to remove noise and uneven lighting, then strip borders and scanner artifacts.

The deskew step gets underestimated. A 2-degree tilt can visibly cut Tesseract word-level accuracy on dense text. Here is the preprocessing function we use:

import cv2
import numpy as np

def preprocess_for_ocr(image_path: str) -> np.ndarray:
    img = cv2.imread(image_path)
    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

    # Deskew via Hough line detection
    edges = cv2.Canny(gray, 50, 150, apertureSize=3)
    lines = cv2.HoughLines(edges, 1, np.pi / 180, 200)

    if lines is not None:
        angles = [line[0][1] for line in lines]
        median_angle = np.median(angles) - np.pi / 2
        (h, w) = gray.shape
        center = (w // 2, h // 2)
        M = cv2.getRotationMatrix2D(center, np.degrees(median_angle), 1.0)
        gray = cv2.warpAffine(
            gray, M, (w, h),
            flags=cv2.INTER_CUBIC,
            borderMode=cv2.BORDER_REPLICATE
        )

    # Adaptive threshold handles uneven lighting across the scan
    binary = cv2.adaptiveThreshold(
        gray, 255,
        cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
        cv2.THRESH_BINARY, 11, 2
    )
    return binary

Once the image is clean, you run OCR and capture confidence scores alongside the text:

import pytesseract
from dataclasses import dataclass, field

@dataclass
class OCRResult:
    text: str
    confidence: float
    word_boxes: list[dict] = field(default_factory=list)

def extract_text(preprocessed_img: np.ndarray, lang: str = "eng") -> OCRResult:
    data = pytesseract.image_to_data(
        preprocessed_img,
        lang=lang,
        output_type=pytesseract.Output.DICT,
        config="--psm 3"  # Auto page segmentation
    )

    words = []
    confidences = []

    for i, word in enumerate(data["text"]):
        if word.strip() and int(data["conf"][i]) > 0:
            words.append(word)
            confidences.append(int(data["conf"][i]))

    avg_confidence = sum(confidences) / len(confidences) if confidences else 0

    return OCRResult(
        text=" ".join(words),
        confidence=avg_confidence / 100,
        word_boxes=[
            {
                "text": data["text"][i],
                "conf": data["conf"][i],
                "x": data["left"][i],
                "y": data["top"][i],
            }
            for i in range(len(data["text"]))
            if data["text"][i].strip()
        ],
    )

The confidence score is your branching point. We trust Tesseract above 0.85. Below 0.60, we route to a cloud OCR API such as Google Document AI or AWS Textract. In between, we apply additional preprocessing passes before making the call.

Where vision models change things

OCR gives you text. Vision models give you understanding. That distinction is worth sitting with for a moment.

If you feed a scanned receipt to Tesseract, you get a blob of text: prices, a merchant name, some line items, a date. What Tesseract does not tell you is "this is a receipt." You have to figure that out from the extracted text itself, which means writing heuristics that are fragile and specific.

Vision models look at the whole document image and build a representation that fuses visual layout with content semantics. They can classify a document as a receipt, invoice, contract, or driver's license before reading a single word, because the visual structure of these document types is distinct. A receipt looks like a receipt. An invoice has a specific spatial layout with a header, line-item table, and totals block.

That classification matters a lot for renaming. If you know the document type upfront, you know which entities to prioritize. Receipts: merchant name, date, total. Invoices: vendor, invoice number, due date. Contracts: parties, effective date, agreement type. Different types need different templates.

We run classification on the raw image in parallel with OCR, then merge the results:

import anthropic
import base64
import json
from pathlib import Path

DOCUMENT_TYPES = [
    "invoice", "receipt", "contract", "tax_form", "bank_statement",
    "id_document", "medical_record", "letter", "form", "report", "other",
]

def classify_document(image_path: str) -> dict:
    client = anthropic.Anthropic()

    image_data = base64.standard_b64encode(
        Path(image_path).read_bytes()
    ).decode("utf-8")

    suffix = Path(image_path).suffix.lower()
    media_type_map = {
        ".jpg": "image/jpeg",
        ".jpeg": "image/jpeg",
        ".png": "image/png",
        ".webp": "image/webp",
        ".pdf": "image/jpeg",  # PDFs rendered to JPEG before this step
    }
    media_type = media_type_map.get(suffix, "image/jpeg")

    response = client.messages.create(
        model="claude-opus-4-5",
        max_tokens=256,
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "image",
                        "source": {
                            "type": "base64",
                            "media_type": media_type,
                            "data": image_data,
                        },
                    },
                    {
                        "type": "text",
                        "text": (
                            "Classify this document. Reply with JSON only:\n"
                            '{"type": "<one of the DOCUMENT_TYPES list>",'
                            '"confidence": <0.0-1.0>,'
                            '"language": "<ISO 639-1 code>"}'
                        ),
                    },
                ],
            }
        ],
    )

    return json.loads(response.content[0].text)

The classification result feeds directly into entity extraction. You get the document type, you select the right template, and you extract the right fields.

From extracted text to a meaningful filename

This is the part that looks simple and isn't. Once you have OCR text and a document type, you run type-specific regex patterns against the text and compose a filename from whatever entities resolve successfully. The key engineering decision is building a fallback chain so you always get something useful, even when entity extraction partially fails.

import re
from datetime import datetime

EXTRACTION_TEMPLATES = {
    "invoice": {
        "patterns": {
            "vendor": r"(?:from|vendor|bill from|billed by)[:\s]+([A-Z][A-Za-z\s&,\.]+?)(?:\n|,|\d)",
            "invoice_number": r"(?:invoice\s*#?|inv\.?\s*#?)[:\s]*([A-Z0-9\-]+)",
            "date": r"(?:date|issued)[:\s]*(\d{1,2}[\/\-]\d{1,2}[\/\-]\d{2,4}|\w+\s+\d{1,2},?\s*\d{4})",
        },
        "template": "{vendor}_{date}_Invoice_{invoice_number}",
        "fallback": "Invoice_{date}",
    },
    "receipt": {
        "patterns": {
            "merchant": r"^([A-Z][A-Za-z\s&]+?)(?:\n|LLC|Inc|Corp|Ltd)",
            "date": r"(\d{1,2}[\/\-]\d{1,2}[\/\-]\d{2,4})",
        },
        "template": "Receipt_{merchant}_{date}",
        "fallback": "Receipt_{date}",
    },
    "contract": {
        "patterns": {
            "parties": r"(?:between|by and between)\s+(.+?)\s+and\s+(.+?)(?:\n|,|\()",
            "date": r"(?:dated|effective|as of)[:\s]*(\w+\s+\d{1,2},?\s*\d{4}|\d{1,2}[\/\-]\d{1,2}[\/\-]\d{2,4})",
            "type": r"(non-disclosure|employment|service|licensing|partnership)\s+agreement",
        },
        "template": "{type}_Agreement_{parties}_{date}",
        "fallback": "Contract_{date}",
    },
}

def normalize_date(raw_date: str) -> str:
    formats = ["%m/%d/%Y", "%m-%d-%Y", "%B %d, %Y", "%b %d %Y", "%m/%d/%y"]
    for fmt in formats:
        try:
            return datetime.strptime(raw_date.strip(), fmt).strftime("%Y-%m-%d")
        except ValueError:
            continue
    return raw_date.replace("/", "-").replace(" ", "-")

def sanitize_filename(name: str) -> str:
    name = re.sub(r'[<>:"/\\|?*\x00-\x1f]', "", name)
    name = re.sub(r"\s+", "_", name.strip())
    name = re.sub(r"_+", "_", name)
    return name[:200]

def generate_filename(ocr_text: str, doc_type: str, original_ext: str) -> str:
    config = EXTRACTION_TEMPLATES.get(doc_type, {})

    if not config:
        first_line = next(
            (line.strip() for line in ocr_text.split("\n") if len(line.strip()) > 5),
            "document",
        )
        return sanitize_filename(first_line[:50]) + original_ext

    extracted = {}
    for field_name, pattern in config["patterns"].items():
        match = re.search(pattern, ocr_text, re.IGNORECASE | re.MULTILINE)
        if match:
            value = match.group(1).strip()
            if field_name == "date":
                value = normalize_date(value)
            extracted[field_name] = sanitize_filename(value)

    try:
        name = config["template"].format(**extracted)
    except KeyError:
        try:
            name = config["fallback"].format(**extracted)
        except KeyError:
            name = f"{doc_type}_{datetime.now().strftime('%Y-%m-%d')}"

    return sanitize_filename(name) + original_ext

Always have a fallback. If entity extraction fails on a low-quality scan, you want something better than Scan0047.pdf even if you cannot get Acme_Inc_2024-03-14_Invoice_INV-4421.pdf. The fallback chain gives you graceful degradation at every level.

Pipeline architecture

Here is the full flow as it runs in production at renamer.ai, where this exact pipeline handles batch renaming across web, Windows, and Mac:

File Input
    |
    v
Format Detection
    +-- Native PDF/DOCX ---------> Text Extraction (skip OCR)
    |                                       |
    +-- Image / Scanned PDF --> Preprocessing Pipeline
                                            |
                               OCR Engine (Tesseract)
                                            |
                         +-----------------+------------------+
                  conf >= 0.85        0.60-0.85          conf < 0.60
                         |               |                   |
                   Use result    Extra preprocessing    Cloud OCR API
                         |               |                   |
                         +---------------+-------------------+
                                         |
                              Vision Classification
                              (document type + language)
                                         |
                              Entity Extraction
                              (type-specific patterns)
                                         |
                              Name Generation
                              (template -> fallback chain)
                                         |
                              Conflict Resolution
                              (append _v2, _v3 if name exists)
                                         |
                              Rename + Audit Log

The Conflict Resolution step is easy to skip in early development and painful to bolt on later. When you process 500 files in a batch, you will hit cases where two invoices from the same vendor on the same date generate identical filenames. You need deterministic tie-breaking before any writes touch disk.

Handling the hard cases

Blurry or low-resolution scans are the most common problem. If Tesseract confidence drops below 0.40, no preprocessing pass will recover it. Our fix: upscale with a super-resolution model (Real-ESRGAN works well for document scans) before the preprocessing step when initial confidence is low. It adds latency but meaningfully improves accuracy on fax-quality documents.

Handwritten notes are a different problem. Standard Tesseract handles handwriting poorly. We route these to a dedicated model such as Google Document AI's handwriting mode or a fine-tuned TrOCR. These paths are slower and more expensive, so you want to detect handwriting early and only invoke them when you need to. Visual classifiers can distinguish handwritten from printed documents reliably before any OCR runs.

Mixed-language documents come up more often than you'd expect. A contract with English headers and Spanish body text is common in cross-border work. Tesseract handles multi-language extraction with --lang eng+spa, but your entity extraction patterns also need to be language-aware. Detect the primary language at the classification stage and select the matching pattern set.

Photos that aren't documents will always make it into your pipeline. Vision models handle this cleanly: they recognize that a photograph of a coastline is not a document and return a flag rather than a confused filename.

Performance at scale

If you're processing thousands of files, throughput becomes the binding constraint. A few things that actually move the needle:

Run OCR and vision classification in parallel using asyncio with a thread pool executor. OCR is CPU-bound. Classification API calls are I/O-bound. Mixing both in the same pipeline makes parallelism tricky; separating them into their own executor pools helps.

Cache document type classifications. If you're processing 200 files from the same client folder, a local cache keyed on image embeddings avoids redundant API calls for visually similar document layouts.

For large PDFs, only process the first two or three pages for naming. The information you need is almost always in the header section of page one. Processing a 200-page contract end to end is wasteful.

One non-obvious bottleneck: filesystem stat() calls during conflict resolution. If you're checking filenames against disk for every file in a batch, that adds up. Build an in-memory name registry at the start of the job and check against that instead.

What the failure modes actually look like

The hardest case is a document that gives you text but the wrong text. A photocopy of a photocopy of a fax from 1994 might OCR to mostly garbage with a confidence score of 0.72 -- high enough that you skip the cloud fallback, low enough that extracted entities are wrong. Your pipeline produces a confidently wrong filename: Smith_Associates_1994-03-15_Invoice_INV-872X.pdf when the vendor is Smithfield & Associates and the invoice number ends in 4, not X.

This is why audit logs are non-negotiable. Every rename operation should write the original filename, the generated name, confidence scores, and extracted entities to a log. When the pipeline gets something wrong, you need to trace exactly which step failed.

If you want to see this pipeline running on real documents, renamer.ai handles batch renaming across web, Windows, and Mac and lets you review generated names before anything commits to disk.

Conclusion

The rough priority order if you're building this from scratch:

Native PDF and DOCX text extraction first -- immediate wins, no OCR complexity
Tesseract for scanned documents, with confidence thresholds
Document type classification (even a rule-based classifier improves entity extraction significantly)
Entity extraction templates for your top three to five document types
Cloud OCR fallback for low-confidence results
Vision model classification for ambiguous cases
Batch throughput optimization once accuracy is solid

Do not start at step 7. Optimizing a pipeline that generates wrong filenames just produces wrong filenames faster. Get accuracy right first, then make it fast.

The core flow -- preprocess, OCR, classify, extract, template, fallback -- is stable enough to build on incrementally. Each stage can be improved in isolation without requiring rewrites downstream. That is the version worth building.

DEV Community