Archit Mittal

Posted on May 1 • Originally published at architmittal.com

Build a Bulk PDF Invoice Renamer in 55 Lines of Python

#python #automation #productivity #tutorial

Last month, my CA messaged me at 11pm: "Bhai, send all April invoices in a single zip, named properly." I had 47 PDF invoices in my Downloads folder, all with names like Invoice (3).pdf, download.pdf, and INV_FINAL_FINAL_v2.pdf.

Two hours of manual renaming wasn't on my agenda. So I wrote a Python script. By the time my chai was done, the script was done. 47 invoices renamed, audit-ready.

Here's the entire thing — 55 lines, no fluff.

What it does

The script scans a folder of PDF invoices and renames each file to a clean, sortable format:

2026-04-15_AcmeServices_INV-1042_15000.pdf

It pulls four things from each invoice: the invoice date, vendor name (from the GSTIN line), invoice number, and total amount in ₹. If a field is missing, it falls back to a placeholder so nothing gets lost.

The Code

import re
import sys
from pathlib import Path
from datetime import datetime
import pdfplumber

DATE_RE = re.compile(r"(\d{1,2})[\-/\s]([A-Za-z]{3,9}|\d{1,2})[\-/\s](\d{2,4})")
GSTIN_RE = re.compile(r"\b\d{2}[A-Z]{5}\d{4}[A-Z][A-Z\d]Z[A-Z\d]\b")
INV_RE = re.compile(r"(?:Invoice\s*(?:No\.?|#|Number)\s*[:\-]?\s*)([A-Z0-9\-/]+)", re.I)
AMT_RE = re.compile(r"(?:Total|Grand\s*Total|Amount\s*Payable)[^\d₹]{0,15}₹?\s*([\d,]+\.?\d{0,2})", re.I)

def parse_date(text):
    m = DATE_RE.search(text)
    if not m: return None
    raw = m.group(0)
    for fmt in ("%d-%m-%Y", "%d/%m/%Y", "%d %b %Y", "%d %B %Y", "%d-%b-%Y"):
        try:
            return datetime.strptime(raw, fmt).strftime("%Y-%m-%d")
        except ValueError:
            continue
    return None

def vendor_from_gstin(text):
    m = GSTIN_RE.search(text)
    if not m: return None
    line = next((l for l in text.splitlines() if m.group(0) in l), "")
    parts = [p.strip() for p in line.split(m.group(0)) if p.strip()]
    return re.sub(r"[^A-Za-z0-9]+", "", parts[0])[:20] if parts else None

def extract(pdf_path):
    with pdfplumber.open(pdf_path) as pdf:
        text = "\n".join((p.extract_text() or "") for p in pdf.pages[:2])
    inv = INV_RE.search(text)
    amt = AMT_RE.search(text)
    return {
        "date": parse_date(text) or "NoDate",
        "vendor": vendor_from_gstin(text) or "UnknownVendor",
        "invoice": (inv.group(1) if inv else "NoInv").replace("/", "-"),
        "amount": (amt.group(1).replace(",", "").split(".")[0] if amt else "0"),
    }

def safe_rename(src, new_name):
    dest = src.with_name(f"{new_name}.pdf")
    n = 1
    while dest.exists() and dest != src:
        dest = src.with_name(f"{new_name}_{n}.pdf")
        n += 1
    src.rename(dest)
    return dest.name

def main(folder):
    folder = Path(folder)
    for pdf in folder.glob("*.pdf"):
        try:
            f = extract(pdf)
            new = f"{f['date']}_{f['vendor']}_{f['invoice']}_{f['amount']}"
            renamed = safe_rename(pdf, new)
            print(f"OK {pdf.name} -> {renamed}")
        except Exception as e:
            print(f"FAIL {pdf.name}: {e}")

if __name__ == "__main__":
    main(sys.argv[1] if len(sys.argv) > 1 else ".")

How it works

Four regexes do most of the heavy lifting.

The date regex matches common Indian invoice formats: 15/04/2026, 15-Apr-2026, 15 April 2026. The parser tries five strptime formats and keeps the first that sticks.

The GSTIN regex matches the 15-character GST identifier (state code + PAN + entity code + Z + checksum). Once found, the script grabs the text on the same line and assumes that's the vendor name. Crude but works in 90% of real invoices I've seen.

The invoice number regex looks for "Invoice No", "Invoice #", or "Invoice Number" followed by an alphanumeric token. The amount regex scans for "Total", "Grand Total", or "Amount Payable" followed by a ₹ value, then strips commas and decimals so file sorting works.

pdfplumber only reads the first two pages of each PDF — invoice headers and totals almost always appear there, and skipping deeper pages keeps it fast. On my 47-file folder it finished in 8 seconds.

Running it

pip install pdfplumber
python rename_invoices.py ~/Downloads/april_invoices

Output:

OK Invoice (3).pdf -> 2026-04-15_AcmeServices_INV-1042_15000.pdf
OK download.pdf -> 2026-04-18_TataCloud_TC-2026-099_47200.pdf
OK INV_FINAL_FINAL_v2.pdf -> 2026-04-22_FreshworksIndia_FW8821_8400.pdf

Edge cases I hit (and you will too)

Scanned PDFs. pdfplumber returns empty text for image-only PDFs. The try/except in main skips them without crashing — but you'll need OCR (Tesseract via pytesseract) to actually read them. I keep a separate folder for scanned ones.

Multi-page invoices with the total on page 3+. Rare in B2B India, but if you hit it, change pdf.pages[:2] to pdf.pages — slower but complete.

Vendors without GSTIN. International vendors won't have one. The fallback "UnknownVendor" gets used; you can extend vendor_from_gstin to also scan the first line of the document.

Duplicate filenames. safe_rename adds a _1, _2 suffix instead of overwriting. I learned this the hard way after losing a Tata Cloud invoice to a clash.

Why this format

Sorting by filename gives you a chronological audit trail. Your CA can scan a folder and see April's invoices in date order without opening anything. The amount in the filename is a sanity check — quick visual confirmation when reconciling with bank statements.

Where to go from here

A few extensions I've layered on top of this base script:

Pipe the extracted data straight into a Google Sheet for monthly P&L
Move processed PDFs into year/month subfolders
Flag any invoice where the GSTIN checksum doesn't validate (there's a published algorithm)
Send a Telegram alert if any single invoice crosses a threshold

55 lines was the goal. Fancier tooling can wait until the basics work.

Follow me on Twitter @automate_archit for daily AI automation tips.

DEV Community