Last month, my CA messaged me at 11pm: "Bhai, send all April invoices in a single zip, named properly." I had 47 PDF invoices in my Downloads folder, all with names like Invoice (3).pdf, download.pdf, and INV_FINAL_FINAL_v2.pdf.
Two hours of manual renaming wasn't on my agenda. So I wrote a Python script. By the time my chai was done, the script was done. 47 invoices renamed, audit-ready.
Here's the entire thing — 55 lines, no fluff.
What it does
The script scans a folder of PDF invoices and renames each file to a clean, sortable format:
2026-04-15_AcmeServices_INV-1042_15000.pdf
It pulls four things from each invoice: the invoice date, vendor name (from the GSTIN line), invoice number, and total amount in ₹. If a field is missing, it falls back to a placeholder so nothing gets lost.
The Code
import re
import sys
from pathlib import Path
from datetime import datetime
import pdfplumber
DATE_RE = re.compile(r"(\d{1,2})[\-/\s]([A-Za-z]{3,9}|\d{1,2})[\-/\s](\d{2,4})")
GSTIN_RE = re.compile(r"\b\d{2}[A-Z]{5}\d{4}[A-Z][A-Z\d]Z[A-Z\d]\b")
INV_RE = re.compile(r"(?:Invoice\s*(?:No\.?|#|Number)\s*[:\-]?\s*)([A-Z0-9\-/]+)", re.I)
AMT_RE = re.compile(r"(?:Total|Grand\s*Total|Amount\s*Payable)[^\d₹]{0,15}₹?\s*([\d,]+\.?\d{0,2})", re.I)
def parse_date(text):
m = DATE_RE.search(text)
if not m: return None
raw = m.group(0)
for fmt in ("%d-%m-%Y", "%d/%m/%Y", "%d %b %Y", "%d %B %Y", "%d-%b-%Y"):
try:
return datetime.strptime(raw, fmt).strftime("%Y-%m-%d")
except ValueError:
continue
return None
def vendor_from_gstin(text):
m = GSTIN_RE.search(text)
if not m: return None
line = next((l for l in text.splitlines() if m.group(0) in l), "")
parts = [p.strip() for p in line.split(m.group(0)) if p.strip()]
return re.sub(r"[^A-Za-z0-9]+", "", parts[0])[:20] if parts else None
def extract(pdf_path):
with pdfplumber.open(pdf_path) as pdf:
text = "\n".join((p.extract_text() or "") for p in pdf.pages[:2])
inv = INV_RE.search(text)
amt = AMT_RE.search(text)
return {
"date": parse_date(text) or "NoDate",
"vendor": vendor_from_gstin(text) or "UnknownVendor",
"invoice": (inv.group(1) if inv else "NoInv").replace("/", "-"),
"amount": (amt.group(1).replace(",", "").split(".")[0] if amt else "0"),
}
def safe_rename(src, new_name):
dest = src.with_name(f"{new_name}.pdf")
n = 1
while dest.exists() and dest != src:
dest = src.with_name(f"{new_name}_{n}.pdf")
n += 1
src.rename(dest)
return dest.name
def main(folder):
folder = Path(folder)
for pdf in folder.glob("*.pdf"):
try:
f = extract(pdf)
new = f"{f['date']}_{f['vendor']}_{f['invoice']}_{f['amount']}"
renamed = safe_rename(pdf, new)
print(f"OK {pdf.name} -> {renamed}")
except Exception as e:
print(f"FAIL {pdf.name}: {e}")
if __name__ == "__main__":
main(sys.argv[1] if len(sys.argv) > 1 else ".")
How it works
Four regexes do most of the heavy lifting.
The date regex matches common Indian invoice formats: 15/04/2026, 15-Apr-2026, 15 April 2026. The parser tries five strptime formats and keeps the first that sticks.
The GSTIN regex matches the 15-character GST identifier (state code + PAN + entity code + Z + checksum). Once found, the script grabs the text on the same line and assumes that's the vendor name. Crude but works in 90% of real invoices I've seen.
The invoice number regex looks for "Invoice No", "Invoice #", or "Invoice Number" followed by an alphanumeric token. The amount regex scans for "Total", "Grand Total", or "Amount Payable" followed by a ₹ value, then strips commas and decimals so file sorting works.
pdfplumber only reads the first two pages of each PDF — invoice headers and totals almost always appear there, and skipping deeper pages keeps it fast. On my 47-file folder it finished in 8 seconds.
Running it
pip install pdfplumber
python rename_invoices.py ~/Downloads/april_invoices
Output:
OK Invoice (3).pdf -> 2026-04-15_AcmeServices_INV-1042_15000.pdf
OK download.pdf -> 2026-04-18_TataCloud_TC-2026-099_47200.pdf
OK INV_FINAL_FINAL_v2.pdf -> 2026-04-22_FreshworksIndia_FW8821_8400.pdf
Edge cases I hit (and you will too)
Scanned PDFs. pdfplumber returns empty text for image-only PDFs. The try/except in main skips them without crashing — but you'll need OCR (Tesseract via pytesseract) to actually read them. I keep a separate folder for scanned ones.
Multi-page invoices with the total on page 3+. Rare in B2B India, but if you hit it, change pdf.pages[:2] to pdf.pages — slower but complete.
Vendors without GSTIN. International vendors won't have one. The fallback "UnknownVendor" gets used; you can extend vendor_from_gstin to also scan the first line of the document.
Duplicate filenames. safe_rename adds a _1, _2 suffix instead of overwriting. I learned this the hard way after losing a Tata Cloud invoice to a clash.
Why this format
Sorting by filename gives you a chronological audit trail. Your CA can scan a folder and see April's invoices in date order without opening anything. The amount in the filename is a sanity check — quick visual confirmation when reconciling with bank statements.
Where to go from here
A few extensions I've layered on top of this base script:
- Pipe the extracted data straight into a Google Sheet for monthly P&L
- Move processed PDFs into
year/monthsubfolders - Flag any invoice where the GSTIN checksum doesn't validate (there's a published algorithm)
- Send a Telegram alert if any single invoice crosses a threshold
55 lines was the goal. Fancier tooling can wait until the basics work.
Follow me on Twitter @automate_archit for daily AI automation tips.
Top comments (0)