DEV Community

Archit Mittal
Archit Mittal

Posted on • Originally published at architmittal.com

Build a GST Invoice PDF Extractor in 53 Lines of Python

If you run a small business in India, you've felt this pain: a folder full of PDF invoices from different vendors, every one formatted slightly differently, and somebody has to type GSTIN, invoice number, date, and total amount into a spreadsheet. I built a single-file Python script that processes the whole folder in seconds and validates GSTINs along the way.

Total: 53 lines including imports and CLI plumbing. Here's the whole thing.

What it does

  • Walks a folder of PDF invoices
  • Pulls out GSTINs (vendor + buyer) and validates the 15-character format
  • Grabs invoice number, date, and total amount with regex
  • Writes a clean CSV you can paste into Tally, Zoho Books, or any ledger

The code

# invoice_extractor.py
import re, sys, csv
from pathlib import Path
import pdfplumber

GSTIN_RE   = re.compile(r'\b(\d{2}[A-Z]{5}\d{4}[A-Z][A-Z\d]Z[A-Z\d])\b')
INV_NUM_RE = re.compile(r'Invoice\s*(?:No\.?|Number|#)\s*[:\-]?\s*([A-Z0-9\-/]+)', re.I)
DATE_RE    = re.compile(r'(\d{2}[\-/]\d{2}[\-/]\d{4})')
AMOUNT_RE  = re.compile(r'(?:Grand\s*Total|Total|Amount)[^\d]*(?:Rs\.?|INR)?\s*([\d,]+\.\d{0,2})', re.I)

def is_valid_gstin(g: str) -> bool:
    if len(g) != 15:
        return False
    state = int(g[:2])
    return 1 <= state <= 38

def extract_from_pdf(path: Path) -> dict:
    text = ""
    with pdfplumber.open(path) as pdf:
        for page in pdf.pages:
            text += (page.extract_text() or "") + "\n"
    gstins  = [g for g in GSTIN_RE.findall(text) if is_valid_gstin(g)]
    inv     = INV_NUM_RE.search(text)
    date    = DATE_RE.search(text)
    amounts = [float(a.replace(",", "")) for a in AMOUNT_RE.findall(text)]
    return {
        "file":         path.name,
        "vendor_gstin": gstins[0] if gstins else "",
        "buyer_gstin":  gstins[1] if len(gstins) > 1 else "",
        "invoice_no":   inv.group(1) if inv else "",
        "date":         date.group(1) if date else "",
        "total_inr":    max(amounts, default=0.0),
    }

def main(folder: str) -> None:
    rows = [extract_from_pdf(p) for p in sorted(Path(folder).glob("*.pdf"))]
    if not rows:
        print("No PDFs found."); return
    out = Path(folder) / "invoices_extracted.csv"
    with out.open("w", newline="") as f:
        w = csv.DictWriter(f, fieldnames=rows[0].keys())
        w.writeheader(); w.writerows(rows)
    print(f"Processed {len(rows)} invoices then wrote {out}")

if __name__ == "__main__":
    main(sys.argv[1] if len(sys.argv) > 1 else ".")
Enter fullscreen mode Exit fullscreen mode

Save it as invoice_extractor.py, drop a folder of PDFs next to it, then:

pip install pdfplumber
python invoice_extractor.py ./invoices
Enter fullscreen mode Exit fullscreen mode

You'll get an invoices_extracted.csv with one row per PDF.

How the GSTIN regex works

A valid GSTIN is exactly 15 characters laid out like this:

  • 2-digit state code
  • 5-letter PAN prefix
  • 4-digit PAN middle
  • 1-letter PAN suffix
  • 1 entity number (alphanumeric)
  • Literal Z
  • 1 checksum character (alphanumeric)

The regex captures that shape and is_valid_gstin adds a sanity check on the state code (1 through 38 covers every state and union territory). The 15th character is technically a mod-36 checksum, so you can layer that in if you want bulletproof matching. For 95 percent of accounting cleanup, the format check is enough.

Why the amount regex uses max(amounts)

Indian invoices repeat amounts everywhere: subtotal, CGST, SGST, IGST, taxable value, grand total. Taking the largest match from anything that follows "Total", "Grand Total" or "Amount" almost always lands on the grand total. If your vendors have weirder layouts, tighten the regex to require "Grand Total" specifically and skip "Total" alone.

A real example

I tested this on 47 PDF invoices from one quarter of vendor bills (electricity, internet, three SaaS subscriptions, freight, the works). It processed them in 6.2 seconds on an M1 Air. Two had no extractable text because they were scans, which is the OCR case I describe below. The other 45 came out clean and tied to the Tally ledger on the first try.

Sample output row:

file=ApolloPharmacy_Mar2026.pdf
vendor_gstin=29AABCA1234B1Z7
buyer_gstin=29AAACS5678E1ZR
invoice_no=AP/2526/00412
date=18-03-2026
total_inr=14237.50
Enter fullscreen mode Exit fullscreen mode

Common edge cases and how to handle them

Scanned PDFs. pdfplumber only extracts embedded text. If a vendor sends you a scan, extract_text() returns empty. Add a pytesseract fallback that runs OCR when the text count is below a threshold.

Multi-page invoices. Already handled. The loop concatenates every page before regex runs.

Wrong amount picked up. Happens when the invoice has a "Late fee total" line above the grand total. Fix by requiring "Grand Total" or by sorting by position rather than value.

Date in DD-MMM-YYYY format. The regex matches numeric dates only. Add an alternation for \d{2}[\-/\s][A-Za-z]{3}[\-/\s]\d{4} if your vendors use month names.

What I would add next

  • Tally XML export instead of CSV (one extra function, maps the dict to Tally's voucher schema)
  • Dedup detection by (vendor_gstin, invoice_no) to catch double entries before they hit the books
  • Date normalization to ISO so it sorts in Excel
  • OCR fallback with pytesseract for scanned PDFs
  • Batch upload to a Google Sheet using the Sheets API instead of writing a local CSV

The bigger point

The whole script is one file. No frameworks, no databases, no scheduler dependency. Drop it in a cron job or a Windows Task Scheduler entry and your books update themselves at midnight before you get to the office.

If you have ever paid a junior accountant for two hours a day to do data entry on PDFs, this is the kind of automation that recovers that cost in a week. At even a conservative 200 invoices a month at 3 minutes each, you are looking at 10 hours per month, every month, that goes back into work that actually moves the business.

Steal the code. Tweak the regex for whatever your vendors do. Ship it.


Follow me on Twitter @automate_archit for daily AI automation tips.

Top comments (0)