If you handle Indian invoices for any business, you know the month-end ritual: thirty vendor PDFs arrive in your inbox, and someone has to type GSTIN, invoice number, taxable value, and the CGST/SGST/IGST split into a spreadsheet so the accountant can reconcile with GSTR-2A. One typo means a mismatched input tax credit, and that means a tense email at 9pm.
I built a parser that reads any GST-compliant PDF invoice and writes the numbers straight into a CSV. Sixty lines, one external dependency, no paid OCR service, no API calls.
Here's the full code, then I'll walk through why each piece exists.
The Code
import re
import csv
import sys
from pathlib import Path
import pdfplumber
GSTIN_RE = re.compile(r"\b(\d{2}[A-Z]{5}\d{4}[A-Z]\d[A-Z][A-Z\d])\b")
INVOICE_RE = re.compile(r"Invoice\s*(?:No\.?|Number|#)\s*[:\-]?\s*([A-Z0-9\-/]+)", re.I)
DATE_RE = re.compile(r"(\d{1,2}[/\-.]\d{1,2}[/\-.]\d{2,4})")
AMOUNT_RE = re.compile(r"(?:Total|Grand\s*Total)[^\d]{0,20}([\d,]+\.\d{2})", re.I)
TAXABLE_RE = re.compile(r"Taxable\s*(?:Value|Amount)[^\d]{0,20}([\d,]+\.\d{2})", re.I)
CGST_RE = re.compile(r"\bCGST[^\d]{0,30}([\d,]+\.\d{2})", re.I)
SGST_RE = re.compile(r"\bSGST[^\d]{0,30}([\d,]+\.\d{2})", re.I)
IGST_RE = re.compile(r"\bIGST[^\d]{0,30}([\d,]+\.\d{2})", re.I)
def clean(amount):
if not amount:
return 0.0
return float(amount.replace(",", ""))
def first(pattern, text, default=""):
m = pattern.search(text)
return m.group(1).strip() if m else default
def parse_invoice(pdf_path):
with pdfplumber.open(pdf_path) as pdf:
text = "\n".join(p.extract_text() or "" for p in pdf.pages)
gstins = GSTIN_RE.findall(text)
return {
"file": pdf_path.name,
"vendor_gstin": gstins[0] if gstins else "",
"buyer_gstin": gstins[1] if len(gstins) > 1 else "",
"invoice_no": first(INVOICE_RE, text),
"date": first(DATE_RE, text),
"taxable": clean(first(TAXABLE_RE, text)),
"cgst": clean(first(CGST_RE, text)),
"sgst": clean(first(SGST_RE, text)),
"igst": clean(first(IGST_RE, text)),
"total": clean(first(AMOUNT_RE, text)),
}
def main(folder, out_csv="invoices.csv"):
rows = [parse_invoice(p) for p in Path(folder).glob("*.pdf")]
if not rows:
print("No PDFs found.")
return
with open(out_csv, "w", newline="") as f:
writer = csv.DictWriter(f, fieldnames=rows[0].keys())
writer.writeheader()
writer.writerows(rows)
print(f"Parsed {len(rows)} invoices -> {out_csv}")
if __name__ == "__main__":
main(sys.argv[1], sys.argv[2] if len(sys.argv) > 2 else "invoices.csv")
Install and run:
pip install pdfplumber
python parse_gst.py ./invoices/ output.csv
How It Works
The GSTIN regex. A GSTIN is exactly 15 characters in a strict format: 2-digit state code, 5 letters (the entity's PAN prefix), 4 digits (PAN), 1 letter (PAN suffix), 1 digit (entity number within the state), 1 letter (Z by default), and 1 alphanumeric checksum. The pattern matches that exact shape and rejects everything else, so you won't accidentally pick up phone numbers, PIN codes, or random alphanumeric strings.
A typical invoice carries two GSTINs: the vendor's at the top, usually in the header block, and the buyer's in the bill-to section. The reading order pdfplumber returns is consistent for nearly every template I've tested, so taking the first two matches works in practice. If you want to be defensive, you can label each GSTIN by checking which one appears before the words "Bill To" in the extracted text.
The tax field regexes. Each tax field has the same structural pattern across templates: a label like CGST, then some whitespace or punctuation (@9%, Rs., INR, a colon, or a table cell separator), then the amount. The [^\d]{0,30} window lets the regex skip past anything non-numeric for up to 30 characters before grabbing the next amount. That window is wide enough to handle most table layouts but tight enough that it won't accidentally jump down to the next row's number.
The \b word boundary on CGST, SGST, and IGST matters more than it looks. Without it, you can match parts of compound labels and end up with the wrong number landing in the wrong column on every single invoice in the batch.
Why pdfplumber over PyPDF2 or pdfminer. pdfplumber preserves visual layout when extracting text, which means amounts in a totals box at the bottom of the page don't get scrambled with line items above them. PyPDF2 returns the same characters in document reading order, which sounds fine until you hit a multi-column invoice and CGST from row 3 ends up next to SGST from row 1. pdfplumber pays attention to coordinates, so structured fields stay structured.
Comma handling. Indian invoices write amounts as 1,23,456.00. Python's float() refuses to parse commas. The clean() helper strips them before casting, and returns 0.0 on missing matches rather than crashing — important when a single malformed PDF in a batch of thirty shouldn't kill the whole run.
What It Catches
Run it against a folder of 30 invoices and you get a CSV like:
file,vendor_gstin,buyer_gstin,invoice_no,date,taxable,cgst,sgst,igst,total
inv_001.pdf,27ABCDE1234F1Z5,29XYZAB5678G2Z9,INV-2025-001,15/05/2025,10000.00,900.0,900.0,0.0,11800.00
inv_002.pdf,27ABCDE1234F1Z5,06PQRST9876H3Z1,INV-2025-002,16/05/2025,25000.00,0.0,0.0,4500.0,29500.00
Pipe that into your GSTR-2A reconciliation sheet, sort by vendor_gstin, and you're done in five minutes instead of two hours.
What It Doesn't Catch (Yet)
Scanned PDFs are invisible to pdfplumber — it reads embedded text only. For scanned invoices, add a pytesseract OCR step before the text extraction; that's another ten lines and a brew install tesseract. Non-standard layouts (vendors who label GST as "Tax Amount" or split CGST across two rows) will defeat the regex. The fix is to print the extracted text for that file once, adjust the pattern, and never touch it again. Reverse-charge entries return the right total but with the tax direction flipped; if you process reverse-charge invoices regularly, add a flag check for "Reverse Charge: Yes" in the text.
Running It On Schedule
Drop the script in a cron job that runs nightly:
0 22 * * * cd /home/me/invoices && python parse_gst.py ./incoming output.csv
Now your accountant opens a fresh, accurate CSV every morning at 10am and the manual typing step disappears from the workflow.
Why I Like This Pattern
Sixty lines of regex over PDF text beats a ₹2000/month invoice OCR SaaS for 95% of real-world Indian invoices, and the 5% you can fix by reading one error message and tweaking one pattern. The entire script fits in your head, so when something breaks, you know exactly where to look. No vendor lock-in, no rate limits, no quarterly price hikes. Just Python and a regex book.
Build the boring thing yourself. Charge yourself ₹0 a month. Use the saved money for better chai.
Follow me on Twitter @automate_archit for daily AI automation tips.
Top comments (0)