Last month, a CA firm in Pune asked me to look at their invoice processing workflow. Their team of 3 was spending 15+ hours every week manually copying data from PDF invoices into Tally and Excel sheets. Vendor name, GSTIN, line items, totals, tax breakdowns — all done by hand.
I built them an automated parser in 75 lines of Python. It now processes 200+ invoices per day with 95% accuracy. Here is exactly how I did it.
The Problem: Death by Data Entry
Here is what the workflow looked like before automation:
- Download PDF invoices from email (50-80 per day)
- Open each PDF manually
- Squint at the layout, find the vendor name, GSTIN, invoice number
- Copy line items one by one into Excel
- Calculate tax totals and verify against the PDF
- Paste into Tally for accounting
Average time per invoice: 4-5 minutes. That is 4+ hours daily for one person. Three people were doing this.
The firm was paying roughly Rs 45,000/month in salaries just for data entry. The error rate hovered around 8% — because humans get tired after the 30th invoice.
The Tech Stack
I kept it dead simple. No fancy ML models, no cloud APIs, no expensive OCR services.
- pdfplumber — extracts text and tables from PDFs (free, open source)
- pandas — structures the extracted data into clean DataFrames
- re (regex) — pattern matching for GSTIN, invoice numbers, dates
- openpyxl — exports results to Excel files
Total cost: Rs 0. All open-source libraries.
pip install pdfplumber pandas openpyxl
The Code: 75 Lines That Replace 3 People
Here is the complete script. I will walk through each section below.
import pdfplumber
import pandas as pd
import re
import os
from datetime import datetime
# Regex patterns for Indian invoices
GSTIN_PATTERN = r'\d{2}[A-Z]{5}\d{4}[A-Z]{1}[A-Z\d]{1}[Z]{1}[A-Z\d]{1}'
INVOICE_PATTERN = r'(?:Invoice\s*(?:No|Number|#)?[\s:]*)([\w\-/]+)'
DATE_PATTERN = r'\d{2}[/-]\d{2}[/-]\d{4}'
AMOUNT_PATTERN = r'(?:Total|Grand\s*Total|Net\s*Amount)[\s:]*(?:Rs\.?|INR)?\s*([\d,]+\.?\d*)'
def extract_invoice_data(pdf_path):
"""Extract structured data from a single PDF invoice."""
data = {
'file': os.path.basename(pdf_path),
'vendor_gstin': None,
'invoice_no': None,
'date': None,
'total_amount': None,
'line_items': []
}
with pdfplumber.open(pdf_path) as pdf:
full_text = ''
for page in pdf.pages:
full_text += page.extract_text() or ''
# Extract tables for line items
tables = page.extract_tables()
for table in tables:
if len(table) > 1: # Has header + data
for row in table[1:]:
cleaned = [str(cell).strip() if cell else '' for cell in row]
if any(cleaned):
data['line_items'].append(cleaned)
# Extract fields using regex
gstin_matches = re.findall(GSTIN_PATTERN, full_text)
if gstin_matches:
data['vendor_gstin'] = gstin_matches[0]
inv_match = re.search(INVOICE_PATTERN, full_text, re.IGNORECASE)
if inv_match:
data['invoice_no'] = inv_match.group(1)
date_matches = re.findall(DATE_PATTERN, full_text)
if date_matches:
data['date'] = date_matches[0]
amount_match = re.search(AMOUNT_PATTERN, full_text, re.IGNORECASE)
if amount_match:
data['total_amount'] = float(amount_match.group(1).replace(',', ''))
return data
def process_invoice_folder(folder_path, output_file='parsed_invoices.xlsx'):
"""Process all PDFs in a folder and export to Excel."""
results = []
pdf_files = [f for f in os.listdir(folder_path) if f.lower().endswith('.pdf')]
print(f'Found {len(pdf_files)} PDF invoices to process...')
for i, filename in enumerate(pdf_files, 1):
filepath = os.path.join(folder_path, filename)
try:
invoice = extract_invoice_data(filepath)
results.append({
'File': invoice['file'],
'Vendor GSTIN': invoice['vendor_gstin'],
'Invoice No': invoice['invoice_no'],
'Date': invoice['date'],
'Total Amount': invoice['total_amount'],
'Line Items Count': len(invoice['line_items'])
})
print(f' [{i}/{len(pdf_files)}] Parsed: {filename}')
except Exception as e:
print(f' [{i}/{len(pdf_files)}] FAILED: {filename} - {e}')
results.append({'File': filename, 'Error': str(e)})
df = pd.DataFrame(results)
df.to_excel(output_file, index=False)
print(f'\nDone! Results saved to {output_file}')
print(f'Successfully parsed: {len([r for r in results if "Error" not in r])}/{len(results)}')
return df
# Usage
if __name__ == '__main__':
df = process_invoice_folder('./invoices')
Breaking Down the Key Parts
GSTIN Extraction
The GSTIN (Goods and Services Tax Identification Number) follows a fixed 15-character format across India. The regex \d{2}[A-Z]{5}\d{4}[A-Z]{1}[A-Z\d]{1}[Z]{1}[A-Z\d]{1} matches this pattern reliably. First two digits are state code, next 10 characters are PAN, and the rest are entity-specific.
Table Extraction with pdfplumber
This is where pdfplumber really shines. Instead of trying to parse free-form text line by line, page.extract_tables() detects table boundaries in the PDF and returns structured 2D arrays. For most GST invoices, line items sit in a table with columns like Description, HSN Code, Quantity, Rate, and Amount.
The Fallback Strategy
Not every invoice is perfectly formatted. Some are scanned images (which need OCR — a story for another day). Some have weird layouts. The try/except block ensures one bad PDF does not crash the entire batch. Failed files get logged so you can handle them manually.
Results After 30 Days
Here is what changed after deploying this script:
- Processing time: 15+ hours/week dropped to about 20 minutes (the script runs in under 2 minutes, plus 15-18 minutes for human review of flagged items)
- Error rate: 8% dropped to under 2% (most remaining errors are from badly scanned PDFs)
- Monthly savings: Roughly Rs 35,000/month in reduced data entry labour
- Staff redeployment: 2 of the 3 data entry staff now handle client communication and follow-ups instead
The script paid for itself on day one.
Making It Production-Ready
For the actual deployment, I added a few extras that pushed it beyond 75 lines:
1. Email integration — A simple IMAP script that auto-downloads PDF attachments from specific senders.
2. Confidence scoring — Each extracted field gets a confidence score. If the GSTIN regex matches but the checksum fails, it gets flagged for human review.
3. Tally XML export — Instead of Excel, the final version exports directly to Tally-compatible XML, cutting out another manual step.
4. Watchdog monitoring — Uses Python's watchdog library to monitor a folder. Drop a PDF in, get structured data out in seconds.
Try It Yourself
Here is a quick way to test this on your own invoices:
mkdir invoice_parser && cd invoice_parser
pip install pdfplumber pandas openpyxl
Save the script above as parser.py, drop some PDF invoices into an invoices/ folder, and run:
python parser.py
You will get a clean parsed_invoices.xlsx with all extracted data.
What I Learned
Start with regex, not ML. For structured documents like Indian GST invoices, regex handles 90% of cases. ML is overkill until you hit that last 10%.
pdfplumber beats PyPDF2 for tables. I wasted a day with PyPDF2 before switching. pdfplumber's table detection is significantly better.
Always build in human review. Automation does not mean zero humans. The sweet spot is automating the boring 95% and letting humans handle the interesting edge cases.
Measure before and after. The Rs 35,000/month savings number sold the client on expanding automation to other workflows. Without that number, it is just a cool script.
What Is Next
I am currently building v2 that adds OCR support for scanned invoices using pytesseract, and a simple web dashboard using Streamlit where the CA firm's team can upload invoices and download results. That will be next week's article.
If you are dealing with similar document processing problems — whether it is invoices, purchase orders, bank statements, or any structured PDF — the approach is the same: extract text, match patterns, structure data, export.
I am Archit Mittal — I automate chaos for businesses. Follow me for daily automation content.
Have a workflow that is eating your team's time? Drop a comment below — I might build it and write about it next.
Top comments (0)