Archit Mittal

Posted on Apr 16

How I Built an Automated PDF Invoice Parser in 75 Lines of Python

#python #automation #tutorial #productivity

Last month, a CA firm in Pune asked me to look at their invoice processing workflow. Their team of 3 was spending 15+ hours every week manually copying data from PDF invoices into Tally and Excel sheets. Vendor name, GSTIN, line items, totals, tax breakdowns — all done by hand.

I built them an automated parser in 75 lines of Python. It now processes 200+ invoices per day with 95% accuracy. Here is exactly how I did it.

The Problem: Death by Data Entry

Here is what the workflow looked like before automation:

Download PDF invoices from email (50-80 per day)
Open each PDF manually
Squint at the layout, find the vendor name, GSTIN, invoice number
Copy line items one by one into Excel
Calculate tax totals and verify against the PDF
Paste into Tally for accounting

Average time per invoice: 4-5 minutes. That is 4+ hours daily for one person. Three people were doing this.

The firm was paying roughly Rs 45,000/month in salaries just for data entry. The error rate hovered around 8% — because humans get tired after the 30th invoice.

The Tech Stack

I kept it dead simple. No fancy ML models, no cloud APIs, no expensive OCR services.

pdfplumber — extracts text and tables from PDFs (free, open source)
pandas — structures the extracted data into clean DataFrames
re (regex) — pattern matching for GSTIN, invoice numbers, dates
openpyxl — exports results to Excel files

Total cost: Rs 0. All open-source libraries.

pip install pdfplumber pandas openpyxl

The Code: 75 Lines That Replace 3 People

Here is the complete script. I will walk through each section below.

import pdfplumber
import pandas as pd
import re
import os
from datetime import datetime

# Regex patterns for Indian invoices
GSTIN_PATTERN = r'\d{2}[A-Z]{5}\d{4}[A-Z]{1}[A-Z\d]{1}[Z]{1}[A-Z\d]{1}'
INVOICE_PATTERN = r'(?:Invoice\s*(?:No|Number|#)?[\s:]*)([\w\-/]+)'
DATE_PATTERN = r'\d{2}[/-]\d{2}[/-]\d{4}'
AMOUNT_PATTERN = r'(?:Total|Grand\s*Total|Net\s*Amount)[\s:]*(?:Rs\.?|INR)?\s*([\d,]+\.?\d*)'

def extract_invoice_data(pdf_path):
    """Extract structured data from a single PDF invoice."""
    data = {
        'file': os.path.basename(pdf_path),
        'vendor_gstin': None,
        'invoice_no': None,
        'date': None,
        'total_amount': None,
        'line_items': []
    }

    with pdfplumber.open(pdf_path) as pdf:
        full_text = ''
        for page in pdf.pages:
            full_text += page.extract_text() or ''

            # Extract tables for line items
            tables = page.extract_tables()
            for table in tables:
                if len(table) > 1:  # Has header + data
                    for row in table[1:]:
                        cleaned = [str(cell).strip() if cell else '' for cell in row]
                        if any(cleaned):
                            data['line_items'].append(cleaned)

        # Extract fields using regex
        gstin_matches = re.findall(GSTIN_PATTERN, full_text)
        if gstin_matches:
            data['vendor_gstin'] = gstin_matches[0]

        inv_match = re.search(INVOICE_PATTERN, full_text, re.IGNORECASE)
        if inv_match:
            data['invoice_no'] = inv_match.group(1)

        date_matches = re.findall(DATE_PATTERN, full_text)
        if date_matches:
            data['date'] = date_matches[0]

        amount_match = re.search(AMOUNT_PATTERN, full_text, re.IGNORECASE)
        if amount_match:
            data['total_amount'] = float(amount_match.group(1).replace(',', ''))

    return data

def process_invoice_folder(folder_path, output_file='parsed_invoices.xlsx'):
    """Process all PDFs in a folder and export to Excel."""
    results = []
    pdf_files = [f for f in os.listdir(folder_path) if f.lower().endswith('.pdf')]

    print(f'Found {len(pdf_files)} PDF invoices to process...')

    for i, filename in enumerate(pdf_files, 1):
        filepath = os.path.join(folder_path, filename)
        try:
            invoice = extract_invoice_data(filepath)
            results.append({
                'File': invoice['file'],
                'Vendor GSTIN': invoice['vendor_gstin'],
                'Invoice No': invoice['invoice_no'],
                'Date': invoice['date'],
                'Total Amount': invoice['total_amount'],
                'Line Items Count': len(invoice['line_items'])
            })
            print(f'  [{i}/{len(pdf_files)}] Parsed: {filename}')
        except Exception as e:
            print(f'  [{i}/{len(pdf_files)}] FAILED: {filename} - {e}')
            results.append({'File': filename, 'Error': str(e)})

    df = pd.DataFrame(results)
    df.to_excel(output_file, index=False)
    print(f'\nDone! Results saved to {output_file}')
    print(f'Successfully parsed: {len([r for r in results if "Error" not in r])}/{len(results)}')
    return df

# Usage
if __name__ == '__main__':
    df = process_invoice_folder('./invoices')

Breaking Down the Key Parts

GSTIN Extraction

The GSTIN (Goods and Services Tax Identification Number) follows a fixed 15-character format across India. The regex \d{2}[A-Z]{5}\d{4}[A-Z]{1}[A-Z\d]{1}[Z]{1}[A-Z\d]{1} matches this pattern reliably. First two digits are state code, next 10 characters are PAN, and the rest are entity-specific.

Table Extraction with pdfplumber

This is where pdfplumber really shines. Instead of trying to parse free-form text line by line, page.extract_tables() detects table boundaries in the PDF and returns structured 2D arrays. For most GST invoices, line items sit in a table with columns like Description, HSN Code, Quantity, Rate, and Amount.

The Fallback Strategy

Not every invoice is perfectly formatted. Some are scanned images (which need OCR — a story for another day). Some have weird layouts. The try/except block ensures one bad PDF does not crash the entire batch. Failed files get logged so you can handle them manually.

Results After 30 Days

Here is what changed after deploying this script:

Processing time: 15+ hours/week dropped to about 20 minutes (the script runs in under 2 minutes, plus 15-18 minutes for human review of flagged items)
Error rate: 8% dropped to under 2% (most remaining errors are from badly scanned PDFs)
Monthly savings: Roughly Rs 35,000/month in reduced data entry labour
Staff redeployment: 2 of the 3 data entry staff now handle client communication and follow-ups instead

The script paid for itself on day one.

Making It Production-Ready

For the actual deployment, I added a few extras that pushed it beyond 75 lines:

1. Email integration — A simple IMAP script that auto-downloads PDF attachments from specific senders.

2. Confidence scoring — Each extracted field gets a confidence score. If the GSTIN regex matches but the checksum fails, it gets flagged for human review.

3. Tally XML export — Instead of Excel, the final version exports directly to Tally-compatible XML, cutting out another manual step.

4. Watchdog monitoring — Uses Python's watchdog library to monitor a folder. Drop a PDF in, get structured data out in seconds.

Try It Yourself

Here is a quick way to test this on your own invoices:

mkdir invoice_parser && cd invoice_parser
pip install pdfplumber pandas openpyxl

Save the script above as parser.py, drop some PDF invoices into an invoices/ folder, and run:

python parser.py

You will get a clean parsed_invoices.xlsx with all extracted data.

What I Learned

Start with regex, not ML. For structured documents like Indian GST invoices, regex handles 90% of cases. ML is overkill until you hit that last 10%.
pdfplumber beats PyPDF2 for tables. I wasted a day with PyPDF2 before switching. pdfplumber's table detection is significantly better.
Always build in human review. Automation does not mean zero humans. The sweet spot is automating the boring 95% and letting humans handle the interesting edge cases.
Measure before and after. The Rs 35,000/month savings number sold the client on expanding automation to other workflows. Without that number, it is just a cool script.

What Is Next

I am currently building v2 that adds OCR support for scanned invoices using pytesseract, and a simple web dashboard using Streamlit where the CA firm's team can upload invoices and download results. That will be next week's article.

If you are dealing with similar document processing problems — whether it is invoices, purchase orders, bank statements, or any structured PDF — the approach is the same: extract text, match patterns, structure data, export.

I am Archit Mittal — I automate chaos for businesses. Follow me for daily automation content.

Have a workflow that is eating your team's time? Drop a comment below — I might build it and write about it next.

Top comments (4)

Leonidas Williamson • Apr 16

Great insights!

Archit Mittal • Apr 16

Thanks Leonidas! If you try this out, one tip — for invoices with inconsistent layouts, you can boost accuracy by adding a few regex patterns as fallbacks alongside pdfplumber's table extraction. I found that hybrid approach (structured extraction + regex) handles about 95% of real-world invoice formats without needing OCR.

Leonidas Williamson • Apr 16

Really useful benchmark — 95% without OCR is better than I expected. The hybrid approach makes sense: let pdfplumber handle the structure, regex catches the edge cases.

I've been playing around with a similar concept and thinking about adding a validation layer after extraction too. Even when extraction thinks it got the right values, you still get weird failures:

"Total" that's actually a subtotal
Dates parsing wrong because of locale differences
Line items that merged or split unexpectedly

Extract → validate → flag low-confidence results for review.
Do you have a go-to set of regex patterns for the common invoice fields? Would be curious to compare notes — I've been building a small library of them but always finding new edge cases.

Archit Mittal • Apr 17

Great edge cases to flag — the subtotal-vs-total trap and locale-aware date parsing have bitten me too. Here are a few regex patterns I keep in my fallback kit (Python-flavored):

patterns = {
    "invoice_no": r"(?:invoice|inv)[\s#:/-]*([A-Z0-9\-\/]{3,20})",
    "total": r"(?:grand\s*total|balance\s*due|amount\s*payable)[\s:.$₹INR]*([0-9,]+\.?[0-9]*)",
    "gstin": r"\b([0-9]{2}[A-Z]{5}[0-9]{4}[A-Z][0-9A-Z]Z[0-9A-Z])\b",
    "date": r"\b(\d{1,2}[/-][A-Za-z0-9]{1,3}[/-]\d{2,4})\b",
}

The "grand total / balance due / amount payable" keyword anchor is the single biggest accuracy bump I've seen — "Total" alone grabs subtotals every time. For dates, I normalize with dateutil.parser.parse(..., dayfirst=True) after extraction so locale ambiguity gets resolved once, not per-pattern. Happy to swap libraries — what confidence threshold are you using to flag for human review?