Last month, a textile exporter in Surat reached out to me with a familiar problem: their accounts team of 4 people was spending 6+ hours every day manually matching invoices to purchase orders and bank statements. For a business doing ₹2 crore in annual revenue with 200+ invoices per month, this was eating into their margins badly.
Here's how I automated their entire invoice processing pipeline in just 3 days using Python — and brought that 6-hour daily task down to 15 minutes of human review.
The Problem: Death by Manual Data Entry
Their workflow looked like this:
- Receive invoices via email (PDF attachments) and WhatsApp
- Manually type invoice details into a Google Sheet
- Cross-check each invoice against purchase orders in another sheet
- Match payments in their bank statement CSV to invoices
- Flag mismatches and follow up with vendors
The error rate was roughly 8-12% — wrong amounts, missed invoices, duplicate entries. Each error meant hours of back-and-forth with vendors and delayed payments.
Cost of the old process:
- 4 staff × 6 hours/day × ₹15,000/month salary = ₹60,000/month just on invoice processing
- Late payment penalties averaging ₹25,000/month
- Total: ~₹85,000/month or ₹10.2 lakh/year
The Solution: A 3-Layer Python Pipeline
I built a pipeline with three components: Extract, Match, and Report.
Day 1: Invoice Data Extraction
First, I needed to pull structured data from PDF invoices. I used pdfplumber for text-based PDFs and pytesseract for scanned ones.
import pdfplumber
import re
from dataclasses import dataclass
from typing import Optional
@dataclassclass InvoiceData:
invoice_number: str
vendor_name: str
date: str
total_amount: float
gst_amount: float
line_items: list
def extract_invoice_data(pdf_path: str) -> InvoiceData:
with pdfplumber.open(pdf_path) as pdf:
full_text = ""
for page in pdf.pages:
full_text += page.extract_text() or ""
# Extract key fields using regex patterns
inv_number = re.search(
r'Invoice\s*#?\s*:?\s*([A-Z0-9-]+)',
full_text, re.IGNORECASE
)
amount = re.search(
r'Total\s*:?\s*₹?\s*([\d,]+\.?\d*)',
full_text, re.IGNORECASE
)
gst = re.search(
r'GST\s*:?\s*₹?\s*([\d,]+\.?\d*)',
full_text, re.IGNORECASE
)
return InvoiceData(
invoice_number=inv_number.group(1) if inv_number else "UNKNOWN",
vendor_name=extract_vendor_name(full_text),
date=extract_date(full_text),
total_amount=float(amount.group(1).replace(',', '')) if amount else 0.0,
gst_amount=float(gst.group(1).replace(',', '')) if gst else 0.0,
line_items=extract_line_items(full_text)
)
For the 30% of invoices that were scanned images, I added an OCR fallback:
import pytesseract
from pdf2image import convert_from_path
def ocr_extract(pdf_path: str) -> str:
images = convert_from_path(pdf_path, dpi=300)
text = ""
for img in images:
text += pytesseract.image_to_string(img, lang='eng+hin')
return text
The Hindi language support was critical since many local vendors send bilingual invoices.
Day 2: Smart Matching Engine
The matching logic compares extracted invoice data against purchase orders and bank transactions. I used fuzzy matching because vendor names are never consistent (think "Raj Textiles" vs "Raj Textile Pvt Ltd").
from fuzzywuzzy import fuzz
import pandas as pd
def match_invoice_to_po(invoice: InvoiceData, po_df: pd.DataFrame) -> dict:
best_match = None
best_score = 0
for _, po in po_df.iterrows():
# Fuzzy match on vendor name
name_score = fuzz.token_sort_ratio(
invoice.vendor_name.lower(),
po['vendor_name'].lower()
)
# Exact or close match on amount (within 1% for rounding)
amount_diff = abs(invoice.total_amount - po['amount']) / po['amount']
amount_score = 100 if amount_diff < 0.01 else max(0, 100 - amount_diff * 100)
# Combined weighted score
combined = (name_score * 0.4) + (amount_score * 0.6)
if combined > best_score:
best_score = combined
best_match = po
return {
'po_number': best_match['po_number'] if best_score > 75 else None,
'confidence': best_score,
'status': 'matched' if best_score > 85 else 'review' if best_score > 75 else 'unmatched'
}
def match_to_bank_statement(invoice: InvoiceData, bank_df: pd.DataFrame) -> dict:
"""Match invoice to bank transactions within a 7-day window"""
for _, txn in bank_df.iterrows():
amount_match = abs(txn['amount'] - invoice.total_amount) < 1.0
date_match = abs((txn['date'] - pd.to_datetime(invoice.date)).days) <= 7
if amount_match and date_match:
return {'txn_id': txn['reference'], 'status': 'paid'}
return {'txn_id': None, 'status': 'unpaid'}
Day 3: Email Monitoring + Dashboard
I set up an email listener to auto-download invoice attachments and a simple Streamlit dashboard for the accounts team:
import imaplib
import email
import os
from pathlib import Path
def fetch_invoice_emails(
imap_server: str,
username: str,
password: str,
download_dir: str = "./invoices"
):
mail = imaplib.IMAP4_SSL(imap_server)
mail.login(username, password)
mail.select('inbox')
# Search for emails with PDF attachments from last 24 hours
_, messages = mail.search(None, '(SINCE "08-Apr-2026" SUBJECT "invoice")')
Path(download_dir).mkdir(exist_ok=True)
downloaded = []
for msg_id in messages[0].split():
_, msg_data = mail.fetch(msg_id, '(RFC822)')
msg = email.message_from_bytes(msg_data[0][1])
for part in msg.walk():
if part.get_content_type() == 'application/pdf':
filename = part.get_filename()
filepath = os.path.join(download_dir, filename)
with open(filepath, 'wb') as f:
f.write(part.get_payload(decode=True))
downloaded.append(filepath)
mail.logout()
return downloaded
The Results: ₹8.5 Lakh Annual Savings
After one month of running the system:
- Processing time: 6 hours/day → 15 minutes of human review
- Error rate: 12% → 0.5% (only edge cases need manual intervention)
- Staff reallocation: 3 of 4 accounts staff moved to higher-value work
- Late payment penalties: ₹25,000/month → ₹2,000/month
- Net savings: ~₹8.5 lakh/year
The total cost of building this? About ₹45,000 for my 3 days of work plus ₹500/month for a basic VPS to run the pipeline. ROI hit positive in the first month itself.
Key Takeaways for Indian Businesses
Start with the most painful manual process — invoice matching was their biggest time sink, not the fanciest problem to solve
Build for bilingual — if you're automating for Indian businesses, Hindi/regional language support isn't optional.
pytesseractwith language packs handles this wellKeep humans in the loop — the system flags low-confidence matches for review instead of auto-approving everything. This built trust with the accounts team
Use fuzzy matching liberally — Indian business names have many variations. Hard string matching will miss 40%+ of valid matches
Measure in ₹, not in hours — when I told the owner "you'll save ₹8.5 lakh/year," the approval was instant. Saying "you'll save 1,400 hours" wouldn't have had the same impact
Want to Build Something Similar?
The full pipeline is about 400 lines of Python. The key libraries you need:
-
pdfplumber— PDF text extraction -
pytesseract+pdf2image— OCR for scanned documents -
fuzzywuzzy— fuzzy string matching -
pandas— data manipulation and CSV handling -
streamlit— quick dashboard for the review interface -
imaplib— email monitoring (built into Python)
The entire stack runs on a ₹500/month DigitalOcean droplet. No expensive SaaS subscriptions needed.
*I'm Archit Mittal — I automate chaos for businesses. Follow me for daily automation content.*thon)
The entire stack runs on a ₹500/month DigitalOcean droplet. No expensive SaaS subscriptions needed.
I'm Archit Mittal — I automate chaos for businesses. Follow me for daily automation content.
Top comments (0)