Last year, a textile trading firm in Surat approached me with a problem that's painfully common across Indian SMBs: their accounts team of 4 people was spending 6 hours every single day manually matching purchase invoices against delivery challans and bank statements.
That's 120 hours a month — burnt on copy-pasting numbers between Excel sheets and Tally.
Here's how I automated the entire pipeline with Python, saving them ₹2.4 lakh/month in operational costs and reducing errors to near-zero.
The Problem: Death by Manual Reconciliation
This ₹3 crore/year business processes roughly 800 invoices per month from 60+ suppliers. Every invoice had to be:
- Downloaded from email or WhatsApp
- Manually entered into an Excel tracker
- Cross-checked against the delivery challan
- Matched with the corresponding bank payment
- Flagged if any amount didn't match
The team was drowning. Invoices got missed, duplicate payments happened twice in Q3, and month-end closing took 5 extra days.
The Solution: A 3-Stage Python Pipeline
I built a lightweight automation system using Python — no expensive ERP, no SaaS subscription. Just clean code running on a ₹30,000 mini-PC sitting in their office.
Stage 1: Invoice Extraction with PDF Parsing
import pdfplumber
import re
from pathlib import Path
def extract_invoice_data(pdf_path: str) -> dict:
"""Extract key fields from a supplier invoice PDF."""
with pdfplumber.open(pdf_path) as pdf:
full_text = ""
for page in pdf.pages:
full_text += page.extract_text() or ""
# Extract invoice number
inv_match = re.search(
r'Invoice\s*#?\s*:?\s*([A-Z0-9\-/]+)',
full_text, re.IGNORECASE
)
# Extract total amount (handles Rs. and INR formats)
amt_match = re.search(
r'(?:Total|Grand\s*Total|Net\s*Amount)\s*:?\s*(?:Rs\.?|INR)\s*([\d,]+\.?\d*)',
full_text, re.IGNORECASE
)
# Extract date
date_match = re.search(
r'Date\s*:?\s*(\d{1,2}[/\-]\d{1,2}[/\-]\d{2,4})',
full_text, re.IGNORECASE
)
# Extract GSTIN
gst_match = re.search(
r'\d{2}[A-Z]{5}\d{4}[A-Z]{1}[A-Z\d]{1}[Z]{1}[A-Z\d]{1}',
full_text
)
return {
"invoice_number": inv_match.group(1) if inv_match else None,
"amount": float(amt_match.group(1).replace(",", "")) if amt_match else None,
"date": date_match.group(1) if date_match else None,
"gstin": gst_match.group(0) if gst_match else None,
"source_file": Path(pdf_path).name
}
# Process all invoices in a folder
invoice_folder = Path("./invoices/april_2025")
results = []
for pdf_file in invoice_folder.glob("*.pdf"):
data = extract_invoice_data(str(pdf_file))
results.append(data)
print(f"Extracted: {data['invoice_number']} | Amount: {data['amount']}")
This script processes 800 PDFs in under 3 minutes. The old manual process? 4 people x 2 hours = 8 person-hours.
Stage 2: Smart Matching Engine
import pandas as pd
from difflib import SequenceMatcher
def fuzzy_match_score(str1: str, str2: str) -> float:
"""Calculate similarity between two strings."""
if not str1 or not str2:
return 0.0
return SequenceMatcher(None, str1.upper(), str2.upper()).ratio()
def reconcile_invoices(
invoices: pd.DataFrame,
challans: pd.DataFrame,
tolerance: float = 100.0
) -> pd.DataFrame:
"""Match invoices against delivery challans."""
matched = []
unmatched_invoices = []
for _, inv in invoices.iterrows():
best_match = None
best_score = 0
for _, challan in challans.iterrows():
amount_diff = abs(inv["amount"] - challan["amount"])
if amount_diff > tolerance:
continue
name_score = fuzzy_match_score(
inv.get("supplier", ""),
challan.get("supplier", "")
)
score = (1 - amount_diff / max(inv["amount"], 1)) * 0.6 + name_score * 0.4
if score > best_score:
best_score = score
best_match = challan
if best_match is not None and best_score > 0.75:
matched.append({
"invoice_no": inv["invoice_number"],
"challan_no": best_match["challan_number"],
"invoice_amt": inv["amount"],
"challan_amt": best_match["amount"],
"difference": inv["amount"] - best_match["amount"],
"confidence": round(best_score * 100, 1),
"status": "MATCHED"
})
else:
unmatched_invoices.append({
"invoice_no": inv["invoice_number"],
"invoice_amt": inv["amount"],
"status": "UNMATCHED - NEEDS REVIEW"
})
return pd.DataFrame(matched + unmatched_invoices)
The key insight: Indian invoices frequently have small rounding differences (a few rupees on GST calculations). The tolerance parameter handles this gracefully instead of flagging false mismatches.
Stage 3: Daily Report Generation
from datetime import datetime
def generate_daily_report(reconciled: pd.DataFrame, output_dir: str = "./reports"):
"""Generate a summary report for the accounts team."""
today = datetime.now().strftime("%Y-%m-%d")
total_invoices = len(reconciled)
matched = reconciled[reconciled["status"] == "MATCHED"]
unmatched = reconciled[reconciled["status"] != "MATCHED"]
total_matched_value = matched["invoice_amt"].sum()
total_discrepancy = matched["difference"].abs().sum()
report_lines = [
f"DAILY RECONCILIATION REPORT - {today}",
"=" * 45,
f"Total Invoices Processed : {total_invoices}",
f"Successfully Matched : {len(matched)} ({len(matched)/total_invoices*100:.1f}%)",
f"Needs Manual Review : {len(unmatched)}",
f"Total Matched Value : INR {total_matched_value:,.2f}",
f"Total Discrepancies : INR {total_discrepancy:,.2f}",
"",
"TOP UNMATCHED INVOICES:"
]
for _, row in unmatched.head(10).iterrows():
report_lines.append(f" - {row['invoice_no']}: INR {row['invoice_amt']:,.2f}")
report = "\n".join(report_lines)
report_path = f"{output_dir}/reconciliation_{today}.txt"
with open(report_path, "w") as f:
f.write(report)
print(f"Report saved: {report_path}")
return report_path
The Results: 4 Months Later
Here's what changed after deploying this system:
Time saved: 120 hours/month reduced to 8 hours/month (only manual review of flagged items)
Cost saved: Approximately 2.4 lakh per month (3 team members redeployed to higher-value work)
Error rate: Dropped from around 5% to 0.3%
Month-end closing: From 5 extra days down to same-day
Duplicate payments: Zero since deployment
The ROI? The entire project cost 1.8 lakh (my fee + hardware). It paid for itself in 23 days.
Why This Matters for Indian Businesses
India has 63 million+ MSMEs. Most run their back-office on Excel + Tally + WhatsApp. The opportunity to automate isn't just about saving money — it's about freeing up humans to do work that actually needs a brain.
You don't need a 50 lakh ERP implementation. A focused Python script running on a cheap mini-PC can transform operations for businesses doing 1-10 crore annually.
What You Can Automate Today
If you're a developer looking to build automation solutions for Indian businesses, start here:
- Invoice processing (PDF to structured data)
- Bank statement reconciliation (CSV parsing + matching)
- GST return preparation (aggregation + formatting)
- Inventory alerts (threshold monitoring + WhatsApp notifications)
- Payment follow-ups (aging analysis + automated reminders)
Each of these is a 50K-2L project for a freelance automation developer.
Try It Yourself
Clone the extraction script above, point it at a folder of invoice PDFs, and see what it pulls out. You'll be surprised how much structure exists in "unstructured" documents.
The full pipeline code (with CSV export, email integration, and WhatsApp alerts) is on my GitHub. Drop a comment if you want the link.
I'm Archit Mittal — I automate chaos for businesses. Follow me for daily automation content.
Top comments (0)