DEV Community

Archit Mittal
Archit Mittal

Posted on

How I Automated Invoice Processing for a ₹3 Crore Business — Saved 120 Hours/Month

Last year, a textile trading firm in Surat approached me with a problem that's painfully common across Indian SMBs: their accounts team of 4 people was spending 6 hours every single day manually matching purchase invoices against delivery challans and bank statements.

That's 120 hours a month — burnt on copy-pasting numbers between Excel sheets and Tally.

Here's how I automated the entire pipeline with Python, saving them ₹2.4 lakh/month in operational costs and reducing errors to near-zero.


The Problem: Death by Manual Reconciliation

This ₹3 crore/year business processes roughly 800 invoices per month from 60+ suppliers. Every invoice had to be:

  1. Downloaded from email or WhatsApp
  2. Manually entered into an Excel tracker
  3. Cross-checked against the delivery challan
  4. Matched with the corresponding bank payment
  5. Flagged if any amount didn't match

The team was drowning. Invoices got missed, duplicate payments happened twice in Q3, and month-end closing took 5 extra days.


The Solution: A 3-Stage Python Pipeline

I built a lightweight automation system using Python — no expensive ERP, no SaaS subscription. Just clean code running on a ₹30,000 mini-PC sitting in their office.

Stage 1: Invoice Extraction with PDF Parsing

import pdfplumber
import re
from pathlib import Path

def extract_invoice_data(pdf_path: str) -> dict:
    """Extract key fields from a supplier invoice PDF."""
    with pdfplumber.open(pdf_path) as pdf:
        full_text = ""
        for page in pdf.pages:
            full_text += page.extract_text() or ""

    # Extract invoice number
    inv_match = re.search(
        r'Invoice\s*#?\s*:?\s*([A-Z0-9\-/]+)',
        full_text, re.IGNORECASE
    )

    # Extract total amount (handles Rs. and INR formats)
    amt_match = re.search(
        r'(?:Total|Grand\s*Total|Net\s*Amount)\s*:?\s*(?:Rs\.?|INR)\s*([\d,]+\.?\d*)',
        full_text, re.IGNORECASE
    )

    # Extract date
    date_match = re.search(
        r'Date\s*:?\s*(\d{1,2}[/\-]\d{1,2}[/\-]\d{2,4})',
        full_text, re.IGNORECASE
    )

    # Extract GSTIN
    gst_match = re.search(
        r'\d{2}[A-Z]{5}\d{4}[A-Z]{1}[A-Z\d]{1}[Z]{1}[A-Z\d]{1}',
        full_text
    )

    return {
        "invoice_number": inv_match.group(1) if inv_match else None,
        "amount": float(amt_match.group(1).replace(",", "")) if amt_match else None,
        "date": date_match.group(1) if date_match else None,
        "gstin": gst_match.group(0) if gst_match else None,
        "source_file": Path(pdf_path).name
    }

# Process all invoices in a folder
invoice_folder = Path("./invoices/april_2025")
results = []
for pdf_file in invoice_folder.glob("*.pdf"):
    data = extract_invoice_data(str(pdf_file))
    results.append(data)
    print(f"Extracted: {data['invoice_number']} | Amount: {data['amount']}")
Enter fullscreen mode Exit fullscreen mode

This script processes 800 PDFs in under 3 minutes. The old manual process? 4 people x 2 hours = 8 person-hours.

Stage 2: Smart Matching Engine

import pandas as pd
from difflib import SequenceMatcher

def fuzzy_match_score(str1: str, str2: str) -> float:
    """Calculate similarity between two strings."""
    if not str1 or not str2:
        return 0.0
    return SequenceMatcher(None, str1.upper(), str2.upper()).ratio()

def reconcile_invoices(
    invoices: pd.DataFrame,
    challans: pd.DataFrame,
    tolerance: float = 100.0
) -> pd.DataFrame:
    """Match invoices against delivery challans."""
    matched = []
    unmatched_invoices = []

    for _, inv in invoices.iterrows():
        best_match = None
        best_score = 0

        for _, challan in challans.iterrows():
            amount_diff = abs(inv["amount"] - challan["amount"])
            if amount_diff > tolerance:
                continue

            name_score = fuzzy_match_score(
                inv.get("supplier", ""),
                challan.get("supplier", "")
            )

            score = (1 - amount_diff / max(inv["amount"], 1)) * 0.6 + name_score * 0.4

            if score > best_score:
                best_score = score
                best_match = challan

        if best_match is not None and best_score > 0.75:
            matched.append({
                "invoice_no": inv["invoice_number"],
                "challan_no": best_match["challan_number"],
                "invoice_amt": inv["amount"],
                "challan_amt": best_match["amount"],
                "difference": inv["amount"] - best_match["amount"],
                "confidence": round(best_score * 100, 1),
                "status": "MATCHED"
            })
        else:
            unmatched_invoices.append({
                "invoice_no": inv["invoice_number"],
                "invoice_amt": inv["amount"],
                "status": "UNMATCHED - NEEDS REVIEW"
            })

    return pd.DataFrame(matched + unmatched_invoices)
Enter fullscreen mode Exit fullscreen mode

The key insight: Indian invoices frequently have small rounding differences (a few rupees on GST calculations). The tolerance parameter handles this gracefully instead of flagging false mismatches.

Stage 3: Daily Report Generation

from datetime import datetime

def generate_daily_report(reconciled: pd.DataFrame, output_dir: str = "./reports"):
    """Generate a summary report for the accounts team."""
    today = datetime.now().strftime("%Y-%m-%d")

    total_invoices = len(reconciled)
    matched = reconciled[reconciled["status"] == "MATCHED"]
    unmatched = reconciled[reconciled["status"] != "MATCHED"]

    total_matched_value = matched["invoice_amt"].sum()
    total_discrepancy = matched["difference"].abs().sum()

    report_lines = [
        f"DAILY RECONCILIATION REPORT - {today}",
        "=" * 45,
        f"Total Invoices Processed  : {total_invoices}",
        f"Successfully Matched      : {len(matched)} ({len(matched)/total_invoices*100:.1f}%)",
        f"Needs Manual Review       : {len(unmatched)}",
        f"Total Matched Value       : INR {total_matched_value:,.2f}",
        f"Total Discrepancies       : INR {total_discrepancy:,.2f}",
        "",
        "TOP UNMATCHED INVOICES:"
    ]

    for _, row in unmatched.head(10).iterrows():
        report_lines.append(f"  - {row['invoice_no']}: INR {row['invoice_amt']:,.2f}")

    report = "\n".join(report_lines)
    report_path = f"{output_dir}/reconciliation_{today}.txt"
    with open(report_path, "w") as f:
        f.write(report)

    print(f"Report saved: {report_path}")
    return report_path
Enter fullscreen mode Exit fullscreen mode

The Results: 4 Months Later

Here's what changed after deploying this system:

Time saved: 120 hours/month reduced to 8 hours/month (only manual review of flagged items)

Cost saved: Approximately 2.4 lakh per month (3 team members redeployed to higher-value work)

Error rate: Dropped from around 5% to 0.3%

Month-end closing: From 5 extra days down to same-day

Duplicate payments: Zero since deployment

The ROI? The entire project cost 1.8 lakh (my fee + hardware). It paid for itself in 23 days.


Why This Matters for Indian Businesses

India has 63 million+ MSMEs. Most run their back-office on Excel + Tally + WhatsApp. The opportunity to automate isn't just about saving money — it's about freeing up humans to do work that actually needs a brain.

You don't need a 50 lakh ERP implementation. A focused Python script running on a cheap mini-PC can transform operations for businesses doing 1-10 crore annually.

What You Can Automate Today

If you're a developer looking to build automation solutions for Indian businesses, start here:

  • Invoice processing (PDF to structured data)
  • Bank statement reconciliation (CSV parsing + matching)
  • GST return preparation (aggregation + formatting)
  • Inventory alerts (threshold monitoring + WhatsApp notifications)
  • Payment follow-ups (aging analysis + automated reminders)

Each of these is a 50K-2L project for a freelance automation developer.


Try It Yourself

Clone the extraction script above, point it at a folder of invoice PDFs, and see what it pulls out. You'll be surprised how much structure exists in "unstructured" documents.

The full pipeline code (with CSV export, email integration, and WhatsApp alerts) is on my GitHub. Drop a comment if you want the link.


I'm Archit Mittal — I automate chaos for businesses. Follow me for daily automation content.

Top comments (0)