Archit Mittal

Posted on Apr 8

How I Automated Invoice Processing for a ₹2Cr/Year Business in 3 Days

#ai #python #tutorial #productivity

Last month, a textile exporter in Surat reached out to me with a familiar problem: their accounts team of 4 people was spending 6+ hours every day manually matching invoices to purchase orders and bank statements. For a business doing ₹2 crore in annual revenue with 200+ invoices per month, this was eating into their margins badly.

Here's how I automated their entire invoice processing pipeline in just 3 days using Python — and brought that 6-hour daily task down to 15 minutes of human review.

The Problem: Death by Manual Data Entry

Their workflow looked like this:

Receive invoices via email (PDF attachments) and WhatsApp
Manually type invoice details into a Google Sheet
Cross-check each invoice against purchase orders in another sheet
Match payments in their bank statement CSV to invoices
Flag mismatches and follow up with vendors

The error rate was roughly 8-12% — wrong amounts, missed invoices, duplicate entries. Each error meant hours of back-and-forth with vendors and delayed payments.

Cost of the old process:

4 staff × 6 hours/day × ₹15,000/month salary = ₹60,000/month just on invoice processing
Late payment penalties averaging ₹25,000/month
Total: ~₹85,000/month or ₹10.2 lakh/year

The Solution: A 3-Layer Python Pipeline

I built a pipeline with three components: Extract, Match, and Report.

Day 1: Invoice Data Extraction

First, I needed to pull structured data from PDF invoices. I used pdfplumber for text-based PDFs and pytesseract for scanned ones.

import pdfplumber
import re
from dataclasses import dataclass
from typing import Optional

@dataclassclass InvoiceData:
    invoice_number: str
    vendor_name: str
    date: str
    total_amount: float
    gst_amount: float
    line_items: list

def extract_invoice_data(pdf_path: str) -> InvoiceData:
    with pdfplumber.open(pdf_path) as pdf:
        full_text = ""
        for page in pdf.pages:
            full_text += page.extract_text() or ""

    # Extract key fields using regex patterns
    inv_number = re.search(
        r'Invoice\s*#?\s*:?\s*([A-Z0-9-]+)', 
        full_text, re.IGNORECASE
    )
    amount = re.search(
        r'Total\s*:?\s*₹?\s*([\d,]+\.?\d*)', 
        full_text, re.IGNORECASE
    )
    gst = re.search(
        r'GST\s*:?\s*₹?\s*([\d,]+\.?\d*)', 
        full_text, re.IGNORECASE
    )

    return InvoiceData(
        invoice_number=inv_number.group(1) if inv_number else "UNKNOWN",
        vendor_name=extract_vendor_name(full_text),
        date=extract_date(full_text),
        total_amount=float(amount.group(1).replace(',', '')) if amount else 0.0,
        gst_amount=float(gst.group(1).replace(',', '')) if gst else 0.0,
        line_items=extract_line_items(full_text)
    )

For the 30% of invoices that were scanned images, I added an OCR fallback:

import pytesseract
from pdf2image import convert_from_path

def ocr_extract(pdf_path: str) -> str:
    images = convert_from_path(pdf_path, dpi=300)
    text = ""
    for img in images:
        text += pytesseract.image_to_string(img, lang='eng+hin')
    return text

The Hindi language support was critical since many local vendors send bilingual invoices.

Day 2: Smart Matching Engine

The matching logic compares extracted invoice data against purchase orders and bank transactions. I used fuzzy matching because vendor names are never consistent (think "Raj Textiles" vs "Raj Textile Pvt Ltd").

from fuzzywuzzy import fuzz
import pandas as pd

def match_invoice_to_po(invoice: InvoiceData, po_df: pd.DataFrame) -> dict:
    best_match = None
    best_score = 0

    for _, po in po_df.iterrows():
        # Fuzzy match on vendor name
        name_score = fuzz.token_sort_ratio(
            invoice.vendor_name.lower(), 
            po['vendor_name'].lower()
        )

        # Exact or close match on amount (within 1% for rounding)
        amount_diff = abs(invoice.total_amount - po['amount']) / po['amount']
        amount_score = 100 if amount_diff < 0.01 else max(0, 100 - amount_diff * 100)

        # Combined weighted score
        combined = (name_score * 0.4) + (amount_score * 0.6)

        if combined > best_score:
            best_score = combined
            best_match = po

    return {
        'po_number': best_match['po_number'] if best_score > 75 else None,
        'confidence': best_score,
        'status': 'matched' if best_score > 85 else 'review' if best_score > 75 else 'unmatched'
    }

def match_to_bank_statement(invoice: InvoiceData, bank_df: pd.DataFrame) -> dict:
    """Match invoice to bank transactions within a 7-day window"""
    for _, txn in bank_df.iterrows():
        amount_match = abs(txn['amount'] - invoice.total_amount) < 1.0
        date_match = abs((txn['date'] - pd.to_datetime(invoice.date)).days) <= 7

        if amount_match and date_match:
            return {'txn_id': txn['reference'], 'status': 'paid'}

    return {'txn_id': None, 'status': 'unpaid'}

Day 3: Email Monitoring + Dashboard

I set up an email listener to auto-download invoice attachments and a simple Streamlit dashboard for the accounts team:

import imaplib
import email
import os
from pathlib import Path

def fetch_invoice_emails(
    imap_server: str, 
    username: str, 
    password: str,
    download_dir: str = "./invoices"
):
    mail = imaplib.IMAP4_SSL(imap_server)
    mail.login(username, password)
    mail.select('inbox')

    # Search for emails with PDF attachments from last 24 hours
    _, messages = mail.search(None, '(SINCE "08-Apr-2026" SUBJECT "invoice")')

    Path(download_dir).mkdir(exist_ok=True)
    downloaded = []

    for msg_id in messages[0].split():
        _, msg_data = mail.fetch(msg_id, '(RFC822)')
        msg = email.message_from_bytes(msg_data[0][1])

        for part in msg.walk():
            if part.get_content_type() == 'application/pdf':
                filename = part.get_filename()
                filepath = os.path.join(download_dir, filename)

                with open(filepath, 'wb') as f:
                    f.write(part.get_payload(decode=True))
                downloaded.append(filepath)

    mail.logout()
    return downloaded

The Results: ₹8.5 Lakh Annual Savings

After one month of running the system:

Processing time: 6 hours/day → 15 minutes of human review
Error rate: 12% → 0.5% (only edge cases need manual intervention)
Staff reallocation: 3 of 4 accounts staff moved to higher-value work
Late payment penalties: ₹25,000/month → ₹2,000/month
Net savings: ~₹8.5 lakh/year

The total cost of building this? About ₹45,000 for my 3 days of work plus ₹500/month for a basic VPS to run the pipeline. ROI hit positive in the first month itself.

Key Takeaways for Indian Businesses

Start with the most painful manual process — invoice matching was their biggest time sink, not the fanciest problem to solve
Build for bilingual — if you're automating for Indian businesses, Hindi/regional language support isn't optional. pytesseract with language packs handles this well
Keep humans in the loop — the system flags low-confidence matches for review instead of auto-approving everything. This built trust with the accounts team
Use fuzzy matching liberally — Indian business names have many variations. Hard string matching will miss 40%+ of valid matches
Measure in ₹, not in hours — when I told the owner "you'll save ₹8.5 lakh/year," the approval was instant. Saying "you'll save 1,400 hours" wouldn't have had the same impact

Want to Build Something Similar?

The full pipeline is about 400 lines of Python. The key libraries you need:

pdfplumber — PDF text extraction
pytesseract + pdf2image — OCR for scanned documents
fuzzywuzzy — fuzzy string matching
pandas — data manipulation and CSV handling
streamlit — quick dashboard for the review interface
imaplib — email monitoring (built into Python)

The entire stack runs on a ₹500/month DigitalOcean droplet. No expensive SaaS subscriptions needed.

*I'm Archit Mittal — I automate chaos for businesses. Follow me for daily automation content.*thon)