Archit Mittal

Posted on Apr 22 • Originally published at architmittal.com

How I Saved a Mumbai CA Firm ₹18 Lakh/Year by Automating GST Invoice Reconciliation

#python #automation #tutorial #productivity

Last month, I worked with a mid-sized CA firm in Andheri, Mumbai. They had 4 junior accountants whose full-time job was reconciling GSTR-2B data with purchase invoices for 80+ clients. Each reconciliation took 6-8 hours per client per month.

The math was brutal: 4 accountants × ₹35,000/month × 12 months = ₹16.8 lakh/year in salaries, plus ₹1.2 lakh in infrastructure and overhead. Total: ₹18 lakh/year of manual work that a 200-line Python script could replace.

Here is exactly how I did it.

The Problem

Every month, Indian businesses running on GST deal with two data sources:

GSTR-2B JSON from the GST portal — purchases reported by your suppliers.
Purchase register exported from Tally, Zoho Books, Busy, or plain Excel.

An accountant then matches invoice numbers, GSTIN, taxable values, and tax amounts row by row. Mismatches — wrong GSTIN, missing invoices, ITC eligibility flags — get flagged manually. It is tedious, error-prone, and scales linearly with client count.

The firm's owner told me plainly: "We quote ₹5,000/client for monthly reconciliation. We spend ₹8,000 worth of accountant hours on it. Every new client makes us less profitable."

That was the unit economics problem I needed to fix.

The Solution: A 3-Part Python Pipeline

I built a three-step pipeline that runs in about 4 minutes per client instead of 8 hours.

Step 1: Parse GSTR-2B JSON

The JSON from the GST portal is nested. Flatten it into a reconciliation-ready DataFrame:

import json
import pandas as pd
from pathlib import Path

def parse_gstr2b(json_path: Path) -> pd.DataFrame:
    """Flatten GSTR-2B JSON into a reconciliation-ready DataFrame."""
    with open(json_path, 'r', encoding='utf-8') as f:
        data = json.load(f)

    rows = []
    b2b_section = data.get('data', {}).get('docdata', {}).get('b2b', [])

    for supplier in b2b_section:
        gstin = supplier.get('ctin', '')
        supplier_name = supplier.get('trdnm', '')

        for invoice in supplier.get('inv', []):
            rows.append({
                'supplier_gstin': gstin,
                'supplier_name': supplier_name,
                'invoice_no': invoice.get('inum', '').strip().upper(),
                'invoice_date': invoice.get('dt', ''),
                'taxable_value': float(invoice.get('val', 0)),
                'igst': float(invoice.get('iamt', 0)),
                'cgst': float(invoice.get('camt', 0)),
                'sgst': float(invoice.get('samt', 0)),
                'itc_eligible': invoice.get('itcavl', 'Y') == 'Y',
            })

    return pd.DataFrame(rows)

Step 2: Normalize the Purchase Register

Client data comes in 15 different formats. Column names are inconsistent across Tally, Zoho, Busy, and handcrafted Excel sheets. I normalize everything to the same schema before reconciling:

def load_purchase_register(file_path: Path) -> pd.DataFrame:
    """Handle Tally exports, Zoho CSVs, and client Excel files."""
    if file_path.suffix == '.csv':
        df = pd.read_csv(file_path)
    else:
        df = pd.read_excel(file_path, sheet_name=0)

    # Column aliases seen across 80+ clients
    alias_map = {
        'GSTIN': 'supplier_gstin',
        'Party GSTIN': 'supplier_gstin',
        'Vendor GSTIN': 'supplier_gstin',
        'Invoice Number': 'invoice_no',
        'Bill No': 'invoice_no',
        'Bill Number': 'invoice_no',
        'Taxable Amount': 'taxable_value',
        'Basic Amount': 'taxable_value',
        'IGST Amount': 'igst',
        'CGST Amount': 'cgst',
        'SGST Amount': 'sgst',
    }

    df = df.rename(columns={k: v for k, v in alias_map.items() if k in df.columns})
    df['invoice_no'] = df['invoice_no'].astype(str).str.strip().str.upper()

    numeric_cols = ['taxable_value', 'igst', 'cgst', 'sgst']
    for col in numeric_cols:
        if col in df.columns:
            df[col] = pd.to_numeric(df[col], errors='coerce').fillna(0)

    return df

Step 3: Reconcile and Flag Mismatches

This is the heart of the script. Outer-join both datasets on GSTIN and invoice number, then tag every row with a reconciliation status:

def reconcile(gstr2b: pd.DataFrame, purchase: pd.DataFrame, tolerance: float = 1.0) -> pd.DataFrame:
    """Match on GSTIN + invoice_no. Flag tax differences above tolerance."""
    merged = gstr2b.merge(
        purchase,
        on=['supplier_gstin', 'invoice_no'],
        how='outer',
        suffixes=('_2b', '_book'),
        indicator=True,
    )

    def classify(row):
        if row['_merge'] == 'left_only':
            return 'MISSING_IN_BOOKS'
        if row['_merge'] == 'right_only':
            return 'MISSING_IN_2B'

        diff_igst = abs(row.get('igst_2b', 0) - row.get('igst_book', 0))
        diff_cgst = abs(row.get('cgst_2b', 0) - row.get('cgst_book', 0))
        diff_sgst = abs(row.get('sgst_2b', 0) - row.get('sgst_book', 0))

        if max(diff_igst, diff_cgst, diff_sgst) > tolerance:
            return 'TAX_MISMATCH'

        diff_taxable = abs(
            row.get('taxable_value_2b', 0) - row.get('taxable_value_book', 0)
        )
        if diff_taxable > tolerance:
            return 'VALUE_MISMATCH'

        return 'MATCHED'

    merged['status'] = merged.apply(classify, axis=1)
    return merged

The Orchestrator

A small driver wraps everything into an Excel report the accountant can email to the client:

def run_reconciliation(client_code: str, month: str):
    base = Path(f'/clients/{client_code}/{month}')

    gstr2b = parse_gstr2b(base / 'gstr2b.json')
    purchase = load_purchase_register(base / 'purchase_register.xlsx')
    result = reconcile(gstr2b, purchase)

    summary = result['status'].value_counts().to_dict()
    output_path = base / f'reco_{client_code}_{month}.xlsx'

    with pd.ExcelWriter(output_path, engine='openpyxl') as writer:
        result.to_excel(writer, sheet_name='Full_Reconciliation', index=False)
        for status in ['MISSING_IN_BOOKS', 'MISSING_IN_2B', 'TAX_MISMATCH']:
            subset = result[result['status'] == status]
            subset.to_excel(writer, sheet_name=status, index=False)

    print(f'{client_code} {month}: {summary}')
    return output_path

Drop this into a cron job, point it at a folder structure like /clients/ABC123/2026-03/, and you have a nightly reconciliation run for every client.

The Results — 60 Days Later

After running this pipeline across all 80 clients for two months:

Metric	Before	After
Hours per client / month	7 hrs	15 min
Accountants needed	4	1 (review only)
Monthly labour cost	₹1.4 lakh	₹45,000
Error rate	4-6%	under 0.5%
Client turnaround	3-5 days	Same day

Three of the four junior accountants were not laid off — they were redeployed to higher-value advisory work (ITR-3 filings, tax planning, GST notice responses) at ₹60,000/month. A real upgrade for them, and higher revenue per head for the firm.

Annual savings: approximately ₹18 lakh. Script cost: one weekend of my time, plus ₹0/month in infra (it runs on their existing laptop via a Windows Task Scheduler job).

Gotchas I Hit

A few things that tripped me up the first time and are worth calling out:

Invoice number normalization is the hardest part. Suppliers write "INV-001", "INV/001", "Inv 001", and "invoice 1" for the same bill. On top of the .strip().upper() normalization, I added a difflib.SequenceMatcher fuzzy-match layer with a 0.85 cutoff for leftover unmatched rows. It caught another 3% of matches that would otherwise have been false negatives.
Rounding differences in CGST/SGST. Tally and the GST portal occasionally round in opposite directions. A ₹0.50 tolerance on tax heads removes about 90% of false mismatches.
Credit notes live in a different JSON section. They are under cdnr (credit/debit notes for registered) inside GSTR-2B, not b2b. Parse them separately and flip the sign.
ITC ineligibility flags under Rule 42/43 are not always set correctly by suppliers. The script flags these for human review rather than auto-accepting — human judgment still matters here.

Who Can Use This Today

If you run or work at a CA firm, a tax consultancy, or any business with 20+ vendor invoices per month, this pipeline pays for itself in week one. The full production script is around 200 lines. Swap pandas for polars if you are processing 100,000+ rows per month.

The pattern generalizes. Any reconciliation that compares two structured data sources is a 1-2 day Python project that can eliminate a full-time role:

Bank statements vs. books of accounts
Payroll register vs. PF/ESI challans
Sales register vs. e-way bills
Shopify/Amazon settlement reports vs. GSTR-1

India has roughly 1.4 million CA firms and tax consultants. Most of them are still doing this by hand. The tooling gap is staggering — and the ROI on closing it is immediate.

If you try this and run into trouble, the three places things usually break are: invoice number cleaning, JSON schema drift when the GSTN portal updates, and the pd.read_excel engine choice for large files (use openpyxl for .xlsx, xlrd==1.2.0 for legacy .xls).

Ship it on Monday. Save ₹1 lakh by Friday.

I'm Archit Mittal — I automate chaos for businesses. Follow me for daily automation content.

Top comments (1)

Benjamin Nguyen • Apr 22

great story