Last month, I worked with a mid-sized CA firm in Andheri, Mumbai. They had 4 junior accountants whose full-time job was reconciling GSTR-2B data with purchase invoices for 80+ clients. Each reconciliation took 6-8 hours per client per month.
The math was brutal: 4 accountants × ₹35,000/month × 12 months = ₹16.8 lakh/year in salaries, plus ₹1.2 lakh in infrastructure and overhead. Total: ₹18 lakh/year of manual work that a 200-line Python script could replace.
Here is exactly how I did it.
The Problem
Every month, Indian businesses running on GST deal with two data sources:
- GSTR-2B JSON from the GST portal — purchases reported by your suppliers.
- Purchase register exported from Tally, Zoho Books, Busy, or plain Excel.
An accountant then matches invoice numbers, GSTIN, taxable values, and tax amounts row by row. Mismatches — wrong GSTIN, missing invoices, ITC eligibility flags — get flagged manually. It is tedious, error-prone, and scales linearly with client count.
The firm's owner told me plainly: "We quote ₹5,000/client for monthly reconciliation. We spend ₹8,000 worth of accountant hours on it. Every new client makes us less profitable."
That was the unit economics problem I needed to fix.
The Solution: A 3-Part Python Pipeline
I built a three-step pipeline that runs in about 4 minutes per client instead of 8 hours.
Step 1: Parse GSTR-2B JSON
The JSON from the GST portal is nested. Flatten it into a reconciliation-ready DataFrame:
import json
import pandas as pd
from pathlib import Path
def parse_gstr2b(json_path: Path) -> pd.DataFrame:
"""Flatten GSTR-2B JSON into a reconciliation-ready DataFrame."""
with open(json_path, 'r', encoding='utf-8') as f:
data = json.load(f)
rows = []
b2b_section = data.get('data', {}).get('docdata', {}).get('b2b', [])
for supplier in b2b_section:
gstin = supplier.get('ctin', '')
supplier_name = supplier.get('trdnm', '')
for invoice in supplier.get('inv', []):
rows.append({
'supplier_gstin': gstin,
'supplier_name': supplier_name,
'invoice_no': invoice.get('inum', '').strip().upper(),
'invoice_date': invoice.get('dt', ''),
'taxable_value': float(invoice.get('val', 0)),
'igst': float(invoice.get('iamt', 0)),
'cgst': float(invoice.get('camt', 0)),
'sgst': float(invoice.get('samt', 0)),
'itc_eligible': invoice.get('itcavl', 'Y') == 'Y',
})
return pd.DataFrame(rows)
Step 2: Normalize the Purchase Register
Client data comes in 15 different formats. Column names are inconsistent across Tally, Zoho, Busy, and handcrafted Excel sheets. I normalize everything to the same schema before reconciling:
def load_purchase_register(file_path: Path) -> pd.DataFrame:
"""Handle Tally exports, Zoho CSVs, and client Excel files."""
if file_path.suffix == '.csv':
df = pd.read_csv(file_path)
else:
df = pd.read_excel(file_path, sheet_name=0)
# Column aliases seen across 80+ clients
alias_map = {
'GSTIN': 'supplier_gstin',
'Party GSTIN': 'supplier_gstin',
'Vendor GSTIN': 'supplier_gstin',
'Invoice Number': 'invoice_no',
'Bill No': 'invoice_no',
'Bill Number': 'invoice_no',
'Taxable Amount': 'taxable_value',
'Basic Amount': 'taxable_value',
'IGST Amount': 'igst',
'CGST Amount': 'cgst',
'SGST Amount': 'sgst',
}
df = df.rename(columns={k: v for k, v in alias_map.items() if k in df.columns})
df['invoice_no'] = df['invoice_no'].astype(str).str.strip().str.upper()
numeric_cols = ['taxable_value', 'igst', 'cgst', 'sgst']
for col in numeric_cols:
if col in df.columns:
df[col] = pd.to_numeric(df[col], errors='coerce').fillna(0)
return df
Step 3: Reconcile and Flag Mismatches
This is the heart of the script. Outer-join both datasets on GSTIN and invoice number, then tag every row with a reconciliation status:
def reconcile(gstr2b: pd.DataFrame, purchase: pd.DataFrame, tolerance: float = 1.0) -> pd.DataFrame:
"""Match on GSTIN + invoice_no. Flag tax differences above tolerance."""
merged = gstr2b.merge(
purchase,
on=['supplier_gstin', 'invoice_no'],
how='outer',
suffixes=('_2b', '_book'),
indicator=True,
)
def classify(row):
if row['_merge'] == 'left_only':
return 'MISSING_IN_BOOKS'
if row['_merge'] == 'right_only':
return 'MISSING_IN_2B'
diff_igst = abs(row.get('igst_2b', 0) - row.get('igst_book', 0))
diff_cgst = abs(row.get('cgst_2b', 0) - row.get('cgst_book', 0))
diff_sgst = abs(row.get('sgst_2b', 0) - row.get('sgst_book', 0))
if max(diff_igst, diff_cgst, diff_sgst) > tolerance:
return 'TAX_MISMATCH'
diff_taxable = abs(
row.get('taxable_value_2b', 0) - row.get('taxable_value_book', 0)
)
if diff_taxable > tolerance:
return 'VALUE_MISMATCH'
return 'MATCHED'
merged['status'] = merged.apply(classify, axis=1)
return merged
The Orchestrator
A small driver wraps everything into an Excel report the accountant can email to the client:
def run_reconciliation(client_code: str, month: str):
base = Path(f'/clients/{client_code}/{month}')
gstr2b = parse_gstr2b(base / 'gstr2b.json')
purchase = load_purchase_register(base / 'purchase_register.xlsx')
result = reconcile(gstr2b, purchase)
summary = result['status'].value_counts().to_dict()
output_path = base / f'reco_{client_code}_{month}.xlsx'
with pd.ExcelWriter(output_path, engine='openpyxl') as writer:
result.to_excel(writer, sheet_name='Full_Reconciliation', index=False)
for status in ['MISSING_IN_BOOKS', 'MISSING_IN_2B', 'TAX_MISMATCH']:
subset = result[result['status'] == status]
subset.to_excel(writer, sheet_name=status, index=False)
print(f'{client_code} {month}: {summary}')
return output_path
Drop this into a cron job, point it at a folder structure like /clients/ABC123/2026-03/, and you have a nightly reconciliation run for every client.
The Results — 60 Days Later
After running this pipeline across all 80 clients for two months:
| Metric | Before | After |
|---|---|---|
| Hours per client / month | 7 hrs | 15 min |
| Accountants needed | 4 | 1 (review only) |
| Monthly labour cost | ₹1.4 lakh | ₹45,000 |
| Error rate | 4-6% | under 0.5% |
| Client turnaround | 3-5 days | Same day |
Three of the four junior accountants were not laid off — they were redeployed to higher-value advisory work (ITR-3 filings, tax planning, GST notice responses) at ₹60,000/month. A real upgrade for them, and higher revenue per head for the firm.
Annual savings: approximately ₹18 lakh. Script cost: one weekend of my time, plus ₹0/month in infra (it runs on their existing laptop via a Windows Task Scheduler job).
Gotchas I Hit
A few things that tripped me up the first time and are worth calling out:
-
Invoice number normalization is the hardest part. Suppliers write "INV-001", "INV/001", "Inv 001", and "invoice 1" for the same bill. On top of the
.strip().upper()normalization, I added adifflib.SequenceMatcherfuzzy-match layer with a 0.85 cutoff for leftover unmatched rows. It caught another 3% of matches that would otherwise have been false negatives. - Rounding differences in CGST/SGST. Tally and the GST portal occasionally round in opposite directions. A ₹0.50 tolerance on tax heads removes about 90% of false mismatches.
-
Credit notes live in a different JSON section. They are under
cdnr(credit/debit notes for registered) inside GSTR-2B, notb2b. Parse them separately and flip the sign. - ITC ineligibility flags under Rule 42/43 are not always set correctly by suppliers. The script flags these for human review rather than auto-accepting — human judgment still matters here.
Who Can Use This Today
If you run or work at a CA firm, a tax consultancy, or any business with 20+ vendor invoices per month, this pipeline pays for itself in week one. The full production script is around 200 lines. Swap pandas for polars if you are processing 100,000+ rows per month.
The pattern generalizes. Any reconciliation that compares two structured data sources is a 1-2 day Python project that can eliminate a full-time role:
- Bank statements vs. books of accounts
- Payroll register vs. PF/ESI challans
- Sales register vs. e-way bills
- Shopify/Amazon settlement reports vs. GSTR-1
India has roughly 1.4 million CA firms and tax consultants. Most of them are still doing this by hand. The tooling gap is staggering — and the ROI on closing it is immediate.
If you try this and run into trouble, the three places things usually break are: invoice number cleaning, JSON schema drift when the GSTN portal updates, and the pd.read_excel engine choice for large files (use openpyxl for .xlsx, xlrd==1.2.0 for legacy .xls).
Ship it on Monday. Save ₹1 lakh by Friday.
I'm Archit Mittal — I automate chaos for businesses. Follow me for daily automation content.
Top comments (1)
great story