Robin

Posted on Feb 17

How to Automate GSTR-2A Reconciliation with Python: A Practical Guide for Indian Accountants

#python #gst #india #accounting

If you work in accounts payable at an Indian company, you already know the dread: every month, someone opens a spreadsheet, pulls up the GSTR-2A from the GST portal, and spends the next two or three days manually ticking off purchase invoices one by one. For mid-sized businesses with hundreds of vendors, that process can easily eat 15–20 hours a month — and still produce errors.

This guide shows you how to automate GSTR-2A reconciliation using Python. We'll load real-world-style data, normalize it, do exact and fuzzy matching, and generate a color-coded Excel report your accountant can actually use.

What Is GSTR-2A and Why Does It Need Reconciliation?

For readers who aren't accountants: GSTR-2A is an auto-populated return on India's GST portal. Every time one of your suppliers files their outward supply return (GSTR-1), their invoices addressed to your GSTIN automatically appear in your GSTR-2A.

Why it matters: You can only claim Input Tax Credit (ITC) on purchases that appear in your GSTR-2A and match your purchase register (the book of invoices your accounting software records). If your books say you paid ₹50,000 in GST to a supplier but the portal doesn't show it, you can't claim that ITC. Under Rule 36(4), excess ITC claims can attract demand notices and interest.

The reconciliation task is: compare every row in your purchase register against every row in your GSTR-2A, flag the matches, and investigate the gaps.

Data Sources

1. Purchase Register (from Tally/ERP)

Exported from TallyPrime or any ERP as CSV or Excel, it typically has these columns:

Column	Example
`supplier_gstin`	27AABCU9603R1ZX
`invoice_no`	INV/2025-26/0042
`invoice_date`	15-Apr-2025
`taxable_amount`	42372.88
`igst`	7627.12
`cgst`	0
`sgst`	0
`total_amount`	50000.00

2. GSTR-2A (from GST Portal)

Downloaded as Excel (or JSON via the GST API). The Excel download has sheets like B2B, B2BA, CDN, etc. We'll focus on B2B (regular B2B invoices). Columns from the portal:

Column	Example
`GSTIN of Supplier`	27AABCU9603R1ZX
`Invoice Number`	INV-2025-26-0042
`Invoice date`	15-04-2025
`Invoice Value`	50000
`Taxable Value`	42372.88
`Integrated Tax`	7627.12

Notice already: the invoice number format differs between the two sources (INV/2025-26/0042 vs INV-2025-26-0042). This is the single biggest cause of failed matches, and it's why we need fuzzy matching.

Python Setup

pip install pandas openpyxl rapidfuzz

pandas — data loading, merging, filtering
openpyxl — write formatted Excel output
rapidfuzz — fast fuzzy string matching (drop-in replacement for fuzzywuzzy, no C compiler needed)

Python 3.9+ recommended.

Step 1: Load and Normalize the Data

The first step is the most important one: get both datasets into a clean, comparable form. GSTIN formatting varies (spaces, lowercase, leading zeros), and dates come in half a dozen formats across different ERPs and the portal.

import pandas as pd
import re

# ── Load data ──────────────────────────────────────────────────────────────
purchase_df = pd.read_excel("purchase_register.xlsx")

# GSTR-2A B2B sheet (portal download has a few header rows to skip)
gstr2a_df = pd.read_excel(
    "GSTR2A_Apr2025.xlsx",
    sheet_name="B2B",
    skiprows=4,          # portal puts metadata in first 4 rows
    header=0
)

# ── Rename columns to consistent internal names ────────────────────────────
purchase_df = purchase_df.rename(columns={
    "supplier_gstin": "gstin",
    "invoice_no":     "inv_no",
    "invoice_date":   "inv_date",
    "total_amount":   "amount",
    "igst":           "igst",
})

gstr2a_df = gstr2a_df.rename(columns={
    "GSTIN of Supplier": "gstin",
    "Invoice Number":    "inv_no",
    "Invoice date":      "inv_date",
    "Invoice Value":     "amount",
    "Integrated Tax":    "igst",
})

# Keep only needed columns
cols = ["gstin", "inv_no", "inv_date", "amount", "igst"]
purchase_df = purchase_df[cols].copy()
gstr2a_df   = gstr2a_df[cols].copy()


def normalize_gstin(s):
    if pd.isna(s):
        return ""
    return str(s).strip().upper().replace(" ", "")


def normalize_inv_no(s):
    # Fold separators: 'INV/2025/001', 'INV-2025-001', 'INV 2025 001' -> 'INV2025001'
    if pd.isna(s):
        return ""
    return re.sub(r"[\\s/\\-_.,]", "", str(s)).upper().strip()


def normalize_amount(s):
    try:
        return round(float(s), 2)
    except (ValueError, TypeError):
        return 0.0


# ── Apply normalizations ───────────────────────────────────────────────────
for df in [purchase_df, gstr2a_df]:
    df["gstin"]      = df["gstin"].apply(normalize_gstin)
    df["inv_no_raw"] = df["inv_no"].copy()          # keep original for report
    df["inv_no"]     = df["inv_no"].apply(normalize_inv_no)
    df["amount"]     = df["amount"].apply(normalize_amount)
    df["inv_date"]   = pd.to_datetime(
        df["inv_date"], dayfirst=True, errors="coerce"
    )

print(f"Purchase register rows : {len(purchase_df)}")
print(f"GSTR-2A rows           : {len(gstr2a_df)}")

After normalization, INV/2025-26/0042 and INV-2025-26-0042 both become INV2025260042 — ready for matching.

Step 2: Exact Matching

We merge on three fields simultaneously: GSTIN + normalized invoice number + amount. This catches the majority of clean, well-entered records.

# ── Exact merge ────────────────────────────────────────────────────────────
merged = pd.merge(
    purchase_df,
    gstr2a_df,
    on=["gstin", "inv_no", "amount"],
    how="outer",
    suffixes=("_books", "_portal"),
    indicator=True
)

# Categorize each row
matched       = merged[merged["_merge"] == "both"].copy()
only_books    = merged[merged["_merge"] == "left_only"].copy()
only_portal   = merged[merged["_merge"] == "right_only"].copy()

print(f"\\n── Exact match results ──────────────────────")
print(f"Matched (exact)        : {len(matched)}")
print(f"Only in books          : {len(only_books)}")
print(f"Only in GSTR-2A        : {len(only_portal)}")
print(f"Match rate             : {len(matched)/(len(purchase_df) or 1)*100:.1f}%")

Sample output:

── Exact match results ──────────────────────
Matched (exact)        : 312
Only in books          : 47
Only in GSTR-2A        : 19
Match rate             : 86.9%

The 47 "only in books" entries are your ITC-at-risk items — the supplier hasn't filed, or there's a data mismatch. The 19 "only in portal" entries are invoices suppliers uploaded that you haven't booked yet (possible missed purchases or duplicate supplier filings).

Step 3: Fuzzy Matching for Near-Misses

The ~13% unmatched often contains legitimate invoices that failed due to minor formatting differences that survive normalization — extra zeros, financial year notation differences, or plain typos. We'll use rapidfuzz to score every unmatched book entry against every unmatched portal entry for the same GSTIN, then accept matches above a confidence threshold.

from rapidfuzz import fuzz, process

FUZZY_THRESHOLD = 88   # lower = more matches but more false positives

fuzzy_matches = []

unmatched_books  = only_books.copy()
unmatched_portal = only_portal.copy()

for _, book_row in unmatched_books.iterrows():
    # Candidate pool: same GSTIN, amount within ±1% (absorbs rounding differences)
    candidates = unmatched_portal[
        (unmatched_portal["gstin"] == book_row["gstin"]) &
        (unmatched_portal["amount"].between(
            book_row["amount"] * 0.99,
            book_row["amount"] * 1.01
        ))
    ]

    if candidates.empty:
        continue

    best = process.extractOne(
        book_row["inv_no"],
        candidates["inv_no"].tolist(),
        scorer=fuzz.token_sort_ratio,   # handles word-order differences too
        score_cutoff=FUZZY_THRESHOLD
    )

    if best:
        matched_inv_no, score, idx = best
        portal_row = candidates[candidates["inv_no"] == matched_inv_no].iloc[0]
        fuzzy_matches.append({
            "gstin":         book_row["gstin"],
            "inv_no_books":  book_row.get("inv_no_raw", book_row["inv_no"]),
            "inv_no_portal": portal_row.get("inv_no_raw", portal_row["inv_no"]),
            "amount":        book_row["amount"],
            "fuzzy_score":   score,
            "match_type":    "fuzzy",
        })

fuzzy_df = pd.DataFrame(fuzzy_matches)
print(f"\\n── Fuzzy match results ──────────────────────")
print(f"Additional fuzzy matches : {len(fuzzy_df)}")
if not fuzzy_df.empty:
    print(fuzzy_df[["inv_no_books", "inv_no_portal", "fuzzy_score"]].head(10).to_string())

Threshold guidance:

≥ 95: Very strict — only catches separator/case differences
88–94: Recommended — catches typos, extra zeros, format differences
80–87: Liberal — may produce false positives; review manually

Step 4: Generate the Reconciliation Report

Export everything to a single Excel workbook with three sheets and color-coded rows so your CA or finance team can work through it without touching Python.

from openpyxl import load_workbook
from openpyxl.styles import PatternFill, Font

# Remove fuzzy-matched rows from unmatched lists
if not fuzzy_df.empty:
    fuzzy_book_invs   = set(fuzzy_df["inv_no_books"])
    fuzzy_portal_invs = set(fuzzy_df["inv_no_portal"])
    inv_no_raw_books  = only_books.get("inv_no_raw", only_books["inv_no"])
    inv_no_raw_portal = only_portal.get("inv_no_raw", only_portal["inv_no"])
    only_books_final  = only_books[~inv_no_raw_books.isin(fuzzy_book_invs)]
    only_portal_final = only_portal[~inv_no_raw_portal.isin(fuzzy_portal_invs)]
else:
    only_books_final  = only_books
    only_portal_final = only_portal

output_path = "GSTR2A_Reconciliation_Apr2025.xlsx"

with pd.ExcelWriter(output_path, engine="openpyxl") as writer:
    matched.to_excel(writer,          sheet_name="Matched",        index=False)
    fuzzy_df.to_excel(writer,         sheet_name="Fuzzy_Matched",  index=False)
    only_books_final.to_excel(writer,  sheet_name="Only_In_Books",  index=False)
    only_portal_final.to_excel(writer, sheet_name="Only_In_Portal", index=False)

GREEN  = PatternFill("solid", fgColor="C6EFCE")
YELLOW = PatternFill("solid", fgColor="FFEB9C")
RED    = PatternFill("solid", fgColor="FFC7CE")
ORANGE = PatternFill("solid", fgColor="FCE4D6")

wb = load_workbook(output_path)

def color_sheet(ws, fill):
    for row in ws.iter_rows(min_row=2):
        for cell in row:
            cell.fill = fill
    for cell in ws[1]:
        cell.font = Font(bold=True)

color_sheet(wb["Matched"],        GREEN)
color_sheet(wb["Fuzzy_Matched"],  ORANGE)
color_sheet(wb["Only_In_Books"],  RED)
color_sheet(wb["Only_In_Portal"], YELLOW)

wb.save(output_path)
print(f"Report saved → {output_path}")
print(f"  ✅ Exact matched  : {len(matched)}")
print(f"  🟠 Fuzzy matched  : {len(fuzzy_df)}")
print(f"  🔴 Only in books  : {len(only_books_final)}  ← ITC at risk")
print(f"  🟡 Only in portal : {len(only_portal_final)} ← not yet booked")

What to Do With the Mismatches

Red rows — only in your books (ITC at risk):

Contact the supplier and ask them to file or amend their GSTR-1
If the supplier is under the composition scheme or is unregistered, ITC isn't available regardless — pass the cost to P&L
Document your follow-up trail. If the portal still doesn't show the invoice by the time you file your annual return, you'll need to reverse the ITC in GSTR-3B with interest under Rule 42/43

Yellow rows — only in GSTR-2A (not in your books):

Cross-check against pending GRNs — goods may have been received but not entered in Tally yet
Watch for duplicate entries: some suppliers accidentally file the same invoice twice with a slightly different amount

Tax head mismatches:

Sometimes a supplier files an intra-state supply as interstate (or vice versa). The total amount matches, but the IGST vs CGST+SGST split is wrong. Add a column-level check on the tax fields for matched rows to catch these before they cause problems in GSTR-3B.

Taking It Further

This Python pipeline gives you full control and works with any accounting system that can export to Excel or CSV. For businesses using TallyPrime, the process is even simpler — Mark IT Solutions has documented how TallyPrime's built-in reconciliation handles most of this automatically, including direct GST portal connectivity. But for teams on other systems — SAP, Zoho Books, custom ERPs — or who want full programmatic control over the matching logic, this approach is hard to beat.

Some ways to extend the script:

Schedule it monthly with a cron job or Windows Task Scheduler and pipe the output to email
Supplier-wise ITC risk summary: group only_books_final by GSTIN, aggregate unmatched tax amounts, and flag suppliers with repeat mismatches
GST API integration: instead of manual downloads, pull GSTR-2A data programmatically via a GST Suvidha Provider (GSP) — this makes the workflow fully automated
Streamlit front-end: wrap the script in a simple web UI so non-technical staff can upload files and run reconciliation with a button click
Push results back to Tally: custom Tally integrations can write reconciliation outcomes back into TallyPrime ledgers automatically, eliminating the manual journal entry step entirely

The 15–20 hours your accountant spends on this every month can realistically drop to 30 minutes of file uploads and a focused review of the flagged rows. The Python part runs in seconds even for thousands of invoices.

Have questions or want to share how you've adapted this for GSTR-2B? Drop a comment below — the matching logic is nearly identical, just with different column names and stricter ITC cut-off rules.

DEV Community