DEV Community

Robin
Robin

Posted on

How to Automate GSTR-2A Reconciliation with Python: A Practical Guide for Indian Accountants

If you work in accounts payable at an Indian company, you already know the dread: every month, someone opens a spreadsheet, pulls up the GSTR-2A from the GST portal, and spends the next two or three days manually ticking off purchase invoices one by one. For mid-sized businesses with hundreds of vendors, that process can easily eat 15–20 hours a month — and still produce errors.

This guide shows you how to automate GSTR-2A reconciliation using Python. We'll load real-world-style data, normalize it, do exact and fuzzy matching, and generate a color-coded Excel report your accountant can actually use.


What Is GSTR-2A and Why Does It Need Reconciliation?

For readers who aren't accountants: GSTR-2A is an auto-populated return on India's GST portal. Every time one of your suppliers files their outward supply return (GSTR-1), their invoices addressed to your GSTIN automatically appear in your GSTR-2A.

Why it matters: You can only claim Input Tax Credit (ITC) on purchases that appear in your GSTR-2A and match your purchase register (the book of invoices your accounting software records). If your books say you paid ₹50,000 in GST to a supplier but the portal doesn't show it, you can't claim that ITC. Under Rule 36(4), excess ITC claims can attract demand notices and interest.

The reconciliation task is: compare every row in your purchase register against every row in your GSTR-2A, flag the matches, and investigate the gaps.


Data Sources

1. Purchase Register (from Tally/ERP)

Exported from TallyPrime or any ERP as CSV or Excel, it typically has these columns:

Column Example
supplier_gstin 27AABCU9603R1ZX
invoice_no INV/2025-26/0042
invoice_date 15-Apr-2025
taxable_amount 42372.88
igst 7627.12
cgst 0
sgst 0
total_amount 50000.00

2. GSTR-2A (from GST Portal)

Downloaded as Excel (or JSON via the GST API). The Excel download has sheets like B2B, B2BA, CDN, etc. We'll focus on B2B (regular B2B invoices). Columns from the portal:

Column Example
GSTIN of Supplier 27AABCU9603R1ZX
Invoice Number INV-2025-26-0042
Invoice date 15-04-2025
Invoice Value 50000
Taxable Value 42372.88
Integrated Tax 7627.12

Notice already: the invoice number format differs between the two sources (INV/2025-26/0042 vs INV-2025-26-0042). This is the single biggest cause of failed matches, and it's why we need fuzzy matching.


Python Setup

pip install pandas openpyxl rapidfuzz
Enter fullscreen mode Exit fullscreen mode
  • pandas — data loading, merging, filtering
  • openpyxl — write formatted Excel output
  • rapidfuzz — fast fuzzy string matching (drop-in replacement for fuzzywuzzy, no C compiler needed)

Python 3.9+ recommended.


Step 1: Load and Normalize the Data

The first step is the most important one: get both datasets into a clean, comparable form. GSTIN formatting varies (spaces, lowercase, leading zeros), and dates come in half a dozen formats across different ERPs and the portal.

import pandas as pd
import re

# ── Load data ──────────────────────────────────────────────────────────────
purchase_df = pd.read_excel("purchase_register.xlsx")

# GSTR-2A B2B sheet (portal download has a few header rows to skip)
gstr2a_df = pd.read_excel(
    "GSTR2A_Apr2025.xlsx",
    sheet_name="B2B",
    skiprows=4,          # portal puts metadata in first 4 rows
    header=0
)

# ── Rename columns to consistent internal names ────────────────────────────
purchase_df = purchase_df.rename(columns={
    "supplier_gstin": "gstin",
    "invoice_no":     "inv_no",
    "invoice_date":   "inv_date",
    "total_amount":   "amount",
    "igst":           "igst",
})

gstr2a_df = gstr2a_df.rename(columns={
    "GSTIN of Supplier": "gstin",
    "Invoice Number":    "inv_no",
    "Invoice date":      "inv_date",
    "Invoice Value":     "amount",
    "Integrated Tax":    "igst",
})

# Keep only needed columns
cols = ["gstin", "inv_no", "inv_date", "amount", "igst"]
purchase_df = purchase_df[cols].copy()
gstr2a_df   = gstr2a_df[cols].copy()


def normalize_gstin(s):
    if pd.isna(s):
        return ""
    return str(s).strip().upper().replace(" ", "")


def normalize_inv_no(s):
    # Fold separators: 'INV/2025/001', 'INV-2025-001', 'INV 2025 001' -> 'INV2025001'
    if pd.isna(s):
        return ""
    return re.sub(r"[\\s/\\-_.,]", "", str(s)).upper().strip()


def normalize_amount(s):
    try:
        return round(float(s), 2)
    except (ValueError, TypeError):
        return 0.0


# ── Apply normalizations ───────────────────────────────────────────────────
for df in [purchase_df, gstr2a_df]:
    df["gstin"]      = df["gstin"].apply(normalize_gstin)
    df["inv_no_raw"] = df["inv_no"].copy()          # keep original for report
    df["inv_no"]     = df["inv_no"].apply(normalize_inv_no)
    df["amount"]     = df["amount"].apply(normalize_amount)
    df["inv_date"]   = pd.to_datetime(
        df["inv_date"], dayfirst=True, errors="coerce"
    )

print(f"Purchase register rows : {len(purchase_df)}")
print(f"GSTR-2A rows           : {len(gstr2a_df)}")
Enter fullscreen mode Exit fullscreen mode

After normalization, INV/2025-26/0042 and INV-2025-26-0042 both become INV202526004​2 — ready for matching.


Step 2: Exact Matching

We merge on three fields simultaneously: GSTIN + normalized invoice number + amount. This catches the majority of clean, well-entered records.

# ── Exact merge ────────────────────────────────────────────────────────────
merged = pd.merge(
    purchase_df,
    gstr2a_df,
    on=["gstin", "inv_no", "amount"],
    how="outer",
    suffixes=("_books", "_portal"),
    indicator=True
)

# Categorize each row
matched       = merged[merged["_merge"] == "both"].copy()
only_books    = merged[merged["_merge"] == "left_only"].copy()
only_portal   = merged[merged["_merge"] == "right_only"].copy()

print(f"\\n── Exact match results ──────────────────────")
print(f"Matched (exact)        : {len(matched)}")
print(f"Only in books          : {len(only_books)}")
print(f"Only in GSTR-2A        : {len(only_portal)}")
print(f"Match rate             : {len(matched)/(len(purchase_df) or 1)*100:.1f}%")
Enter fullscreen mode Exit fullscreen mode

Sample output:

── Exact match results ──────────────────────
Matched (exact)        : 312
Only in books          : 47
Only in GSTR-2A        : 19
Match rate             : 86.9%
Enter fullscreen mode Exit fullscreen mode

The 47 "only in books" entries are your ITC-at-risk items — the supplier hasn't filed, or there's a data mismatch. The 19 "only in portal" entries are invoices suppliers uploaded that you haven't booked yet (possible missed purchases or duplicate supplier filings).


Step 3: Fuzzy Matching for Near-Misses

The ~13% unmatched often contains legitimate invoices that failed due to minor formatting differences that survive normalization — extra zeros, financial year notation differences, or plain typos. We'll use rapidfuzz to score every unmatched book entry against every unmatched portal entry for the same GSTIN, then accept matches above a confidence threshold.

from rapidfuzz import fuzz, process

FUZZY_THRESHOLD = 88   # lower = more matches but more false positives

fuzzy_matches = []

unmatched_books  = only_books.copy()
unmatched_portal = only_portal.copy()

for _, book_row in unmatched_books.iterrows():
    # Candidate pool: same GSTIN, amount within ±1% (absorbs rounding differences)
    candidates = unmatched_portal[
        (unmatched_portal["gstin"] == book_row["gstin"]) &
        (unmatched_portal["amount"].between(
            book_row["amount"] * 0.99,
            book_row["amount"] * 1.01
        ))
    ]

    if candidates.empty:
        continue

    best = process.extractOne(
        book_row["inv_no"],
        candidates["inv_no"].tolist(),
        scorer=fuzz.token_sort_ratio,   # handles word-order differences too
        score_cutoff=FUZZY_THRESHOLD
    )

    if best:
        matched_inv_no, score, idx = best
        portal_row = candidates[candidates["inv_no"] == matched_inv_no].iloc[0]
        fuzzy_matches.append({
            "gstin":         book_row["gstin"],
            "inv_no_books":  book_row.get("inv_no_raw", book_row["inv_no"]),
            "inv_no_portal": portal_row.get("inv_no_raw", portal_row["inv_no"]),
            "amount":        book_row["amount"],
            "fuzzy_score":   score,
            "match_type":    "fuzzy",
        })

fuzzy_df = pd.DataFrame(fuzzy_matches)
print(f"\\n── Fuzzy match results ──────────────────────")
print(f"Additional fuzzy matches : {len(fuzzy_df)}")
if not fuzzy_df.empty:
    print(fuzzy_df[["inv_no_books", "inv_no_portal", "fuzzy_score"]].head(10).to_string())
Enter fullscreen mode Exit fullscreen mode

Threshold guidance:

  • ≥ 95: Very strict — only catches separator/case differences
  • 88–94: Recommended — catches typos, extra zeros, format differences
  • 80–87: Liberal — may produce false positives; review manually

Step 4: Generate the Reconciliation Report

Export everything to a single Excel workbook with three sheets and color-coded rows so your CA or finance team can work through it without touching Python.

from openpyxl import load_workbook
from openpyxl.styles import PatternFill, Font

# Remove fuzzy-matched rows from unmatched lists
if not fuzzy_df.empty:
    fuzzy_book_invs   = set(fuzzy_df["inv_no_books"])
    fuzzy_portal_invs = set(fuzzy_df["inv_no_portal"])
    inv_no_raw_books  = only_books.get("inv_no_raw", only_books["inv_no"])
    inv_no_raw_portal = only_portal.get("inv_no_raw", only_portal["inv_no"])
    only_books_final  = only_books[~inv_no_raw_books.isin(fuzzy_book_invs)]
    only_portal_final = only_portal[~inv_no_raw_portal.isin(fuzzy_portal_invs)]
else:
    only_books_final  = only_books
    only_portal_final = only_portal

output_path = "GSTR2A_Reconciliation_Apr2025.xlsx"

with pd.ExcelWriter(output_path, engine="openpyxl") as writer:
    matched.to_excel(writer,          sheet_name="Matched",        index=False)
    fuzzy_df.to_excel(writer,         sheet_name="Fuzzy_Matched",  index=False)
    only_books_final.to_excel(writer,  sheet_name="Only_In_Books",  index=False)
    only_portal_final.to_excel(writer, sheet_name="Only_In_Portal", index=False)

GREEN  = PatternFill("solid", fgColor="C6EFCE")
YELLOW = PatternFill("solid", fgColor="FFEB9C")
RED    = PatternFill("solid", fgColor="FFC7CE")
ORANGE = PatternFill("solid", fgColor="FCE4D6")

wb = load_workbook(output_path)

def color_sheet(ws, fill):
    for row in ws.iter_rows(min_row=2):
        for cell in row:
            cell.fill = fill
    for cell in ws[1]:
        cell.font = Font(bold=True)

color_sheet(wb["Matched"],        GREEN)
color_sheet(wb["Fuzzy_Matched"],  ORANGE)
color_sheet(wb["Only_In_Books"],  RED)
color_sheet(wb["Only_In_Portal"], YELLOW)

wb.save(output_path)
print(f"Report saved → {output_path}")
print(f"  ✅ Exact matched  : {len(matched)}")
print(f"  🟠 Fuzzy matched  : {len(fuzzy_df)}")
print(f"  🔴 Only in books  : {len(only_books_final)}  ← ITC at risk")
print(f"  🟡 Only in portal : {len(only_portal_final)} ← not yet booked")
Enter fullscreen mode Exit fullscreen mode

What to Do With the Mismatches

Red rows — only in your books (ITC at risk):

  • Contact the supplier and ask them to file or amend their GSTR-1
  • If the supplier is under the composition scheme or is unregistered, ITC isn't available regardless — pass the cost to P&L
  • Document your follow-up trail. If the portal still doesn't show the invoice by the time you file your annual return, you'll need to reverse the ITC in GSTR-3B with interest under Rule 42/43

Yellow rows — only in GSTR-2A (not in your books):

  • Cross-check against pending GRNs — goods may have been received but not entered in Tally yet
  • Watch for duplicate entries: some suppliers accidentally file the same invoice twice with a slightly different amount

Tax head mismatches:

Sometimes a supplier files an intra-state supply as interstate (or vice versa). The total amount matches, but the IGST vs CGST+SGST split is wrong. Add a column-level check on the tax fields for matched rows to catch these before they cause problems in GSTR-3B.


Taking It Further

This Python pipeline gives you full control and works with any accounting system that can export to Excel or CSV. For businesses using TallyPrime, the process is even simpler — Mark IT Solutions has documented how TallyPrime's built-in reconciliation handles most of this automatically, including direct GST portal connectivity. But for teams on other systems — SAP, Zoho Books, custom ERPs — or who want full programmatic control over the matching logic, this approach is hard to beat.

Some ways to extend the script:

  • Schedule it monthly with a cron job or Windows Task Scheduler and pipe the output to email
  • Supplier-wise ITC risk summary: group only_books_final by GSTIN, aggregate unmatched tax amounts, and flag suppliers with repeat mismatches
  • GST API integration: instead of manual downloads, pull GSTR-2A data programmatically via a GST Suvidha Provider (GSP) — this makes the workflow fully automated
  • Streamlit front-end: wrap the script in a simple web UI so non-technical staff can upload files and run reconciliation with a button click
  • Push results back to Tally: custom Tally integrations can write reconciliation outcomes back into TallyPrime ledgers automatically, eliminating the manual journal entry step entirely

The 15–20 hours your accountant spends on this every month can realistically drop to 30 minutes of file uploads and a focused review of the flagged rows. The Python part runs in seconds even for thousands of invoices.


Have questions or want to share how you've adapted this for GSTR-2B? Drop a comment below — the matching logic is nearly identical, just with different column names and stricter ITC cut-off rules.

Top comments (0)