I track every UPI transaction across PhonePe, Google Pay, and Paytm. Three apps, three different CSV exports, and a tax season that looks like a hostage negotiation.
Last weekend I got tired of opening three spreadsheets side by side and built one Python script to merge, deduplicate, and categorize every UPI transaction into a single reconciled Excel file. The whole thing is 52 lines.
Here's what it does:
- Reads CSV exports from any of the three UPI apps
- Normalizes the wildly different column names each app uses
- Detects duplicates (the same payment showing up in two apps)
- Categorizes transactions by merchant keywords (food, fuel, rent, bills, shopping)
- Exports a clean Excel file with a transactions sheet and a monthly category pivot in ₹
The Code
import pandas as pd
from pathlib import Path
from datetime import datetime
SCHEMA = {
"phonepe": {"date": "Transaction Date", "amount": "Amount (INR)", "party": "Merchant", "ref": "UTR"},
"gpay": {"date": "Date", "amount": "Amount", "party": "To", "ref": "Transaction ID"},
"paytm": {"date": "Txn Date", "amount": "Amount(₹)", "party": "Payee", "ref": "Order ID"},
}
CATEGORIES = {
"food": ["zomato", "swiggy", "dominos", "kfc", "mcdonald"],
"fuel": ["hpcl", "iocl", "bpcl", "petrol", "indian oil"],
"rent": ["rent", "landlord", "housing"],
"bills": ["airtel", "jio", "vi ", "tata power", "electricity"],
"shop": ["amazon", "flipkart", "myntra", "blinkit", "zepto"],
}
def categorize(party: str) -> str:
p = (party or "").lower()
for cat, keywords in CATEGORIES.items():
if any(k in p for k in keywords):
return cat
return "other"
def load(path: Path, app: str) -> pd.DataFrame:
df = pd.read_csv(path)
cols = SCHEMA[app]
out = pd.DataFrame({
"date": pd.to_datetime(df[cols["date"]], errors="coerce"),
"amount": pd.to_numeric(df[cols["amount"]].astype(str).str.replace("[₹,]", "", regex=True), errors="coerce"),
"party": df[cols["party"]].astype(str),
"ref": df[cols["ref"]].astype(str),
"app": app,
})
out["category"] = out["party"].apply(categorize)
return out.dropna(subset=["date", "amount"])
def reconcile(folder: Path) -> pd.DataFrame:
frames = [load(p, app) for app in SCHEMA for p in folder.glob(f"{app}*.csv")]
if not frames:
raise SystemExit("No UPI CSVs found.")
combined = pd.concat(frames, ignore_index=True).sort_values("date")
combined["minute"] = combined["date"].dt.floor("min")
combined = combined.drop_duplicates(subset=["minute", "amount", "party"], keep="first").drop(columns="minute")
combined["month"] = combined["date"].dt.to_period("M").astype(str)
return combined
def export(df: pd.DataFrame, out_path: Path) -> None:
monthly = df.pivot_table(index="month", columns="category", values="amount", aggfunc="sum", fill_value=0)
with pd.ExcelWriter(out_path, engine="openpyxl") as xl:
df.to_excel(xl, sheet_name="transactions", index=False)
monthly.to_excel(xl, sheet_name="monthly_by_category")
print(f"Saved {len(df)} transactions to {out_path}")
if __name__ == "__main__":
df = reconcile(Path("./statements"))
export(df, Path(f"upi_reconciled_{datetime.now():%Y_%m_%d}.xlsx"))
How it works
The schema dict. PhonePe calls the date column "Transaction Date." GPay calls it "Date." Paytm calls it "Txn Date." Same data, three names. The SCHEMA dict maps each app's column names to a standard set so the rest of the code never has to care which app a row came from. Adding a new app means one more entry in SCHEMA and zero changes anywhere else.
Amount cleanup. Paytm sneaks the ₹ symbol into the amount column. PhonePe uses comma-formatted numbers. The str.replace("[₹,]", "", regex=True) strips both before pd.to_numeric, and errors="coerce" turns any unparseable cell into NaN so the final dropna can sweep it out.
The deduplication trick. When the same UPI transaction shows up in two apps (which happens when you screenshot a payment and Paytm catalogs it from your gallery, or when GPay picks up an SMS confirmation Paytm has already logged), the timestamps usually differ by a few seconds. Floor the timestamp to the minute, drop_duplicates on (minute, amount, party), and you keep one copy of each real transaction. After three months of data this cut my row count by about 18 percent.
Categorization. A simple substring match on the merchant name. Add your own keywords to CATEGORIES. After three months of running this, my "food" category alone has 47 distinct merchant strings (different Zomato franchise codes, mis-spellings, you name it) that all map to the same bucket.
The export. Two sheets in one Excel file. transactions has every reconciled row. monthly_by_category pivots them into a grid you can paste straight into your CA's template at year-end.
How I actually use it
I export CSVs from each app on the first weekend of every month and drop them in a statements/ folder named like phonepe_april.csv, gpay_april.csv, paytm_april.csv. Run the script, get a single Excel file. The whole reconciliation that used to eat my Sunday morning now takes twelve seconds.
The categorization saved me during my GST filing for Q4 — I could see immediately that I'd spent ₹38,000 on shopping vs the ₹12,000 I'd loosely budgeted. Painful but useful. It also caught a ₹1,499 recurring charge from a streaming service I'd cancelled in November but was still being billed for. That one find paid for the entire weekend project.
Where to take it next
Three things I added after the first draft:
- A
--monthCLI flag that filters to a single month — useful when slicing a quarter for tax filing. - A
manual_overrides.csvso I can tag a specific UTR with the right category without polluting the keyword list. The override file gets joined on therefcolumn and wins over the keyword match. - A simple matplotlib bar chart of category totals, saved as a PNG, dropped into my monthly review note.
If you want the version with all three additions, ping me — happy to share. The base script lives in my python-automation-starter repo on GitHub.
A note on edge cases: refunds show up as positive amounts in PhonePe but negative in GPay. If you care about net spend, multiply PhonePe rows where the merchant string contains "refund" by -1 before reconciling. I do this in a tiny preprocessor function I left out of the 52-line version for brevity — it's three lines and you'll know where to put it.
Twelve seconds. Three apps. One Excel file. Worth a Sunday afternoon.
Follow me on Twitter @automate_archit for daily AI automation tips.
Top comments (0)