DEV Community

Archit Mittal
Archit Mittal

Posted on • Originally published at architmittal.com

Build a UPI Transaction Categorizer in 95 Lines of Python

If you use UPI for everything (and at this point, who in India doesn't?), your bank statement is a wall of cryptic merchant strings: UPI/zomatoonline@paytm/..., UPI/SWIGGY-BANGALORE/..., UPI/9876543210@oksbi/.... Trying to figure out where your money actually goes by reading that mess is a lost cause.

I built a 95-line Python script that ingests a UPI statement CSV from any major Indian bank, classifies each transaction into a sensible category, and spits out a monthly summary. No paid services, no APIs, no LLMs in the loop — just regex, a keyword map, and pandas.

Here's exactly how it works.

The problem

Every bank exports UPI transactions a little differently, but the merchant string is roughly always there. The trick is that real merchants leak through:

  • Zomato, Swiggy, Dunzo → food delivery
  • BigBasket, Blinkit, Zepto, DMart → groceries
  • Uber, Ola, Rapido, IRCTC → transport
  • Airtel, Jio, Vi, Tata Power, BESCOM → bills
  • Amazon, Flipkart, Myntra → shopping

If we maintain a keyword → category dictionary and walk every row, we can categorize 80–90% of statements automatically. The remaining slice falls into "Other" and you can label them by hand once a month.

The full code

Save this as upi_categorizer.py:

import re
import sys
from pathlib import Path
import pandas as pd

CATEGORIES = {
    "Food": [
        "zomato", "swiggy", "dunzo", "eatfit", "faasos", "behrouz",
        "dominos", "pizzahut", "kfc", "mcdonalds", "starbucks", "cafe",
    ],
    "Groceries": [
        "bigbasket", "blinkit", "grofers", "zepto", "dmart", "instamart",
        "reliancefresh", "natures", "spencer", "more-retail",
    ],
    "Transport": [
        "uber", "olacabs", "rapido", "irctc", "redbus", "abhibus",
        "namma-yatri", "blusmart", "yulu", "vogo",
    ],
    "Bills": [
        "airtel", "jio", "vodafone", "vi-mobile", "bsnl", "tatapower",
        "bescom", "adani-electricity", "torrent-power", "mahanagargas",
        "act-fibernet", "hathway",
    ],
    "Shopping": [
        "amazon", "flipkart", "myntra", "ajio", "nykaa", "meesho",
        "tatacliq", "firstcry", "lenskart", "boat-lifestyle",
    ],
    "Investments": [
        "zerodha", "groww", "upstox", "kuvera", "paytmmoney",
        "indmoney", "angelone", "smallcase",
    ],
    "Health": [
        "pharmeasy", "1mg", "netmeds", "apollo", "practo",
        "cult-fit", "healthifyme",
    ],
    "Entertainment": [
        "netflix", "hotstar", "primevideo", "sonyliv", "zee5",
        "spotify", "gaana", "bookmyshow", "pvr", "inox",
    ],
}

UPI_PATTERN = re.compile(r"upi[/-]([a-z0-9.\-_@]+)", re.IGNORECASE)


def extract_merchant(narration: str) -> str:
    """Pull the merchant slug out of a UPI narration string."""
    if not isinstance(narration, str):
        return ""
    match = UPI_PATTERN.search(narration.lower())
    return match.group(1) if match else narration.lower()


def categorize(narration: str) -> str:
    merchant = extract_merchant(narration)
    for category, keywords in CATEGORIES.items():
        for keyword in keywords:
            if keyword in merchant:
                return category
    return "Other"


def load_statement(path: Path) -> pd.DataFrame:
    df = pd.read_csv(path)
    # Normalize the columns most banks use; adjust as needed.
    df.columns = [c.strip().lower() for c in df.columns]
    rename_map = {
        "transaction date": "date", "txn date": "date", "value date": "date",
        "narration": "narration", "description": "narration", "details": "narration",
        "withdrawal amt.": "debit", "withdrawal": "debit", "debit": "debit",
        "deposit amt.": "credit", "deposit": "credit", "credit": "credit",
    }
    df = df.rename(columns={k: v for k, v in rename_map.items() if k in df.columns})
    df["debit"] = pd.to_numeric(df.get("debit", 0), errors="coerce").fillna(0)
    df["date"] = pd.to_datetime(df["date"], errors="coerce", dayfirst=True)
    return df.dropna(subset=["date"])


def summarize(df: pd.DataFrame) -> pd.DataFrame:
    df = df[df["debit"] > 0].copy()
    df["category"] = df["narration"].apply(categorize)
    df["month"] = df["date"].dt.to_period("M")
    pivot = df.pivot_table(
        index="month", columns="category", values="debit",
        aggfunc="sum", fill_value=0,
    )
    pivot["Total"] = pivot.sum(axis=1)
    return pivot.round(0)


if __name__ == "__main__":
    path = Path(sys.argv[1]) if len(sys.argv) > 1 else Path("statement.csv")
    summary = summarize(load_statement(path))
    print(summary.to_string())
    summary.to_csv("upi_summary.csv")
    print("\nSaved upi_summary.csv")
Enter fullscreen mode Exit fullscreen mode

How to use it

  1. Log in to net banking and download your last 3-6 months of statement as CSV.
  2. Save it as statement.csv in the same folder as the script.
  3. Run python upi_categorizer.py statement.csv.

You'll get a monthly pivot like:

month     Bills  Food  Groceries  Shopping  Transport  Other  Total
2026-02    3450  6280       9120      4500       2150   1820  27320
2026-03    3450  5840      10100      6700       1980   2310  30380
2026-04    3450  7120       9550      3200       2620   1950  27890
Enter fullscreen mode Exit fullscreen mode

Three months in, you have a ground-truth picture of where roughly ₹85,000 went.

Tweaks worth making

The keyword list is the soul of this script. Spend ten minutes adding the merchants YOUR statement actually shows — the long tail is regional (Big Bazaar in some cities, Star Bazaar in others, local kirana with their own UPI handle). Every keyword you add today saves you from "Other" forever.

If you want to go further:

  • Auto-tag recurring UPI handles: track per-handle medians and flag transactions that deviate by 3x.
  • Detect EMIs: same amount, same merchant, monthly cadence — pull these into their own bucket so they don't pollute "Shopping".
  • Streak the savings rate: subtract Total from monthly credits and chart the gap.

I run a version of this on the 1st of every month via cron. Takes 2 seconds, replaces a 30-minute manual sort, and the only thing I update is the keyword list.

Why I stopped using budget apps for this

Two reasons. One, I don't love handing over read-only bank credentials to a third party I haven't audited. Two, every budget app eventually starts pushing premium tiers, and the ones that survive long enough to be useful tend to be acquired and shut down. A 95-line script that runs on my laptop has none of those problems.

The keyword map is the only thing that needs maintenance, and that's a feature — your categories should match how YOU spend, not how a startup's data scientist thought you'd spend.


If you want more scripts like this (PDF invoice parsers, GST reconciliation, WhatsApp/Telegram automations), follow me on Twitter @automate_archit for daily AI automation tips.

Top comments (0)