Adwaith A Kumar

Posted on Mar 29

How to Generate Synthetic UPI Transaction Data for Fraud Detection

#python #machinelearning #webdev #ai

India's DPDP Act just made your training data illegal. Here's how to build fraud detection models anyway.

Every AI team building UPI fraud detection in India faces the same impossible problem: you need millions of transaction records to train a model, but you can't legally use real user data.

The Digital Personal Data Protection Act, 2023 (DPDP Act) — specifically Section 4(1) — requires explicit consent for processing personal data. Section 8(7) mandates data minimization. Real UPI transaction logs contain sender IDs, receiver IDs, amounts, locations, and timestamps — all classified as personal data under the Act.

So you're stuck. You can't train without data. You can't use real data without consent. And getting consent from millions of UPI users? Good luck.

The answer is synthetic data — algorithmically generated datasets that preserve the statistical properties of real data without containing any actual user information.

What is Synthetic Data?
Synthetic data is artificially generated information that mimics the patterns, distributions, and correlations found in real-world data. It's not just random numbers — a good synthetic dataset preserves the statistical fingerprint of the original: the same distribution of transaction amounts, the same fraud-to-legitimate ratio, the same temporal patterns.

The key insight: a fraud detection model doesn't need real transactions to learn patterns. It needs realistic transactions. If your synthetic data has the same statistical properties as real UPI data, your model will generalize to real-world fraud just as effectively.

Step 1: Install the Tools
bash
pip install indic-faker sdv scikit-learn pandas
indic-faker — generates realistic Indian financial data (UPI IDs, bank accounts, IFSC codes, INR amounts in lakhs/crores)
sdv — Synthetic Data Vault for learning statistical distributions
scikit-learn — for the fraud detection model
Step 2: Generate a Realistic UPI Transaction Seed Dataset
First, we create 500 realistic-looking UPI transactions using indic-faker:

python
import pandas as pd
import random
from datetime import datetime, timedelta
from indic_faker import IndicFaker
fake = IndicFaker(language="hi")

Merchant categories common in UPI

categories = [
"Groceries", "Fuel", "Electronics", "Restaurants",
"Online Shopping", "Utilities", "Education", "Healthcare",
"Entertainment", "Travel", "Rent", "Insurance"
]
states = ["MH", "KA", "DL", "TN", "KL", "UP", "GJ", "RJ", "WB", "AP"]
def generate_upi_transaction(is_fraud=False):
"""Generate a single realistic UPI transaction."""
if is_fraud:
# Fraud patterns: unusual amounts, late night, rapid succession
amount = random.choice([
random.uniform(45000, 99999), # Large amounts
random.uniform(9900, 9999), # Just under 10K limit
random.uniform(1, 50), # Micro-transactions (testing card)
])
hour = random.choice(range(1, 5)) # Late night
else:
amount = random.lognormvariate(6.5, 1.2) # Normal: avg ₹500-2000
amount = min(amount, 50000)
hour = random.choice(range(7, 23)) # Normal hours

base_date = datetime(2024, 1, 1)

txn_date = base_date + timedelta(

    days=random.randint(0, 365),

    hours=hour,

    minutes=random.randint(0, 59)

)

return {

    "sender_upi": fake.upi_id(),

    "receiver_upi": fake.upi_id(),

    "amount_inr": round(amount, 2),

    "timestamp": txn_date.isoformat(),

    "merchant_category": random.choice(categories),

    "location_state": random.choice(states),

    "sender_bank": fake.bank_name(),

    "ifsc": fake.ifsc(),

    "is_fraud": int(is_fraud)

}

Generate 500 seed transactions (5% fraud rate — realistic for India)

transactions = []
for _ in range(475):
transactions.append(generate_upi_transaction(is_fraud=False))
for _ in range(25):
transactions.append(generate_upi_transaction(is_fraud=True))
random.shuffle(transactions)
seed_df = pd.DataFrame(transactions)
seed_df.to_csv("upi_seed_data.csv", index=False)
print(f"Generated {len(seed_df)} seed transactions")
print(f"Fraud rate: {seed_df['is_fraud'].mean():.1%}")
print(f"\nSample:")
print(seed_df.head())
Output:

Generated 500 seed transactions
Fraud rate: 5.0%
sender_upi receiver_upi amount_inr ... is_fraud
0 rajesh.k@okicici priya.n@paytm 1247.50 ... 0
1 amit.sharma@ybl deepak.m@okaxis 48923.11 ... 1
2 sunita.d@okicici grocery@ybl 342.00 ... 0
Step 3: Train SDV on the Seed Data and Generate 50,000 Rows
python
from sdv.single_table import GaussianCopulaSynthesizer
from sdv.metadata import SingleTableMetadata

Define metadata

metadata = SingleTableMetadata()
metadata.detect_from_dataframe(seed_df)

Override detected types for accuracy

metadata.update_column("amount_inr", sdtype="numerical")
metadata.update_column("is_fraud", sdtype="categorical")
metadata.update_column("timestamp", sdtype="datetime", datetime_format="%Y-%m-%dT%H:%M:%S")

Train the synthesizer

synthesizer = GaussianCopulaSynthesizer(metadata)
synthesizer.fit(seed_df)

Generate 50,000 synthetic transactions

synthetic_df = synthesizer.sample(num_rows=50000)
synthetic_df.to_csv("upi_synthetic_50k.csv", index=False)
print(f"Generated {len(synthetic_df)} synthetic transactions")
print(f"Fraud rate (synthetic): {synthetic_df['is_fraud'].mean():.1%}")
print(f"Fraud rate (original): {seed_df['is_fraud'].mean():.1%}")
Step 4: Evaluate Quality
python
from sdv.evaluation.single_table import evaluate_quality
quality_report = evaluate_quality(seed_df, synthetic_df, metadata)
print(f"\nOverall Quality Score: {quality_report.get_score():.2%}")
A score above 85% means your synthetic data closely mirrors the statistical properties of the seed data. Anything above 90% is excellent.

Step 5: Train a Fraud Detection Model
python
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.preprocessing import LabelEncoder

Prepare features

feature_cols = ["amount_inr", "merchant_category", "location_state"]
le_cat = LabelEncoder()
le_state = LabelEncoder()
synthetic_df["cat_encoded"] = le_cat.fit_transform(synthetic_df["merchant_category"])
synthetic_df["state_encoded"] = le_state.fit_transform(synthetic_df["location_state"])
X = synthetic_df[["amount_inr", "cat_encoded", "state_encoded"]]
y = synthetic_df["is_fraud"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Train

clf = RandomForestClassifier(n_estimators=100, random_state=42)
clf.fit(X_train, y_train)

Evaluate

y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred, target_names=["Legitimate", "Fraud"]))
The Key Takeaway
You just:

✅ Generated realistic Indian financial data with indic-faker
✅ Created 50,000 synthetic UPI transactions with SDV
✅ Trained a fraud detection model
✅ Never touched a single real user's data
This is 100% DPDP Act compliant. No consent forms. No data sharing agreements. No legal risk.

Want Pre-Built Indian Financial Data?
indic-faker is an open-source Python library that generates realistic Indian fake data — Aadhaar (with valid Verhoeff checksums), PAN, GSTIN, UPI IDs, bank accounts, IFSC codes, addresses with real pincodes, and more. All in 8 Indian languages.

bash
pip install indic-faker
python
from indic_faker import IndicFaker
fake = IndicFaker(language="ta") # Tamil
fake.aadhaar() # "3847 2918 4721" — Verhoeff valid ✓
fake.upi_id() # "murugan.n@okicici"
fake.amount_inr() # "₹4,29,150.00" — Indian comma system
fake.salary_lpa() # "₹18.5 LPA"
fake.profile() # Complete Indian identity in one call
⭐ Star it on GitHub if this saved you time.

DEV Community