<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Adwaith A Kumar</title>
    <description>The latest articles on DEV Community by Adwaith A Kumar (@adwaith0).</description>
    <link>https://dev.to/adwaith0</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3849277%2F797ee5c3-0ea8-4bf4-8ad6-d0107b25c509.png</url>
      <title>DEV Community: Adwaith A Kumar</title>
      <link>https://dev.to/adwaith0</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/adwaith0"/>
    <language>en</language>
    <item>
      <title>How to Generate Synthetic UPI Transaction Data for Fraud Detection</title>
      <dc:creator>Adwaith A Kumar</dc:creator>
      <pubDate>Sun, 29 Mar 2026 12:32:41 +0000</pubDate>
      <link>https://dev.to/adwaith0/how-to-generate-synthetic-upi-transaction-data-for-fraud-detection-1k94</link>
      <guid>https://dev.to/adwaith0/how-to-generate-synthetic-upi-transaction-data-for-fraud-detection-1k94</guid>
      <description>&lt;p&gt;India's DPDP Act just made your training data illegal. Here's how to build fraud detection models anyway.&lt;/p&gt;

&lt;p&gt;Every AI team building UPI fraud detection in India faces the same impossible problem: you need millions of transaction records to train a model, but you can't legally use real user data.&lt;/p&gt;

&lt;p&gt;The Digital Personal Data Protection Act, 2023 (DPDP Act) — specifically Section 4(1) — requires explicit consent for processing personal data. Section 8(7) mandates data minimization. Real UPI transaction logs contain sender IDs, receiver IDs, amounts, locations, and timestamps — all classified as personal data under the Act.&lt;/p&gt;

&lt;p&gt;So you're stuck. You can't train without data. You can't use real data without consent. And getting consent from millions of UPI users? Good luck.&lt;/p&gt;

&lt;p&gt;The answer is synthetic data — algorithmically generated datasets that preserve the statistical properties of real data without containing any actual user information.&lt;/p&gt;

&lt;p&gt;What is Synthetic Data?&lt;br&gt;
Synthetic data is artificially generated information that mimics the patterns, distributions, and correlations found in real-world data. It's not just random numbers — a good synthetic dataset preserves the statistical fingerprint of the original: the same distribution of transaction amounts, the same fraud-to-legitimate ratio, the same temporal patterns.&lt;/p&gt;

&lt;p&gt;The key insight: a fraud detection model doesn't need real transactions to learn patterns. It needs realistic transactions. If your synthetic data has the same statistical properties as real UPI data, your model will generalize to real-world fraud just as effectively.&lt;/p&gt;

&lt;p&gt;Step 1: Install the Tools&lt;br&gt;
bash&lt;br&gt;
pip install indic-faker sdv scikit-learn pandas&lt;br&gt;
indic-faker — generates realistic Indian financial data (UPI IDs, bank accounts, IFSC codes, INR amounts in lakhs/crores)&lt;br&gt;
sdv — Synthetic Data Vault for learning statistical distributions&lt;br&gt;
scikit-learn — for the fraud detection model&lt;br&gt;
Step 2: Generate a Realistic UPI Transaction Seed Dataset&lt;br&gt;
First, we create 500 realistic-looking UPI transactions using indic-faker:&lt;/p&gt;

&lt;p&gt;python&lt;br&gt;
import pandas as pd&lt;br&gt;
import random&lt;br&gt;
from datetime import datetime, timedelta&lt;br&gt;
from indic_faker import IndicFaker&lt;br&gt;
fake = IndicFaker(language="hi")&lt;/p&gt;

&lt;h1&gt;
  
  
  Merchant categories common in UPI
&lt;/h1&gt;

&lt;p&gt;categories = [&lt;br&gt;
    "Groceries", "Fuel", "Electronics", "Restaurants", &lt;br&gt;
    "Online Shopping", "Utilities", "Education", "Healthcare",&lt;br&gt;
    "Entertainment", "Travel", "Rent", "Insurance"&lt;br&gt;
]&lt;br&gt;
states = ["MH", "KA", "DL", "TN", "KL", "UP", "GJ", "RJ", "WB", "AP"]&lt;br&gt;
def generate_upi_transaction(is_fraud=False):&lt;br&gt;
    """Generate a single realistic UPI transaction."""&lt;br&gt;
    if is_fraud:&lt;br&gt;
        # Fraud patterns: unusual amounts, late night, rapid succession&lt;br&gt;
        amount = random.choice([&lt;br&gt;
            random.uniform(45000, 99999),   # Large amounts&lt;br&gt;
            random.uniform(9900, 9999),      # Just under 10K limit&lt;br&gt;
            random.uniform(1, 50),           # Micro-transactions (testing card)&lt;br&gt;
        ])&lt;br&gt;
        hour = random.choice(range(1, 5))  # Late night&lt;br&gt;
    else:&lt;br&gt;
        amount = random.lognormvariate(6.5, 1.2)  # Normal: avg ₹500-2000&lt;br&gt;
        amount = min(amount, 50000)&lt;br&gt;
        hour = random.choice(range(7, 23))  # Normal hours&lt;/p&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;base_date = datetime(2024, 1, 1)&lt;br&gt;
txn_date = base_date + timedelta(&lt;br&gt;
    days=random.randint(0, 365),&lt;br&gt;
    hours=hour,&lt;br&gt;
    minutes=random.randint(0, 59)&lt;br&gt;
)

&lt;p&gt;return {&lt;br&gt;
    "sender_upi": fake.upi_id(),&lt;br&gt;
    "receiver_upi": fake.upi_id(),&lt;br&gt;
    "amount_inr": round(amount, 2),&lt;br&gt;
    "timestamp": txn_date.isoformat(),&lt;br&gt;
    "merchant_category": random.choice(categories),&lt;br&gt;
    "location_state": random.choice(states),&lt;br&gt;
    "sender_bank": fake.bank_name(),&lt;br&gt;
    "ifsc": fake.ifsc(),&lt;br&gt;
    "is_fraud": int(is_fraud)&lt;br&gt;
}&lt;br&gt;
&lt;/p&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h1&gt;
&lt;br&gt;
  &lt;br&gt;
  &lt;br&gt;
  Generate 500 seed transactions (5% fraud rate — realistic for India)&lt;br&gt;
&lt;/h1&gt;

&lt;p&gt;transactions = []&lt;br&gt;
for _ in range(475):&lt;br&gt;
    transactions.append(generate_upi_transaction(is_fraud=False))&lt;br&gt;
for _ in range(25):&lt;br&gt;
    transactions.append(generate_upi_transaction(is_fraud=True))&lt;br&gt;
random.shuffle(transactions)&lt;br&gt;
seed_df = pd.DataFrame(transactions)&lt;br&gt;
seed_df.to_csv("upi_seed_data.csv", index=False)&lt;br&gt;
print(f"Generated {len(seed_df)} seed transactions")&lt;br&gt;
print(f"Fraud rate: {seed_df['is_fraud'].mean():.1%}")&lt;br&gt;
print(f"\nSample:")&lt;br&gt;
print(seed_df.head())&lt;br&gt;
Output:&lt;/p&gt;

&lt;p&gt;Generated 500 seed transactions&lt;br&gt;
Fraud rate: 5.0%&lt;br&gt;
        sender_upi        receiver_upi  amount_inr  ...  is_fraud&lt;br&gt;
0  rajesh.k@okicici   priya.n@paytm      1247.50  ...         0&lt;br&gt;
1  amit.sharma@ybl   deepak.m@okaxis    48923.11  ...         1&lt;br&gt;
2  sunita.d@okicici  grocery@ybl          342.00  ...         0&lt;br&gt;
Step 3: Train SDV on the Seed Data and Generate 50,000 Rows&lt;br&gt;
python&lt;br&gt;
from sdv.single_table import GaussianCopulaSynthesizer&lt;br&gt;
from sdv.metadata import SingleTableMetadata&lt;/p&gt;

&lt;h1&gt;
  
  
  Define metadata
&lt;/h1&gt;

&lt;p&gt;metadata = SingleTableMetadata()&lt;br&gt;
metadata.detect_from_dataframe(seed_df)&lt;/p&gt;

&lt;h1&gt;
  
  
  Override detected types for accuracy
&lt;/h1&gt;

&lt;p&gt;metadata.update_column("amount_inr", sdtype="numerical")&lt;br&gt;
metadata.update_column("is_fraud", sdtype="categorical")&lt;br&gt;
metadata.update_column("timestamp", sdtype="datetime", datetime_format="%Y-%m-%dT%H:%M:%S")&lt;/p&gt;

&lt;h1&gt;
  
  
  Train the synthesizer
&lt;/h1&gt;

&lt;p&gt;synthesizer = GaussianCopulaSynthesizer(metadata)&lt;br&gt;
synthesizer.fit(seed_df)&lt;/p&gt;

&lt;h1&gt;
  
  
  Generate 50,000 synthetic transactions
&lt;/h1&gt;

&lt;p&gt;synthetic_df = synthesizer.sample(num_rows=50000)&lt;br&gt;
synthetic_df.to_csv("upi_synthetic_50k.csv", index=False)&lt;br&gt;
print(f"Generated {len(synthetic_df)} synthetic transactions")&lt;br&gt;
print(f"Fraud rate (synthetic): {synthetic_df['is_fraud'].mean():.1%}")&lt;br&gt;
print(f"Fraud rate (original):  {seed_df['is_fraud'].mean():.1%}")&lt;br&gt;
Step 4: Evaluate Quality&lt;br&gt;
python&lt;br&gt;
from sdv.evaluation.single_table import evaluate_quality&lt;br&gt;
quality_report = evaluate_quality(seed_df, synthetic_df, metadata)&lt;br&gt;
print(f"\nOverall Quality Score: {quality_report.get_score():.2%}")&lt;br&gt;
A score above 85% means your synthetic data closely mirrors the statistical properties of the seed data. Anything above 90% is excellent.&lt;/p&gt;

&lt;p&gt;Step 5: Train a Fraud Detection Model&lt;br&gt;
python&lt;br&gt;
from sklearn.ensemble import RandomForestClassifier&lt;br&gt;
from sklearn.model_selection import train_test_split&lt;br&gt;
from sklearn.metrics import classification_report&lt;br&gt;
from sklearn.preprocessing import LabelEncoder&lt;/p&gt;

&lt;h1&gt;
  
  
  Prepare features
&lt;/h1&gt;

&lt;p&gt;feature_cols = ["amount_inr", "merchant_category", "location_state"]&lt;br&gt;
le_cat = LabelEncoder()&lt;br&gt;
le_state = LabelEncoder()&lt;br&gt;
synthetic_df["cat_encoded"] = le_cat.fit_transform(synthetic_df["merchant_category"])&lt;br&gt;
synthetic_df["state_encoded"] = le_state.fit_transform(synthetic_df["location_state"])&lt;br&gt;
X = synthetic_df[["amount_inr", "cat_encoded", "state_encoded"]]&lt;br&gt;
y = synthetic_df["is_fraud"]&lt;br&gt;
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)&lt;/p&gt;

&lt;h1&gt;
  
  
  Train
&lt;/h1&gt;

&lt;p&gt;clf = RandomForestClassifier(n_estimators=100, random_state=42)&lt;br&gt;
clf.fit(X_train, y_train)&lt;/p&gt;

&lt;h1&gt;
  
  
  Evaluate
&lt;/h1&gt;

&lt;p&gt;y_pred = clf.predict(X_test)&lt;br&gt;
print(classification_report(y_test, y_pred, target_names=["Legitimate", "Fraud"]))&lt;br&gt;
The Key Takeaway&lt;br&gt;
You just:&lt;/p&gt;

&lt;p&gt;✅ Generated realistic Indian financial data with indic-faker&lt;br&gt;
✅ Created 50,000 synthetic UPI transactions with SDV&lt;br&gt;
✅ Trained a fraud detection model&lt;br&gt;
✅ Never touched a single real user's data&lt;br&gt;
This is 100% DPDP Act compliant. No consent forms. No data sharing agreements. No legal risk.&lt;/p&gt;

&lt;p&gt;Want Pre-Built Indian Financial Data?&lt;br&gt;
indic-faker is an open-source Python library that generates realistic Indian fake data — Aadhaar (with valid Verhoeff checksums), PAN, GSTIN, UPI IDs, bank accounts, IFSC codes, addresses with real pincodes, and more. All in 8 Indian languages.&lt;/p&gt;

&lt;p&gt;bash&lt;br&gt;
pip install indic-faker&lt;br&gt;
python&lt;br&gt;
from indic_faker import IndicFaker&lt;br&gt;
fake = IndicFaker(language="ta")  # Tamil&lt;br&gt;
fake.aadhaar()        # "3847 2918 4721" — Verhoeff valid ✓&lt;br&gt;
fake.upi_id()         # "murugan.n@okicici"&lt;br&gt;
fake.amount_inr()     # "₹4,29,150.00" — Indian comma system&lt;br&gt;
fake.salary_lpa()     # "₹18.5 LPA"&lt;br&gt;
fake.profile()        # Complete Indian identity in one call&lt;br&gt;
⭐ Star it on GitHub if this saved you time.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>webdev</category>
      <category>python</category>
      <category>machinelearning</category>
    </item>
  </channel>
</rss>
