ML fraud detection platform using AI agents

Souptik Chakraborty — Tue, 12 May 2026 15:01:48 +0000

I built a production ML fraud detection platform using AI agents. Here's everything

Few months ago I had an idea for an open-source fraud detection platform.

I had no engineering team. I had no budget. And I cannot write production Python.

Today, RiskOS is live. Four ML services, real APIs, open source, MIT licensed. You can call the fraud detection endpoint right now with no signup.

This post is the honest account of how I built it, what broke badly, and the exact prompting patterns that finally worked.

Who I Am (Context Matters)

I'm Souptik Chakraborty — an AI Product Manager based in Kolkata, India. My background is product strategy, not engineering. I can read code well enough to understand what it does. I cannot write it reliably enough to ship to production.

I used three AI tools as my engineering team:

Claude (Anthropic) — architecture, reasoning, debugging strategy
Codex (OpenAI) — implementation execution
Gemini via Antigravity — deployment and infrastructure

My job was to write prompts. Their job was to write code.

What I Built

RiskOS is four composable services, each independently deployable, all with FastAPI backends running alongside Gradio on HuggingFace Spaces.

🔍 Fraud Intelligence Suite

Five agents in one space: transaction fraud scoring (XGBoost), credit risk assessment (LightGBM), KYC identity anomaly detection (IsolationForest), sanctions and PEP screening (RapidFuzz), and an LLM-backed risk consultant.

Metrics on synthetic test set:

Recall: ≥ 88%
Precision: ≥ 70%
Inference latency: ~55ms on CPU
SHAP interpretability on every prediction
Drift detection on out-of-distribution inputs

Live: https://huggingface.co/spaces/soupstick/fraud-detector-app

⚡ Risk Pipeline

LightGBM scorer combined with a 15-rule decision engine. Ingests batches of up to 500 transactions, scores each one, applies rules, and triages into ESCALATE / MONITOR / AUTO_CLOSE.

Metrics:

AUC-ROC: ≥ 0.88
Workload reduction on test set: ~70%
Processing time: <5 seconds for 100 transactions

Live: https://huggingface.co/spaces/soupstick/risk-pipeline

🛡️ LLM Guard

RAG-augmented guardrail layer using LangChain and Opik. Evaluates LLM outputs against policy documents. Blocks jailbreaks, prompt injections, PII leakage, and social engineering scripts. Every call logged to Opik for audit trails.

Metrics on adversarial test set:

Block rate on unsafe inputs: ~94%
Safe pass-through rate: >95% (no over-blocking)

Live: https://huggingface.co/spaces/soupstick/opik_guard_v1

📊 Marketplace Intelligence

Natural language to SQL to Plotly chart. Ask a question in plain English, get structured results and a visualization. SELECT-only enforcement via sqlglot — blocks DROP, DELETE, INSERT, UPDATE, PRAGMA, and ATTACH before execution. 15,000-row SQLite database seeded with realistic e-commerce patterns.

Live: https://huggingface.co/spaces/soupstick/marketplace-intelligence

Why Four Services Instead of One

This was the first major architectural decision I had to make, and it shaped everything else.

The initial instinct was to build one monolithic service. One API, one model, one space. Simpler to explain, simpler to deploy.

I pushed back on that for three reasons, which I had Claude reason through with me explicitly:

Cold start isolation. HuggingFace free tier spaces sleep after inactivity. If one service is sleeping, it should not block the others. Four spaces means four independent cold starts.

Independent deployability. A bug in the SQL safety layer of the marketplace service should not require redeploying the fraud model. Separate repos, separate deployments, separate failure domains.

The FastAPI/Gradio port constraint. HuggingFace Spaces exposes exactly one port (7860). Mounting FastAPI alongside Gradio on a single port requires a specific pattern — gr.mount_gradio_app(fastapi_app, gradio_app, path="/"). One monolith would mean one complex routing layer. Four services means four clean implementations.

The Prompting Workflow

I did not write a single line of Python. Here is what I actually did.

For each component, I wrote a prompt that included:

What the system does — in plain English, no jargon
The exact output schema — the precise JSON structure the API must return
Success metrics with numeric thresholds — "recall >= 0.88, AUC >= 0.82, latency < 200ms"
Failure modes to explicitly guard against — "do not fabricate model artifacts, do not write tests that only validate synthetic data against synthetic models"
A test gate — "the agent must not push until all tests pass against the live HF Space URL, not the local mock"

That last constraint — tests against the live URL — was the single most important thing I learned. Without it, the AI will validate its own work against its own synthetic data and report success. It is circular and it is invisible until something breaks in production.

A Prompt That Worked

Here is the actual prompt structure I used for the XGBoost fraud model:

Train a LightGBM classifier on data/train.csv.

Minimum performance thresholds — raise an exception and do not save
the model if any of these are not met:
- Recall on test set >= 0.88
- Precision >= 0.70  
- AUC-ROC >= 0.82

The model must be saved via model.booster_.save_model() to
model_artifacts/risk_lgbm.txt — not serialized via pickle.

Write the training script to model_artifacts/metadata.json with:
version, training date, actual recall, actual precision, actual AUC.

If the thresholds are not met, do not adjust the thresholds.
Fix the model. Tune class_weight, learning_rate, or n_estimators.

The key phrase: "do not adjust the thresholds. Fix the model." Without that, the agent lowers the bar.

What Broke (The Honest Part)

Fabricated Model Artifacts

The first time I ran Codex on the fraud detector, it produced a file called fraud_xgb.json and reported that the model had 1.0 recall on the test set.

It had not trained a model. It had written a JSON file by hand that looked like an XGBoost model. The "test set" it validated against was 50 rows of synthetic data it had generated itself. The recall was 1.0 because the model was perfectly overfit to data it had invented.

I caught this because I required tests to hit the live HF Space API. When the live endpoint returned wrong predictions on known-fraud transactions, I investigated and found the artifact was fabricated.

Fix: I added this line to every training prompt: "The model must be trained by calling model.fit() on real data from the CSV. Do not write model artifacts by hand. If a .json or .pkl file already exists, delete it and retrain."

The SQL Security Silent Failure

I built a 10-test SQL security suite. Six tests failed in production — DROP TABLE, DELETE, INSERT, UPDATE, PRAGMA, ATTACH DATABASE were all passing through without being blocked.

The sql_validator.py file existed. The logic looked correct. The problem: the module was imported at the top level, failing on a version conflict with sqlglot, and the except ImportError block was catching the failure silently and proceeding with an unprotected execution path.

Every test I ran locally passed because locally, sqlglot was installed. On HF Spaces, a dependency conflict on the first import caused the validator to silently not load, and queries went directly to SQLite execution.

Fix: I replaced the validator entirely with a first-token whitelist approach that has zero external dependencies — pure Python string operations, no sqlglot at module level. If the first token of the SQL is not SELECT, the query is blocked. No imports that can fail.

Test Suites That Validated Themselves

Three times I received "all tests pass" reports where the agent had written the test data, the training data, and the model in the same session. The tests passed because everything was internally consistent — not because the system worked.

Fix: Test data must come from a different source than training data. I started requiring agents to generate test fixtures first, commit them, and then train models — so the test data was frozen before the model saw any data.

The Architecture Decision I Got Right

Running FastAPI alongside Gradio on HuggingFace Spaces.

HF exposes one port. Gradio wants that port. FastAPI wants that port. The solution is to mount Gradio as a sub-application inside FastAPI:

import gradio as gr
from fastapi import FastAPI

fastapi_app = FastAPI()

# ... define all FastAPI routes ...

gradio_interface = gr.Blocks()
# ... build Gradio UI ...

app = gr.mount_gradio_app(fastapi_app, gradio_interface, path="/")

This makes Gradio serve on / and FastAPI serve on /api/v1/*, all on port 7860. One Dockerfile, one exposed port, both interfaces live.

I did not know this pattern existed. I described the constraint to Claude — "HF only exposes port 7860, I need both a UI and an API" — and it found this solution in the Gradio docs.

What I Learned About Being an AI Product Manager

The job is not prompting. The job is knowing what good looks like.

Anyone can ask Claude to "build a fraud detection system." Getting production-quality output requires:

Specifying exact success criteria — not "good accuracy" but "recall >= 0.88 on a 15% fraud-rate test set"
Specifying failure modes explicitly — if you do not tell the AI what not to do, it will do it
Adversarial validation — test suites must be designed to catch the AI's blind spots, not confirm its assumptions
Reading output critically — confident, well-structured code can be completely broken. You have to know enough to notice

That last point is where being a PM actually helps. PMs are trained to ask "what could go wrong" and "how do we know this is working." Those are the same questions that produce good prompts.

The Live System

Website: https://souptik-aipm.vercel.app

Try the fraud API right now — no signup, no key:

curl -X POST https://soupstick-fraud-detector-app.hf.space/api/v1/fraud/predict \
  -H "Content-Type: application/json" \
  -d '{
    "transaction_id": "devto-test",
    "amount": 9500,
    "hour_of_day": 3,
    "is_international": true,
    "merchant_category": "electronics",
    "transaction_velocity_1h": 8,
    "amount_vs_avg_ratio": 4.5,
    "is_new_device": true,
    "distance_from_home_km": 650,
    "failed_attempts_before": 2,
    "account_age_days": 15
  }'

Expected response: "verdict": "FRAUD" in under 60 seconds (first call may hit a cold start).

GitHub: https://github.com/Souptik96/riskos
HuggingFace: https://huggingface.co/soupstick

One Honest Limitation

All models are trained on synthetically generated data with engineered fraud signals. They are not production-ready without retraining on real labeled data from a live system. The metrics in this post reflect performance on held-out synthetic test sets. I have stated this clearly in every README and I am stating it here.

The value of RiskOS is the architecture, the API contracts, the test harnesses, and the build process — not the model weights. Those need to be replaced with real data before anyone puts this in front of real transactions.

If You're a PM Trying to Build AI Products

The biggest unlock for me was treating Claude like a senior engineer, not a code generator. I did not say "write me a fraud model." I said "I need a fraud model that meets these specific criteria, these are the constraints, here are the failure modes I'm worried about, here is how I will know it worked."

That is a product spec, not a prompt.

If you're building something similar and want to compare notes, my DMs are open on LinkedIn: https://www.linkedin.com/in/souptikchakraborty

And if this was useful — a GitHub star means a lot: https://github.com/Souptik96/riskos

DEV Community: Souptik Chakraborty