DEV Community

Ross Peili
Ross Peili

Posted on

I built a modular Fraud Detection System to solve 0.17% class imbalance (RF + XGBoost)

Hi everyone! I wanted to share a project I've been polishing to demonstrate how to structure a machine learning pipeline beyond just a Jupyter Notebook.

It’s a complete Credit Card Fraud Detection System built on the PaySim dataset. The main challenge was the extreme class imbalance (only ~0.17% of transactions are fraud), which makes standard accuracy metrics misleading.

Project Highlights:

  • Imbalance Handling: Implementation of class_weight='balanced' in Random Forest and scale_pos_weight in XGBoost to penalize missing fraud cases.
  • Modular Architecture: The code is split into distinct modules:
  • data_loader.py
  • - Ingestion & cleaning.
  • - features.py
  • - Feature engineering (time-based features, behavioral flags).
  • model.py
  • - Model wrapper with persistence (joblib).
  • Full Evaluation: Automated generation of ROC-AUC (~0.999), Confusion Matrix, and Precision-Recall reports.
  • Testing: End-to-end integration tests using pytest to ensure the pipeline doesn't break when refactoring.

I included detailed docs on the system architecture and testing strategy if anyone is interested in how to organize ML projects for production.

Repo: github.com/arpahls/cfd

Feedback on the code structure or model choice is welcome!

Top comments (1)

Collapse
 
matthewhou profile image
Matthew Hou

0.17% minority class is brutal. I've worked with similarly imbalanced datasets and SMOTE alone usually isn't enough — it tends to generate synthetic samples that sit in the overlap region between classes.

Did you try any anomaly detection approaches (isolation forest, autoencoders) as a first pass before the supervised models? Sometimes a two-stage pipeline works better than trying to make a single model handle the imbalance directly.