Hi everyone! I wanted to share a project I've been polishing to demonstrate how to structure a machine learning pipeline beyond just a Jupyter Notebook.
It’s a complete Credit Card Fraud Detection System built on the PaySim dataset. The main challenge was the extreme class imbalance (only ~0.17% of transactions are fraud), which makes standard accuracy metrics misleading.
Project Highlights:
- Imbalance Handling: Implementation of class_weight='balanced' in Random Forest and scale_pos_weight in XGBoost to penalize missing fraud cases.
- Modular Architecture: The code is split into distinct modules:
- data_loader.py
- - Ingestion & cleaning.
- - features.py
- - Feature engineering (time-based features, behavioral flags).
- model.py
- - Model wrapper with persistence (joblib).
- Full Evaluation: Automated generation of ROC-AUC (~0.999), Confusion Matrix, and Precision-Recall reports.
- Testing: End-to-end integration tests using pytest to ensure the pipeline doesn't break when refactoring.
I included detailed docs on the system architecture and testing strategy if anyone is interested in how to organize ML projects for production.
Repo: github.com/arpahls/cfd
Feedback on the code structure or model choice is welcome!
Top comments (1)
0.17% minority class is brutal. I've worked with similarly imbalanced datasets and SMOTE alone usually isn't enough — it tends to generate synthetic samples that sit in the overlap region between classes.
Did you try any anomaly detection approaches (isolation forest, autoencoders) as a first pass before the supervised models? Sometimes a two-stage pipeline works better than trying to make a single model handle the imbalance directly.