Hi everyone! I wanted to share a project I've been polishing to demonstrate how to structure a machine learning pipeline beyond just a Jupyter Notebook.
It’s a complete Credit Card Fraud Detection System built on the PaySim dataset. The main challenge was the extreme class imbalance (only ~0.17% of transactions are fraud), which makes standard accuracy metrics misleading.
Project Highlights:
- Imbalance Handling: Implementation of class_weight='balanced' in Random Forest and scale_pos_weight in XGBoost to penalize missing fraud cases.
- Modular Architecture: The code is split into distinct modules:
- data_loader.py
- - Ingestion & cleaning.
- - features.py
- - Feature engineering (time-based features, behavioral flags).
- model.py
- - Model wrapper with persistence (joblib).
- Full Evaluation: Automated generation of ROC-AUC (~0.999), Confusion Matrix, and Precision-Recall reports.
- Testing: End-to-end integration tests using pytest to ensure the pipeline doesn't break when refactoring.
I included detailed docs on the system architecture and testing strategy if anyone is interested in how to organize ML projects for production.
Repo: github.com/arpahls/cfd
Feedback on the code structure or model choice is welcome!
Top comments (2)
0.17% minority class is brutal. I've worked with similarly imbalanced datasets and SMOTE alone usually isn't enough — it tends to generate synthetic samples that sit in the overlap region between classes.
Did you try any anomaly detection approaches (isolation forest, autoencoders) as a first pass before the supervised models? Sometimes a two-stage pipeline works better than trying to make a single model handle the imbalance directly.
This was made as a project for the AI/ML Emgineer path on Codecademy, then enhanced with best practices and training options, and tests. How would you do it better? Please feel free to open an issue and I'll sort it. Is it worth to set up issues and contributions to make it open source?