A Minimal Case Study of Bias in Clinical Machine Learning
Repository: https://github.com/lemind/clinical-ml-bias-demo
Background
This project was done as part of a university course on machine learning in clinical settings.
The course focused on bias and fairness in ML, clinical risk, interpretability, and the limits of technical mitigation in real-world systems.
One of the key inspirations was Ntoutsi et al. (2020), “Bias in Data-Driven AI Systems”[1], which emphasizes that bias can appear at all stages of an ML pipeline and that mitigation is usually partial and involves trade-offs.
The goal of this project was not to build a strong predictive model, but to:
- demonstrate bias,
- measure it explicitly,
- and apply a simple mitigation strategy.
Problem Setup
We simulate a basic clinical prediction task using synthetic data:
- Features: age, sex, severity score
- Target: binary clinical outcome
Sex is treated as a protected attribute.
Bias is intentionally introduced so that female patients are more likely to be missed by the model.
This setup reflects real clinical scenarios where historical or systemic biases are encoded in data.
Method
Tools
- Python
- NumPy, Pandas
- scikit-learn
- Logistic Regression (interpretable baseline)
Steps
- Generate biased synthetic clinical data
- Train a logistic regression model
- Evaluate performance by subgroup
- Measure False Negative Rate (FNR) per group
- Apply post-processing mitigation
False negatives were chosen as the primary metric due to their clinical relevance.
Results
Baseline (default threshold = 0.5)
- Overall accuracy: 0.593
- False Negative Rate (male): 0.468
- False Negative Rate (female): 0.925
The model misses most positive cases for female patients, despite acceptable overall accuracy.
After mitigation (group-specific threshold)
- False Negative Rate (male): 0.468
- False Negative Rate (female): 0.094
The disparity is substantially reduced without retraining the model.
Interpretation
Mitigation was applied at decision time, not during training.
A more sensitive threshold was used for the disadvantaged group.
This improves recall for female patients but implicitly increases false positives, illustrating a real clinical trade-off.
The outcome aligns with the arguments of Ntoutsi et al. (2020):
bias mitigation redistributes errors rather than eliminating them.
Limitations
- Synthetic data lacks real clinical complexity
- Only one protected attribute was considered
- Thresholds were manually chosen
- Not all fairness metrics were evaluated
Despite this, the example effectively demonstrates key concepts from the course.
Conclusion
This project shows that:
- bias can exist even in simple, interpretable models
- overall accuracy can hide clinically important disparities
- mitigation strategies reduce harm but introduce trade-offs
Bias in clinical ML is not purely a modeling issue, but a system-level problem, consistent with current research.
[1] - Ntoutsi, E., Fafalios, P., Gadiraju, U., Iosifidis, V., Nejdl, W., Vidal, M.-E., Ruggieri, S., Turini, F., Papadopoulos, S., Krasanakis, E., Kompatsiaris, I., Kinder-Kurlanda, K., Wagner, C., Karimi, F., Fernandez, M., Alani, H., Berendt, B., Kruegel, T., Heinze, C., Broelemann, K., Kasneci, G., Tiropanis, T., & Staab, S. (2020). Bias in data-driven artificial intelligence systems—An introductory survey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 10(3), e1356. https://doi.org/10.1002/widm.1356
Top comments (0)