How Data Preprocessing Impacts Machine Learning Models in Clinical Prediction

Carlos Peñalver Pérez — Wed, 13 May 2026 15:08:00 +0000

One of the ideas I wanted to explore in this project was simple: how much does data preprocessing really affect the performance of Machine Learning models?

In clinical prediction problems, this question becomes especially relevant. A model may achieve good overall accuracy, but still fail to detect the most important cases: patients at risk. For that reason, I wanted to focus not only on accuracy, but also on metrics such as recall, F1-score and the behaviour of the model on minority classes.

The datasets

For this project, I worked with three public clinical datasets:

Diabetes Dataset: used to predict diabetes from variables such as glucose, blood pressure, insulin, BMI and age.
Healthcare Stroke Dataset: focused on predicting stroke risk using demographic, clinical and lifestyle-related variables.
Thyroid Disease Dataset: related to thyroid disease detection using clinical, hormonal and categorical features.

Each dataset presented different challenges. Some had invalid clinical values, others contained missing values, categorical variables or strong class imbalance. This made them useful for testing how different preprocessing strategies affect different types of models.

The process

The main goal was not to find the best possible model, but to compare how models behave before and after preprocessing.

I tested several algorithms:

Logistic Regression
XGBoost
Support Vector Machine
Random Forest
Naive Bayes

The preprocessing techniques included:

Missing value imputation
Treatment of clinically invalid values
One-Hot Encoding for categorical variables
Feature scaling
Class balancing with SMOTE
Class weighting with class_weight="balanced"
scale_pos_weight for XGBoost
Dimensionality reduction with PCA
Feature selection with SelectKBest

For each model, I compared a baseline version against one or more preprocessed versions. The results were saved as CSV files and later analysed in a summary notebook with comparative tables and visualizations.

This chart shows why accuracy alone can be misleading in imbalanced clinical datasets.

Results

One of the clearest findings was that accuracy alone can be misleading, especially in imbalanced clinical datasets.

For example, in the diabetes dataset, preprocessing helped improve the detection of positive cases:

Dataset	Model	Best strategy	Accuracy	Recall key class	F1-score key class
Diabetes	Logistic Regression	class_weight="balanced"	0.734	0.704	0.650
Diabetes	XGBoost	scale_pos_weight	0.760	0.741	0.684
Stroke	Logistic Regression	SMOTE	0.751	0.800	0.240
Stroke	SVM	class_weight="balanced"	0.762	0.700	0.224
Thyroid	Naive Bayes	SMOTE + PCA + SelectKBest	0.899	0.552	0.457

For Diabetes and Stroke, the key class is class 1. For Thyroid, the key class is the minority class, which is class 0 in this dataset.

The stroke dataset was particularly interesting. Some baseline models achieved high accuracy, but they almost failed to detect positive stroke cases. After applying balancing strategies, recall improved significantly, although precision decreased.

This trade-off is important in early detection scenarios. In some clinical contexts, detecting more possible risk cases may be preferable, even if it means accepting more false positives.

The thyroid dataset showed a different behaviour. Random Forest achieved almost perfect metrics from the baseline version, suggesting that the dataset contained a very strong predictive signal. However, Naive Bayes still struggled, even after preprocessing. This was a useful reminder that preprocessing helps, but it does not make every model suitable for every dataset.

What I learned

The main lesson from this project is that preprocessing should not be treated as a fixed recipe. Its impact depends on the dataset, the model and the metric we want to prioritize.

I also learned that in clinical prediction problems, improving recall can be more meaningful than simply improving accuracy. A model with high accuracy but poor detection of positive cases may not be useful in practice.

If I continued this project, the next steps would be to include cross-validation, perform deeper hyperparameter tuning and test the models on external clinical datasets.

You can find the full project here:

GitHub repository

This academic project was developed during the Master in Data Science at Evolve.

DEV Community: Carlos Peñalver Pérez

How Data Preprocessing Impacts Machine Learning Models in Clinical Prediction

The datasets

The process

Results

What I learned