<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Carlos Peñalver Pérez</title>
    <description>The latest articles on DEV Community by Carlos Peñalver Pérez (@forzau).</description>
    <link>https://dev.to/forzau</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3928905%2F28e1e708-24b7-40af-8c9c-4bb636afa932.jpg</url>
      <title>DEV Community: Carlos Peñalver Pérez</title>
      <link>https://dev.to/forzau</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/forzau"/>
    <language>en</language>
    <item>
      <title>How Data Preprocessing Impacts Machine Learning Models in Clinical Prediction</title>
      <dc:creator>Carlos Peñalver Pérez</dc:creator>
      <pubDate>Wed, 13 May 2026 15:08:00 +0000</pubDate>
      <link>https://dev.to/evolve-space/how-data-preprocessing-impacts-machine-learning-models-in-clinical-prediction-32dp</link>
      <guid>https://dev.to/evolve-space/how-data-preprocessing-impacts-machine-learning-models-in-clinical-prediction-32dp</guid>
      <description>&lt;p&gt;One of the ideas I wanted to explore in this project was simple: &lt;strong&gt;how much does data preprocessing really affect the performance of Machine Learning models?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In clinical prediction problems, this question becomes especially relevant. A model may achieve good overall accuracy, but still fail to detect the most important cases: patients at risk. For that reason, I wanted to focus not only on accuracy, but also on metrics such as recall, F1-score and the behaviour of the model on minority classes.&lt;/p&gt;

&lt;h2&gt;
  
  
  The datasets
&lt;/h2&gt;

&lt;p&gt;For this project, I worked with three public clinical datasets:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Diabetes Dataset&lt;/strong&gt;: used to predict diabetes from variables such as glucose, blood pressure, insulin, BMI and age.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Healthcare Stroke Dataset&lt;/strong&gt;: focused on predicting stroke risk using demographic, clinical and lifestyle-related variables.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Thyroid Disease Dataset&lt;/strong&gt;: related to thyroid disease detection using clinical, hormonal and categorical features.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each dataset presented different challenges. Some had invalid clinical values, others contained missing values, categorical variables or strong class imbalance. This made them useful for testing how different preprocessing strategies affect different types of models.&lt;/p&gt;

&lt;h2&gt;
  
  
  The process
&lt;/h2&gt;

&lt;p&gt;The main goal was not to find the best possible model, but to compare how models behave before and after preprocessing.&lt;/p&gt;

&lt;p&gt;I tested several algorithms:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Logistic Regression&lt;/li&gt;
&lt;li&gt;XGBoost&lt;/li&gt;
&lt;li&gt;Support Vector Machine&lt;/li&gt;
&lt;li&gt;Random Forest&lt;/li&gt;
&lt;li&gt;Naive Bayes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The preprocessing techniques included:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Missing value imputation&lt;/li&gt;
&lt;li&gt;Treatment of clinically invalid values&lt;/li&gt;
&lt;li&gt;One-Hot Encoding for categorical variables&lt;/li&gt;
&lt;li&gt;Feature scaling&lt;/li&gt;
&lt;li&gt;Class balancing with SMOTE&lt;/li&gt;
&lt;li&gt;Class weighting with &lt;code&gt;class_weight="balanced"&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;scale_pos_weight&lt;/code&gt; for XGBoost&lt;/li&gt;
&lt;li&gt;Dimensionality reduction with PCA&lt;/li&gt;
&lt;li&gt;Feature selection with SelectKBest&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For each model, I compared a baseline version against one or more preprocessed versions. The results were saved as CSV files and later analysed in a summary notebook with comparative tables and visualizations.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5uq9ujhity9e7z4a9nj4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5uq9ujhity9e7z4a9nj4.png" alt=" " width="800" height="468"&gt;&lt;/a&gt;&lt;br&gt;
This chart shows why accuracy alone can be misleading in imbalanced clinical datasets.&lt;/p&gt;

&lt;h2&gt;
  
  
  Results
&lt;/h2&gt;

&lt;p&gt;One of the clearest findings was that &lt;strong&gt;accuracy alone can be misleading&lt;/strong&gt;, especially in imbalanced clinical datasets.&lt;/p&gt;

&lt;p&gt;For example, in the diabetes dataset, preprocessing helped improve the detection of positive cases:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Dataset&lt;/th&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Best strategy&lt;/th&gt;
&lt;th&gt;Accuracy&lt;/th&gt;
&lt;th&gt;Recall key class&lt;/th&gt;
&lt;th&gt;F1-score key class&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Diabetes&lt;/td&gt;
&lt;td&gt;Logistic Regression&lt;/td&gt;
&lt;td&gt;class_weight="balanced"&lt;/td&gt;
&lt;td&gt;0.734&lt;/td&gt;
&lt;td&gt;0.704&lt;/td&gt;
&lt;td&gt;0.650&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Diabetes&lt;/td&gt;
&lt;td&gt;XGBoost&lt;/td&gt;
&lt;td&gt;scale_pos_weight&lt;/td&gt;
&lt;td&gt;0.760&lt;/td&gt;
&lt;td&gt;0.741&lt;/td&gt;
&lt;td&gt;0.684&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Stroke&lt;/td&gt;
&lt;td&gt;Logistic Regression&lt;/td&gt;
&lt;td&gt;SMOTE&lt;/td&gt;
&lt;td&gt;0.751&lt;/td&gt;
&lt;td&gt;0.800&lt;/td&gt;
&lt;td&gt;0.240&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Stroke&lt;/td&gt;
&lt;td&gt;SVM&lt;/td&gt;
&lt;td&gt;class_weight="balanced"&lt;/td&gt;
&lt;td&gt;0.762&lt;/td&gt;
&lt;td&gt;0.700&lt;/td&gt;
&lt;td&gt;0.224&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Thyroid&lt;/td&gt;
&lt;td&gt;Naive Bayes&lt;/td&gt;
&lt;td&gt;SMOTE + PCA + SelectKBest&lt;/td&gt;
&lt;td&gt;0.899&lt;/td&gt;
&lt;td&gt;0.552&lt;/td&gt;
&lt;td&gt;0.457&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;For Diabetes and Stroke, the key class is class 1. For Thyroid, the key class is the minority class, which is class 0 in this dataset.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The stroke dataset was particularly interesting. Some baseline models achieved high accuracy, but they almost failed to detect positive stroke cases. After applying balancing strategies, recall improved significantly, although precision decreased.&lt;/p&gt;

&lt;p&gt;This trade-off is important in early detection scenarios. In some clinical contexts, detecting more possible risk cases may be preferable, even if it means accepting more false positives.&lt;/p&gt;

&lt;p&gt;The thyroid dataset showed a different behaviour. Random Forest achieved almost perfect metrics from the baseline version, suggesting that the dataset contained a very strong predictive signal. However, Naive Bayes still struggled, even after preprocessing. This was a useful reminder that preprocessing helps, but it does not make every model suitable for every dataset.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I learned
&lt;/h2&gt;

&lt;p&gt;The main lesson from this project is that preprocessing should not be treated as a fixed recipe. Its impact depends on the dataset, the model and the metric we want to prioritize.&lt;/p&gt;

&lt;p&gt;I also learned that in clinical prediction problems, improving recall can be more meaningful than simply improving accuracy. A model with high accuracy but poor detection of positive cases may not be useful in practice.&lt;/p&gt;

&lt;p&gt;If I continued this project, the next steps would be to include cross-validation, perform deeper hyperparameter tuning and test the models on external clinical datasets.&lt;/p&gt;

&lt;p&gt;You can find the full project here:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/forzau/Proyecto-Master-DataScience-Evolve-CarlosPenalver" rel="noopener noreferrer"&gt;GitHub repository&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;This academic project was developed during the Master in Data Science at &lt;a href="https://evolve.es" rel="noopener noreferrer"&gt;Evolve&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>python</category>
      <category>datascience</category>
      <category>machinelearning</category>
      <category>pandas</category>
    </item>
  </channel>
</rss>
