DEV Community

Cover image for Pneumonia Detection
Cristopher Delgado
Cristopher Delgado

Posted on

Pneumonia Detection

Contacts: Github, LinkedIn

Github Repository containing Project

Overview

Throughout my journey in embracing data science, I finally decided to dive into the biomedical aspect of data science. Machine Learning is taking over the analytics in medical devices as machine learning is beginning to provide a vast amount of opportunities in medical devices.

In this hypothetical scenario, I developed a Deep-learning model that can classify Pneumonia Images with 88% accuracy. Data science is a comprehensive application and in this story, I got the chance to showcase my hard work in developing a deep-learning model

Data Understanding

The X-ray images used in this git repo are of pediatric patients. The classification is a binary case on whether a patient has Pneumonia. The dataset comes from Kermany et al. on Mendley. The dataset on Kaggle is from the original source using Version 2 that was published on January 01, 2018. The validation folder consisted on 16 images. This quantity was insufficient in my opinion to truly grasp if the deep learning model would be overfitting. Instead a custom validation folder was made by incorporating the 16 images in the val folder back into the train folder and randomly selecting 20% of the train set of each category.

data_distribution

Best Model

Base Convolutional Model

The model architecture was simple with one Convolutional Layer of 16 filters, one hidden Dense Layer of 256 neurons, and one output layer consisting on 1 neuron for binary classification. This model did well in training with promising generalization as seen in the Epoch vs Loss graph but, with unknown data, the model does very well in classifying Pneumonia but not Normal Images. We can see that in the metrics as in training both Precision and Recall were high in the validation and training sets. In the test set only the Recall was high at 99% and low in Precision at 71%. This is due to the small amount of Normal images in the training set in comparison to Pneumonia.

Set Loss Precision Recall Accuracy
Train 0.020 100.00% 100.00% 100.00%
Test 1.336 71.85% 99.84% 75.32%
Validation 0.097 98.56% 97.68% 97.22%

cnn_metrics

Augmentation Model

To combat the low availability of Normal images data augmentation was utilized to create synthetic examples of Normal images for the model to train on. Taking the pre-trained model I introduced data augmentation generators that act as brand new images to learn from. Data augmentation resulted in better performance of unknown data. The training metrics on the augmented data show how it began to do better on Normal Images as seen with the high Precision scores in the train and validations sets. The test set used a data generator that supplied unaltered images. This was done to truly grasp its performance on the actual test set.

Set Loss Precision Recall Accuracy
Train 0.228 98.64% 88.79% 90.78%
Test 0.360 88.43% 94.10% 88.62%
Validation 0.217 98.85% 89.04% 91.10%

cnn_aug_metrics

Conclusion

The best model was the augmented model as this model was the best-performing model across the board. It is slightly more complex than its base version but as a result, it learned from augmented data leading to better performance in Normal images. Please check out my entire project in the GitHub repository where I try many more iterations such as Transfer Learning!

Top comments (0)