During my time interviewing I have been asked to describe some of my projects in detail. So to help myself go through my process and the decisions I made I decided to write a blog about one of my favorites, using a Neural Network to detect Pneumonia in Xray images. It has been awhile since I visited this project so I decided to try and add some upgrades to my previous model. The dataset I used for my actual project is 7.9 GB I will be referring to the smaller dataset from Kaggle found here.
Since I am still on the newer side of working with Neural Networks I usually don't come up with the most efficient models, therefore so my data takes awhile to run so for today I will be showing you how I organize my data and set it up for the model.
Getting a first impression of the Data
Before even looking at the data it was important for me to get a business understanding of this data so that I could plan on what metrics I wanted to use to evaluate my model. I determined that in this situation of using a Neural Network for detection, this process could be used in rural areas that lacked expertise or just to lessen the doctors workload in a busy hospital when someone is coming in for a recovery checkup. So with this in mind I determined that it was most important that if the machine did get a diagnosis wrong I wanted it to diagnose a healthy person as having Pneumonia since the doctor would most likely double check an images that is flagged as having Pneumonia, than it telling the doctor that someone who has Pneumonia is healthy. Which means that I needed to really focus on Recall as a metric, while keeping others in mind as well.
Now that I had an understanding of the data it was time to take a look at our downloaded data. The author of the Kaggle post was nice enough to already separate the data into training, test and validation folders. So import it and take a look at what we have.
base_dir, _ = os.path.splitext("../../Downloads/chest_xray")
train_dir = os.path.join(base_dir, 'train')
validation_dir = os.path.join(base_dir, 'val')
train_normal = os.path.join(train_dir, 'NORMAL')
print ('Total training normal images:', len(os.listdir(train_normal)))
train_pneu = os.path.join(train_dir, 'PNEUMONIA')
print ('Total training pneu images:', len(os.listdir(train_pneu)))
val_normal = os.path.join(validation_dir, 'NORMAL')
print ('Total validation normal images:', len(os.listdir(val_normal)))
val_pneu = os.path.join(validation_dir, 'PNEUMONIA')
print ('Total validation pneu images:', len(os.listdir(val_pneu)))
test_dir = os.path.join(base_dir,'test')
test_normal = os.path.join(test_dir, 'NORMAL')
print ('Total test normal images:', len(os.listdir(test_normal)))
test_pneu = os.path.join(test_dir, 'PNEUMONIA')
print ('Total test pneu images:', len(os.listdir(test_pneu)))
Interesting the training data has quite a big imbalance with "Pneumonia" X-rays over "Normal" X-rays, but the testing and validation folders are pretty equal. We can correct this bias with our Neural Network before we run it, but we might as well calculate that weight now.
weight_norm = (1 / len(os.listdir(train_normal)))*(len(os.listdir(train_normal))+len(os.listdir(train_pneu)))/2.0
weight_pneu = (1 / len(os.listdir(train_pneu)))*(len(os.listdir(train_normal))+len(os.listdir(train_pneu)))/2.0
class_weight = {0: weight_norm, 1: weight_pneu}
class_weight
{0: 1.9448173005219984, 1: 0.6730322580645162}
Prepping the images
For this part we have to get a general understanding about how computers "sees" pictures. Images are made up of pixels arranged in rows and columns. To you these are just very tiny dots of colors, and to give you a visualization of how small, a 1920 x 1080 HD TV is made up of a width of 1920 pixels and a height of 1080 pixels. But the computer only understands numbers, so to convert the colors to numbers we use various color models with the most well known being the RGB or Red Green Blue Model. This gives the computer information ranging from 0-255 of how much of each color is in each pixel.
Now that we understand that we need make it so our model can easily understand our images and find the similarities and differences between them. So now I am going to define a way for the model to break down the images to look at 224x224 pixels at a time to compare with the ImageDataGenerator which is a Data Augmenter. Though this is not enough in itself the pixels still have coefficients ranging from 0-255 which is very hard for our model to process, so to make this easier for the computer as well we are going to target between 0 and 1 by scaling 1/255 factor.
image_size = 224 # All images will be resized to 224x224
batch_size = 8
# Rescale all images by 1./255 and apply image augmentation
train_datagen = keras.preprocessing.image.ImageDataGenerator(rescale=1./255)
validation_datagen = keras.preprocessing.image.ImageDataGenerator(rescale=1./255)
test_datagen = keras.preprocessing.image.ImageDataGenerator(rescale=1./255)
# Flow training images in batches of 20 using train_datagen generator
train_generator = train_datagen.flow_from_directory(
train_dir, # Source directory for the training images
target_size=(image_size, image_size),
batch_size=batch_size,
# Since we use binary_crossentropy loss, we need binary labels
class_mode='binary')
# Flow validation images in batches of 20 using test_datagen generator
validation_generator = validation_datagen.flow_from_directory(
validation_dir, # Source directory for the validation images
target_size=(image_size, image_size),
batch_size=batch_size,
class_mode='binary')
# Flow validation images in batches of 20 using test_datagen generator
test_generator = test_datagen.flow_from_directory(
test_dir, # Source directory for the validation images
target_size=(image_size, image_size),
batch_size=batch_size,
class_mode='binary',
shuffle=False)
Conclusion
Now our data is set and in a state that will be easily understood by the computer. Next we can create our model to process our data through, but that is a whole different mess and so I will be continuing with that next week. So stay tuned for Using a Neural Network Pt.2.
Top comments (0)