From the project Skin cancer detection with Convolutional Neural Networks
Skin cancer is by far the most common type of cancer.
According to one estimate, about 5.4 million cases of skin cancer are diagnosed among 3.3 million people each year (many people are diagnosed with more than one spot of skin cancer at the same time).
1 out of 5 Americans will develop skin cancer by the time they are 70.
Causes: Skin cancer occurs when errors (mutations) occur in the DNA of skin cells. The mutations cause the cells to grow out of control and form a mass of cancer cells.
Risk factors: Fair skin, light hair, freckling, moles, history of sunburns, excessive sun exposure, sunny or high-altitude climates, precancerous skin lesions, weakened immune system, etc.
If you have skin cancer, it is important to know which type you have because it affects your treatment options and your outlook (prognosis).
If you arenโt sure which type of skin cancer you have, it is recommended that you ask your doctor so you are properly aware.
A doctor will usually do an examination looking at all the skin moles, growths and abnormalities to understand which ones are at risk for being cancerous.
But what if the doctor is not sure?
What if we could develop a tool that could help the doctor decide with more confidence and ensure more safety for every patient?
What if, to make a decision about a patient, the doctor could have the support of advanced technology and a model that makes its determination based on a direct comparison with thousands of other cases?
This is what we were trying to achieve in this project.
When the model is ready to be used in the field, an app can be developed from it.
This app would use the cell phone camera and it would return the type of skin anomaly and the percentage probability of it being cancerous. The threshold above which a case is considered cancerous or at risk can be adjusted by the user, to decrease the chance of false negative cases.
The app would also allow the user to upload the images to a database so the model will continue to improve in accuracy over time.
The app will also show which part of the images the model focuses on to make its determination and show filters that the model applied to the image.
In this way, since the granularity that a computer can scan is higher than the one of the human eye, the model might have caught details that the doctor did not, and thus make a more informed assessment.
The app cannot substitute the critical judgment of a human being, but by the power that this technology offers, we feel this could be a very useful tool to support a doctor in his decision.
The black box model and LIME
One of the main issue we have with Convolutional Neural Networks (and Neural Networks in general) is that even though they are very powerful and efficient, they are hard to understand from the outside.
They are what is usually called a "black-box model". Which means that we provide them with some structure for the model and the input and they produce a result, that is often very accurate, but we have no way from the outside to see what happened exactly for the model to get to that result.
The calculations are not explicit and often anyway not very intelligible, so it is hard for us to trust the model, or once it makes a mistake, to understand why it did and what went wrong.
This is why tools like LIME, that help us understand more about the model, are becoming more and more popular.
With LIME we can see which ones were the parts of the image that more heavily influenced our model to believe that the picture belonged to the cancerous or benign class.
This can be extremely useful to doctors using our app because they don't need to believe blindly in our model, but for each one of the images they can extract what was the part of the picture that led the model to its conclusion, and whether the model focused on the wrong part or read the image correctly, the physician can draw his conclusions, and make a more informed decision.
Summary:
Our data consisted of 2357 pictures of skin anomalies, that belong to 9 different classes.
The goal of the project was to build a model that could classify the images, first in their 9 native classes, and secondly another model was built that would classify the images between cancerous and benign.
Our models are all Convolution Neural Networks. We used Tensorflow with Keras backend, for building the models.
We built sequential models with densely connected layers, and took advantage of regularizers and constraints to tune the models.
For validation, a cross validation was ran at every fitting of the model, setting aside a 20% validation set.
During the grid searches, a 3-fold cross validation was used.
Finally LIME and Visualization of Activation Layers was used for model explainability.
The best 9 classes model reached a mean accuracy (calculated over 10 samples) between 70% and 80% on the train.
When evaluated on the holdout test set the results were an average accuracy between 15% and 20% on the test.
The binary classification model had a mean Recall of 80% and f1 of 85% on the train.
On the test we obtained a recall around 65% and f1 around 70%. Both with a recall threshold of 50%.
When we moved the recall threshold to 30% the model reached on the test a recall of about 85%.
Data Understanding:
Let us dig deeper into what each one of these classes are, and we will preview one image for each class to get a visual sense of what our model is going to be studying.
In particular we will divide the classes in two macro classes, benign and malignant, since we will also build a model to determine if the image is ultimately of benign or cancerous nature.
The firs five classes, Dermatofibroma Pigmented benign keratosis and Seborrheic keratosis, Nevus and Vascular lesion are benign, while the other 4 classes actinic keratosis, basal cell carcinoma, melanoma and squamous cell carcinoma are malignant.
The distribution of the 9 classes is as follows:
To divide into the 2 classes, benign and cancerous, we grouped together the benign classes in one class, and all the malignant classes in another. We obtained this distribution:
Models:
First we created a model to identify which of the 9 classes the image belongs to.
All the models used are convolutional Neural Netowrs (see details in the notebooks about their structure).
We started with a Naive model, and with images of sixe 8X8 pixels, only one Convolutional Layer, one Pooling layer and one Dense layer.
We increased the size of the images, at 32x32 and then at 64x64, and we continued for the rest of the model with this size.
After that, we normalized the pixel values and then started to tune the models, running Grid Searchs on the following parameters: number of epochs, batch size, optimization algorithm, learning rate of optimization algorithm, neuron activation function, number of neurons. We also did some tuning to avoid overfitting, using L2 regularization and dropout regularization.
We used accuracy and loss as metrics to keep track of the progress of our model.
The model that we selected as the best one turned out to be the one we created right after tuning the number of neurons.
This model has the following parameters:
activation function = relu for each layer except last one softmax
optimizer = Adam with learning rate 0.001
neurons - 5 for each layer expect the last one in which they are 9
This model reached an mean accuracy between 70% and 80% and a mean loss between 0 and 2 on the train.
For the test set the results were an average accuracy between 15% and 20% on the test and loss on the test on average between 6 and 14.
With this model we obtained the following confusion matrix:
Next we reorganized the images into 2 classes instead of 9: 'benign' and 'malignant'.
We tuned the model using grid searches of the same paramters as the first model.
In this case we chose two different metrics to evaluate the model, while still keeping an eye on accuracy and loss, we defined functions to extract the recall and f1 of our model.
We chose the recall as our metric in this case because we wanted to try to minimize the cases of false negatives, and we kept monitoring also f1 to make sure the performance of our model remained good.
The performance of the binary classification model was a little unstable but we found a way to select the best performing one and we found for the train a recall of around 80% and an f1 of roughly 85%.
On the test we obtained a recall of around 65% and f1 of around 70%.
When we lowered the recall threshold to 30% we obtained a recall for the test set of about 70%.
LIME
LIME stands for Local Interpretable Model-Agnostic Explanations. This means that it is focusing on some of the results locally, not trying to understand the whole model.
Interpretable as we said, because makes the model more interpretable, Model Agnostic is because it works with any machine learning classifier. Explanation because it returns an explanation of why the model made the classification it did, and returned that specific result.
In general what LIME does is it breaks the images into interpretable components called superpixels (clusters of contiguous pixels). It then generates a data set of perturbed instances by turning some of the interpretable components โoffโ (in this case, making some of the superpixels in our picture gray).
For each perturbed instance, we get the probability that the skin growth is cancerous according to the model. We then learn a simple (linear) model on this data set, which is locally weighted โ that is, we care more about making mistakes in perturbed instances that are more similar to the original image. In the end, we present the superpixels with highest positive weights as an explanation, graying out everything else.
Example of image categorized correctly by the model:
In this image, the parts that are blocked off, in black and red, are the parts of the image that the model ignored to make its determination. The parts that are shown in the natural color and green, are the parts that the model used more heavily to make its determination.
We can notice how the model identified correctly which part of the image to focus on, which was the skin anomaly.
Examples of image categorized incorrectly by the model:
We can see here how the model did not identify correctly the parts of the image to focus on, and made its determination based on the skin around the lesion and not the lesion itself.
The doctor by seeing this image will be warned that most likely the model made a mistake in the determination of the class of this image.
Visualizing Activation Layers
One more thing that we can offer to the AAD to make more clear for the doctors what lead the model to its decision, is visualizing activation layers.
This is part of how a Convolutional Neural Network works, in order to make its determination and classify an image. We can visualize the intermediate hidden layers within our CNN to uncover what sorts of features our deep network is uncovering through some of the various filters.
As we mentioned before a CNN to learn about an image applies different filters, and this new representation of the image is called feature map. What we do when we visualize activation layers is that we look at feature maps and see number of channels. You can visualize each of these channels by slicing the tensor along that axis.
We can also visualize all of the channels from one activation function, with a loop.
Results
We built two different models, one that identified the images belonging to 9 different classes of skin anomalies.
The best model we choose had the following characteristics:
image size 64x64
epochs 10, batch size 10
optimization algorithm Adam, with learning rate 0.001
optimization function 'relu', number of neurons 5
This model reached an accuracy of on average between 70% and 80% and a loss between 0 and 2 on the train.
When evaluated on the holdout test set the results were an average accuracy between 15% and 20% on the test and loss on the test between 6 and 14.
The second model we built was a binary classification model, trying to identify if an image belonged to the 'benign' or 'cancerous' class.
This model was tuned just like the previous one in terms of image size, number of epochs, batch size, optimization algorithm, activation function, number of neurons, regularization and dropout.
The performance of this model was a little unstable but we found a way to select the best performing one (which changes given the stochastic nature of NNs) and in general, we were able to obtain for the train a recall of around 80% and an f1 of roughly 85%.
On the test we obtained a recall of around 65% and f1 of around 70%.
When we lowered the recall threshold to 30% we obtained a recall for the test set of about 70%.
We also used tools like LIME and Visualization of Activation Layers to make the model more explainable, so that if a physician is uncertain of the result of the model, or wanted to dig deeper for other reasons, they would have the chance to see more in-depth what was the way in which the model processed the image and made its determination.
Limitations
- Given the stochastic nature of the Neural Networks, we were not able to have permanent results. We hope that by building a broader database of images and training the model several times whenever the database gets updated, we will be able to obtain more stable models with higher accuracy.
There might be limitations to uploading the images in a database for patient privacy reasons, so a HIPAA form would have to be provided and signed by the patient to be able to use the images of their skin.
The set of images for the 9 classes was not balanced, which might have brought the first model to recognize better the more populated classes versus the less populated ones. With a more extensive database to train the model, this issue could be solved and the model could improve.
We had some technical limitations in terms of the running time of the code for which we could not run more grid searches or expand the ranges swept even more, or increase the number of layers or neurons of the model. With more computational power or taking advantage of one of the cloud services higher accuracy could be achieved.
Recommendations
We know that black box models are scary and it can be hard to trust a computer with a patient's health. But we are not trying to substitute the physician, with his skill and critical thinking. But we believe our app can be such a useful tool to support the doctor in a situation of uncertainty, using the power of always-evolving technology.
Use the app to its full potential, looking at LIME explanations and activation layers, and setting the threshold for images to be considered cancerous.
Whenever a patient agrees to it, upload the images of their skin anomalies to make our database always growing, and help us to constantly improve our model.
Report whenever there is a doubt or an error so that the model can be trained better.
Next Steps
To improve our model and for a more in-depth study we could also:
- Balance out perfectly the classes in the 9 classes model by image augmentation, to obtain better results. Utilize more powerful tools like models available like Transfer Learning.
- Create a function that selects only the images classified incorrectly and runs them through the model again or to another more powerful model (Transfer Learning).
- Flagging images with uncertain probability. Most likely the images that are closer to error are the ones where the prediction is close to 0.5. We can select a range from 0.4 to 0.6 where the image instead of being classified gets flagged as an uncertain image and sent through the model again or through a more powerful model.
- Create the app that the physicians can use with the possibility to add images to the dataset, and periodically retrain and improve the model.
- Take a whole other set of skin lesion images and train our same model on them to increase its accuracy and flexibility. For More Information
You can find my whole project on the following link on GitHub.
Sources:
American Cancer Society
Skin Cancer Foundation
Mayo Clinic
Top comments (0)