Working with imbalance dataset using Neural Network and Imblearn Library SMOTE function.
The goal of this analysis I did was to help the bank of Portugal predict people who are more likely to purchase a Bank term Deposit, it is common in marketing data set that the percentage of people who actually buy are maybe 8% percent and in this case it would be disastrous as our model would only predict very accurately for the class we are not interested in
Today I would show you exactly how to worked with imbalance dataset and my result from this analysis would be proof of how important it is to do this. Let's get right into the analysis now.
This would start with importing the library that we would us for the analysis and also the dataset here, the dataset set used here was gotten from this link and it is available for download anywhere.
After importing the library and dataset we would quickly fit our model and test it to see how it performs without using Imblearn to fit the data,
This is really clear now that we are not going to get any real value from our analysis if we do not do anything about this.
Dealing with Imbalanced Dataset.
There are two basic way of dealing with imbalanced data set which in this case is a mixture of numbers and strings.
The Up sampling method
The Under sampling method
The up sampling method is used when you have a dataset where one class is pretty low and you do not have a very large data, in this case the best thing to do is generate more samples as we call it here, this samples would be as similar as possible to the Original Data, there are different method used here like K-nearest neighbor, Random sampling and SMOTE.
An important thing to note here is to understand that this is done to enable us to have a balanced model and it is important to split the data into Train and Test before performing any Sampling on your data, this would allow you to do correct prediction for your test data. As I did here.
Here is the result of my analysis the classification report, the ROC curve
In under sampling is quite the opposite and is used when you have a very large data set and dividing it would not take important information on the dataset, it is advised to always use Oversampling method so as to get all the insights you can get.
More about Imblearn Library and the different method can be found here.
Thank you so much for reading this till the end you did really well, here is a link to my GitHub repository where the code is hosted I am working on something like a library where it would be really easy to work with Imbalanced data.
Top comments (0)