Self Study: Data Science - Machine Learning journey : Day 3 (R Programming | ROC and AUC Curves)

#beginners #datascience #machinelearning #tutorial

Before getting into this topic, you shall optionally refer Day 1 post for the basics of ROC and AUC, Day 2 post to install R package and R Studio Desktop.

Subscribe to my Youtube channel: MyDigitalWorld - Cloud, DevOps, Data Science

Day 1 video : https://youtu.be/DPjFVNuMHaE
Day 2 video : https://youtu.be/WyFoSSvbNRA

Assume you have a set of 100 candidates preparing for a competitive examination. We are going to draw a ROC and AUC chart in R plotting the number of hours spent learning or preparing for the exam vs the result obtained, either pass or fail. We will ROC to compare the different thresholds and find the best threshold and AUC to determine which ML algorithm performed best.

1. Import libraries for ROC and Random forest algorithm

library(pROC) # install with install.packages("pROC")
library(randomForest) # install with install.packages("randomForest")

2. Generate random numbers for 100 samples for number of hours spent in learning

set.seed(420) # random number generation ensures that you get the same result if you start with that same seed each time you run the same process

num.samples <- 100

## generate 100 values from a normal distribution with
## mean 50 and standard deviation 12, then sort them
elearn <- sort(rnorm(n=num.samples, mean=50, sd=12))

3. Now we will decide if a sample is pass or not

## rank(elearn) returns 1 for the least time spent, 2 for the second least time, ...
##              ... and it returns 100 for the highest time spent for exam preparation
## So what we do is generate a random number between 0 and 1. Then we see if
## that number is less than rank/100. So, for the least sample, rank = 1.
## This sample will be classified "pass" if we get a random number less than
## 1/100. For the second least sample, rank = 2, we get another random
## number between 0 and 1 and classify this sample "pass" if that random
## number is < 2/100. We repeat that process for all 100 samples
pass <- ifelse(test=(runif(n=num.samples) < (rank(elearn)/num.samples)), 
  yes=1, no=0)
pass ## print out the contents of "pass" to show us which samples were
      ## classified "pass" with 1, and which samples were classified
      ## "not pass" with 0.

4. Plot the data

plot(x=elearn, y=pass)

5. Fit a logistic regression to the data

glm.fit=glm(pass ~ elearn, family=binomial)
lines(elearn, glm.fit$fitted.values)

6. Draw ROC and AUC using pROC

## ROC graphs should be square, since the x and y axes
## both go from 0 to 1. However, the window in which I draw them isn't square
## so extra whitespace is added to pad the sides.
roc(pass, glm.fit$fitted.values, plot=TRUE)

## Now let's configure R so that it prints the graph as a square.
##
par(pty = "s") ## pty sets the aspect ratio of the plot region. Two options:
##                "s" - creates a square plotting region
##                "m" - (the default) creates a maximal plotting region
roc(pass, glm.fit$fitted.values, plot=TRUE)

## NOTE: By default, roc() uses specificity on the x-axis and the values range
## from 1 to 0.  To use 1-specificity (i.e. the 
## False Positive Rate) on the x-axis, set "legacy.axes" to TRUE.
roc(pass, glm.fit$fitted.values, plot=TRUE, legacy.axes=TRUE)

## If you want to rename the x and y axes...
roc(pass, glm.fit$fitted.values, plot=TRUE, legacy.axes=TRUE, percent=TRUE, xlab="False Positive Percentage", ylab="True Postive Percentage")

## We can also change the color of the ROC line, and make it wider...
roc(pass, glm.fit$fitted.values, plot=TRUE, legacy.axes=TRUE, percent=TRUE, xlab="False Positive Percentage", ylab="True Postive Percentage", col="#377eb8", lwd=4)

## If we want to find out the optimal threshold we can store the 
## data used to make the ROC graph in a variable...
roc.info <- roc(pass, glm.fit$fitted.values, legacy.axes=TRUE)
str(roc.info)

## and then extract just the information that we want from that variable.
roc.df <- data.frame(
  tpp=roc.info$sensitivities*100, ## tpp = true positive percentage
  fpp=(1 - roc.info$specificities)*100, ## fpp = false positive precentage
  thresholds=roc.info$thresholds)

head(roc.df) ## head() will show us the values for the upper right-hand corner
             ## of the ROC graph, when the threshold is so low 
             ## (negative infinity) that every single sample is called "pass".
             ## Thus TPP = 100% and FPP = 100%

tail(roc.df) ## tail() will show us the values for the lower left-hand corner
             ## of the ROC graph, when the threshold is so high (infinity) 
             ## that every single sample is called "not pass". 
             ## Thus, TPP = 0% and FPP = 0%

## now let's look at the thresholds between TPP 60% and 80%...
roc.df[roc.df$tpp > 60 & roc.df$tpp < 80,]

## We can calculate the area under the curve...
roc(pass, glm.fit$fitted.values, plot=TRUE, legacy.axes=TRUE, percent=TRUE, xlab="False Positive Percentage", ylab="True Postive Percentage", col="#377eb8", lwd=4, print.auc=TRUE)

## ...and the partial area under the curve.
roc(pass, glm.fit$fitted.values, plot=TRUE, legacy.axes=TRUE, percent=TRUE, xlab="False Positive Percentage", ylab="True Postive Percentage", col="#377eb8", lwd=4, print.auc=TRUE, print.auc.x=45, partial.auc=c(100, 90), auc.polygon = TRUE, auc.polygon.col = "#377eb822")

7. Fit the data with a random forest

rf.model <- randomForest(factor(pass) ~ elearn)

## ROC for random forest
roc(pass, rf.model$votes[,1], plot=TRUE, legacy.axes=TRUE, percent=TRUE, xlab="False Positive Percentage", ylab="True Postive Percentage", col="#4daf4a", lwd=4, print.auc=TRUE)

8. Layer logistic regression and random forest ROC graphs

roc(pass, glm.fit$fitted.values, plot=TRUE, legacy.axes=TRUE, percent=TRUE, xlab="False Positive Percentage", ylab="True Postive Percentage", col="#377eb8", lwd=4, print.auc=TRUE)

plot.roc(pass, rf.model$votes[,1], percent=TRUE, col="#4daf4a", lwd=4, print.auc=TRUE, add=TRUE, print.auc.y=40)
legend("bottomright", legend=c("Logisitic Regression", "Random Forest"), col=c("#377eb8", "#4daf4a"), lwd=4)