Going Beyond Accuracy: Understanding the Balanced Accuracy, Precision, Recall and F1-score.

#datascience #machinelearning #metrics #tutorial

Tutorial about metrics which are used for machine learning model validations. The metrics covered in this tutorial are balanced accuracy, precision, recall and F1-score. This same tutorial may be read in a portuguese version here.

During a data science project one of the most wished steps is the development of a machine learning model. In this step there's training and validation of the model, and one of the most used metrics to validate the machine learning model is accuracy. However, how far can the accuracy show how effective the model was in classifying two or more classes?

Therefore, in this post other metrics will be described. They will help you to get other perspectives on the performance of your model, especially with unbalanced databases, in other words, databases with more numbers of a class than others. Here will be covered the metrics balanced accuracy, precision, recall and F1-score.

All the metrics described in this post result in a score between 0 and 1. Where 0 is the worst result and 1 is an excellent result. However, Each metric has different interpretation.

Confusion Matrix

Before to understand how the metrics work, you need to know what is a confusion matrix. Because it will be our basis for the calculations of each metric.This matrix show which are the predictions of each class "yes" or "no". Where the rows are the true classes and the columns are the predictions. So it's possible to classify how the classes were predicted. The table 1 shows this kind of matrix.

	No	Yes
No	TN	FP
Yes	FN	TP

Table 1: Confusion matrix where the "no" and "yes" classes are related to the predictions made by a machine learning model. TN, FP, FN, and TP are acronyms that stand for "true negative", "false positive", "false negative" and "true positive" respectively.

That said, the correct classifications of the “no” class are defined as true negatives (TN), while the correct classifications of the “yes” class are called true positives (TP). Misclassifications of “no” as “yes” are referred to as false positives (FP), and incorrect classifications of “yes” as “no” are known as false negatives (FN).

Table 2 shows the same Table 1, now with example values to illustrate a machine learning model from a data science project for predicting fraudulent banking transactions. The values 101,668, 3, 36, and 95 represent, respectively, TN, FN, FP, and TP. For more information about the referenced data science project, visit the provided link.

	No	Yes
No	101668	3
Yes	36	95

Table 2: Confusion matrix with the results of a machine learning model for predicting fraudulent banking transactions. The values 101,668, 3, 36, and 95 represent, respectively, the number of TN, FP, FN, and TP.

Balanced Accuracy

Accuracy basically calculates all correct predictions (TP and TN) divided by the total number of predictions, that is, all correct and incorrect ones (TP + TN + FP + FN), as shown in Equation 1. However, when there is a highly imbalanced class, accuracy is not a good metric to use. As can be seen in the equation, the high number of TN classifications can mask the low number of TP classifications — giving a misleading impression that the model is performing well in classifying the data.

Equation 1: Accuracy.

An alternative to accuracy is balanced accuracy, which is not affected by class imbalance because it is calculated based on the true positive rate and the true negative rate, as shown in Equation 2. This approach provides a more reliable measure of the model’s performance across both classes.

Equation 2: Balanced Accuracy.

To illustrate, the values of accuracy and balanced accuracy will be calculated using the data from Table 2. The resulting accuracy is 0.9996, which might initially suggest that the model correctly classified almost all instances and is performing exceptionally well. However, most of the correct predictions come from the majority class, which skews the result.

When we use balanced accuracy, which gives equal weight to the performance on each class, the value is 0.8626. This provides a more realistic measure of how well the model performs across both classes.

Even with balanced accuracy, we still only get a global view of overall correctness, so we cannot see how well the model performed on a specific class of interest. In our example, how well did the model identify fraudulent transactions? What percentage of the “yes” class was classified correctly?

Precision

We now understand balanced accuracy and how it provides a global view of the model’s performance across all classes. However, it is also important to examine the model’s classification ability in more detail. In our fraud detection example, how well can the model correctly identify a transaction as truly fraudulent?

The metric used to answer this question is precision, which measures the percentage of correct positive predictions made by the model. This metric relates the number of true positives (TP) to the sum of TP and false positives (FP), as shown in Equation 3.

Equation 3: Precision.

To better interpret this metric, imagine you have a distant target to hit. If you take 100 shots and hit the target 70 times, your precision is 70%. The same logic applies to interpreting the precision of a machine learning model. In our example from Table 2, the precision is 0.9694 or 96.94%. This means that for every 100 positive predictions, the model correctly identifies approximately 97.

Recall

In addition to precision, which shows how well the model can differentiate between classes, it is also important to know how many fraudulent transactions were correctly identified in our example from Table 2. For this reason, we consider the recall metric, also known as sensitivity. This metric measures how well a model can recognize instances of a specific class. Recall is calculated by dividing the number of true positives (TP) by the sum of TP and false negatives (FN) — in other words, the “yes” instances that were incorrectly classified.

Equation 4: Recall

In our example from Table 2, the recall value is 0.7252 or 72.52%. This result shows that the model correctly classified approximately 73% of the “yes” instances. Therefore, this metric can be used to report the percentage of fraudulent transactions that the model is able to correctly identify.

F1 Score

After reviewing precision and recall, you might be thinking that these metrics are important for evaluating model performance. After all, the better a model can differentiate between classes and correctly predict the “yes” class of interest, the better its overall performance.

So, how can we take both metrics into account when assessing a model’s performance?

In this context, one metric we can use is the F1 score. The F1 score is basically the harmonic mean of precision and recall, as shown in Equation 5. In the example from Table 2, the F1 score is 0.8297. This metric can be particularly useful when developing new models to determine which one achieves the best performance.

Equation 5: F1 Score

Conclusion

In this post, we explored several metrics commonly used to evaluate the performance of a machine learning model. We learned that accuracy is not always the best validation metric and can sometimes give a misleading impression of the model’s effectiveness.

For imbalanced classes, a more appropriate metric is balanced accuracy, which provides a global view of the model’s performance across all classes.

To get a more detailed view of individual classes, we rely on precision, which shows how many predictions the model got right, and recall, which indicates how many instances of a particular class were correctly identified by the model. To evaluate both metrics simultaneously, we use the F1 score.

Finally, I hope you enjoyed this post and gained a better understanding of alternative metrics to assess your models. See you next time!