Classification Metrics

#machinelearning #ai #beginners #programming

Introduction

In machine learning, classification is the task of predicting the class to which input data belongs. One example would be to classify whether the text from an email (input data) is spam (one class) or not spam (another class).

When building a classification system, we need a way to evaluate the performance of the classifier. And we want to have evaluation metrics that are reflective of the classifier’s true performance.

Common Classification Metrics

Accuracy:
This is the simplest and most intuitive metric, measuring the proportion of correct predictions. We use it when a dataset is balanced and you want to perform simple classification.
It’s calculated as (True Positives + True Negatives) / Total Predictions.
Example: A sample class has 80% males and 20% females. Our model can easily predict 80% accuracy of the male representation by predicting total students in the sample.
Precision:
Precision, also known as positive predictive value measures the proportion of true positive predictions among all positive predictions. It helps answer the question: “Of all the positive predictions made by the model, how many were correct?” The formula for precision is True Positives / (True Positives + False Positives).
It is used where false positives are costly.
Example: Fraud detection.
Recall:
Recall, or sensitivity, gauges the proportion of true positive predictions among all actual positive instances. It answers the question: “Of all the actual positive instances, how many did the model correctly predict?” The formula for recall is True Positives / (True Positives + False Negatives).
It is used where false negatives are costly.
Example: Medical cases that are sensitive like having a patient who has cancer testing negative.
F1 Score:
The F1 score is the harmonic mean of precision and recall. It provides a balance between these two metrics, giving you a single value that considers both false positives and false negatives. The formula for the F1 score is 2 * (Precision * Recall) / (Precision + Recall).
Used when we have an imbalanced dataset.
Example: Sentiment Analysis.

5.Specificity:
Specificity, also known as the true negative rate, measures the proportion of true negative predictions among all actual negative instances. It’s calculated as True Negatives / (True Negatives + False Positives).
The focus is on correctly identifying negatives and minimizing false alarms.
Example: Patient explaining symptoms to doctor.

6.Confusion Matrix:
While not a single metric, the confusion matrix is a table that summarizes the model’s performance. It includes values for true positives, true negatives, false positives, and false negatives, providing a detailed view of classification results.

Terms to remember:

True Positives: It is the case where we predicted Yes and the real output was also yes.
True Negatives: It is the case where we predicted No and the real output was also No.
False Positives: It is the case where we predicted Yes but it was actually No.
False Negatives: It is the case where we predicted No but it was actually Yes.

7.AUC-ROC:
The Receiver Operating Characteristic (ROC) curve plots the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings. The Area Under the Curve (AUC) of the ROC curve quantifies the model’s ability to distinguish between positive and negative classes.
It focuses on the trade-off between precision and recall, which is particularly important when dealing with imbalanced datasets.

True positive rate:

Also called or termed sensitivity. True Positive Rate is considered as a portion of positive data points that are correctly considered as positive, with respect to all data points that are positive.

2.True Negative Rate

Also called or termed specificity. False Negative Rate is considered as a portion of negative data points that are correctly considered as negative, with respect to all data points that are negatives.

3.False-positive Rate

False Negatives rate is actually the proportion of actual positives that are incorrectly identified as negatives.

8.Matthews Correlation Coefficient (MCC):

The Matthews Correlation Coefficient is a metric that takes into account true positives, true negatives, false positives, and false negatives to provide a balanced measure of classification performance. It ranges from -1 (total disagreement) to 1 (perfect agreement), with 0 indicating no better than random chance.

9.Log Loss (Logarithmic Loss):

Log Loss, or Cross-Entropy Loss, is a metric used in the evaluation of probabilistic classifiers. It quantifies how well the predicted probabilities match the actual class labels. Lower log loss values indicate better model performance.
It usually works well with multi-class classification.
Working on Log loss, the classifier should assign a probability for each and every class of all the samples. If there are N samples belonging to the M class, then we calculate the Log loss in this way:

[Tex]LogarithmicLoss$=\frac{-1}{N} \sum_{i=1}^{N} \sum_{j=1}^{M} y_{i j} * \log \left(p_{i j}\right)$ [/Tex]

yij indicate whether sample i belongs to class j.
pij : The probability of sample i belongs to class j.
The range of log loss is [0,?). When the log loss is near 0 it indicates high accuracy and when away from zero then, it indicates lower accuracy.
Minimizing log loss gives you higher accuracy for the classifier.

Regression Evaluation Metrics
Predicts the target variable which is in the form of continuous values. To evaluate the performance of such a model the evaluation metrics listed below are used:

1.Mean Absolute Error(MAE)
It is the average distance between Predicted and original values. Basically, it gives how we have predicted from the actual output. However, there is one limitation i.e. it doesn’t give any idea about the direction of the error which is whether we are under-predicting or over-predicting our data.

2.Mean Squared Error(MSE)
It is similar to mean absolute error but the difference is it takes the square of the average of between predicted and original values. The main advantage to take this metric is here, it is easier to calculate the gradient whereas, in the case of mean absolute error, it takes complicated programming tools to calculate the gradient. By taking the square of errors it pronounces larger errors more than smaller errors, we can focus more on larger errors.

3.Root Mean Square Error(RMSE)
We can say that RMSE is a metric that can be obtained by just taking the square root of the MSE value. As we know that the MSE metrics are not robust to outliers and so are the RMSE values. This gives higher weightage to the large errors in predictions.

4.Root Mean Squared Logarithmic Error(RMSLE)
There are times when the target variable varies in a wide range of values. And hence we do not want to penalize the overestimation of the target values but penalize the underestimation of the target values. For such cases, RMSLE is used as an evaluation metric which helps us to achieve the above objective.

5.R2 – Score
The coefficient of determination also called the R2 score is used to evaluate the performance of a linear regression model. It is the amount of variation in the output-dependent attribute which is predictable from the input independent variable(s). It is used to check how well-observed results are reproduced by the model, depending on the ratio of total deviation of results described by the model.

Conclusion

Choosing the right classification metric depends on the problem you're solving. Accuracy is useful when classes are balanced, but in cases where false positives or false negatives have serious consequences, metrics like precision, recall, and F1 score are more reliable. The confusion matrix provides a detailed breakdown, while AUC-ROC helps evaluate model performance across different thresholds. Other metrics like MCC and Log Loss offer deeper insights, especially for imbalanced datasets or probabilistic models.

For regression models, MAE, MSE, RMSE, and R²-score help evaluate predictions of continuous values. Each metric has its strengths, and selecting the right one ensures your model is not just working, but working well.

Summary

Classification helps categorize data into classes.
Common evaluation metrics include Accuracy, Precision, Recall, F1 Score, Specificity, AUC-ROC, and MCC.
The choice of metric depends on the nature of the dataset and the impact of errors.
Regression models predict continuous values and use metrics like MAE, MSE, RMSE, and R²-score for evaluation.
Understanding these metrics ensures better model performance and more reliable results.