Classification Metrics: When and Why to Use Them
When building a classification model, evaluating its performance is crucial. Different metrics provide insights based on the problem type, class distribution, and business objectives.
1. Accuracy
When to Use:
- When classes are balanced (equal distribution of classes).
- When false positives (FP) and false negatives (FN) have equal importance.
Formula:
[
\text{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN}
]
- ( TP ) = True Positives (correctly predicted positive instances)
- ( TN ) = True Negatives (correctly predicted negative instances)
- ( FP ) = False Positives (incorrectly predicted as positive)
- ( FN ) = False Negatives (incorrectly predicted as negative)
Example:
If a model correctly classifies 90 out of 100 samples, the accuracy is 90%.
Why Use It?
- Good for balanced datasets.
- Not reliable for imbalanced datasets (e.g., detecting fraud when 99% of transactions are normal).
2. Precision
When to Use:
- When false positives (FP) are costly (e.g., spam detection, where misclassifying an important email as spam is bad).
Formula:
[
\text{Precision} = \frac{TP}{TP + FP}
]
Example:
If a cancer detection model predicts 50 positive cases, but only 40 are actually positive, the precision is:
[
\frac{40}{40+10} = 0.8 \text{ (80%)}
]
Why Use It?
- Useful when false positives need to be minimized (e.g., medical diagnosis, where predicting cancer falsely can cause panic).
3. Recall (Sensitivity, True Positive Rate)
When to Use:
- When false negatives (FN) are costly (e.g., detecting cancer, where missing a case could be fatal).
Formula:
[
\text{Recall} = \frac{TP}{TP + FN}
]
Example:
If a model detects 40 cancer cases but misses 10, recall is:
[
\frac{40}{40+10} = 0.8 \text{ (80%)}
]
Why Use It?
- Helps when missing positive cases is critical (e.g., fraud detection, medical diagnosis).
4. F1-Score
When to Use:
- When both precision and recall matter (e.g., fraud detection, medical tests).
Formula:
[
F1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}
]
Example:
If precision = 80% and recall = 70%,
[
F1 = 2 \times \frac{0.8 \times 0.7}{0.8 + 0.7} = 0.746
]
Why Use It?
- Balances precision and recall.
- Ideal when false positives and false negatives are equally important.
5. ROC-AUC (Receiver Operating Characteristic - Area Under Curve)
When to Use:
- For imbalanced datasets, to measure how well the model distinguishes between classes.
How It Works:
- The ROC curve plots true positive rate (Recall) vs. false positive rate (FPR).
- AUC (Area Under Curve) measures the model's ability to distinguish between classes.
Example:
- AUC = 1.0 → Perfect classifier.
- AUC = 0.5 → Random guessing.
- AUC < 0.5 → Worse than random.
Why Use It?
- Works well with imbalanced data (e.g., rare event detection like fraud).
6. Log Loss (Logarithmic Loss)
When to Use:
- For probabilistic models that output probabilities instead of hard classifications.
Formula:
[
\text{Log Loss} = -\frac{1}{N} \sum_{i=1}^{N} [y_i \log(p_i) + (1 - y_i) \log(1 - p_i)]
]
where:
- ( y_i ) = true label (1 or 0)
- ( p_i ) = predicted probability of class 1
Why Use It?
- Measures the confidence of probability predictions (e.g., used in logistic regression).
Choosing the Right Metric
Scenario | Best Metric |
---|---|
Balanced dataset | Accuracy |
Imbalanced dataset | Precision, Recall, F1-Score, AUC-ROC |
False positives costly (spam filter, medical tests) | Precision |
False negatives costly (fraud detection, cancer diagnosis) | Recall |
Probabilistic classification (logistic regression, deep learning) | Log Loss |
Difference Between CDF and ECDF
1. CDF (Cumulative Distribution Function)
Definition:
- Mathematical function that shows the probability of a variable being less than or equal to a given value.
- Used for continuous distributions (e.g., normal distribution).
Formula:
[
F(x) = P(X \leq x)
]
Example:
For a normal distribution, ( P(X \leq 1) ) might be 84%, meaning 84% of values are less than 1.
2. ECDF (Empirical Cumulative Distribution Function)
Definition:
- Data-driven version of the CDF, built from a finite dataset.
- Instead of a formula, it uses observed data points.
Formula:
[
F_n(x) = \frac{\text{number of samples} \leq x}{\text{total samples}}
]
Example:
For a dataset [2, 3, 5, 7], the ECDF at x = 5 is:
[
\frac{3}{4} = 0.75
]
This means 75% of values are ≤ 5.
Key Differences
Feature | CDF | ECDF |
---|---|---|
Definition | Theoretical function | Data-driven function |
Data Type | Used for continuous distributions | Works with finite datasets |
Exact or Approximate? | Exact probability | Approximate (depends on data) |
Use Case | Probability distributions (normal, Poisson, etc.) | Empirical analysis of sample data |
Top comments (0)