Accuracy is often the first metric people learn in machine learning.
Train a model.
Evaluate it.
See a number like:
Accuracy: 95%
At first glance, that looks excellent. A model that is correct 95% of the time must be good.
But in many real-world problems, accuracy can be the most misleading number in the entire pipeline.
Sometimes, a model with 95% accuracy is completely useless.
What Accuracy Actually Measures
Accuracy is defined as:
Accuracy = Correct Predictions / Total Predictions
In code, it often looks like this:
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_true, y_pred)
It simply measures the fraction of predictions that match the true labels.
The problem is that this number does not tell you what kinds of
mistakes the model makes.
And in many applications, those mistakes matter far more than the total percentage.
The Classic Example: Imbalanced Data
Imagine you are building a model to detect fraud in financial transactions.
Out of 10,000 transactions:
Fraudulent: 100
Legitimate: 9,900
Fraud represents only 1% of the data.
Now consider a model that predicts:
"Legitimate" for every transaction
This model never detects fraud.
But its accuracy would be:
9,900 / 10,000 = 99% accuracy
A model that misses every fraud case looks nearly perfect by accuracy alone.
In practice, it is useless.
Accuracy Hides the Type of Errors
In many applications, different mistakes have very different costs.
Consider medical diagnosis.
Two types of errors exist:
- False positives: predicting disease when none exists
- False negatives: missing a real disease A false negative might delay treatment for a serious condition. But accuracy treats all mistakes the same. It does not distinguish which mistakes are dangerous.
The Confusion Matrix Tells the Real Story
Instead of relying on accuracy alone, we need to look at the confusion matrix.
Example:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_true, y_pred)
print(cm)
This shows:
True Positive
False Positive
True Negative
False Negative
These numbers reveal what accuracy hides.
You can see exactly how the model fails.
Better Metrics for Real Problems
Many tasks require metrics that capture different aspects of performance.
Common alternatives include:
Precision
Measures how many predicted positives are correct.
Precision = True Positives / (True Positives + False Positives)
Important when false alarms are costly.
Recall
Measures how many real positives are detected.
Recall = True Positives / (True Positives + False Negatives)
Important when missing cases is dangerous.
F1 Score
Balances precision and recall.
F1 = 2 × (Precision × Recall) / (Precision + Recall)
Useful when classes are imbalanced.
ROC-AUC
Evaluates how well the model separates classes across thresholds.
Often more informative than accuracy in classification tasks.
Accuracy Still Has Its Place
Accuracy is not useless.
It works well when:
- classes are balanced
- the cost of errors is similar
- the problem is symmetric But those conditions are surprisingly rare in real-world ML.
Why This Matters
Accuracy is dangerous not because it is wrong, but because it looks authoritative.
It gives a single clean number.
But machine learning performance is rarely a single-number problem.
If we optimize the wrong metric, we may build models that look good in evaluation and fail in practice.
Final Thought
Accuracy answers one question:
How often is the model correct?
But in many real systems, the better question is:
What kinds of mistakes can we afford?
Until that question is answered, accuracy alone can be the most dangerous number in machine learning.
Top comments (0)