Neetika Mittal

Posted on May 30 • Originally published at en.wikipedia.org

Why Accuracy Is Not Enough: Evaluation Metrics Every AI Engineer Should Understand

#ai #llm #machinelearning #softwareengineering

Why Accuracy Is Not Enough: Evaluation Metrics Every AI Engineer Should Understand

Your evaluation dashboard says your model is 95% accurate. Leadership is happy. The deployment goes live.

Two weeks later, users complain that critical failures are still slipping through.

The problem is not always the model. Sometimes the problem is the metric.

As AI systems move from research prototypes into production infrastructure, evaluation becomes one of the most important engineering problems. This is especially true for modern GenAI systems, where outputs are probabilistic, subjective, and highly context dependent.

In this article, we will break down the most important evaluation metrics used in machine learning and GenAI systems, understand where they fail, and discuss how to think about evaluation from a production engineering perspective.

The Core Problem With Accuracy

Accuracy is usually the first metric people encounter in machine learning. It is simple:

Accuracy = \frac{Correct\ Predictions}{Total\ Predictions}

At first glance, it seems reasonable. If a model predicts correctly 95% of the time, surely that sounds good.

But accuracy becomes dangerous when datasets are imbalanced.

Imagine a fraud detection system:

99% of transactions are legitimate
1% are fraudulent

Now suppose your model predicts:

"Every transaction is legitimate."

The result?

99% accuracy
Completely useless fraud detection

To make the failure more obvious, imagine 10,000 transactions:

Metric	Count
Fraudulent transactions	100
Legitimate transactions	9,900
Fraud cases detected	0
Fraud cases missed	100

The model gets 9,900 predictions right, so accuracy looks excellent. But recall for fraud is 0%.

This is one of the most common evaluation mistakes in production systems: the metric looks healthy while the system fails at its actual job.

Understanding the Confusion Matrix

Most evaluation metrics are derived from something called the confusion matrix.

	Predicted Positive	Predicted Negative
Actual Positive	True Positive (TP)	False Negative (FN)
Actual Negative	False Positive (FP)	True Negative (TN)

This matrix gives us a much richer understanding of model behavior. From it, we derive several important metrics.

Precision

Precision answers:

"When the model predicts positive, how often is it correct?"

Precision = \frac{TP}{TP + FP}

High precision means the model produces few false positives, so its positive predictions are more trustworthy.

Precision matters when false alarms are expensive. Common examples include spam filters, content moderation, automated bans, and financial transaction blocking.

If your spam detector incorrectly flags legitimate emails, users lose trust quickly.

Recall

Recall answers:

"How many actual positives did the model successfully detect?"

Recall = \frac{TP}{TP + FN}

High recall means the model misses fewer positive cases and catches most of the important events.

Recall matters when missing something is costly. Common examples include fraud detection, medical diagnosis, security systems, and safety monitoring.

A cancer detection model with low recall can miss life-threatening cases.

The Precision vs Recall Tradeoff

In most real-world systems, improving precision hurts recall, and improving recall hurts precision. This creates one of the central optimization problems in machine learning.

For example, lowering a classification threshold usually increases recall, but it also increases false positives, which reduces precision.

This tradeoff appears everywhere in production AI systems. Modern LLM moderation systems constantly balance aggressive filtering, user experience, safety requirements, and operational costs.

There is rarely a perfect threshold. Only tradeoffs.

F1 Score

F1 score combines precision and recall into a single metric.

F1 = 2 \times \frac{Precision \times Recall}{Precision + Recall}

F1 becomes useful when class imbalance exists, both precision and recall matter, and you want a single aggregate metric.

This is why F1 is heavily used in information retrieval, NLP classification, GenAI evaluations, entity extraction, and multi-label classification.

However, F1 also hides information. Two models can have identical F1 scores while behaving very differently operationally.

One model may produce many false positives. Another may miss many true positives. The same metric can hide very different failure modes.

When F1 Is Not Enough

F1 assumes precision and recall are equally important. That is not always true.

In fraud detection, recall may matter more because missing fraud is expensive. In automated account bans, precision may matter more because false accusations damage user trust.

In these cases, optimizing F1 can still produce the wrong system behavior.

A related metric, F-beta, lets you control this tradeoff:

F2 emphasizes recall
F0.5 emphasizes precision

The important question is not "Which metric is popular?" The important question is "Which mistake is more expensive?"

A Production Lesson From GenAI Evaluations

One of the most interesting problems in GenAI systems is that evaluation itself becomes probabilistic.

Traditional systems often evaluate deterministic outputs:

Correct
Incorrect

But LLM systems are rarely binary. Suppose you build a ticket classification system using an LLM. The model may partially understand the issue: it might identify the correct root cause, assign the wrong severity, produce an incomplete explanation, or hallucinate remediation steps.

Now evaluation becomes much harder.

In one evaluation pipeline I worked on, aggregate metrics initially looked strong despite obvious quality problems observed by engineers. The root cause was class imbalance.

Some labels appeared thousands of times while others appeared only a handful of times. Weighted metrics looked excellent because common labels dominated the scores.

Macro F1 revealed the actual issue immediately: the system was effectively ignoring rare but operationally important classes.

This is one reason why evaluation engineering is becoming a major discipline in modern AI infrastructure.

Macro vs Micro vs Weighted F1

This distinction becomes extremely important in multi-class systems.

Micro F1

Micro F1 aggregates all predictions globally. It favors common classes, which makes it useful when overall system performance matters most and the dataset distribution reflects production reality.

Macro F1

Macro F1 computes F1 independently per class and averages them equally. This treats rare classes as equally important, which makes it useful when rare classes, fairness, or tail performance matter.

Weighted F1

Weighted F1 balances both worlds. Classes contribute proportionally based on frequency.

This is often used in production dashboards, but it can sometimes hide minority-class failures.

ROC-AUC

ROC-AUC stands for Receiver Operating Characteristic - Area Under the Curve.

It measures how well a model separates positive cases from negative cases across different classification thresholds.

Many classifiers do not directly output positive or negative. They output a score or probability.

For example:

Transaction	Actual Class	Model Score
A	Fraud	0.92
B	Fraud	0.81
C	Legitimate	0.40
D	Legitimate	0.12

To turn these scores into predictions, we choose a threshold.

If the threshold is 0.8:

A and B are predicted as fraud
C and D are predicted as legitimate

If the threshold is 0.3:

A, B, and C are predicted as fraud
D is predicted as legitimate

Changing the threshold changes false positives and false negatives.

The ROC curve shows this tradeoff by plotting the true positive rate, which tells you how many actual positives the model catches, against the false positive rate, which tells you how many actual negatives the model incorrectly flags.

AUC stands for Area Under the Curve.

A score of 1.0 means perfect separation, 0.5 means random guessing, and anything below 0.5 means worse than random guessing.

A high ROC-AUC means the model usually gives higher scores to positive examples than to negative examples.

ROC-AUC is useful when comparing models because it does not depend on one fixed threshold. But in highly imbalanced datasets, it can look better than the system actually feels in production.

PR-AUC

Precision-Recall AUC often becomes more informative for imbalanced problems.

Unlike ROC-AUC, PR-AUC focuses directly on precision and recall. This makes it especially valuable for fraud detection, security systems, rare event detection, and GenAI issue detection.

In practice, PR-AUC often tells a more honest story for production AI systems.

Calibration: The Metric Most Teams Ignore

Suppose two models both predict:

"90% confidence"

But:

Model A is actually correct 90% of the time
Model B is correct only 60% of the time

Model A is calibrated. Model B is overconfident.

Calibration measures whether model confidence matches reality. This becomes critically important in autonomous systems, medical AI, LLM judges, recommendation systems, and human-AI collaboration.

Common ways to inspect calibration include reliability diagrams, expected calibration error, and Brier score.

Modern LLMs are notoriously poor at calibrated confidence estimation. This creates major challenges for autonomous agent systems, where the model must decide when to act, ask for help, or stop.

Evaluation in LLM Systems Is Different

Traditional ML evaluation usually assumes clear labels, deterministic outputs, and stable datasets.

LLM systems violate all three assumptions. Their outputs may be subjective, creative, multi-step, context dependent, and non-deterministic.

For LLM products, evaluation often needs to measure multiple dimensions at once: factual correctness, instruction following, relevance, completeness, groundedness, safety, formatting compliance, tool-use correctness, latency, and cost.

This creates new evaluation approaches.

LLM-as-a-Judge

One increasingly popular technique is using LLMs themselves as evaluators.

The idea is simple:

Generate model output
Ask another LLM to evaluate quality
Compare against expected behavior

This enables scalable evaluation pipelines for summarization, reasoning, agent workflows, coding systems, and customer support systems.

But LLM judges introduce new problems, including judge bias, prompt sensitivity, position bias, preference leakage, and self-preference bias.

Teams reduce these risks by using clear rubrics, randomizing answer order, hiding model identity, comparing judge scores against human labels, and tracking agreement between judges.

Evaluation systems now require evaluation themselves. This recursive problem is becoming a major research area.

Human Evaluations Still Matter

Despite advances in automated metrics, humans remain essential, especially for alignment, safety, UX quality, tone, reasoning correctness, and policy compliance.

The most reliable production evaluation systems usually combine automated metrics, human review, statistical monitoring, regression detection, and real user feedback.

No single metric captures reality completely.

Offline vs Online Evaluation

Offline evaluation happens before deployment. It includes test sets, golden datasets, regression suites, and benchmark runs.

Online evaluation happens after deployment. It includes A/B tests, shadow deployments, user feedback, production monitoring, and human review queues.

Both matter.

Offline evaluation catches regressions before users see them. Online evaluation tells you whether the system is actually working in the messy reality of production traffic.

Which Metric Should You Use?

Use Case	Recommended Metric
Fraud Detection	Recall + PR-AUC
Spam Detection	Precision
Search Ranking	NDCG
Recommendation Systems	MAP / CTR
Multi-label NLP	Macro F1
GenAI Classification	F1 + Human Review
Safety Systems	Recall
LLM Judges	Agreement Metrics
Ranking Models	ROC-AUC + NDCG

Some ranking metrics deserve a quick note:

NDCG is useful when the order of results matters and top-ranked items are more important
MAP is useful for retrieval systems where multiple relevant results may exist
CTR is a behavioral business metric, but it can be noisy and biased by position, UI, and user intent

The key lesson is:

Metrics must align with operational goals.

Optimizing the wrong metric can destroy system quality while dashboards continue looking healthy.

A Practical Evaluation Checklist

Before trusting a model metric, ask:

Is the dataset imbalanced?
Which error is more expensive: false positives or false negatives?
Are rare classes hidden by averages?
Is the model calibrated?
Does offline performance match production behavior?
Are humans reviewing ambiguous cases?
Are evaluation datasets versioned?
Are regressions caught before deployment?
Are latency and cost part of the evaluation?

This checklist is often more useful than adding another metric to a dashboard.

Evaluation Is an Engineering Discipline

Many teams treat evaluation as an afterthought. In reality, evaluation systems are production infrastructure.

Good evaluation systems require more than a few metrics on a dashboard. They need dataset versioning, label quality pipelines, drift detection, continuous benchmarking, human review loops, statistical monitoring, cost-aware execution, and experiment reproducibility.

As AI systems become core infrastructure, evaluation engineering is becoming as important as model engineering itself.

Final Thoughts

Metrics are compression functions for reality. Every metric hides information.

Accuracy hides class imbalance. F1 hides confidence. ROC-AUC hides calibration. Calibration hides ranking quality.

No single number can fully describe model behavior.

The best evaluation systems combine multiple perspectives: correctness, reliability, uncertainty, safety, and operational impact.

If you are building production AI systems, choosing the right evaluation metric is often more important than choosing the right model.

Because in the end:

What you measure is what your system learns to optimize.

And poorly chosen metrics can quietly push systems in the wrong direction for months before anyone notices.

Top comments (3)

Harjot Singh • May 31

The 95%-accurate-but-critical-failures-slip-through opener is the exact trap, and the reason accuracy lies is that it averages over cases that don't matter equally. If 95% of queries are easy and the 5% it gets wrong are the high-stakes ones (the fraud, the safety case, the edge that actually hurts a user), then accuracy is a comforting number measuring the wrong thing. The metrics that matter are the ones tied to the cost of the specific failure: recall when a miss is catastrophic (you'd rather flag false positives than let one real case through), precision when false alarms are expensive, and per-segment breakdowns so the easy bulk can't hide the hard tail. The deeper point for AI engineers is that picking the metric IS the modeling decision, the metric encodes what you actually care about, and accuracy encodes you care about everything equally, which is almost never true. Measure the failure that hurts, not the average that flatters. That choose-the-metric-that-matches-the-real-cost discipline is core to how I think about evals in Moonshift. What's the metric you reach for first when the failures are rare but high-stakes, recall at a fixed precision, or a cost-weighted score?

Neetika Mittal • May 31

You' are right, picking the metric is the modeling decision, and an average will always flatter the model while hiding the tail.

When failures are rare but high-stakes, I use Recall at a fixed Precision constraint first.

While cost-weighted scores look good on paper, actual business costs fluctuate constantly.
Tying evals to a strict precision boundary gives downstream teams a predictable volume of false alarms to staff for, while engineering optimizes for the rare misses.

Harjot Singh • May 31

Recall-at-fixed-Precision is a clean choice precisely because it hands downstream teams a staffing-predictable false-alarm volume, which is the operational reality cost-weighted scores lose once costs drift. The one thing I'd add: the precision boundary itself becomes something you have to monitor, because the data distribution moves under it and a fixed threshold silently changes meaning over time, so the predictable volume you promised quietly stops being predictable. That measure-on-your-real-distribution-and-watch-it-drift discipline is core to how I think about verification in Moonshift. Do you re-fit the precision constraint on a schedule, or hold it fixed and watch recall move as the signal that it's time to recalibrate?