<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Neetika Mittal</title>
    <description>The latest articles on DEV Community by Neetika Mittal (@mneetika).</description>
    <link>https://dev.to/mneetika</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3960536%2F3060766a-9bd6-444d-b649-2b8cf9e00026.png</url>
      <title>DEV Community: Neetika Mittal</title>
      <link>https://dev.to/mneetika</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/mneetika"/>
    <language>en</language>
    <item>
      <title>Why Accuracy Is Not Enough: Evaluation Metrics Every AI Engineer Should Understand</title>
      <dc:creator>Neetika Mittal</dc:creator>
      <pubDate>Sat, 30 May 2026 22:58:12 +0000</pubDate>
      <link>https://dev.to/mneetika/why-accuracy-is-not-enough-evaluation-metrics-every-ai-engineer-should-understand-cah</link>
      <guid>https://dev.to/mneetika/why-accuracy-is-not-enough-evaluation-metrics-every-ai-engineer-should-understand-cah</guid>
      <description>&lt;h1&gt;
  
  
  Why Accuracy Is Not Enough: Evaluation Metrics Every AI Engineer Should Understand
&lt;/h1&gt;

&lt;p&gt;Your evaluation dashboard says your model is &lt;strong&gt;95% accurate&lt;/strong&gt;. Leadership is happy. The deployment goes live.&lt;/p&gt;

&lt;p&gt;Two weeks later, users complain that critical failures are still slipping through.&lt;/p&gt;

&lt;p&gt;The problem is not always the model. Sometimes the problem is the metric.&lt;/p&gt;

&lt;p&gt;As AI systems move from research prototypes into production infrastructure, evaluation becomes one of the most important engineering problems. This is especially true for modern GenAI systems, where outputs are probabilistic, subjective, and highly context dependent.&lt;/p&gt;

&lt;p&gt;In this article, we will break down the most important evaluation metrics used in machine learning and GenAI systems, understand where they fail, and discuss how to think about evaluation from a production engineering perspective.&lt;/p&gt;




&lt;h1&gt;
  
  
  The Core Problem With Accuracy
&lt;/h1&gt;

&lt;p&gt;Accuracy is usually the first metric people encounter in machine learning. It is simple:&lt;/p&gt;

&lt;p&gt;

&lt;/p&gt;
&lt;div class="katex-element"&gt;
  &lt;span class="katex-display"&gt;&lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;Accuracy=Correct PredictionsTotal Predictions
Accuracy = \frac{Correct\ Predictions}{Total\ Predictions}
&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;A&lt;/span&gt;&lt;span class="mord mathnormal"&gt;cc&lt;/span&gt;&lt;span class="mord mathnormal"&gt;u&lt;/span&gt;&lt;span class="mord mathnormal"&gt;r&lt;/span&gt;&lt;span class="mord mathnormal"&gt;a&lt;/span&gt;&lt;span class="mord mathnormal"&gt;cy&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mrel"&gt;=&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mopen nulldelimiter"&gt;&lt;/span&gt;&lt;span class="mfrac"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;T&lt;/span&gt;&lt;span class="mord mathnormal"&gt;o&lt;/span&gt;&lt;span class="mord mathnormal"&gt;t&lt;/span&gt;&lt;span class="mord mathnormal"&gt;a&lt;/span&gt;&lt;span class="mord mathnormal"&gt;l&lt;/span&gt;&lt;span class="mspace"&gt; &lt;/span&gt;&lt;span class="mord mathnormal"&gt;P&lt;/span&gt;&lt;span class="mord mathnormal"&gt;re&lt;/span&gt;&lt;span class="mord mathnormal"&gt;d&lt;/span&gt;&lt;span class="mord mathnormal"&gt;i&lt;/span&gt;&lt;span class="mord mathnormal"&gt;c&lt;/span&gt;&lt;span class="mord mathnormal"&gt;t&lt;/span&gt;&lt;span class="mord mathnormal"&gt;i&lt;/span&gt;&lt;span class="mord mathnormal"&gt;o&lt;/span&gt;&lt;span class="mord mathnormal"&gt;n&lt;/span&gt;&lt;span class="mord mathnormal"&gt;s&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="frac-line"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;C&lt;/span&gt;&lt;span class="mord mathnormal"&gt;orrec&lt;/span&gt;&lt;span class="mord mathnormal"&gt;t&lt;/span&gt;&lt;span class="mspace"&gt; &lt;/span&gt;&lt;span class="mord mathnormal"&gt;P&lt;/span&gt;&lt;span class="mord mathnormal"&gt;re&lt;/span&gt;&lt;span class="mord mathnormal"&gt;d&lt;/span&gt;&lt;span class="mord mathnormal"&gt;i&lt;/span&gt;&lt;span class="mord mathnormal"&gt;c&lt;/span&gt;&lt;span class="mord mathnormal"&gt;t&lt;/span&gt;&lt;span class="mord mathnormal"&gt;i&lt;/span&gt;&lt;span class="mord mathnormal"&gt;o&lt;/span&gt;&lt;span class="mord mathnormal"&gt;n&lt;/span&gt;&lt;span class="mord mathnormal"&gt;s&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mclose nulldelimiter"&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/div&gt;


&lt;p&gt;At first glance, it seems reasonable. If a model predicts correctly 95% of the time, surely that sounds good.&lt;/p&gt;

&lt;p&gt;But accuracy becomes dangerous when datasets are imbalanced.&lt;/p&gt;

&lt;p&gt;Imagine a fraud detection system:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;99% of transactions are legitimate&lt;/li&gt;
&lt;li&gt;1% are fraudulent&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now suppose your model predicts:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Every transaction is legitimate."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The result?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;99% accuracy&lt;/li&gt;
&lt;li&gt;Completely useless fraud detection&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To make the failure more obvious, imagine 10,000 transactions:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Count&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Fraudulent transactions&lt;/td&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Legitimate transactions&lt;/td&gt;
&lt;td&gt;9,900&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Fraud cases detected&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Fraud cases missed&lt;/td&gt;
&lt;td&gt;100&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The model gets 9,900 predictions right, so accuracy looks excellent. But recall for fraud is 0%.&lt;/p&gt;

&lt;p&gt;This is one of the most common evaluation mistakes in production systems: the metric looks healthy while the system fails at its actual job.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6xqhp7h3b5pp9gom8z8z.png" width="799" height="436"&gt;
&lt;/h2&gt;

&lt;h1&gt;
  
  
  Understanding the Confusion Matrix
&lt;/h1&gt;

&lt;p&gt;Most evaluation metrics are derived from something called the confusion matrix.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Predicted Positive&lt;/th&gt;
&lt;th&gt;Predicted Negative&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Actual Positive&lt;/td&gt;
&lt;td&gt;True Positive (TP)&lt;/td&gt;
&lt;td&gt;False Negative (FN)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Actual Negative&lt;/td&gt;
&lt;td&gt;False Positive (FP)&lt;/td&gt;
&lt;td&gt;True Negative (TN)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;This matrix gives us a much richer understanding of model behavior. From it, we derive several important metrics.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkljugarfuv2xh69x96bh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkljugarfuv2xh69x96bh.png" width="799" height="436"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h1&gt;
  
  
  Precision
&lt;/h1&gt;

&lt;p&gt;Precision answers:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"When the model predicts positive, how often is it correct?"&lt;/p&gt;
&lt;/blockquote&gt;


&lt;div class="katex-element"&gt;
  &lt;span class="katex-display"&gt;&lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;Precision=TPTP+FP
Precision = \frac{TP}{TP + FP}
&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;P&lt;/span&gt;&lt;span class="mord mathnormal"&gt;rec&lt;/span&gt;&lt;span class="mord mathnormal"&gt;i&lt;/span&gt;&lt;span class="mord mathnormal"&gt;s&lt;/span&gt;&lt;span class="mord mathnormal"&gt;i&lt;/span&gt;&lt;span class="mord mathnormal"&gt;o&lt;/span&gt;&lt;span class="mord mathnormal"&gt;n&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mrel"&gt;=&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mopen nulldelimiter"&gt;&lt;/span&gt;&lt;span class="mfrac"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;TP&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;+&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;FP&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="frac-line"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;TP&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mclose nulldelimiter"&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/div&gt;


&lt;p&gt;High precision means the model produces few false positives, so its positive predictions are more trustworthy.&lt;/p&gt;

&lt;p&gt;Precision matters when false alarms are expensive. Common examples include spam filters, content moderation, automated bans, and financial transaction blocking.&lt;/p&gt;

&lt;p&gt;If your spam detector incorrectly flags legitimate emails, users lose trust quickly.&lt;/p&gt;




&lt;h1&gt;
  
  
  Recall
&lt;/h1&gt;

&lt;p&gt;Recall answers:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"How many actual positives did the model successfully detect?"&lt;/p&gt;
&lt;/blockquote&gt;


&lt;div class="katex-element"&gt;
  &lt;span class="katex-display"&gt;&lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;Recall=TPTP+FN
Recall = \frac{TP}{TP + FN}
&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;R&lt;/span&gt;&lt;span class="mord mathnormal"&gt;ec&lt;/span&gt;&lt;span class="mord mathnormal"&gt;a&lt;/span&gt;&lt;span class="mord mathnormal"&gt;ll&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mrel"&gt;=&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mopen nulldelimiter"&gt;&lt;/span&gt;&lt;span class="mfrac"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;TP&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;+&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;FN&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="frac-line"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;TP&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mclose nulldelimiter"&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/div&gt;


&lt;p&gt;High recall means the model misses fewer positive cases and catches most of the important events.&lt;/p&gt;

&lt;p&gt;Recall matters when missing something is costly. Common examples include fraud detection, medical diagnosis, security systems, and safety monitoring.&lt;/p&gt;

&lt;p&gt;A cancer detection model with low recall can miss life-threatening cases.&lt;/p&gt;




&lt;h1&gt;
  
  
  The Precision vs Recall Tradeoff
&lt;/h1&gt;

&lt;p&gt;In most real-world systems, improving precision hurts recall, and improving recall hurts precision. This creates one of the central optimization problems in machine learning.&lt;/p&gt;

&lt;p&gt;For example, lowering a classification threshold usually increases recall, but it also increases false positives, which reduces precision.&lt;/p&gt;

&lt;p&gt;This tradeoff appears everywhere in production AI systems. Modern LLM moderation systems constantly balance aggressive filtering, user experience, safety requirements, and operational costs.&lt;/p&gt;

&lt;p&gt;There is rarely a perfect threshold. Only tradeoffs.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo9ll38cxq5z06txx0vd6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo9ll38cxq5z06txx0vd6.png" width="799" height="436"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h1&gt;
  
  
  F1 Score
&lt;/h1&gt;

&lt;p&gt;F1 score combines precision and recall into a single metric.&lt;/p&gt;


&lt;div class="katex-element"&gt;
  &lt;span class="katex-display"&gt;&lt;span class="katex"&gt;&lt;span class="katex-mathml"&gt;F1=2×Precision×RecallPrecision+Recall
F1 = 2 \times \frac{Precision \times Recall}{Precision + Recall}
&lt;/span&gt;&lt;span class="katex-html"&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;F&lt;/span&gt;&lt;span class="mord"&gt;1&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mrel"&gt;=&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;2&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;×&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="base"&gt;&lt;span class="strut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mopen nulldelimiter"&gt;&lt;/span&gt;&lt;span class="mfrac"&gt;&lt;span class="vlist-t vlist-t2"&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;P&lt;/span&gt;&lt;span class="mord mathnormal"&gt;rec&lt;/span&gt;&lt;span class="mord mathnormal"&gt;i&lt;/span&gt;&lt;span class="mord mathnormal"&gt;s&lt;/span&gt;&lt;span class="mord mathnormal"&gt;i&lt;/span&gt;&lt;span class="mord mathnormal"&gt;o&lt;/span&gt;&lt;span class="mord mathnormal"&gt;n&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;+&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;R&lt;/span&gt;&lt;span class="mord mathnormal"&gt;ec&lt;/span&gt;&lt;span class="mord mathnormal"&gt;a&lt;/span&gt;&lt;span class="mord mathnormal"&gt;ll&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="frac-line"&gt;&lt;/span&gt;&lt;/span&gt;&lt;span&gt;&lt;span class="pstrut"&gt;&lt;/span&gt;&lt;span class="mord"&gt;&lt;span class="mord mathnormal"&gt;P&lt;/span&gt;&lt;span class="mord mathnormal"&gt;rec&lt;/span&gt;&lt;span class="mord mathnormal"&gt;i&lt;/span&gt;&lt;span class="mord mathnormal"&gt;s&lt;/span&gt;&lt;span class="mord mathnormal"&gt;i&lt;/span&gt;&lt;span class="mord mathnormal"&gt;o&lt;/span&gt;&lt;span class="mord mathnormal"&gt;n&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mbin"&gt;×&lt;/span&gt;&lt;span class="mspace"&gt;&lt;/span&gt;&lt;span class="mord mathnormal"&gt;R&lt;/span&gt;&lt;span class="mord mathnormal"&gt;ec&lt;/span&gt;&lt;span class="mord mathnormal"&gt;a&lt;/span&gt;&lt;span class="mord mathnormal"&gt;ll&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-s"&gt;​&lt;/span&gt;&lt;/span&gt;&lt;span class="vlist-r"&gt;&lt;span class="vlist"&gt;&lt;span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;span class="mclose nulldelimiter"&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;&lt;/span&gt;
&lt;/div&gt;


&lt;p&gt;F1 becomes useful when class imbalance exists, both precision and recall matter, and you want a single aggregate metric.&lt;/p&gt;

&lt;p&gt;This is why F1 is heavily used in information retrieval, NLP classification, GenAI evaluations, entity extraction, and multi-label classification.&lt;/p&gt;

&lt;p&gt;However, F1 also hides information. Two models can have identical F1 scores while behaving very differently operationally.&lt;/p&gt;

&lt;p&gt;One model may produce many false positives. Another may miss many true positives. The same metric can hide very different failure modes.&lt;/p&gt;




&lt;h1&gt;
  
  
  When F1 Is Not Enough
&lt;/h1&gt;

&lt;p&gt;F1 assumes precision and recall are equally important. That is not always true.&lt;/p&gt;

&lt;p&gt;In fraud detection, recall may matter more because missing fraud is expensive. In automated account bans, precision may matter more because false accusations damage user trust.&lt;/p&gt;

&lt;p&gt;In these cases, optimizing F1 can still produce the wrong system behavior.&lt;/p&gt;

&lt;p&gt;A related metric, F-beta, lets you control this tradeoff:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;F2 emphasizes recall&lt;/li&gt;
&lt;li&gt;F0.5 emphasizes precision&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The important question is not "Which metric is popular?" The important question is "Which mistake is more expensive?"&lt;/p&gt;




&lt;h1&gt;
  
  
  A Production Lesson From GenAI Evaluations
&lt;/h1&gt;

&lt;p&gt;One of the most interesting problems in GenAI systems is that evaluation itself becomes probabilistic.&lt;/p&gt;

&lt;p&gt;Traditional systems often evaluate deterministic outputs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Correct&lt;/li&gt;
&lt;li&gt;Incorrect&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But LLM systems are rarely binary. Suppose you build a ticket classification system using an LLM. The model may partially understand the issue: it might identify the correct root cause, assign the wrong severity, produce an incomplete explanation, or hallucinate remediation steps.&lt;/p&gt;

&lt;p&gt;Now evaluation becomes much harder.&lt;/p&gt;

&lt;p&gt;In one evaluation pipeline I worked on, aggregate metrics initially looked strong despite obvious quality problems observed by engineers. The root cause was class imbalance.&lt;/p&gt;

&lt;p&gt;Some labels appeared thousands of times while others appeared only a handful of times. Weighted metrics looked excellent because common labels dominated the scores.&lt;/p&gt;

&lt;p&gt;Macro F1 revealed the actual issue immediately: the system was effectively ignoring rare but operationally important classes.&lt;/p&gt;

&lt;p&gt;This is one reason why evaluation engineering is becoming a major discipline in modern AI infrastructure.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa6sa1xb6dwvarvryc57q.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa6sa1xb6dwvarvryc57q.png" width="799" height="436"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h1&gt;
  
  
  Macro vs Micro vs Weighted F1
&lt;/h1&gt;

&lt;p&gt;This distinction becomes extremely important in multi-class systems.&lt;/p&gt;

&lt;h2&gt;
  
  
  Micro F1
&lt;/h2&gt;

&lt;p&gt;Micro F1 aggregates all predictions globally. It favors common classes, which makes it useful when overall system performance matters most and the dataset distribution reflects production reality.&lt;/p&gt;




&lt;h2&gt;
  
  
  Macro F1
&lt;/h2&gt;

&lt;p&gt;Macro F1 computes F1 independently per class and averages them equally. This treats rare classes as equally important, which makes it useful when rare classes, fairness, or tail performance matter.&lt;/p&gt;




&lt;h2&gt;
  
  
  Weighted F1
&lt;/h2&gt;

&lt;p&gt;Weighted F1 balances both worlds. Classes contribute proportionally based on frequency.&lt;/p&gt;

&lt;p&gt;This is often used in production dashboards, but it can sometimes hide minority-class failures.&lt;/p&gt;




&lt;h1&gt;
  
  
  ROC-AUC
&lt;/h1&gt;

&lt;p&gt;ROC-AUC stands for &lt;strong&gt;Receiver Operating Characteristic - Area Under the Curve&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;It measures how well a model separates positive cases from negative cases across different classification thresholds.&lt;/p&gt;

&lt;p&gt;Many classifiers do not directly output &lt;code&gt;positive&lt;/code&gt; or &lt;code&gt;negative&lt;/code&gt;. They output a score or probability.&lt;/p&gt;

&lt;p&gt;For example:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Transaction&lt;/th&gt;
&lt;th&gt;Actual Class&lt;/th&gt;
&lt;th&gt;Model Score&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;A&lt;/td&gt;
&lt;td&gt;Fraud&lt;/td&gt;
&lt;td&gt;0.92&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;B&lt;/td&gt;
&lt;td&gt;Fraud&lt;/td&gt;
&lt;td&gt;0.81&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;C&lt;/td&gt;
&lt;td&gt;Legitimate&lt;/td&gt;
&lt;td&gt;0.40&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;D&lt;/td&gt;
&lt;td&gt;Legitimate&lt;/td&gt;
&lt;td&gt;0.12&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;To turn these scores into predictions, we choose a threshold.&lt;/p&gt;

&lt;p&gt;If the threshold is 0.8:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A and B are predicted as fraud&lt;/li&gt;
&lt;li&gt;C and D are predicted as legitimate&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If the threshold is 0.3:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A, B, and C are predicted as fraud&lt;/li&gt;
&lt;li&gt;D is predicted as legitimate&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Changing the threshold changes false positives and false negatives.&lt;/p&gt;

&lt;p&gt;The ROC curve shows this tradeoff by plotting the true positive rate, which tells you how many actual positives the model catches, against the false positive rate, which tells you how many actual negatives the model incorrectly flags.&lt;/p&gt;

&lt;p&gt;AUC stands for Area Under the Curve.&lt;/p&gt;

&lt;p&gt;A score of 1.0 means perfect separation, 0.5 means random guessing, and anything below 0.5 means worse than random guessing.&lt;/p&gt;

&lt;p&gt;A high ROC-AUC means the model usually gives higher scores to positive examples than to negative examples.&lt;/p&gt;

&lt;p&gt;ROC-AUC is useful when comparing models because it does not depend on one fixed threshold. But in highly imbalanced datasets, it can look better than the system actually feels in production.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhspsl25grjd8mkhsssym.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhspsl25grjd8mkhsssym.png" width="799" height="436"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h1&gt;
  
  
  PR-AUC
&lt;/h1&gt;

&lt;p&gt;Precision-Recall AUC often becomes more informative for imbalanced problems.&lt;/p&gt;

&lt;p&gt;Unlike ROC-AUC, PR-AUC focuses directly on precision and recall. This makes it especially valuable for fraud detection, security systems, rare event detection, and GenAI issue detection.&lt;/p&gt;

&lt;p&gt;In practice, PR-AUC often tells a more honest story for production AI systems.&lt;/p&gt;




&lt;h1&gt;
  
  
  Calibration: The Metric Most Teams Ignore
&lt;/h1&gt;

&lt;p&gt;Suppose two models both predict:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"90% confidence"&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;But:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Model A is actually correct 90% of the time&lt;/li&gt;
&lt;li&gt;Model B is correct only 60% of the time&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Model A is calibrated. Model B is overconfident.&lt;/p&gt;

&lt;p&gt;Calibration measures whether model confidence matches reality. This becomes critically important in autonomous systems, medical AI, LLM judges, recommendation systems, and human-AI collaboration.&lt;/p&gt;

&lt;p&gt;Common ways to inspect calibration include reliability diagrams, expected calibration error, and Brier score.&lt;/p&gt;

&lt;p&gt;Modern LLMs are notoriously poor at calibrated confidence estimation. This creates major challenges for autonomous agent systems, where the model must decide when to act, ask for help, or stop.&lt;/p&gt;




&lt;h1&gt;
  
  
  Evaluation in LLM Systems Is Different
&lt;/h1&gt;

&lt;p&gt;Traditional ML evaluation usually assumes clear labels, deterministic outputs, and stable datasets.&lt;/p&gt;

&lt;p&gt;LLM systems violate all three assumptions. Their outputs may be subjective, creative, multi-step, context dependent, and non-deterministic.&lt;/p&gt;

&lt;p&gt;For LLM products, evaluation often needs to measure multiple dimensions at once: factual correctness, instruction following, relevance, completeness, groundedness, safety, formatting compliance, tool-use correctness, latency, and cost.&lt;/p&gt;

&lt;p&gt;This creates new evaluation approaches.&lt;/p&gt;




&lt;h1&gt;
  
  
  LLM-as-a-Judge
&lt;/h1&gt;

&lt;p&gt;One increasingly popular technique is using LLMs themselves as evaluators.&lt;/p&gt;

&lt;p&gt;The idea is simple:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Generate model output&lt;/li&gt;
&lt;li&gt;Ask another LLM to evaluate quality&lt;/li&gt;
&lt;li&gt;Compare against expected behavior&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This enables scalable evaluation pipelines for summarization, reasoning, agent workflows, coding systems, and customer support systems.&lt;/p&gt;

&lt;p&gt;But LLM judges introduce new problems, including judge bias, prompt sensitivity, position bias, preference leakage, and self-preference bias.&lt;/p&gt;

&lt;p&gt;Teams reduce these risks by using clear rubrics, randomizing answer order, hiding model identity, comparing judge scores against human labels, and tracking agreement between judges.&lt;/p&gt;

&lt;p&gt;Evaluation systems now require evaluation themselves. This recursive problem is becoming a major research area.&lt;/p&gt;




&lt;h1&gt;
  
  
  Human Evaluations Still Matter
&lt;/h1&gt;

&lt;p&gt;Despite advances in automated metrics, humans remain essential, especially for alignment, safety, UX quality, tone, reasoning correctness, and policy compliance.&lt;/p&gt;

&lt;p&gt;The most reliable production evaluation systems usually combine automated metrics, human review, statistical monitoring, regression detection, and real user feedback.&lt;/p&gt;

&lt;p&gt;No single metric captures reality completely.&lt;/p&gt;




&lt;h1&gt;
  
  
  Offline vs Online Evaluation
&lt;/h1&gt;

&lt;p&gt;Offline evaluation happens before deployment. It includes test sets, golden datasets, regression suites, and benchmark runs.&lt;/p&gt;

&lt;p&gt;Online evaluation happens after deployment. It includes A/B tests, shadow deployments, user feedback, production monitoring, and human review queues.&lt;/p&gt;

&lt;p&gt;Both matter.&lt;/p&gt;

&lt;p&gt;Offline evaluation catches regressions before users see them. Online evaluation tells you whether the system is actually working in the messy reality of production traffic.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkl6b6lou4ka6z8cmdse0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkl6b6lou4ka6z8cmdse0.png" width="799" height="436"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h1&gt;
  
  
  Which Metric Should You Use?
&lt;/h1&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Use Case&lt;/th&gt;
&lt;th&gt;Recommended Metric&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Fraud Detection&lt;/td&gt;
&lt;td&gt;Recall + PR-AUC&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Spam Detection&lt;/td&gt;
&lt;td&gt;Precision&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Search Ranking&lt;/td&gt;
&lt;td&gt;NDCG&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Recommendation Systems&lt;/td&gt;
&lt;td&gt;MAP / CTR&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Multi-label NLP&lt;/td&gt;
&lt;td&gt;Macro F1&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GenAI Classification&lt;/td&gt;
&lt;td&gt;F1 + Human Review&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Safety Systems&lt;/td&gt;
&lt;td&gt;Recall&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;LLM Judges&lt;/td&gt;
&lt;td&gt;Agreement Metrics&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Ranking Models&lt;/td&gt;
&lt;td&gt;ROC-AUC + NDCG&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Some ranking metrics deserve a quick note:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;NDCG is useful when the order of results matters and top-ranked items are more important&lt;/li&gt;
&lt;li&gt;MAP is useful for retrieval systems where multiple relevant results may exist&lt;/li&gt;
&lt;li&gt;CTR is a behavioral business metric, but it can be noisy and biased by position, UI, and user intent&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The key lesson is:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Metrics must align with operational goals.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Optimizing the wrong metric can destroy system quality while dashboards continue looking healthy.&lt;/p&gt;




&lt;h1&gt;
  
  
  A Practical Evaluation Checklist
&lt;/h1&gt;

&lt;p&gt;Before trusting a model metric, ask:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Is the dataset imbalanced?&lt;/li&gt;
&lt;li&gt;Which error is more expensive: false positives or false negatives?&lt;/li&gt;
&lt;li&gt;Are rare classes hidden by averages?&lt;/li&gt;
&lt;li&gt;Is the model calibrated?&lt;/li&gt;
&lt;li&gt;Does offline performance match production behavior?&lt;/li&gt;
&lt;li&gt;Are humans reviewing ambiguous cases?&lt;/li&gt;
&lt;li&gt;Are evaluation datasets versioned?&lt;/li&gt;
&lt;li&gt;Are regressions caught before deployment?&lt;/li&gt;
&lt;li&gt;Are latency and cost part of the evaluation?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This checklist is often more useful than adding another metric to a dashboard.&lt;/p&gt;




&lt;h1&gt;
  
  
  Evaluation Is an Engineering Discipline
&lt;/h1&gt;

&lt;p&gt;Many teams treat evaluation as an afterthought. In reality, evaluation systems are production infrastructure.&lt;/p&gt;

&lt;p&gt;Good evaluation systems require more than a few metrics on a dashboard. They need dataset versioning, label quality pipelines, drift detection, continuous benchmarking, human review loops, statistical monitoring, cost-aware execution, and experiment reproducibility.&lt;/p&gt;

&lt;p&gt;As AI systems become core infrastructure, evaluation engineering is becoming as important as model engineering itself.&lt;/p&gt;




&lt;h1&gt;
  
  
  Final Thoughts
&lt;/h1&gt;

&lt;p&gt;Metrics are compression functions for reality. Every metric hides information.&lt;/p&gt;

&lt;p&gt;Accuracy hides class imbalance. F1 hides confidence. ROC-AUC hides calibration. Calibration hides ranking quality.&lt;/p&gt;

&lt;p&gt;No single number can fully describe model behavior.&lt;/p&gt;

&lt;p&gt;The best evaluation systems combine multiple perspectives: correctness, reliability, uncertainty, safety, and operational impact.&lt;/p&gt;

&lt;p&gt;If you are building production AI systems, choosing the right evaluation metric is often more important than choosing the right model.&lt;/p&gt;

&lt;p&gt;Because in the end:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;What you measure is what your system learns to optimize.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;And poorly chosen metrics can quietly push systems in the wrong direction for months before anyone notices.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>machinelearning</category>
      <category>llm</category>
      <category>softwareengineering</category>
    </item>
  </channel>
</rss>
