DEV Community

Cover image for How to Evaluate ML Models Step by Step
likhitha manikonda
likhitha manikonda

Posted on

How to Evaluate ML Models Step by Step

When you're starting out in machine learning, the math and metrics can feel scary — but don’t worry!

This guide explains everything using simple analogies, intuitive examples, and your formula images included exactly as required.


🚀 Why Do We Evaluate Models?

When you train a machine learning model, it’s like teaching a kid how to identify something — for example, ripe vs unripe fruits.

But how do you know if the kid (or model) actually learned well?

Evaluation metrics answer:

  • ✅ Is the model making correct predictions overall?
  • 🎯 Is it mistakenly marking wrong things as right?
  • 🔍 Is it missing important cases?
  • 🔄 Does it perform consistently on new unseen data?

Let’s simplify every metric with beginner‑friendly analogies 👇


🔍 1. Accuracy“How often am I right overall?”

📘 Definition

Accuracy is the percentage of predictions your model got correct.

📷 Formula

🍉 Analogy: Exam Score

You answer 100 questions → Get 90 right → Accuracy = 90%

🥭 Mango Analogy

You show 100 mangoes to your robot:

  • It correctly identifies 90 👉 Accuracy = 90%

⚠️ Watch out!

Accuracy can mislead when classes are imbalanced.

If 95 mangoes are unripe, the robot can simply guess "unripe" and still get 95% accuracy… but it totally fails at finding ripe ones.

💻 Code Example

from sklearn.metrics import accuracy_score

y_true = [1, 0, 1, 1, 0]
y_pred = [1, 0, 1, 0, 0]

accuracy = accuracy_score(y_true, y_pred)
print("Accuracy:", accuracy)
Enter fullscreen mode Exit fullscreen mode

🎯 2. Precision“When I say YES, how often am I correct?”

📘 Definition

Out of all items predicted as positive, how many were actually positive?

📷 Formula

🍪 Analogy: Cookie Thief Accusation

You accuse 10 people of stealing cookies → Only 8 actually did it.

👉 Precision = 8/10 = 0.8

🥭 Mango Analogy

Robot says 10 mangoes are ripe → 8 truly are.

It made 2 false alarms.

👉 High precision = rarely raises false alarms

💻 Code Example

from sklearn.metrics import precision_score

precision = precision_score(y_true, y_pred)
print("Precision:", precision)
Enter fullscreen mode Exit fullscreen mode

🧲 3. Recall“How many actual YES cases did I find?”

📘 Definition

Out of all actual positives, how many did the model correctly identify?

📷 Formula

🍪 Analogy: Cookie Thief Hunt

There were 12 actual cookie thieves → you caught 8.

👉 Recall = 8/12 = 0.67

🥭 Mango Analogy

There are 12 ripe mangoes → robot finds 8.

It missed 4 real ripe ones.

👉 High recall = rarely misses positives

💻 Code Example

from sklearn.metrics import recall_score

recall = recall_score(y_true, y_pred)
print("Recall:", recall)
Enter fullscreen mode Exit fullscreen mode

⚖️ Precision vs Recall (Super Simple)

  • Precision = “Of the ones I flagged, how many were correct?”
  • Recall = “Of the ones that exist, how many did I find?”

If you’re catching thieves:

  • Precision: Did I wrongly accuse people?
  • Recall: Did I fail to catch the real thieves?

💡 4. F1‑Score“Balanced performance between Precision and Recall”

📘 Definition

F1 combines Precision and Recall into a single score — useful when classes are imbalanced.

📷 Formula

🎓 Analogy

A student who gets everything right (precision) but answers only a few questions (low recall) isn't ideal.

F1 rewards someone who is balanced.

💻 Code Example

from sklearn.metrics import f1_score

f1 = f1_score(y_true, y_pred)
print("F1 Score:", f1)
Enter fullscreen mode Exit fullscreen mode

🔁 5. Cross‑Validation“Test your recipe in multiple kitchens”

📘 Definition

Instead of testing once, cross-validation tests your model on multiple splits of the data.

Why?

To ensure the model isn’t just performing well by luck — it should perform well across many subsets.

🍽️ Analogy

You make a dish:

  • Tastes good at home
  • Tastes good in a friend’s kitchen
  • Tastes good in a hotel kitchen

👉 Then it’s truly a solid recipe.

💻 Code Example

from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
import numpy as np

model = RandomForestClassifier()
scores = cross_val_score(model, X, y, cv=5)

print("Cross-Validation Scores:", scores)
print("Average Score:", np.mean(scores))
Enter fullscreen mode Exit fullscreen mode

🧮 Confusion Matrix — The Scoreboard Behind All Metrics

🧩 Understanding TP, FP, TN, FN (The Simplest Explanation Ever)
These four numbers come from the confusion matrix and form the foundation of all metrics.
Let’s continue with our ripe mango detection analogy 🍋🥭:

✅ TP — True Positive (“Correct YES”)
You predicted ripe, and it was actually ripe.
👉 Robot says “ripe” → Mango is ripe
✔️ Correct positive prediction

❌ FP — False Positive (“Wrong YES”)
You predicted ripe, but it was unripe.
👉 Robot says “ripe” → Mango is unripe
⚠️ False alarm
(Also called Type‑1 error)

❌ FN — False Negative (“Wrong NO”)
You predicted unripe, but it was actually ripe.
👉 Robot says “unripe” → Mango is ripe
⚠️ Missed case
(Also called Type‑2 error)

✅ TN — True Negative (“Correct NO”)
You predicted unripe, and it was unripe.
👉 Robot says “unripe” → Mango is unripe
✔️ Correct negative prediction

📷 Visual

Formulas

Accuracy = (TP + TN) / Total

Precision = TP / (TP + FP)

Recall = TP / (TP + FN)


🎉 Final Summary

Metric Meaning (Simple) Best For
Accuracy Overall correctness Balanced datasets
Precision When I say “yes”, am I right? Avoid false alarms
Recall Did I find all actual positives? Avoid misses
F1 Score Balance of precision & recall Imbalanced classes
Cross‑Validation Reliable performance on many data splits Ensuring generalization

🎉 One-Line Mnemonics

Precision protects from false positives
Recall rescues missed positives
F1 fixes imbalance
Accuracy averages everything
TP/TN = correct; FP/FN = mistakes


🧭 Which Metric Matters More — And When?
Choosing the right evaluation metric depends on one simple question:

“Which type of mistake is more costly for my problem — false positives or false negatives?”

Let’s simplify this.

🎯 1. When Accuracy Matters the Most
Use accuracy when:

Your classes are balanced
Both mistake types (FP & FN) matter equally
You want an overall “how correct am I?” score

Good for:
Digit recognition, fruit classification, general tasks with equal class distribution.
Not good for:
Imbalanced datasets (e.g., fraud detection, medical tests)

🔍 2. When Precision Matters More
Precision cares about how trustworthy your positive predictions are.
Use precision when:

False positives (FP) are more harmful
You want to avoid “raising false alarms”

Examples:

Spam filter → Don’t put important emails into spam
Fraud alert → Don’t accuse innocent customers
Search results → Don’t show irrelevant items

Think:
👉 “If I say YES, I must be correct.”

🧲 3. When Recall Matters More
Recall focuses on catching all actual positives.
Use recall when:

False negatives (FN) are dangerous
Missing a positive case is worse than raising false alarms

Examples:

Disease detection → Don’t miss sick people
Fraud detection → Better catch more suspicious cases
Safety inspections → Better over‑report than miss hazards

Think:
👉 “I don’t want to miss anything important.”

⚖️ 4. When F1‑Score Matters Most
Use F1-score when:

Data is imbalanced
You care about both precision & recall
You want a single metric to compare models

Examples:

Classification with rare positive cases
NLP intent detection
Relevance ranking

📈 5. When AUC‑ROC Matters More
Use AUC‑ROC when:

You want to compare model quality across thresholds
You care about how well the model separates classes
Data is extremely imbalanced

Good for:
Credit scoring, fraud detection, anomaly detection.

Top comments (0)