DEV Community

Suman Nath
Suman Nath

Posted on

Breaking down the accuracy number: Building an LLM Eval Harness From Scratch

In my last series I fine-tuned models and kept quoting one proud number: ~96% accuracy. This series is about the thing I didn't do carefully enough back then — actually checking what that number meant.

Here's the trap. Accuracy is a single number trying to summarize a task with many possible answers. It blends the cases the model nails together with the ones it quietly fails, and hands you back one confident percentage. So I built a small eval harness from scratch — no evaluate, no lm-eval-harness — and ran it on the base Qwen2.5-1.5B-Instruct (no fine-tuning, so anyone can run it cold on a Kaggle T4).

The point isn't "frameworks are bad." It's that once you write the loop yourself, you understand exactly what those frameworks are doing for you — and why a single metric can hide a broken model.

The task and the loop

I reused the Banking77 intent-classification dataset from my fine-tuning series (77 customer-support intents) and the same parse_prediction() helper, so the parsing here is identical to what I trained/served with. Evaluating with a different parser than you served with is a classic way to produce numbers that don't mean anything.

N_EVAL = 400
LABELS_BLOCK = ', '.join(label_names)   # give the base model the label space

def predict_one(query: str) -> str:
    prompt = build_chat_prompt(tokenizer, query)
    prompt = prompt.replace('and nothing else.',
                            f'and nothing else. Valid intents: {LABELS_BLOCK}.')
    inputs = tokenizer(prompt, return_tensors='pt').to(DEVICE)
    with torch.no_grad():
        out = model.generate(**inputs, max_new_tokens=16,
                             do_sample=False, pad_token_id=tokenizer.eos_token_id)
    gen = tokenizer.decode(out[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
    return parse_prediction(gen, label_names)
Enter fullscreen mode Exit fullscreen mode

Greedy decoding (do_sample=False) keeps the eval deterministic. Now the fun part: score the same predictions five different ways.

Metric #1 — Accuracy (the number everyone quotes)

from sklearn.metrics import accuracy_score
acc = accuracy_score(y_true, y_pred)
# Accuracy: 50.0%
# Unparseable / no-match predictions: 0 (0.0%)
Enter fullscreen mode Exit fullscreen mode

50%. Mediocre, but "functional" — the kind of number you'd note and move on from. And notably, 0% unparseable: every prediction was a clean, valid intent label. By every surface check, the model looked fine.

Metric #2 — Per-class precision / recall / F1

Now the same predictions, broken out by intent:

from sklearn.metrics import classification_report
report = classification_report(y_true, y_pred, labels=label_names,
                               output_dict=True, zero_division=0)
Enter fullscreen mode Exit fullscreen mode

per-class precision

The ten worst intents had F1 = 0.0 and support = 0.0 — meaning the model predicted them literally zero times. It wasn't getting them wrong. It was pretending they didn't exist.

Metric #3 — Macro vs. micro F1

This is where the headline falls apart:

from sklearn.metrics import f1_score
micro = f1_score(y_true, y_pred, labels=label_names, average='micro', zero_division=0)  # 50.0%
macro = f1_score(y_true, y_pred, labels=label_names, average='macro', zero_division=0)  # 7.5%
# Gap: 42.5%
Enter fullscreen mode Exit fullscreen mode

Micro F1: 50%. Macro F1: 7.5%. That 42.5-point gap is the story. Micro lets the common classes dominate; macro weights every intent equally, so all the abandoned classes drag it to the floor. When micro ≫ macro, your model is carrying a few common cases and ignoring the rest.

Metric #4 — The confusion matrix

Accuracy says how often. The confusion matrix says what gets mistaken for what — and it's the only view that shows the model collapsing 77 intents into a handful of favorites:

from sklearn.metrics import confusion_matrix
import seaborn as sns
cm = confusion_matrix(y_true, y_pred, labels=label_names)
sns.heatmap(cm, cmap='magma', square=True)
Enter fullscreen mode Exit fullscreen mode

The matrix had almost no diagonal, bright vertical streaks (the model's favorite default labels, absorbing many true intents), and a near-empty lower half (dozens of intents never predicted at all). I also ranked the off-diagonal cells to name the worst confusions in plain English — the actually-actionable output.

The point

I scored the exact same predictions five ways and got five different impressions:

  • Accuracy said 50% — "functional."
  • Macro F1 said 7.5% — the model abandoned most classes.
  • Per-class F1 / never-predicted list named which classes, with zero recall.
  • The confusion matrix showed what it collapses into what.
  • 0% unparseable meant none of this showed up in any surface check.

A metric doesn't just measure your model. It decides what you're allowed to notice. Pick the wrong one and you'll ship blind spots you never knew were there — not from carelessness, but because your one number was never capable of showing them to you.

What's next

Part 2: when there's no label to compare against — paragraphs, summaries, support replies — people reach for an LLM to grade the LLM. I build that judge from scratch and check whether it agrees with actual humans. (Spoiler: not as often as you'd hope.)

📓 Full runnable notebook on Kaggle: https://www.kaggle.com/code/sumannath88/ep01-eval-harness-from-scratch


Built with PyTorch + Hugging Face Transformers + scikit-learn. Questions or corrections welcome in the comments.

Top comments (1)

Collapse
 
ansifi profile image
Ansif

Thanks for sharing your insights! It's crucial to critically evaluate metrics like accuracy. I'd love to learn more about your experiences with this.