Breaking down the accuracy number: Building an LLM Eval Harness From Scratch

#ai #machinelearning #python #llm

In my last series I fine-tuned models and kept quoting one proud number: ~96% accuracy. This series is about the thing I didn't do carefully enough back then — actually checking what that number meant.

Here's the trap. Accuracy is a single number trying to summarize a task with many possible answers. It blends the cases the model nails together with the ones it quietly fails, and hands you back one confident percentage. So I built a small eval harness from scratch — no evaluate, no lm-eval-harness — and ran it on the base Qwen2.5-1.5B-Instruct (no fine-tuning, so anyone can run it cold on a Kaggle T4).

The point isn't "frameworks are bad." It's that once you write the loop yourself, you understand exactly what those frameworks are doing for you — and why a single metric can hide a broken model.

The task and the loop

I reused the Banking77 intent-classification dataset from my fine-tuning series (77 customer-support intents) and the same parse_prediction() helper, so the parsing here is identical to what I trained/served with. Evaluating with a different parser than you served with is a classic way to produce numbers that don't mean anything.

N_EVAL = 400
LABELS_BLOCK = ', '.join(label_names)   # give the base model the label space

def predict_one(query: str) -> str:
    prompt = build_chat_prompt(tokenizer, query)
    prompt = prompt.replace('and nothing else.',
                            f'and nothing else. Valid intents: {LABELS_BLOCK}.')
    inputs = tokenizer(prompt, return_tensors='pt').to(DEVICE)
    with torch.no_grad():
        out = model.generate(**inputs, max_new_tokens=16,
                             do_sample=False, pad_token_id=tokenizer.eos_token_id)
    gen = tokenizer.decode(out[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
    return parse_prediction(gen, label_names)

Greedy decoding (do_sample=False) keeps the eval deterministic. Now the fun part: score the same predictions five different ways.

Metric #1 — Accuracy (the number everyone quotes)

from sklearn.metrics import accuracy_score
acc = accuracy_score(y_true, y_pred)
# Accuracy: 50.0%
# Unparseable / no-match predictions: 0 (0.0%)

50%. Mediocre, but "functional" — the kind of number you'd note and move on from. And notably, 0% unparseable: every prediction was a clean, valid intent label. By every surface check, the model looked fine.

Metric #2 — Per-class precision / recall / F1

Now the same predictions, broken out by intent:

from sklearn.metrics import classification_report
report = classification_report(y_true, y_pred, labels=label_names,
                               output_dict=True, zero_division=0)

The ten worst intents had F1 = 0.0 and support = 0.0 — meaning the model predicted them literally zero times. It wasn't getting them wrong. It was pretending they didn't exist.

Metric #3 — Macro vs. micro F1

This is where the headline falls apart:

from sklearn.metrics import f1_score
micro = f1_score(y_true, y_pred, labels=label_names, average='micro', zero_division=0)  # 50.0%
macro = f1_score(y_true, y_pred, labels=label_names, average='macro', zero_division=0)  # 7.5%
# Gap: 42.5%

Micro F1: 50%. Macro F1: 7.5%. That 42.5-point gap is the story. Micro lets the common classes dominate; macro weights every intent equally, so all the abandoned classes drag it to the floor. When micro ≫ macro, your model is carrying a few common cases and ignoring the rest.

Metric #4 — The confusion matrix

Accuracy says how often. The confusion matrix says what gets mistaken for what — and it's the only view that shows the model collapsing 77 intents into a handful of favorites:

from sklearn.metrics import confusion_matrix
import seaborn as sns
cm = confusion_matrix(y_true, y_pred, labels=label_names)
sns.heatmap(cm, cmap='magma', square=True)

The matrix had almost no diagonal, bright vertical streaks (the model's favorite default labels, absorbing many true intents), and a near-empty lower half (dozens of intents never predicted at all). I also ranked the off-diagonal cells to name the worst confusions in plain English — the actually-actionable output.

The point

I scored the exact same predictions five ways and got five different impressions:

Accuracy said 50% — "functional."
Macro F1 said 7.5% — the model abandoned most classes.
Per-class F1 / never-predicted list named which classes, with zero recall.
The confusion matrix showed what it collapses into what.
0% unparseable meant none of this showed up in any surface check.

A metric doesn't just measure your model. It decides what you're allowed to notice. Pick the wrong one and you'll ship blind spots you never knew were there — not from carelessness, but because your one number was never capable of showing them to you.

What's next

Part 2: when there's no label to compare against — paragraphs, summaries, support replies — people reach for an LLM to grade the LLM. I build that judge from scratch and check whether it agrees with actual humans. (Spoiler: not as often as you'd hope.)

📓 Full runnable notebook on Kaggle: https://www.kaggle.com/code/sumannath88/ep01-eval-harness-from-scratch

Built with PyTorch + Hugging Face Transformers + scikit-learn. Questions or corrections welcome in the comments.