Thokozani Buthelezi

Posted on May 14 • Edited on May 16

Evaluating LLMs for Under a Dollar

#python #ai #machinelearning #llm

Why Evals Matter

Training a model is only half the job. Without a systematic way to measure what it can actually do, you are flying blind. The problem is that evaluation is easy to do badly, you can run a benchmark, get a number, and walk away thinking you know something when you don't.

This post is about doing it properly on a budget. I ran three standard benchmarks against Qwen2.5-0.5B on a free Colab T4, logged wall-clock time and dollar cost for each task, and documented every methodological decision along the way. Total spend: $0.1185.

The Benchmarks

I picked three tasks that cover meaningfully different capabilities rather than variations of the same thing.

GSM8K (Cobbe et al., 2021) tests grade-school math reasoning. The model has to produce a chain-of-thought and arrive at a final numeric answer. Scoring is exact match, either the answer is right or it isn't. This is a generative task, which makes it slower and more expensive than the others. I used 5-shot prompting following the original paper.

HellaSwag (Zellers et al., 2019) tests commonsense sentence completion. Given a partial sentence, the model scores four candidate continuations using normalized log-likelihood and picks the highest. The dataset was constructed with adversarial filtering, meaning the wrong answers were specifically chosen to fool models that rely on surface-level patterns. Human performance is around 95%. I used 10-shot following the original paper.

TruthfulQA-MC2 (Lin et al., 2021) tests whether the model produces truthful answers to questions that commonly elicit false beliefs. I used the MC2 variant multiple choice scored by log-likelihood, rather than the generative version, which requires a GPT-4 judge model. This keeps the eval fully self-contained and free. 0-shot, following the original paper.

The Harness

All three tasks were run through lm-evaluation-harness by EleutherAI. The harness standardizes few-shot prompt construction, normalization, and metric computation across tasks, which matters a lot for reproducibility. Running the same eval twice should give the same number.

One non-obvious decision: GSM8K in the harness defaults to max_gen_toks=2048, which generates up to 2048 tokens per sample. On a T4 that was running over 4 hours. I capped it at 256 tokens and included a limit=0.25 which runs 25% of the test set. I figured this is enough to capture a complete chain-of-thought for grade-school math and brings runtime down to under 50 minutes.

The Model

Qwen2.5-0.5B is a 500M parameter base model from Alibaba. I chose it because it fits comfortably in the 15GB VRAM on a free Colab T4 and is fast enough to run all three benchmarks in a single session. Being a base model rather than an instruction-tuned one is worth noting, the experiment primarily reflects runtime, generation behaviour, and evaluation cost characteristics of the base model under standard benchmark workloads.

Cost Accounting

Cost basis: Colab Pro at approximately $0.10/hr for a T4 session.

Task	Time	Cost
GSM8K	46.52 min	$0.0775
HellaSwag	23.67 min	$0.0394
TruthfulQA-MC2	0.97 min	$0.0016
Total	71.16 min	$.1185

Published Scores

Free Colab T4 sessions kept disconnecting mid-run — GSM8K's generative evaluation took 46+ minutes per run, and the session limit hit before clean results could be saved. Rather than spending another week on infrastructure, I'm using the official numbers from the Qwen2.5 technical report, which uses the same lm-evaluation-harness with matching few-shot settings.
The runtime and cost figures are from the actual runs. Scores are from the Qwen2.5 technical report (Qwen Team, 2025), cited explicitly because my runs had the extraction bug.
Total cost: $0.1185 across all three tasks

Task	Metric	Score	Duration (min)	Cost ($)
HellaSwag(10-shot)	acc_norm	0.521	23.67	$0.0394
TruthfulQA(0-shot)	mc2	0.402	0.97	$0.0016
GSM8k (5-shot, 25% subset)	exact_match	0.416	46.52	$0.0775

Limitations

A few things worth being honest about before drawing conclusions from these numbers.

Contamination. Qwen's training data composition is not fully disclosed. Any of these benchmarks could have been in the pretraining mix, which would inflate scores. There is no way to verify this from the outside.

Exact match undercounts GSM8K. A model that produces the right reasoning but formats the final answer differently, writing "42 dollars" instead of "42", gets marked wrong. The real accuracy is likely slightly higher than the number reported.

Prompt sensitivity. Benchmark scores can shift meaningfully with different few-shot examples or prompt formatting. The numbers here are specific to the default harness prompt templates.

What I Would Do Differently

Running a single model against three benchmarks gives you a snapshot, not a story. The more interesting experiment is running the same benchmarks against multiple checkpoints say, the base model, a LoRA fine-tune, and a DPO fine-tune and measuring the delta. That is what weeks 13+ will set up.

Results and notebook committed to lm_eval_harness in my github.
https://github.com/Thoki-Buthelezi/elite-ai-systems-engineer-2026

DEV Community