Skip to content

DEV Community

kol kol

Posted on May 14

I Benchmarked 3 LLM Tasks for $0.12. Here's What the Cost Breakdown Reveals About AI Evaluation

#machinelearning #ai #llm #performance

TL;DR: Running a full LLM benchmark suite (GSM8K + HellaSwag + TruthfulQA) on a single T4 GPU costs just $0.12.

Most teams treat LLM evaluation as a monolithic black box. Here is what I found when I broke down the compute costs.

The Cost Breakdown

Task	Method	Runtime	Cost
GSM8K	Generative	46.5 min	$0.0775
HellaSwag	Log-Likelihood	23.7 min	$0.0394
TruthfulQA	Log-Likelihood	0.97 min	$0.0016

Generative tasks dominate cost. Log-likelihood tasks process in parallel.

Key Rules

Cap tokens at 256 (not 2048 default) - cuts runtime 75%
25% stratified sample captures variance
MC2 scoring needs no external LLM judge

Full article: https://www.codcompass.com/blog/crawl-5777e247a1e2fb

Top comments (0)

Subscribe