DEV Community

kol kol
kol kol

Posted on

I Benchmarked 3 LLM Tasks for $0.12. Here's What the Cost Breakdown Reveals About AI Evaluation

TL;DR: Running a full LLM benchmark suite (GSM8K + HellaSwag + TruthfulQA) on a single T4 GPU costs just $0.12.

Most teams treat LLM evaluation as a monolithic black box. Here is what I found when I broke down the compute costs.

The Cost Breakdown

Task Method Runtime Cost
GSM8K Generative 46.5 min $0.0775
HellaSwag Log-Likelihood 23.7 min $0.0394
TruthfulQA Log-Likelihood 0.97 min $0.0016

Generative tasks dominate cost. Log-likelihood tasks process in parallel.

Key Rules

  1. Cap tokens at 256 (not 2048 default) - cuts runtime 75%
  2. 25% stratified sample captures variance
  3. MC2 scoring needs no external LLM judge

Full article: https://www.codcompass.com/blog/crawl-5777e247a1e2fb

Top comments (0)