TL;DR: Running a full LLM benchmark suite (GSM8K + HellaSwag + TruthfulQA) on a single T4 GPU costs just $0.12.
Most teams treat LLM evaluation as a monolithic black box. Here is what I found when I broke down the compute costs.
The Cost Breakdown
| Task | Method | Runtime | Cost |
|---|---|---|---|
| GSM8K | Generative | 46.5 min | $0.0775 |
| HellaSwag | Log-Likelihood | 23.7 min | $0.0394 |
| TruthfulQA | Log-Likelihood | 0.97 min | $0.0016 |
Generative tasks dominate cost. Log-likelihood tasks process in parallel.
Key Rules
- Cap tokens at 256 (not 2048 default) - cuts runtime 75%
- 25% stratified sample captures variance
- MC2 scoring needs no external LLM judge
Full article: https://www.codcompass.com/blog/crawl-5777e247a1e2fb
Top comments (0)