TL;DR: Running LLM evaluations on every PR will burn your GPU budget faster than you can blink. We cut our eval spend by about 60% by batching jobs into windowed runs on shared GPU pools, plus a smarter queue that knows the difference between a "smoke test" eval and a full regression run. Here's how, and where the trade-offs hurt.
Right, so a few months back I got pulled into a conversation that's becoming pretty familiar around here. A team had wired up an LLM-based evaluation suite into their CI. Every PR triggered a run against a set of prompts, scored the outputs, and posted results back to the PR. Lovely in theory.
The cloud bill was not lovely.
They were spinning up a g5.xlarge per PR, sometimes three or four in parallel during peak hours, and the GPU sat idle for about 70% of the run because most of the time was spent on cold starts, model loading, and prompt formatting. Classic case of treating GPUs like CPUs.
I reckon a lot of teams are hitting this wall right now. So let's talk about what actually works.
The problem with "GPU per job"
CI runners are designed for stateless, throwaway compute. That model breaks the second you involve a 7B+ parameter model that takes 30-90 seconds to load into VRAM.
Here's the rough breakdown of a typical eval job we measured:
| Phase | Time (avg) | GPU utilization |
|---|---|---|
| Cold start (instance boot) | 45s | 0% |
| Model download from S3 | 60s | 0% |
| Model load into VRAM | 25s | ~10% |
| Actual inference (50 prompts) | 40s | ~85% |
| Result upload + teardown | 15s | 0% |
So out of about 3 minutes of billable GPU time, you're getting 40 seconds of useful work. That's brutal economics.
Batching: the boring fix that works
The trick isn't fancy. You stop spinning up a GPU per job and start treating the GPU like a long-lived service that consumes jobs from a queue.
We run a small pool of g5.xlarge instances (usually 2-4 depending on load) that stay warm. Each runner has the model preloaded in VRAM. CI jobs push eval requests to an SQS queue, runners pull from the queue, batch up to N prompts per inference pass, and post results back.
Rough sketch of the runner config:
runner:
instance_type: g5.xlarge
pool_size_min: 2
pool_size_max: 6
scale_metric: queue_depth
scale_threshold: 25 # jobs in queue
model:
name: llama-3.1-8b-instruct
preload: true
keep_warm_seconds: 1800
batching:
max_batch_size: 16
max_wait_ms: 2000
job_types:
smoke_eval:
priority: high
max_prompts: 10
full_regression:
priority: low
max_prompts: 500
window: nightly_only
That max_wait_ms is doing the heavy lifting. The runner waits up to 2 seconds to gather a batch before firing inference. For CI, 2 seconds of latency is nothing. For inference throughput, it's everything.
Routing matters too
Once you've got a warm pool, you might as well route different model calls through one place. We have eval suites that hit a mix of self-hosted Llama, Claude via API, and OpenAI. Instead of each CI job authenticating separately and managing keys, we put a gateway in front.
There's a bunch of options here. LiteLLM is popular, Bifrost (https://github.com/maximhq/bifrost) is another one that does the same kind of multi-provider routing with rate limit handling, and you can roll your own with a thin FastAPI wrapper if you're feeling keen. The point is you stop scattering API keys across twenty CI configs.
Job classification: don't run a full eval on every commit
This was the biggest single win, honestly. We split eval jobs into tiers:
- Smoke evals: 5-10 prompts, run on every PR, catches obvious regressions
- Standard evals: 50-100 prompts, run on merge to main
- Full regression: 500+ prompts, run nightly on main
Before this, every PR triggered the full 500-prompt suite because nobody had bothered to think about what they actually needed to know per PR. The answer is "did this change break something obvious?", not "is this model production-ready?"
Cut our GPU-hours by about 40% just from that change alone, before any of the batching work.
What the numbers looked like
After about three weeks of running the new setup:
| Metric | Before | After |
|---|---|---|
| GPU-hours per day | 38 | 14 |
| Avg PR feedback time | 4m 20s | 1m 50s |
| Monthly GPU spend (eval only) | ~$8,200 | ~$3,100 |
| Queue p99 wait time | n/a | 8s |
Faster and cheaper, which is the dream combination you almost never get.
Trade-offs and limitations
Nothing's free, so here's what actually hurt:
Cold start on scale-up is still painful. When the queue spikes past what the warm pool can handle, the new runners take 90+ seconds to come online with the model loaded. We mitigated by being more aggressive on the scale_threshold than felt comfortable, which means we're occasionally paying for idle capacity. You can't have both.
Batching adds latency variance. A job that arrives just after a batch fires waits the full max_wait_ms. For CI this is fine. For production inference it might not be, so don't blindly copy this config to your prod inference pipeline.
Pool exhaustion is a real failure mode. If your queue grows faster than you can scale, jobs back up. We had a Friday afternoon where a misconfigured test suite generated 4,000 eval jobs in 10 minutes and the queue depth alert woke me up at 11pm. Add circuit breakers and per-team quotas early, not after the first incident.
Model updates are now an event. When you preload models, swapping versions means a rolling restart of the pool. We do this during low-traffic windows but it's added operational overhead that didn't exist with the per-job model.
Further Reading
- SQS as a job queue for ML workloads - the AWS docs are surprisingly readable on this
- vLLM batching internals - if you want to understand continuous batching at the inference layer
- Karpenter for GPU autoscaling - if you're on EKS and want smarter node scaling
- The Cost of LLM Inference - good benchmarks for sizing decisions
- LiteLLM gateway docs - one of the multi-provider routing options
No worries if your setup looks different. The general shape holds: warm pools, batched jobs, classified workloads. Apply where it fits.
Top comments (0)