Stop paying for idle GPUs in your CI: batching LLM eval jobs

#devops #mlops #llm #sre

TL;DR: Running LLM evaluations on every PR will burn your GPU budget faster than you can blink. We cut our eval spend by about 60% by batching jobs into windowed runs on shared GPU pools, plus a smarter queue that knows the difference between a "smoke test" eval and a full regression run. Here's how, and where the trade-offs hurt.

Right, so a few months back I got pulled into a conversation that's becoming pretty familiar around here. A team had wired up an LLM-based evaluation suite into their CI. Every PR triggered a run against a set of prompts, scored the outputs, and posted results back to the PR. Lovely in theory.

The cloud bill was not lovely.

They were spinning up a g5.xlarge per PR, sometimes three or four in parallel during peak hours, and the GPU sat idle for about 70% of the run because most of the time was spent on cold starts, model loading, and prompt formatting. Classic case of treating GPUs like CPUs.

I reckon a lot of teams are hitting this wall right now. So let's talk about what actually works.

The problem with "GPU per job"

CI runners are designed for stateless, throwaway compute. That model breaks the second you involve a 7B+ parameter model that takes 30-90 seconds to load into VRAM.

Here's the rough breakdown of a typical eval job we measured:

Phase	Time (avg)	GPU utilization
Cold start (instance boot)	45s	0%
Model download from S3	60s	0%
Model load into VRAM	25s	~10%
Actual inference (50 prompts)	40s	~85%
Result upload + teardown	15s	0%

So out of about 3 minutes of billable GPU time, you're getting 40 seconds of useful work. That's brutal economics.

Batching: the boring fix that works

The trick isn't fancy. You stop spinning up a GPU per job and start treating the GPU like a long-lived service that consumes jobs from a queue.

We run a small pool of g5.xlarge instances (usually 2-4 depending on load) that stay warm. Each runner has the model preloaded in VRAM. CI jobs push eval requests to an SQS queue, runners pull from the queue, batch up to N prompts per inference pass, and post results back.

Rough sketch of the runner config:

runner:
  instance_type: g5.xlarge
  pool_size_min: 2
  pool_size_max: 6
  scale_metric: queue_depth
  scale_threshold: 25  # jobs in queue

  model:
    name: llama-3.1-8b-instruct
    preload: true
    keep_warm_seconds: 1800

  batching:
    max_batch_size: 16
    max_wait_ms: 2000

  job_types:
    smoke_eval:
      priority: high
      max_prompts: 10
    full_regression:
      priority: low
      max_prompts: 500
      window: nightly_only

That max_wait_ms is doing the heavy lifting. The runner waits up to 2 seconds to gather a batch before firing inference. For CI, 2 seconds of latency is nothing. For inference throughput, it's everything.

Routing matters too

Once you've got a warm pool, you might as well route different model calls through one place. We have eval suites that hit a mix of self-hosted Llama, Claude via API, and OpenAI. Instead of each CI job authenticating separately and managing keys, we put a gateway in front.

There's a bunch of options here. LiteLLM is popular, Bifrost (https://github.com/maximhq/bifrost) is another one that does the same kind of multi-provider routing with rate limit handling, and you can roll your own with a thin FastAPI wrapper if you're feeling keen. The point is you stop scattering API keys across twenty CI configs.

Job classification: don't run a full eval on every commit

This was the biggest single win, honestly. We split eval jobs into tiers:

Smoke evals: 5-10 prompts, run on every PR, catches obvious regressions
Standard evals: 50-100 prompts, run on merge to main
Full regression: 500+ prompts, run nightly on main

Before this, every PR triggered the full 500-prompt suite because nobody had bothered to think about what they actually needed to know per PR. The answer is "did this change break something obvious?", not "is this model production-ready?"

Cut our GPU-hours by about 40% just from that change alone, before any of the batching work.

What the numbers looked like

After about three weeks of running the new setup:

Metric	Before	After
GPU-hours per day	38	14
Avg PR feedback time	4m 20s	1m 50s
Monthly GPU spend (eval only)	~$8,200	~$3,100
Queue p99 wait time	n/a	8s

Faster and cheaper, which is the dream combination you almost never get.

Trade-offs and limitations

Nothing's free, so here's what actually hurt:

Cold start on scale-up is still painful. When the queue spikes past what the warm pool can handle, the new runners take 90+ seconds to come online with the model loaded. We mitigated by being more aggressive on the scale_threshold than felt comfortable, which means we're occasionally paying for idle capacity. You can't have both.

Batching adds latency variance. A job that arrives just after a batch fires waits the full max_wait_ms. For CI this is fine. For production inference it might not be, so don't blindly copy this config to your prod inference pipeline.

Pool exhaustion is a real failure mode. If your queue grows faster than you can scale, jobs back up. We had a Friday afternoon where a misconfigured test suite generated 4,000 eval jobs in 10 minutes and the queue depth alert woke me up at 11pm. Add circuit breakers and per-team quotas early, not after the first incident.

Model updates are now an event. When you preload models, swapping versions means a rolling restart of the pool. We do this during low-traffic windows but it's added operational overhead that didn't exist with the per-job model.