Idle GPUs also burn money — a Kubernetes Operator that can scale large models down to zero

#ai #kubernetes #llm #showdev

It's early — come build it with me

Hearth is moving fast and contributions are very welcome — especially validating the Ascend backend on real NPUs, plus the roadmap's P0/P1 items. There are good first issues waiting.

⭐ Star + follow along: github.com/hearth-project/hearth

Your idle GPUs are burning money. Here's a Kubernetes operator that fixes it

If you self-host open-source LLMs on Kubernetes, you've hit the same wall I did:

A GPU pinned to a model that gets traffic 3 hours a day still costs you 24 hours a day.
Every serving stack assumes NVIDIA-first, English-first — awkward if you're running Qwen, DeepSeek, or GLM, or deploying on Ascend / domestic chips.
"Just use KServe" means dragging in Knative + Istio to serve one model on one GPU.

🔥 Hearth — a vendor-neutral Kubernetes operator that turns "run Qwen on my private cluster" into a single LLMService manifest, with scale-to-zero built in.

One manifest. Scale-to-zero. Pick your chip.

apiVersion: serving.hearth.dev/v1alpha1
  kind: LLMService
  metadata:
    name: qwen3-8b
    namespace: ai
  spec:
    model:
      source:
        uri: modelscope://Qwen/Qwen3-8B-Instruct
    runtime:
      selector: { vendor: [nvidia, ascend] }   # auto-pick a backend
    resources:
      accelerators: 1
    scaling:
      min: 0        # 👈 scale-to-zero
      max: 3
      metric: queueDepth
      target: 10

$ kubectl apply -f qwen3-8b.yaml
$ kubectl get llmservice -n ai
NAME       PHASE          RUNTIME       REPLICAS   AGE
qwen3-8b   ScaledToZero   vllm-nvidia   0          30s

When a request arrives, Hearth's gateway buffers it, scales the model 0 → 1, holds the client connection alive with SSE heartbeats through the cold start, then streams tokens back. Idle again? Back to zero GPUs.

The same manifest runs on an Ascend cluster by making vllm-ascend the available runtime — no spec change. That portability is the whole point.

What makes it different

Hearth deliberately does not re-implement the things that already work:

Backends are described declaratively in a cluster-scoped InferenceRuntime (image, args, accelerator resource, probes,metrics). Adding a new chip is a thin adapter — not a rewrite.

What's actually working today

I'm being honest about maturity — this is pre-release v0.1.0 (alpha):

✅ NVIDIA backend + the full scale-to-zero path verified end-to-end on real A100s — cold-start keepalive, graceful drain (in-flight streams survive scale-down), model caching/prewarm, 1→N autoscaling, Grafana dashboard.
🧪 Ascend backend is scaffolded and golden-tested (renders correct manifests) — real-NPU validation is the v1 milestone.
⚠️ Not production-ready yet: no auth, no multi-tenancy. It's a strong fit today for internal / dev,latency-tolerant, cost-sensitive serving — scale-to-zero packs many idle models onto few GPUs.

Try it in 60 seconds — no GPU required

You can exercise the whole control plane on kind:

make install      # CRDs into your kube-context
make run          # run the operator
kubectl apply -f config/samples/serving_v1alpha1_inferenceruntime.yaml
kubectl apply -f config/samples/serving_v1alpha1_llmservice.yaml -n ai

Top comments (1)

kube-gopher • Jun 11

Everyone is very welcome to contribute, whether you are an early contributor or a seasoned one. We greatly welcome your contributions.