Hearth: scale-to-zero LLM serving on Kubernetes — and you can hack on it without a GPU

kube-gopher — Sun, 07 Jun 2026 12:59:16 +0000

Repo:github.com/hearth-project/hearth · Apache-2.0 · v0.1.0, alpha.

I've been building Hearth, a Kubernetes operator that serves open-source LLMs (Qwen, DeepSeek, GLM, …) declaratively and scales them to zero when idle. It's at a point where the core works end-to-end on real GPUs, and I'm looking for people to build it with me. The thing I most want you to know up front: you can contribute without owning an accelerator. More on that below.

## The one interesting problem

Self-hosting an LLM on K8s is easy until you notice the GPU is burning money while nobody's using the model. The obvious fix — "scale to zero" — runs straight into a chicken-and-egg problem: a stock HPA can't scale up from zero, because zero replicas means zero metrics, which means it never wakes up.

Hearth puts a small gateway (an OpenAI-compatible reverse proxy) in front of each model. When a request arrives at a scaled-to-zero backend, the gateway accepts it, holds the connection open (SSE keepalive heartbeats so nothing times out), and bumps a pending counter exposed at /hearth/queue. KEDA polls that endpoint, sees pending > 0, and scales the backend 0 → 1. The pod loads weights from a warm cache, becomes Ready, and the gateway forwards the buffered request and streams tokens back. Idle again → KEDA scales it back to 0.

The whole thing is one manifest:

  apiVersion: serving.hearth.dev/v1alpha1
  kind: LLMService
  metadata: { name: qwen3-8b, namespace: ai }
  spec:
    model:
      source: { uri: modelscope://Qwen/Qwen3-8B-Instruct }   # or hf://
    runtime:
      selector: { vendor: [nvidia, ascend] }   # auto-pick a backend, in order
    resources: { accelerators: 1 }
    scaling: { min: 0, max: 3, metric: queueDepth, target: 10 }

  $ kubectl get llmservice -n ai
  NAME       PHASE          RUNTIME       REPLICAS   AGE
  qwen3-8b   ScaledToZero   vllm-nvidia   0          30s

It's deliberately vendor-neutral: backends (NVIDIA-vLLM, vLLM-Ascend, …) are described as data in a cluster-scoped InferenceRuntime CRD — image, args, the device-plugin resource name, probes, metrics paths. Adding a chip is a thin adapter that does K8s-layer adaptation only; it never re-implements vLLM or touches kernels. The same LLMService is meant to run unchanged on NVIDIA or Ascend.

Hearth deliberately stays in its lane: it's the K8s orchestration/lifecycle layer. The engine is vLLM; scheduling is device-plugins / HAMi / Volcano; datacenter-scale serving is KServe / llm-d Hearth is the few-GPU, scale-to-zero, private end of that spectrum.

Why you can contribute without a GPU

This is the part I'm proud of and the reason I'm posting. A vendor-neutral project is useless to contributors if every change needs a rack of hardware. So there's a full no-GPU test path: a CPU vllm-stub that fakes startup delay, streaming, and /metrics, plus a fake extended resource on the node. On a plain kind cluster, with no accelerator, one command —

make test-scale-e2e

— runs the entire 0 → 1 → N → 0 loop, including cold-start keepalive and graceful drain. A laptop is enough to develop and verify the core behavior.

Honest status

I won't oversell it. As of v0.1.0:

Works, verified end-to-end on real NVIDIA GPUs: multi-backend abstraction, model caching/prewarm, gateway + KEDA scale-to-zero, cold-start keepalive, graceful drain, 1→N autoscaling, Helm install, Grafana dashboard.
Scaffolded + golden-tested, not yet on real hardware: the Ascend backend renders correct manifests but hasn't been validated on real NPUs. This is the big v1 gap, blocked purely on hardware access.
Not there yet: auth, multi-tenancy. It's v1alpha1 and not production-ready — a strong fit today for internal/dev, latency-tolerant, cost-sensitive serving.

Where I'd love help

Got Ascend (or Cambricon) hardware? Validating the Ascend backend on a real NPU is the single most valuable thing right now.
No special hardware? Grab a good-first-issue (https://github.com/hearth-project/hearth/issues) — the no-GPU path above means you can build, test, and verify locally.
Just curious? Try the kind quickstart, poke holes, open an issue, or ⭐ and follow along.

If any of this resonates, the Welcome issue (#1)(https://github.com/hearth-project/hearth/issues/1) is the place to
say hi. Thanks for reading.

Your models, your hearth. 🔥

Idle GPUs also burn money — a Kubernetes Operator that can scale large models down to zero

kube-gopher — Thu, 04 Jun 2026 14:30:29 +0000

It's early — come build it with me

Hearth is moving fast and contributions are very welcome — especially validating the Ascend backend on real NPUs, plus the roadmap's P0/P1 items. There are good first issues waiting.

⭐ Star + follow along: github.com/hearth-project/hearth

Your idle GPUs are burning money. Here's a Kubernetes operator that fixes it

If you self-host open-source LLMs on Kubernetes, you've hit the same wall I did:

A GPU pinned to a model that gets traffic 3 hours a day still costs you 24 hours a day.
Every serving stack assumes NVIDIA-first, English-first — awkward if you're running Qwen, DeepSeek, or GLM, or deploying on Ascend / domestic chips.
"Just use KServe" means dragging in Knative + Istio to serve one model on one GPU.

🔥 Hearth — a vendor-neutral Kubernetes operator that turns "run Qwen on my private cluster" into a single LLMService manifest, with scale-to-zero built in.

One manifest. Scale-to-zero. Pick your chip.

apiVersion: serving.hearth.dev/v1alpha1
  kind: LLMService
  metadata:
    name: qwen3-8b
    namespace: ai
  spec:
    model:
      source:
        uri: modelscope://Qwen/Qwen3-8B-Instruct
    runtime:
      selector: { vendor: [nvidia, ascend] }   # auto-pick a backend
    resources:
      accelerators: 1
    scaling:
      min: 0        # 👈 scale-to-zero
      max: 3
      metric: queueDepth
      target: 10

$ kubectl apply -f qwen3-8b.yaml
$ kubectl get llmservice -n ai
NAME       PHASE          RUNTIME       REPLICAS   AGE
qwen3-8b   ScaledToZero   vllm-nvidia   0          30s

When a request arrives, Hearth's gateway buffers it, scales the model 0 → 1, holds the client connection alive with SSE heartbeats through the cold start, then streams tokens back. Idle again? Back to zero GPUs.

The same manifest runs on an Ascend cluster by making vllm-ascend the available runtime — no spec change. That portability is the whole point.

What makes it different

Hearth deliberately does not re-implement the things that already work:

Backends are described declaratively in a cluster-scoped InferenceRuntime (image, args, accelerator resource, probes,metrics). Adding a new chip is a thin adapter — not a rewrite.

What's actually working today

I'm being honest about maturity — this is pre-release v0.1.0 (alpha):

✅ NVIDIA backend + the full scale-to-zero path verified end-to-end on real A100s — cold-start keepalive, graceful drain (in-flight streams survive scale-down), model caching/prewarm, 1→N autoscaling, Grafana dashboard.
🧪 Ascend backend is scaffolded and golden-tested (renders correct manifests) — real-NPU validation is the v1 milestone.
⚠️ Not production-ready yet: no auth, no multi-tenancy. It's a strong fit today for internal / dev,latency-tolerant, cost-sensitive serving — scale-to-zero packs many idle models onto few GPUs.

Try it in 60 seconds — no GPU required

You can exercise the whole control plane on kind:

make install      # CRDs into your kube-context
make run          # run the operator
kubectl apply -f config/samples/serving_v1alpha1_inferenceruntime.yaml
kubectl apply -f config/samples/serving_v1alpha1_llmservice.yaml -n ai

DEV Community: kube-gopher

Hearth: scale-to-zero LLM serving on Kubernetes — and you can hack on it without a GPU

Idle GPUs also burn money — a Kubernetes Operator that can scale large models down to zero