DEV Community

Cover image for Hearth: scale-to-zero LLM serving on Kubernetes — and you can hack on it without a GPU
kube-gopher
kube-gopher

Posted on

Hearth: scale-to-zero LLM serving on Kubernetes — and you can hack on it without a GPU

Repo:github.com/hearth-project/hearth · Apache-2.0 · v0.1.0, alpha.

I've been building Hearth, a Kubernetes operator that serves open-source LLMs (Qwen, DeepSeek, GLM, …) declaratively and scales them to zero when idle. It's at a point where the core works end-to-end on real GPUs, and I'm looking for people to build it with me. The thing I most want you to know up front: you can contribute without owning an accelerator. More on that below.

## The one interesting problem

Self-hosting an LLM on K8s is easy until you notice the GPU is burning money while nobody's using the model. The obvious fix — "scale to zero" — runs straight into a chicken-and-egg problem: a stock HPA can't scale up from zero, because zero replicas means zero metrics, which means it never wakes up.

Hearth puts a small gateway (an OpenAI-compatible reverse proxy) in front of each model. When a request arrives at a scaled-to-zero backend, the gateway accepts it, holds the connection open (SSE keepalive heartbeats so nothing times out), and bumps a pending counter exposed at /hearth/queue. KEDA polls that endpoint, sees pending > 0, and scales the backend 0 → 1. The pod loads weights from a warm cache, becomes Ready, and the gateway forwards the buffered request and streams tokens back. Idle again → KEDA scales it back to 0.

The whole thing is one manifest:

  apiVersion: serving.hearth.dev/v1alpha1
  kind: LLMService
  metadata: { name: qwen3-8b, namespace: ai }
  spec:
    model:
      source: { uri: modelscope://Qwen/Qwen3-8B-Instruct }   # or hf://
    runtime:
      selector: { vendor: [nvidia, ascend] }   # auto-pick a backend, in order
    resources: { accelerators: 1 }
    scaling: { min: 0, max: 3, metric: queueDepth, target: 10 }

  $ kubectl get llmservice -n ai
  NAME       PHASE          RUNTIME       REPLICAS   AGE
  qwen3-8b   ScaledToZero   vllm-nvidia   0          30s
Enter fullscreen mode Exit fullscreen mode

It's deliberately vendor-neutral: backends (NVIDIA-vLLM, vLLM-Ascend, …) are described as data in a cluster-scoped InferenceRuntime CRD — image, args, the device-plugin resource name, probes, metrics paths. Adding a chip is a thin adapter that does K8s-layer adaptation only; it never re-implements vLLM or touches kernels. The same LLMService is meant to run unchanged on NVIDIA or Ascend.

Hearth deliberately stays in its lane: it's the K8s orchestration/lifecycle layer. The engine is vLLM; scheduling is device-plugins / HAMi / Volcano; datacenter-scale serving is KServe / llm-d Hearth is the few-GPU, scale-to-zero, private end of that spectrum.

Why you can contribute without a GPU

This is the part I'm proud of and the reason I'm posting. A vendor-neutral project is useless to contributors if every change needs a rack of hardware. So there's a full no-GPU test path: a CPU vllm-stub that fakes startup delay, streaming, and /metrics, plus a fake extended resource on the node. On a plain kind cluster, with no accelerator, one command —

make test-scale-e2e

— runs the entire 0 → 1 → N → 0 loop, including cold-start keepalive and graceful drain. A laptop is enough to develop and verify the core behavior.

Honest status

I won't oversell it. As of v0.1.0:

  • Works, verified end-to-end on real NVIDIA GPUs: multi-backend abstraction, model caching/prewarm, gateway + KEDA scale-to-zero, cold-start keepalive, graceful drain, 1→N autoscaling, Helm install, Grafana dashboard.
  • Scaffolded + golden-tested, not yet on real hardware: the Ascend backend renders correct manifests but hasn't been validated on real NPUs. This is the big v1 gap, blocked purely on hardware access.
  • Not there yet: auth, multi-tenancy. It's v1alpha1 and not production-ready — a strong fit today for internal/dev, latency-tolerant, cost-sensitive serving.

Where I'd love help

  • Got Ascend (or Cambricon) hardware? Validating the Ascend backend on a real NPU is the single most valuable thing right now.
  • No special hardware? Grab a good-first-issue (https://github.com/hearth-project/hearth/issues) — the no-GPU path above means you can build, test, and verify locally.
  • Just curious? Try the kind quickstart, poke holes, open an issue, or ⭐ and follow along.

If any of this resonates, the Welcome issue (#1)(https://github.com/hearth-project/hearth/issues/1) is the place to
say hi. Thanks for reading.

Your models, your hearth. 🔥

Top comments (1)

Collapse
 
kubegopher profile image
kube-gopher

Discussions and contributions from everyone are very welcome.