It's early — come build it with me
Hearth is moving fast and contributions are very welcome — especially validating the Ascend backend on real NPUs, plus the roadmap's P0/P1 items. There are good first issues waiting.
⭐ Star + follow along: github.com/hearth-project/hearth
Your idle GPUs are burning money. Here's a Kubernetes operator that fixes it
If you self-host open-source LLMs on Kubernetes, you've hit the same wall I did:
- A GPU pinned to a model that gets traffic 3 hours a day still costs you 24 hours a day.
- Every serving stack assumes NVIDIA-first, English-first — awkward if you're running Qwen, DeepSeek, or GLM, or deploying on Ascend / domestic chips.
- "Just use KServe" means dragging in Knative + Istio to serve one model on one GPU.
🔥 Hearth — a vendor-neutral Kubernetes operator that turns "run Qwen on my private cluster" into a single LLMService manifest, with scale-to-zero built in.
One manifest. Scale-to-zero. Pick your chip.
apiVersion: serving.hearth.dev/v1alpha1
kind: LLMService
metadata:
name: qwen3-8b
namespace: ai
spec:
model:
source:
uri: modelscope://Qwen/Qwen3-8B-Instruct
runtime:
selector: { vendor: [nvidia, ascend] } # auto-pick a backend
resources:
accelerators: 1
scaling:
min: 0 # 👈 scale-to-zero
max: 3
metric: queueDepth
target: 10
$ kubectl apply -f qwen3-8b.yaml
$ kubectl get llmservice -n ai
NAME PHASE RUNTIME REPLICAS AGE
qwen3-8b ScaledToZero vllm-nvidia 0 30s
When a request arrives, Hearth's gateway buffers it, scales the model 0 → 1, holds the client connection alive with SSE heartbeats through the cold start, then streams tokens back. Idle again? Back to zero GPUs.
The same manifest runs on an Ascend cluster by making vllm-ascend the available runtime — no spec change. That portability is the whole point.
What makes it different
Hearth deliberately does not re-implement the things that already work:
Backends are described declaratively in a cluster-scoped InferenceRuntime (image, args, accelerator resource, probes,metrics). Adding a new chip is a thin adapter — not a rewrite.
What's actually working today
I'm being honest about maturity — this is pre-release v0.1.0 (alpha):
- ✅ NVIDIA backend + the full scale-to-zero path verified end-to-end on real A100s — cold-start keepalive, graceful drain (in-flight streams survive scale-down), model caching/prewarm, 1→N autoscaling, Grafana dashboard.
- 🧪 Ascend backend is scaffolded and golden-tested (renders correct manifests) — real-NPU validation is the v1 milestone.
- ⚠️ Not production-ready yet: no auth, no multi-tenancy. It's a strong fit today for internal / dev,latency-tolerant, cost-sensitive serving — scale-to-zero packs many idle models onto few GPUs.
Try it in 60 seconds — no GPU required
You can exercise the whole control plane on kind:
make install # CRDs into your kube-context
make run # run the operator
kubectl apply -f config/samples/serving_v1alpha1_inferenceruntime.yaml
kubectl apply -f config/samples/serving_v1alpha1_llmservice.yaml -n ai

Top comments (0)