<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: kube-gopher</title>
    <description>The latest articles on DEV Community by kube-gopher (@kubegopher).</description>
    <link>https://dev.to/kubegopher</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3967235%2F05843d56-4399-4d43-ac79-00620cc98d5a.jpeg</url>
      <title>DEV Community: kube-gopher</title>
      <link>https://dev.to/kubegopher</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/kubegopher"/>
    <language>en</language>
    <item>
      <title>Hearth: scale-to-zero LLM serving on Kubernetes — and you can hack on it without a GPU</title>
      <dc:creator>kube-gopher</dc:creator>
      <pubDate>Sun, 07 Jun 2026 12:59:16 +0000</pubDate>
      <link>https://dev.to/kubegopher/hearth-scale-to-zero-llm-serving-on-kubernetes-and-you-can-hack-on-it-without-a-gpu-bn2</link>
      <guid>https://dev.to/kubegopher/hearth-scale-to-zero-llm-serving-on-kubernetes-and-you-can-hack-on-it-without-a-gpu-bn2</guid>
      <description>&lt;p&gt;&lt;em&gt;Repo:&lt;a href="https://github.com/hearth-project/hearth" rel="noopener noreferrer"&gt;github.com/hearth-project/hearth&lt;/a&gt; · Apache-2.0 · &lt;code&gt;v0.1.0&lt;/code&gt;, alpha.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;I've been building &lt;strong&gt;Hearth&lt;/strong&gt;, a Kubernetes operator that serves open-source LLMs (Qwen, DeepSeek, GLM, …) declaratively and &lt;strong&gt;scales them to zero when idle&lt;/strong&gt;. It's at a point where the core works end-to-end on real GPUs, and I'm looking for people to build it with me. The thing I most want you to know up front: &lt;strong&gt;you can contribute without owning an accelerator.&lt;/strong&gt; More on that below.&lt;/p&gt;

&lt;p&gt;## The one interesting problem&lt;/p&gt;

&lt;p&gt;Self-hosting an LLM on K8s is easy until you notice the GPU is burning money while nobody's using the model. The obvious fix — "scale to zero" — runs straight into a chicken-and-egg problem: a stock HPA can't scale &lt;em&gt;up from zero&lt;/em&gt;, because zero replicas means zero metrics, which means it never wakes up.&lt;/p&gt;

&lt;p&gt;Hearth puts a small &lt;strong&gt;gateway&lt;/strong&gt; (an OpenAI-compatible reverse proxy) in front of each model. When a request arrives at a scaled-to-zero backend, the gateway accepts it, holds the connection open (SSE keepalive heartbeats so nothing times out), and bumps a &lt;code&gt;pending&lt;/code&gt; counter exposed at &lt;code&gt;/hearth/queue&lt;/code&gt;. &lt;strong&gt;KEDA&lt;/strong&gt; polls that endpoint, sees &lt;code&gt;pending &amp;gt; 0&lt;/code&gt;, and scales the backend &lt;code&gt;0 → 1&lt;/code&gt;. The pod loads weights from a warm cache, becomes Ready, and the gateway forwards the buffered request and streams tokens back. Idle again → KEDA scales it back to &lt;code&gt;0&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The whole thing is one manifest:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;  &lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;serving.hearth.dev/v1alpha1&lt;/span&gt;
  &lt;span class="na"&gt;kind&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;LLMService&lt;/span&gt;
  &lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt; &lt;span class="nv"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;qwen3-8b&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;ai&lt;/span&gt; &lt;span class="pi"&gt;}&lt;/span&gt;
  &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;source&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt; &lt;span class="nv"&gt;uri&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;modelscope&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;&lt;span class="nv"&gt;//Qwen/Qwen3-8B-Instruct&lt;/span&gt; &lt;span class="pi"&gt;}&lt;/span&gt;   &lt;span class="c1"&gt;# or hf://&lt;/span&gt;
    &lt;span class="na"&gt;runtime&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;selector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt; &lt;span class="nv"&gt;vendor&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;nvidia&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;ascend&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt; &lt;span class="pi"&gt;}&lt;/span&gt;   &lt;span class="c1"&gt;# auto-pick a backend, in order&lt;/span&gt;
    &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt; &lt;span class="nv"&gt;accelerators&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;1&lt;/span&gt; &lt;span class="pi"&gt;}&lt;/span&gt;
    &lt;span class="na"&gt;scaling&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt; &lt;span class="nv"&gt;min&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;0&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;max&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;3&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;metric&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;queueDepth&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;target&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="nv"&gt;10&lt;/span&gt; &lt;span class="pi"&gt;}&lt;/span&gt;

  &lt;span class="s"&gt;$ kubectl get llmservice -n ai&lt;/span&gt;
  &lt;span class="s"&gt;NAME       PHASE          RUNTIME       REPLICAS   AGE&lt;/span&gt;
  &lt;span class="s"&gt;qwen3-8b   ScaledToZero   vllm-nvidia   0          30s&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It's deliberately vendor-neutral: backends (NVIDIA-vLLM, vLLM-Ascend, …) are described as data in a cluster-scoped InferenceRuntime CRD — image, args, the device-plugin resource name, probes, metrics paths. Adding a chip is a thin adapter that does K8s-layer adaptation only; it never re-implements vLLM or touches kernels. The same LLMService is meant to run unchanged on NVIDIA or Ascend.&lt;/p&gt;

&lt;p&gt;Hearth deliberately stays in its lane: it's the K8s orchestration/lifecycle layer. The engine is vLLM; scheduling is device-plugins / HAMi / Volcano; datacenter-scale serving is KServe / llm-d Hearth is the few-GPU, scale-to-zero, private end of that spectrum.&lt;/p&gt;

&lt;p&gt;Why you can contribute without a GPU&lt;/p&gt;

&lt;p&gt;This is the part I'm proud of and the reason I'm posting. A vendor-neutral project is useless to contributors if every change needs a rack of hardware. So there's a full no-GPU test path: a CPU vllm-stub that fakes startup delay, streaming, and /metrics, plus a fake extended resource on the node. On a plain kind cluster, with no accelerator, one command —&lt;/p&gt;

&lt;p&gt;make test-scale-e2e&lt;/p&gt;

&lt;p&gt;— runs the entire 0 → 1 → N → 0 loop, including cold-start keepalive and graceful drain. A laptop is enough to develop and verify the core behavior.&lt;/p&gt;

&lt;p&gt;Honest status&lt;/p&gt;

&lt;p&gt;I won't oversell it. As of v0.1.0:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Works, verified end-to-end on real NVIDIA GPUs: multi-backend abstraction, model caching/prewarm, gateway + KEDA scale-to-zero, cold-start keepalive, graceful drain, 1→N autoscaling, Helm install, Grafana dashboard.&lt;/li&gt;
&lt;li&gt;Scaffolded + golden-tested, not yet on real hardware: the Ascend backend renders correct manifests but hasn't been validated on real NPUs. This is the big v1 gap, blocked purely on hardware access.&lt;/li&gt;
&lt;li&gt;Not there yet: auth, multi-tenancy. It's v1alpha1 and not production-ready — a strong fit today for internal/dev, latency-tolerant, cost-sensitive serving.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Where I'd love help&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Got Ascend (or Cambricon) hardware? Validating the Ascend backend on a real NPU is the single most valuable thing right now.&lt;/li&gt;
&lt;li&gt;No special hardware? Grab a good-first-issue (&lt;a href="https://github.com/hearth-project/hearth/issues" rel="noopener noreferrer"&gt;https://github.com/hearth-project/hearth/issues&lt;/a&gt;) — the no-GPU path above means you can build, test, and verify locally.&lt;/li&gt;
&lt;li&gt;Just curious? Try the kind quickstart, poke holes, open an issue, or ⭐ and follow along.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If any of this resonates, the Welcome issue (#1)(&lt;a href="https://github.com/hearth-project/hearth/issues/1" rel="noopener noreferrer"&gt;https://github.com/hearth-project/hearth/issues/1&lt;/a&gt;) is the place to&lt;br&gt;
 say hi. Thanks for reading.&lt;/p&gt;

&lt;p&gt;Your models, your hearth. 🔥&lt;/p&gt;

</description>
      <category>kubernetes</category>
      <category>llm</category>
      <category>ai</category>
      <category>devops</category>
    </item>
    <item>
      <title>Idle GPUs also burn money — a Kubernetes Operator that can scale large models down to zero</title>
      <dc:creator>kube-gopher</dc:creator>
      <pubDate>Thu, 04 Jun 2026 14:30:29 +0000</pubDate>
      <link>https://dev.to/kubegopher/idle-gpus-also-burn-money-a-kubernetes-operator-that-can-scale-large-models-down-to-zero-ofa</link>
      <guid>https://dev.to/kubegopher/idle-gpus-also-burn-money-a-kubernetes-operator-that-can-scale-large-models-down-to-zero-ofa</guid>
      <description>&lt;p&gt;&lt;strong&gt;It's early — come build it with me&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Hearth is moving fast and contributions are very welcome — especially validating the Ascend backend on real NPUs, plus the roadmap's P0/P1 items. There are good first issues waiting.&lt;/p&gt;

&lt;p&gt;⭐ &lt;strong&gt;Star + follow along&lt;/strong&gt;: github.com/hearth-project/hearth&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Your idle GPUs are burning money. Here's a Kubernetes operator that fixes it&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If you self-host open-source LLMs on Kubernetes, you've hit the same wall I did:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A GPU pinned to a model that gets traffic 3 hours a day still costs you 24 hours a day.&lt;/li&gt;
&lt;li&gt;Every serving stack assumes NVIDIA-first, English-first — awkward if you're running Qwen, DeepSeek, or GLM, or deploying on Ascend / domestic chips.&lt;/li&gt;
&lt;li&gt;"Just use KServe" means dragging in Knative + Istio to serve one model on one GPU.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;🔥 Hearth — a vendor-neutral Kubernetes operator that turns "run Qwen on my private cluster" into a single LLMService manifest, with scale-to-zero built in.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;One manifest. Scale-to-zero. Pick your chip.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;serving.hearth.dev/v1alpha1&lt;/span&gt;
  &lt;span class="s"&gt;kind&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="s"&gt;LLMService&lt;/span&gt;
  &lt;span class="s"&gt;metadata&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;qwen3-8b&lt;/span&gt;
    &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ai&lt;/span&gt;
  &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;source&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;uri&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;modelscope://Qwen/Qwen3-8B-Instruct&lt;/span&gt;
    &lt;span class="na"&gt;runtime&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;selector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt; &lt;span class="nv"&gt;vendor&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;nvidia&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;ascend&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt; &lt;span class="pi"&gt;}&lt;/span&gt;   &lt;span class="c1"&gt;# auto-pick a backend&lt;/span&gt;
    &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;accelerators&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
    &lt;span class="na"&gt;scaling&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;min&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;        &lt;span class="c1"&gt;# 👈 scale-to-zero&lt;/span&gt;
      &lt;span class="na"&gt;max&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3&lt;/span&gt;
      &lt;span class="na"&gt;metric&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;queueDepth&lt;/span&gt;
      &lt;span class="na"&gt;target&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; qwen3-8b.yaml
&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;kubectl get llmservice &lt;span class="nt"&gt;-n&lt;/span&gt; ai
&lt;span class="go"&gt;NAME       PHASE          RUNTIME       REPLICAS   AGE
qwen3-8b   ScaledToZero   vllm-nvidia   0          30s
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When a request arrives, Hearth's gateway buffers it, scales the model 0 → 1, holds the client connection alive with SSE heartbeats through the cold start, then streams tokens back. Idle again? Back to zero GPUs.&lt;/p&gt;

&lt;p&gt;The same manifest runs on an Ascend cluster by making vllm-ascend the available runtime — no spec change. That portability is the whole point.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What makes it different&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Hearth deliberately does not re-implement the things that already work:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkhhkqvsi31m5gdb3mmkp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkhhkqvsi31m5gdb3mmkp.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Backends are described declaratively in a cluster-scoped InferenceRuntime (image, args, accelerator resource, probes,metrics). Adding a new chip is a thin adapter — not a rewrite.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What's actually working today&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I'm being honest about maturity — this is pre-release v0.1.0 (alpha):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✅ NVIDIA backend + the full scale-to-zero path verified end-to-end on real A100s — cold-start keepalive, graceful drain (in-flight streams survive scale-down), model caching/prewarm, 1→N autoscaling, Grafana dashboard.&lt;/li&gt;
&lt;li&gt;🧪 Ascend backend is scaffolded and golden-tested (renders correct manifests) — real-NPU validation is the v1 milestone.&lt;/li&gt;
&lt;li&gt;⚠️ Not production-ready yet: no auth, no multi-tenancy. It's a strong fit today for internal / dev,latency-tolerant, cost-sensitive serving — scale-to-zero packs many idle models onto few GPUs.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Try it in 60 seconds — no GPU required&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;You can exercise the whole control plane on kind:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;make &lt;span class="nb"&gt;install&lt;/span&gt;      &lt;span class="c"&gt;# CRDs into your kube-context&lt;/span&gt;
make run          &lt;span class="c"&gt;# run the operator&lt;/span&gt;
kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; config/samples/serving_v1alpha1_inferenceruntime.yaml
kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; config/samples/serving_v1alpha1_llmservice.yaml &lt;span class="nt"&gt;-n&lt;/span&gt; ai
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



</description>
      <category>ai</category>
      <category>kubernetes</category>
      <category>llm</category>
      <category>showdev</category>
    </item>
  </channel>
</rss>
