<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: kube-gopher</title>
    <description>The latest articles on DEV Community by kube-gopher (@kubegopher).</description>
    <link>https://dev.to/kubegopher</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3967235%2F05843d56-4399-4d43-ac79-00620cc98d5a.jpeg</url>
      <title>DEV Community: kube-gopher</title>
      <link>https://dev.to/kubegopher</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/kubegopher"/>
    <language>en</language>
    <item>
      <title>Idle GPUs also burn money — a Kubernetes Operator that can scale large models down to zero</title>
      <dc:creator>kube-gopher</dc:creator>
      <pubDate>Thu, 04 Jun 2026 14:30:29 +0000</pubDate>
      <link>https://dev.to/kubegopher/idle-gpus-also-burn-money-a-kubernetes-operator-that-can-scale-large-models-down-to-zero-ofa</link>
      <guid>https://dev.to/kubegopher/idle-gpus-also-burn-money-a-kubernetes-operator-that-can-scale-large-models-down-to-zero-ofa</guid>
      <description>&lt;p&gt;&lt;strong&gt;It's early — come build it with me&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Hearth is moving fast and contributions are very welcome — especially validating the Ascend backend on real NPUs, plus the roadmap's P0/P1 items. There are good first issues waiting.&lt;/p&gt;

&lt;p&gt;⭐ &lt;strong&gt;Star + follow along&lt;/strong&gt;: github.com/hearth-project/hearth&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Your idle GPUs are burning money. Here's a Kubernetes operator that fixes it&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If you self-host open-source LLMs on Kubernetes, you've hit the same wall I did:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A GPU pinned to a model that gets traffic 3 hours a day still costs you 24 hours a day.&lt;/li&gt;
&lt;li&gt;Every serving stack assumes NVIDIA-first, English-first — awkward if you're running Qwen, DeepSeek, or GLM, or deploying on Ascend / domestic chips.&lt;/li&gt;
&lt;li&gt;"Just use KServe" means dragging in Knative + Istio to serve one model on one GPU.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;🔥 Hearth — a vendor-neutral Kubernetes operator that turns "run Qwen on my private cluster" into a single LLMService manifest, with scale-to-zero built in.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;One manifest. Scale-to-zero. Pick your chip.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;apiVersion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;serving.hearth.dev/v1alpha1&lt;/span&gt;
  &lt;span class="s"&gt;kind&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt; &lt;span class="s"&gt;LLMService&lt;/span&gt;
  &lt;span class="s"&gt;metadata&lt;/span&gt;&lt;span class="err"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;qwen3-8b&lt;/span&gt;
    &lt;span class="na"&gt;namespace&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;ai&lt;/span&gt;
  &lt;span class="na"&gt;spec&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;source&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
        &lt;span class="na"&gt;uri&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;modelscope://Qwen/Qwen3-8B-Instruct&lt;/span&gt;
    &lt;span class="na"&gt;runtime&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;selector&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt; &lt;span class="nv"&gt;vendor&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;[&lt;/span&gt;&lt;span class="nv"&gt;nvidia&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;ascend&lt;/span&gt;&lt;span class="pi"&gt;]&lt;/span&gt; &lt;span class="pi"&gt;}&lt;/span&gt;   &lt;span class="c1"&gt;# auto-pick a backend&lt;/span&gt;
    &lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;accelerators&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;
    &lt;span class="na"&gt;scaling&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;min&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;0&lt;/span&gt;        &lt;span class="c1"&gt;# 👈 scale-to-zero&lt;/span&gt;
      &lt;span class="na"&gt;max&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;3&lt;/span&gt;
      &lt;span class="na"&gt;metric&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;queueDepth&lt;/span&gt;
      &lt;span class="na"&gt;target&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;10&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; qwen3-8b.yaml
&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;kubectl get llmservice &lt;span class="nt"&gt;-n&lt;/span&gt; ai
&lt;span class="go"&gt;NAME       PHASE          RUNTIME       REPLICAS   AGE
qwen3-8b   ScaledToZero   vllm-nvidia   0          30s
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When a request arrives, Hearth's gateway buffers it, scales the model 0 → 1, holds the client connection alive with SSE heartbeats through the cold start, then streams tokens back. Idle again? Back to zero GPUs.&lt;/p&gt;

&lt;p&gt;The same manifest runs on an Ascend cluster by making vllm-ascend the available runtime — no spec change. That portability is the whole point.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What makes it different&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Hearth deliberately does not re-implement the things that already work:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkhhkqvsi31m5gdb3mmkp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkhhkqvsi31m5gdb3mmkp.png" alt=" "&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Backends are described declaratively in a cluster-scoped InferenceRuntime (image, args, accelerator resource, probes,metrics). Adding a new chip is a thin adapter — not a rewrite.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What's actually working today&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I'm being honest about maturity — this is pre-release v0.1.0 (alpha):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✅ NVIDIA backend + the full scale-to-zero path verified end-to-end on real A100s — cold-start keepalive, graceful drain (in-flight streams survive scale-down), model caching/prewarm, 1→N autoscaling, Grafana dashboard.&lt;/li&gt;
&lt;li&gt;🧪 Ascend backend is scaffolded and golden-tested (renders correct manifests) — real-NPU validation is the v1 milestone.&lt;/li&gt;
&lt;li&gt;⚠️ Not production-ready yet: no auth, no multi-tenancy. It's a strong fit today for internal / dev,latency-tolerant, cost-sensitive serving — scale-to-zero packs many idle models onto few GPUs.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Try it in 60 seconds — no GPU required&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;You can exercise the whole control plane on kind:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;make &lt;span class="nb"&gt;install&lt;/span&gt;      &lt;span class="c"&gt;# CRDs into your kube-context&lt;/span&gt;
make run          &lt;span class="c"&gt;# run the operator&lt;/span&gt;
kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; config/samples/serving_v1alpha1_inferenceruntime.yaml
kubectl apply &lt;span class="nt"&gt;-f&lt;/span&gt; config/samples/serving_v1alpha1_llmservice.yaml &lt;span class="nt"&gt;-n&lt;/span&gt; ai
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



</description>
      <category>ai</category>
      <category>kubernetes</category>
      <category>llm</category>
      <category>showdev</category>
    </item>
  </channel>
</rss>
