<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Yash Sharma</title>
    <description>The latest articles on DEV Community by Yash Sharma (@yashsharam_f).</description>
    <link>https://dev.to/yashsharam_f</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3913493%2F5a992fae-1fda-41cd-972c-c8f79ee7a84b.jpg</url>
      <title>DEV Community: Yash Sharma</title>
      <link>https://dev.to/yashsharam_f</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/yashsharam_f"/>
    <language>en</language>
    <item>
      <title>We Got 2x LLM Inference Speed With Three Kubernetes Settings</title>
      <dc:creator>Yash Sharma</dc:creator>
      <pubDate>Tue, 19 May 2026 09:48:44 +0000</pubDate>
      <link>https://dev.to/yashsharam_f/we-got-2x-llm-inference-speed-with-three-kubernetes-settings-3l3n</link>
      <guid>https://dev.to/yashsharam_f/we-got-2x-llm-inference-speed-with-three-kubernetes-settings-3l3n</guid>
      <description>&lt;p&gt;Serving LLMs is not easy, especially when it comes to scalability, we have to optimise the infra to make sure we're actually able to serve inference to our customers.&lt;/p&gt;

&lt;p&gt;And when you start scaling LLM inference on Kubernetes, two problems quietly show up and cost you real money. The first one is where you put the model weights, because these files are huge, and every pod needs them. &lt;/p&gt;

&lt;p&gt;The second one is how fast your nodes can actually read those weights off shared storage, because if the network isn't tuned right, you leave a lot of throughput on the table.&lt;/p&gt;

&lt;p&gt;In today's video, I'll walk you through how we solved both at DigitalOcean using a reference architecture we built, vLLM on DOKS, with Managed NFS for shared model storage. The whole thing is open source, Terraform, Kubernetes manifests, everything. Link in the description.&lt;/p&gt;

&lt;p&gt;Let's go.&lt;/p&gt;

&lt;h2&gt;
  
  
  SECTION 1: ARCHITECTURE OVERVIEW
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1vftt2u1p23wzt9uqto7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1vftt2u1p23wzt9uqto7.png" alt=" " width="800" height="718"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Let me start with the full picture, so the rest of the video makes sense.&lt;/p&gt;

&lt;p&gt;Everything lives inside one VPC. Inside that VPC, there's a DOKS cluster with two node pools. The first is a management pool of regular droplets, these run system services and the model download job. The second is a GPU pool of H100 droplets, this is where vLLM actually runs.&lt;/p&gt;

&lt;p&gt;Next to the cluster, there's a Managed NFS share. The download job writes the model weights to it once. Every vLLM pod mounts it and reads from it. That's the whole storage story in one sentence.&lt;/p&gt;

&lt;p&gt;The whole thing is deployed as two Terraform stacks. Stack one builds the infrastructure VPC, cluster, NFS. Stack two deploys everything inside Kubernetes, the namespace, the persistent volume, the download job, vLLM, the gateway. They're split on purpose, so you can redeploy vLLM, swap models, or change replica counts without touching the underlying infrastructure.&lt;/p&gt;

&lt;p&gt;Now let me explain the two decisions that matter most, why NFS, and the network tuning piece.&lt;/p&gt;

&lt;h2&gt;
  
  
  SECTION 2: WHY NFS FOR MODEL STORAGE
&lt;/h2&gt;

&lt;p&gt;When you scale LLM inference, every pod needs the model weights. And these files are huge a 70B model is around 140 gigs.&lt;/p&gt;

&lt;p&gt;There are usually three approaches people try.&lt;/p&gt;

&lt;p&gt;The first is to download the weights from object storage on pod startup. So every pod, every restart, pulls the model from something like S3 or Spaces. For a 140-gig model, that's 15-20 mins on a cold node. Every autoscale event, you pay that cost again.&lt;/p&gt;

&lt;p&gt;The second is to bake the weights into the container image. Now your image is 140 gigs. Slow to pull, painful to update, and switching models means rebuilding the image every time.&lt;/p&gt;

&lt;p&gt;The third is block storage. This works fine for one pod, but block volumes are ReadWriteOnce — the moment you want multiple replicas, you're stuck.&lt;/p&gt;

&lt;p&gt;What we actually want is "download once, use many" one copy of the weights, and every pod reads from it. That's exactly what NFS gives us. Specifically, DigitalOcean Managed NFS, which supports ReadWriteMany and is available in the same regions as our H100 droplets NYC2 and ATL1 right now.&lt;/p&gt;

&lt;p&gt;The flow is simple. A Kubernetes Job runs once, pulls the model from HuggingFace onto the NFS share. Every vLLM pod mounts that share and reads from it. Scaling from one replica to three takes 20 to 30 seconds, because the weights are already there no re-downloading, no waiting.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fai62cjdgwclaau3gmewn.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fai62cjdgwclaau3gmewn.gif" alt=" " width="800" height="428"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A quick note on vLLM itself, since it's the serving layer. vLLM is an open-source inference engine, and it exposes an OpenAI-compatible API same&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;/v1/chat/completions
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;endpoint you're used to. For our architecture, vLLM just needs a folder with model files in it. It doesn't care that the folder is actually an NFS mount. That's why this setup works so cleanly.&lt;/p&gt;

&lt;h2&gt;
  
  
  SECTION 3: THE NETWORK TUNER
&lt;/h2&gt;

&lt;p&gt;Okay, now the part that quietly matters the most&lt;/p&gt;

&lt;p&gt;First let’s understand what MTU mean, it is called Maximum Transmission Unit (MTU) which defines the maximum size of a network packet that can be transmitted over a network interface without fragmentation.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg02f0nc3vkiz1wo39vkq.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg02f0nc3vkiz1wo39vkq.gif" alt=" " width="800" height="428"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;By default, DOKS nodes use a 1500-byte MTU and pretty small TCP buffer sizes. For most workloads, that's totally fine. But for NFS reads of multi-gigabyte weight files, it leaves a lot of throughput on the table and we want to ensure to utilize it properly.&lt;/p&gt;

&lt;p&gt;We benchmarked it. With default settings, we got around 420 MB/s loading model weights from NFS. With tuning applied, around 880 MB/s. That's roughly 2x faster — same hardware, same NFS share, same model.&lt;/p&gt;

&lt;p&gt;Now let’s see Three changes get us there.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The first is jumbo frames&lt;/strong&gt;. We bump the MTU on the private network from 1500 bytes up to 9000. When vLLM pods read file from NFS packet by packet for 140gigabyte model, its 93 million packets however when we make MTU to 9000 it’s 15 million.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzl693pbny1vsxn7m67zo.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzl693pbny1vsxn7m67zo.gif" alt=" " width="800" height="428"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Bigger packets means less overhead per packet during large transfers. One thing to note here — this only works on GPU droplets. Standard droplets don't support jumbo frames on the private network.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The second is bigger TCP buffers.&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;rmem_max, wmem_max, tcp_rmem, tcp_wmem
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We raise all of these to 16 megabytes. This lets the kernel actually use the bandwidth that's available during high-throughput NFS reads.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The third is on the NFS mount itself&lt;/strong&gt; — nconnect=8. Instead of opening one TCP connection per mount, each pod opens eight. More connections, more aggregate throughput.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwu7mubn8soqi3fy14yqo.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwu7mubn8soqi3fy14yqo.gif" alt=" " width="800" height="428"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;And all of these tuning is done by a demonset on when a node freshly joins the cluster&lt;/p&gt;

&lt;p&gt;Now here's the tricky part, and this is what I was hinting at in the intro.&lt;/p&gt;

&lt;p&gt;When a fresh GPU node joins the cluster, two things want to run on it right away, the network tuner, and the vLLM pod. If the vLLM pod wins that race and mounts NFS first, TCP negotiates the packet size based on the old 1500-byte MTU. And here's the thing, that number never changes after the handshake. So even if the tuner runs five seconds later and raises the MTU to 9000, that specific NFS mount is locked at degraded speed for its entire lifetime. The only way to fix it is to remount.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8vihszsjljqkijlbk2eq.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8vihszsjljqkijlbk2eq.gif" alt=" " width="800" height="428"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;The fix is a node taint.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;When a GPU node joins the cluster, it comes up with a taint called network-not-tuned.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;node.digitalocean.com/network-not-tuned:NoSchedule
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That taint blocks every workload pod from being scheduled on the node. But the network tuner DaemonSet is built to tolerate it, so it schedules right away. It tunes the network sets jumbo frames, raises the TCP buffers and then removes the taint. Only after that do vLLM pods get to schedule.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmkhy3e7fjzs4v711exy5.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmkhy3e7fjzs4v711exy5.gif" alt=" " width="800" height="428"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;So the guarantee is simple if a vLLM pod is running on a node, the network on that node was already tuned before the pod ever touched NFS.&lt;/p&gt;

&lt;p&gt;This matters most with autoscaling. Every time the cluster brings up a new GPU node, it goes through this exact sequence. This helps us avoid the race conditions&lt;/p&gt;

&lt;h2&gt;
  
  
  SECTION 4: HA CHOICES — AND WHY
&lt;/h2&gt;

&lt;p&gt;Before the demo, a few production choices worth explaining quickly, because the reasoning matters more than the config itself.&lt;/p&gt;

&lt;p&gt;For rolling updates, we use maxSurge: 0, maxUnavailable: 1. The default behaviour is to spin up an extra pod during a rollout but on GPU-constrained clusters, you often don't have a spare H100 sitting around. So we'd rather accept one moment of reduced capacity than block the rollout waiting for a GPU that isn't coming.&lt;/p&gt;

&lt;p&gt;The startup probe is set to 120 seconds, because loading a 70B model from NFS into VRAM takes real time  220sec in our case. Default health probes would kill the pod before it ever finished loading.&lt;/p&gt;

&lt;p&gt;There's also a PreStop hook that drains in-flight requests before the pod terminates. vLLM batches requests on the GPU, so if you just send SIGTERM, those requests get dropped. The hook polls vLLM's metrics endpoint, waits until the queue is empty, then shuts down cleanly.&lt;/p&gt;

&lt;h2&gt;
  
  
  SECTION 5: DEMO
&lt;/h2&gt;

&lt;p&gt;For quick demo checkout the video at &lt;/p&gt;

&lt;h2&gt;
  
  
  SECTION 6: WRAP
&lt;/h2&gt;

&lt;p&gt;That's the architecture. Managed NFS for shared weights, a taint-plus-DaemonSet pattern to make sure the network is tuned before any NFS mount happens, and a few HA choices that make sense specifically for GPU-constrained clusters.&lt;/p&gt;

&lt;p&gt;The full reference architecture is open source in the &lt;a href="https://github.com/digitalocean/scale-with-simplicity/tree/main/reference-architectures/vllm-nfs" rel="noopener noreferrer"&gt;scale-with-simplicity repo&lt;/a&gt;. See you in the next one.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>kubernetes</category>
    </item>
  </channel>
</rss>
