<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Devashish</title>
    <description>The latest articles on DEV Community by Devashish (@ric03uec).</description>
    <link>https://dev.to/ric03uec</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F104592%2F5802344f-b876-4875-8d65-d8226b7c8598.jpeg</url>
      <title>DEV Community: Devashish</title>
      <link>https://dev.to/ric03uec</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/ric03uec"/>
    <language>en</language>
    <item>
      <title>Two Qwen3 Models on One DGX Spark: The Residency Math for Local LLM Coding</title>
      <dc:creator>Devashish</dc:creator>
      <pubDate>Tue, 16 Jun 2026 20:23:10 +0000</pubDate>
      <link>https://dev.to/ric03uec/two-qwen3-models-on-one-dgx-spark-the-residency-math-for-local-llm-coding-5bpj</link>
      <guid>https://dev.to/ric03uec/two-qwen3-models-on-one-dgx-spark-the-residency-math-for-local-llm-coding-5bpj</guid>
      <description>&lt;p&gt;My agent stack with Hermes runs on a workstation. The models run on a DGX Spark on the same LAN. The split is deliberate: the workstation stays responsive, the Spark does the GPU work, and they talk over an HTTP proxy.&lt;/p&gt;

&lt;p&gt;Since I started managing the agent fleet through &lt;a href="//github.com/ric03uec/clawrium/"&gt;Clawrium&lt;/a&gt;, the Hermes count has climbed. More agents on more hosts, more concurrent traffic, all hitting the same Spark. What was a one-laptop, one-model setup is now a small fleet against a single backend — and the shape of the load is exactly what a single-model server can't serve.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Febx9lrmv2zjy9b7mlx8x.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Febx9lrmv2zjy9b7mlx8x.png" alt="Fleet management using Clawrium" width="799" height="418"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The Spark served models through ollama for months. It worked. One model up, single config, easy to bring down.&lt;/p&gt;

&lt;p&gt;But ollama owns the card. There's no per-process memory budget, no &lt;code&gt;gpu_memory_utilization&lt;/code&gt; knob, no straightforward way to coresident a heavy model for reasoning and a fast model for quick turns. KV cache management is whatever the underlying llama.cpp backend gives you. PagedAttention isn't there.&lt;/p&gt;

&lt;p&gt;vLLM fixes all of that.&lt;/p&gt;

&lt;p&gt;PagedAttention reclaims KV blocks instead of contiguous-pinning them.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;gpu_memory_utilization&lt;/code&gt; gives you a per-container budget.&lt;/p&gt;

&lt;p&gt;One Spark (GB10, 119.67 GiB unified memory) can run multiple vLLM containers behind a LiteLLM proxy on &lt;code&gt;:4000&lt;/code&gt;, and Hermes hits one URL to route to either model. The promise: serve Qwen3-Next-80B-Instruct-FP8 for the heavy work and Qwen3-4B-Instruct-2507 for fast turns, coresident, both reachable from a single endpoint.&lt;/p&gt;

&lt;p&gt;That's the why. What follows is what it took to make the promise hold.&lt;/p&gt;

&lt;p&gt;Spark hardware will happily hold two Qwen3 models if the numbers line up. They didn't, for several days. That's where my last weekend went.&lt;/p&gt;

&lt;h2&gt;
  
  
  Attempt one: trust the target
&lt;/h2&gt;

&lt;p&gt;First 80B config: &lt;code&gt;gpu_memory_utilization: 0.75&lt;/code&gt;, &lt;code&gt;max_model_len: 65536&lt;/code&gt;, &lt;code&gt;max_num_seqs: 4&lt;/code&gt;. vLLM's KV cache init crashed with &lt;em&gt;"No available memory for the cache blocks."&lt;/em&gt; Qwen3-Next is mostly Mamba; the per-block page alignment pushes KV pool demand higher than the ~14 GiB residue after weights.&lt;/p&gt;

&lt;p&gt;Bumped to 0.85. Now the free-memory check crashed: &lt;em&gt;"Free memory on device (98.51/119.67 GiB) is less than desired GPU memory utilization (0.85, 101.72 GiB)."&lt;/em&gt; The 4B was already resident at ~16 GiB. The 80B's 0.85 target was reading the whole card, not what was free.&lt;/p&gt;

&lt;p&gt;That's the first lesson. &lt;code&gt;gpu_memory_utilization&lt;/code&gt; is a fraction of total GPU memory, not free memory.&lt;/p&gt;

&lt;p&gt;Two co-resident vLLM processes need their fractions to sum below ~0.95 to leave room for CUDA framework overhead. If your math assumes free, you'll oscillate between OOMs and silent KV starvation.&lt;/p&gt;

&lt;p&gt;Settled at 0.80 / 32k / 2 for the 80B. Loaded clean. KV pool ~20.8 GiB after weights.&lt;/p&gt;

&lt;h2&gt;
  
  
  Attempt two: point Hermes at it
&lt;/h2&gt;

&lt;p&gt;Then Hermes came online and tool calls came back as plain text. &lt;code&gt;&amp;lt;tool_call&amp;gt;&lt;/code&gt; JSON sitting inside &lt;code&gt;content&lt;/code&gt;. &lt;code&gt;tool_calls: []&lt;/code&gt;. &lt;code&gt;finish_reason: stop&lt;/code&gt;. Hermes never executed it.&lt;/p&gt;

&lt;p&gt;A day of parser triage produced nothing actionable. Both &lt;code&gt;hermes_tool_parser.py&lt;/code&gt; and &lt;code&gt;qwen3xml_tool_parser.py&lt;/code&gt; look for &lt;code&gt;&amp;lt;tool_call&amp;gt;&lt;/code&gt; (singular). The &lt;code&gt;&amp;lt;tools&amp;gt;&lt;/code&gt; plural tag is the system-prompt definition, not the output. The parser wasn't wrong. The model wasn't emitting.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;tool_choice: "required"&lt;/code&gt; worked. &lt;code&gt;tool_choice: "auto"&lt;/code&gt; came back empty: &lt;code&gt;tool_calls: []&lt;/code&gt;, &lt;code&gt;content: ""&lt;/code&gt;, 619 characters of reasoning inside &lt;code&gt;&amp;lt;think&amp;gt;&lt;/code&gt; concluding &lt;em&gt;"Alright, that's it"&lt;/em&gt; without emitting the call.&lt;/p&gt;

&lt;p&gt;Qwen's own model card states it plainly: Qwen3-Next-80B-Thinking supports only thinking mode. &lt;code&gt;enable_thinking: false&lt;/code&gt; is a structural no-op on this checkpoint. &lt;code&gt;/no_think&lt;/code&gt; in the prompt is ignored. The model reasons inside &lt;code&gt;&amp;lt;think&amp;gt;&lt;/code&gt;, decides, and never emits.&lt;/p&gt;

&lt;p&gt;That's an unrecoverable failure for any agent SDK that defaults to &lt;code&gt;tool_choice: "auto"&lt;/code&gt;. The fix wasn't a parser flag. It was swapping the whole 80B backbone from Thinking to Instruct.&lt;/p&gt;

&lt;p&gt;77 GiB pre-pull. Drain GPU. Bring up with &lt;code&gt;--enable-auto-tool-choice --tool-call-parser hermes&lt;/code&gt;, no &lt;code&gt;--reasoning-parser&lt;/code&gt;. Three LiteLLM aliases (writer / reviewer / sources) all passed &lt;code&gt;tool_choice: "auto"&lt;/code&gt; cleanly with &lt;code&gt;finish_reason: tool_calls&lt;/code&gt;. Trade accepted: reviewer loses native &lt;code&gt;&amp;lt;think&amp;gt;&lt;/code&gt; traces. Reasoning moved into the prompt.&lt;/p&gt;

&lt;h2&gt;
  
  
  Attempt three: the bump that broke coresidency
&lt;/h2&gt;

&lt;p&gt;Reviewer agent (running on Hermes) needed 64k context. Bumped the 80B to &lt;code&gt;0.85 / 65536 / 2&lt;/code&gt;. 80B loaded healthy. The 4B's restart loop kicked in 19 times: &lt;em&gt;"Free memory on device (12.58/119.67 GiB) is less than desired GPU memory utilization (0.12, 14.36 GiB)."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;80B's actual residency at 0.85 was 101.5 GiB. Plus ~5 GiB CUDA framework overhead. That left ~12.5 GiB free. The 4B needed 14.36 GiB. No room.&lt;/p&gt;

&lt;p&gt;Toned the 80B back to 0.80, dropped the 4B to &lt;code&gt;0.10 / 16384 / 8&lt;/code&gt;. Both came up healthy. The 4B's &lt;code&gt;max_model_len&lt;/code&gt; had to drop because the 0.10 allocation leaves only ~3.5 GiB for KV pool — 32k single-seq KV demand (~4.8 GiB) doesn't fit; 16k (~2.4 GiB) does.&lt;/p&gt;

&lt;h2&gt;
  
  
  The residency math
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8m4mdv6o8nvvkdmpw4in.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8m4mdv6o8nvvkdmpw4in.png" alt=" " width="800" height="577"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is the table I wish I'd built on day one:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;Allocation target&lt;/th&gt;
&lt;th&gt;Actual resident&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Qwen3-Next-80B-Instruct-FP8 at 0.80&lt;/td&gt;
&lt;td&gt;~95 GiB&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;87.8 GiB&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qwen3-4B-Instruct at 0.10&lt;/td&gt;
&lt;td&gt;~12 GiB&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;13.8 GiB&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~107 GiB&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;101.6 GiB&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Free headroom&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~12 GiB&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~18 GiB&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Three observations from the actuals.&lt;/p&gt;

&lt;p&gt;The 80B's actual residency at 0.80 ran 8 GiB &lt;em&gt;under&lt;/em&gt; allocation. That cushion is the only reason the 4B's restart variability doesn't break the deployment. At 0.85, the cushion went negative — same hardware, same models, same vLLM build.&lt;/p&gt;

&lt;p&gt;The 4B at 0.10 actually resides at 13.8 GiB, not the 12 GiB the target implies. CUDA framework overhead doesn't disappear at small allocations.&lt;/p&gt;

&lt;p&gt;On Qwen3-Next specifically, &lt;code&gt;max_model_len × max_num_seqs&lt;/code&gt; is dominated by Mamba state alignment, not attention KV. Halving &lt;code&gt;max_model_len&lt;/code&gt; doesn't halve KV pool demand the way it does on a pure attention model. Plan KV against Mamba page sizes, not against intuition from Llama-class models.&lt;/p&gt;

&lt;p&gt;Once the wiring was complete, LiteLLM showed all the aliases for the same two models running on the Spark.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw17gyy0bpuec3ykwvq0t.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw17gyy0bpuec3ykwvq0t.png" alt=" " width="800" height="573"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The insight
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;gpu_memory_utilization&lt;/code&gt; is a snapshot vLLM takes at process start, against total card memory. It is not a target against free memory. CUDA contexts from prior failed attempts can transiently inflate residency and trip the check spuriously. Co-resident processes don't negotiate — they race.&lt;/p&gt;

&lt;p&gt;The only number that matters is actual residency after both processes have stabilized, measured against the headroom the harder-to-restart model needs to come back from a crash. Target allocations are a planning input; actuals are the ground truth.&lt;/p&gt;

&lt;p&gt;For a two-model Spark deployment, the playbook is: load the bigger model first, let it settle, run &lt;code&gt;nvidia-smi&lt;/code&gt; to read actual residency, then size the smaller model's &lt;code&gt;gpu_memory_utilization&lt;/code&gt; against the free pool minus ~5 GiB for its own framework overhead. Recheck after both restart cleanly twice.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 24-hour action
&lt;/h2&gt;

&lt;p&gt;If you have a vLLM deployment running right now, pull this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;nvidia-smi &lt;span class="nt"&gt;--query-gpu&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;memory.used &lt;span class="nt"&gt;--format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;csv
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Compare the actual number to what your &lt;code&gt;gpu_memory_utilization&lt;/code&gt; target implies. If the two diverge by more than 10%, your sizing model is wrong. Fix it before you ship anything that depends on coresidency — agent stacks, parallel workers, fallback chains. The math has to be empirical, not aspirational.&lt;/p&gt;




&lt;p&gt;If you're standing up a similar local-LLM stack — DGX Spark (or other hardware), vLLM, multiple coresident models, or wiring a remote agent fleet to a single inference backend — I'd love to compare notes.&lt;/p&gt;

</description>
      <category>localllm</category>
      <category>vllm</category>
      <category>ai</category>
      <category>nvidia</category>
    </item>
    <item>
      <title>https://www.devashish.me/p/aie-code-2025-wrapup</title>
      <dc:creator>Devashish</dc:creator>
      <pubDate>Tue, 23 Dec 2025 06:56:57 +0000</pubDate>
      <link>https://dev.to/ric03uec/httpswwwdevashishmepaie-code-2025-wrapup-365l</link>
      <guid>https://dev.to/ric03uec/httpswwwdevashishmepaie-code-2025-wrapup-365l</guid>
      <description>&lt;div class="crayons-card c-embed text-styles text-styles--secondary"&gt;
    &lt;div class="c-embed__content"&gt;
        &lt;div class="c-embed__cover"&gt;
          &lt;a href="https://www.devashish.me/p/aie-code-2025-wrapup" class="c-link align-middle" rel="noopener noreferrer"&gt;
            &lt;img alt="" src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fsubstackcdn.com%2Fimage%2Ffetch%2F%24s_%21JZD1%21%2Cw_1200%2Ch_600%2Cc_fill%2Cf_jpg%2Cq_auto%3Agood%2Cfl_progressive%3Asteep%2Cg_auto%2Fhttps%253A%252F%252Fsubstack-post-media.s3.amazonaws.com%252Fpublic%252Fimages%252F5eccc34d-86fe-482c-9ae9-86e3b2fc0a0d_1252x970.jpeg" height="400" class="m-0" width="800"&gt;
          &lt;/a&gt;
        &lt;/div&gt;
      &lt;div class="c-embed__body"&gt;
        &lt;h2 class="fs-xl lh-tight"&gt;
          &lt;a href="https://www.devashish.me/p/aie-code-2025-wrapup" rel="noopener noreferrer" class="c-link"&gt;
            AIE Code 2025 Wrapup - devashish.me
          &lt;/a&gt;
        &lt;/h2&gt;
          &lt;p class="truncate-at-3"&gt;
            Leadership and engineering takeaways from AIE CODE 2025.
          &lt;/p&gt;
        &lt;div class="color-secondary fs-s flex items-center"&gt;
            &lt;img alt="favicon" class="c-embed__favicon m-0 mr-2 radius-0" src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fsubstackcdn.com%2Fimage%2Ffetch%2F%24s_%21fEfd%21%2Cf_auto%2Cq_auto%3Agood%2Cfl_progressive%3Asteep%2Fhttps%253A%252F%252Fsubstack-post-media.s3.amazonaws.com%252Fpublic%252Fimages%252F0d92f0f4-93b1-4725-8c15-c140459d9507%252Ffavicon.ico" width="64" height="64"&gt;
          devashish.me
        &lt;/div&gt;
      &lt;/div&gt;
    &lt;/div&gt;
&lt;/div&gt;


</description>
    </item>
  </channel>
</rss>
