<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Dharamendra Kumar</title>
    <description>The latest articles on DEV Community by Dharamendra Kumar (@dharamendra1314).</description>
    <link>https://dev.to/dharamendra1314</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3951812%2F3c92b80c-a33c-4766-882c-03538cf96286.jpeg</url>
      <title>DEV Community: Dharamendra Kumar</title>
      <link>https://dev.to/dharamendra1314</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/dharamendra1314"/>
    <language>en</language>
    <item>
      <title>Serving a Fleet of SLMs on One RTX 5080: Multi-Model on a Single Consumer GPU</title>
      <dc:creator>Dharamendra Kumar</dc:creator>
      <pubDate>Tue, 26 May 2026 16:15:38 +0000</pubDate>
      <link>https://dev.to/dharamendra1314/serving-a-fleet-of-slms-on-one-rtx-5080-multi-model-on-a-single-consumer-gpu-48hg</link>
      <guid>https://dev.to/dharamendra1314/serving-a-fleet-of-slms-on-one-rtx-5080-multi-model-on-a-single-consumer-gpu-48hg</guid>
      <description>&lt;p&gt;&lt;em&gt;Every number below was measured on a single RTX 5080 (16 GB) and is reproducible&lt;br&gt;
from the repo. Each result states the exact config it was measured under; I don't&lt;br&gt;
compare numbers across configs, and I flag anything we did **not&lt;/em&gt;* cleanly measure.&lt;/p&gt;
&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;You can serve &lt;strong&gt;several small chat LLMs from one 16 GB RTX 5080&lt;/strong&gt;, behind a single&lt;br&gt;
OpenAI-compatible endpoint, by &lt;strong&gt;reusing an existing router&lt;/strong&gt; (the Shepherd Model&lt;br&gt;
Gateway) plus &lt;strong&gt;~150 lines of shell&lt;/strong&gt; — no custom router, no inference engine, and&lt;br&gt;
&lt;strong&gt;no Python or Rust in the serving stack&lt;/strong&gt;. Three controlled findings:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;A 0.5B model serves ~12,800 tok/s&lt;/strong&gt; (CUDA graphs on, concurrency 48).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Giving a model 1/3 of the GPU's memory instead of all of it made no difference&lt;/strong&gt;
to throughput (12,766 vs 12,838 tok/s).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prefix caching doubled throughput&lt;/strong&gt; on a prefill-heavy workload; &lt;strong&gt;cache-aware
routing &lt;em&gt;lost&lt;/em&gt; 20%&lt;/strong&gt; to plain round-robin in one regime.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Why this works on a 16 GB card
&lt;/h2&gt;

&lt;p&gt;Small models are tiny: a quantized ~1B model is roughly 0.5–2 GB. So several fit in&lt;br&gt;
16 GB at once, and you can serve them concurrently behind one endpoint. The only&lt;br&gt;
real question is &lt;em&gt;how&lt;/em&gt; to place and route them cleanly on one card.&lt;/p&gt;
&lt;h2&gt;
  
  
  The key idea: a router routes, it does not &lt;em&gt;place&lt;/em&gt;
&lt;/h2&gt;

&lt;p&gt;Mature routers already exist (NVIDIA Dynamo, vLLM's router, the Shepherd Model&lt;br&gt;
Gateway). They give you an OpenAI endpoint and cache/load-aware routing across&lt;br&gt;
workers — but they &lt;strong&gt;route to workers that already exist&lt;/strong&gt;; they don't start the&lt;br&gt;
workers or divide GPU memory. So "many models on one GPU" only needs a &lt;strong&gt;placement&lt;/strong&gt;&lt;br&gt;
step: start N model servers, each memory-capped to co-fit, and register them. That's&lt;br&gt;
a shell script:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;clients → SMG (reused binary) → N vLLM workers on one GPU
            ↑ a ~15-line bash launcher places + registers the workers
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;(The only Python anywhere is an offline script that renders the charts below — the&lt;br&gt;
serving path is shell + reused binaries.)&lt;/p&gt;

&lt;p&gt;I ran &lt;strong&gt;three chat models&lt;/strong&gt; — Qwen2.5-0.5B-Instruct, Qwen2.5-1.5B-Instruct,&lt;br&gt;
SmolLM2-360M-Instruct — co-resident behind one gateway, each reachable by model name.&lt;/p&gt;

&lt;h2&gt;
  
  
  Three gotchas we actually hit on the 5080
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;flashinfer's JIT needs &lt;code&gt;ninja&lt;/code&gt; and &lt;code&gt;nvcc&lt;/code&gt; on PATH.&lt;/strong&gt; Launch a vLLM worker from a
bare venv path and it dies with &lt;code&gt;FileNotFoundError: 'ninja'&lt;/code&gt; inside kernel
compilation. Activate the venv &lt;em&gt;and&lt;/em&gt; put the CUDA toolkit on PATH first.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Don't start co-located workers concurrently.&lt;/strong&gt; vLLM measures
&lt;code&gt;--gpu-memory-utilization&lt;/code&gt; against &lt;em&gt;total&lt;/em&gt; memory at startup; launch two at once
and they race → one gets "No available memory for the cache blocks." Start them
&lt;strong&gt;sequentially&lt;/strong&gt; (health-check each before the next).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Three models + CUDA graphs don't fit 16 GB.&lt;/strong&gt; With graphs on, the third worker
OOM'd. Co-locating 3 means running with graphs off (&lt;code&gt;--enforce-eager&lt;/code&gt;) or fewer
models.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Finding 1: the memory split didn't matter (controlled)
&lt;/h2&gt;

&lt;p&gt;I gave one 0.5B the whole GPU vs. just 30% of it, holding everything else constant&lt;br&gt;
(concurrency 48, CUDA graphs on):&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flod0ok2qa8kcsonlxelj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flod0ok2qa8kcsonlxelj.png" alt="Memory split has no effect" width="715" height="520"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;12,766 vs 12,838 tok/s — identical.&lt;/strong&gt; At this concurrency the KV cache wasn't the&lt;br&gt;
bottleneck, so shrinking it to make room for neighbors cost nothing. Good news for&lt;br&gt;
co-location: you can hand most of the card to other models without hurting a small&lt;br&gt;
model's throughput.&lt;/p&gt;

&lt;h2&gt;
  
  
  Finding 2: prefix caching doubled throughput (when prefill dominates)
&lt;/h2&gt;

&lt;p&gt;vLLM's automatic prefix caching is on by default, but a random-prompt benchmark&lt;br&gt;
&lt;em&gt;hides&lt;/em&gt; it (no shared prefix). Same model, same config (graphs off), only the&lt;br&gt;
workload's shared-prefix fraction changes:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhtixhfbj2rx7huscd9at.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhtixhfbj2rx7huscd9at.png" alt="Prefix cache A/B" width="800" height="380"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;With a 2048-token shared prefix and 32-token output (prefill-heavy), throughput&lt;br&gt;
&lt;strong&gt;doubled (1,153 → 2,316 tok/s)&lt;/strong&gt; and p99 TTFT dropped &lt;strong&gt;64%&lt;/strong&gt;. With a short prefix&lt;br&gt;
and long output it was a modest ~15%. So a shared system prompt / RAG context is&lt;br&gt;
worth a lot; random prompts get nothing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Finding 3: "smart" routing isn't always smart
&lt;/h2&gt;

&lt;p&gt;Cache-aware routing only matters with &lt;em&gt;multiple replicas&lt;/em&gt; of one model (pin&lt;br&gt;
same-prefix requests to the same replica). Holding config constant and sweeping the&lt;br&gt;
prefix working set against constrained per-replica caches:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8k8x3xjficfplz9hwn0y.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8k8x3xjficfplz9hwn0y.png" alt="Routing regimes" width="800" height="457"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Prefix working set&lt;/th&gt;
&lt;th&gt;cache_aware&lt;/th&gt;
&lt;th&gt;round_robin&lt;/th&gt;
&lt;th&gt;result&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;small (fits everywhere)&lt;/td&gt;
&lt;td&gt;62.4&lt;/td&gt;
&lt;td&gt;60.6 req/s&lt;/td&gt;
&lt;td&gt;tie&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;sweet spot&lt;/td&gt;
&lt;td&gt;55.2&lt;/td&gt;
&lt;td&gt;53.4&lt;/td&gt;
&lt;td&gt;+3.5% (within noise)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;oversized&lt;/td&gt;
&lt;td&gt;38.5&lt;/td&gt;
&lt;td&gt;46.3&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;round_robin +20%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;When the working set overflows a replica's cache, cache-aware pinning sacrifices&lt;br&gt;
load balance and &lt;strong&gt;loses by 20%&lt;/strong&gt;. On these small models, plain round-robin /&lt;br&gt;
power-of-two was the better default.&lt;/p&gt;

&lt;h2&gt;
  
  
  What we did NOT measure cleanly
&lt;/h2&gt;

&lt;p&gt;To be straight about the limits:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The multi-model &lt;em&gt;aggregate&lt;/em&gt; throughput under contention.&lt;/strong&gt; Our controlled
3-model run had a worker OOM (graphs on, 16 GB), so we don't have a clean
contention number — it's omitted rather than guessed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;One GPU.&lt;/strong&gt; Everything here is the RTX 5080; we make no claims about other
hardware.&lt;/li&gt;
&lt;li&gt;The prefix-cache and routing runs used CUDA graphs &lt;strong&gt;off&lt;/strong&gt;, so their absolute
tok/s aren't comparable to Finding 1's — only the &lt;em&gt;relative&lt;/em&gt; effects (2×, −20%)
are the result.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Reproduce it
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;scripts/launch_workers.sh   &lt;span class="c"&gt;# probe GPU, size + start N capped workers (sequential)&lt;/span&gt;
scripts/run_gateway.sh      &lt;span class="c"&gt;# smg launch in front&lt;/span&gt;
bench/sweep.sh              &lt;span class="c"&gt;# QPS + goodput sweep&lt;/span&gt;
bench/chart.sh              &lt;span class="c"&gt;# self-contained HTML report&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Closing
&lt;/h2&gt;

&lt;p&gt;The takeaway isn't a throughput record — it's that &lt;strong&gt;reuse + a shell script&lt;/strong&gt; gets a&lt;br&gt;
working multi-model serving stack onto a consumer GPU, and that controlled&lt;br&gt;
measurement beats intuition: memory split didn't matter, prefix caching was a 2× win&lt;br&gt;
&lt;em&gt;only&lt;/em&gt; with shared prefixes, and cache-aware routing &lt;em&gt;lost&lt;/em&gt; in the wrong regime.&lt;br&gt;
Measure your own workload.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Repo, scripts, and raw benchmark JSON: **&lt;a href="https://github.com/dk67604/monogpu" rel="noopener noreferrer"&gt;https://github.com/dk67604/monogpu&lt;/a&gt;&lt;/em&gt;&lt;em&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>ai</category>
      <category>llm</category>
      <category>performance</category>
    </item>
  </channel>
</rss>
