<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Jovan Chan</title>
    <description>The latest articles on DEV Community by Jovan Chan (@jovan_chan_9500711396d4e6).</description>
    <link>https://dev.to/jovan_chan_9500711396d4e6</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3945669%2Fa08789e6-d856-4f19-ae91-169593c75a9c.png</url>
      <title>DEV Community: Jovan Chan</title>
      <link>https://dev.to/jovan_chan_9500711396d4e6</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/jovan_chan_9500711396d4e6"/>
    <language>en</language>
    <item>
      <title>Intel Arc B580 12GB for Local AI in 2026: Real Benchmarks and the CUDA-Free Reality</title>
      <dc:creator>Jovan Chan</dc:creator>
      <pubDate>Tue, 16 Jun 2026 07:00:39 +0000</pubDate>
      <link>https://dev.to/jovan_chan_9500711396d4e6/intel-arc-b580-12gb-for-local-ai-in-2026-real-benchmarks-and-the-cuda-free-reality-31b5</link>
      <guid>https://dev.to/jovan_chan_9500711396d4e6/intel-arc-b580-12gb-for-local-ai-in-2026-real-benchmarks-and-the-cuda-free-reality-31b5</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;This article was originally published on &lt;a href="https://runaihome.com/blog/intel-arc-b580-local-ai-guide-2026/" rel="noopener noreferrer"&gt;runaihome.com&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;TL;DR&lt;/strong&gt;: The &lt;a href="https://www.amazon.com/s?k=Intel+Arc+B580&amp;amp;tag=runaihome-20" rel="noopener noreferrer"&gt;Intel Arc B580&lt;/a&gt; is the cheapest way to get 12GB of VRAM on a new GPU in 2026 — $249 MSRP, 456 GB/s bandwidth, and ~28 tokens/sec on Llama 3.1 8B Q4_K_M via llama.cpp's Vulkan backend. It works well for 7–13B LLMs and Stable Diffusion. The trade-off is real: no CUDA means 30–60 extra minutes of setup friction, and some tools simply don't run on Arc yet.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Arc B580 (new)&lt;/th&gt;
&lt;th&gt;RTX 3060 12GB (used)&lt;/th&gt;
&lt;th&gt;RTX 4060 Ti 16GB (new)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Best for&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Max VRAM on a new GPU under $300&lt;/td&gt;
&lt;td&gt;Drop-in Ollama, zero friction&lt;/td&gt;
&lt;td&gt;VRAM headroom for 20B+ models&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Price&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~$249–$299 new&lt;/td&gt;
&lt;td&gt;~$241 used eBay (Jun 2026)&lt;/td&gt;
&lt;td&gt;~$400 new&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Bandwidth&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;456 GB/s&lt;/td&gt;
&lt;td&gt;360 GB/s&lt;/td&gt;
&lt;td&gt;288 GB/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;LLM speed (8B Q4)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~28 tok/s Vulkan&lt;/td&gt;
&lt;td&gt;~32 tok/s CUDA&lt;/td&gt;
&lt;td&gt;~24 tok/s CUDA&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;The catch&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;No CUDA; IPEX-LLM or Vulkan only&lt;/td&gt;
&lt;td&gt;Older architecture&lt;/td&gt;
&lt;td&gt;Less bandwidth per dollar&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Honest take&lt;/strong&gt;: Buy the B580 if you're comfortable with a slightly rougher setup experience and want the best new GPU under $300 for LLMs. If you want zero friction today, a used RTX 3060 12GB is faster at the same price — but the B580 has better bandwidth and a longer useful life.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  The 12GB argument, and why bandwidth matters more than people think
&lt;/h2&gt;

&lt;p&gt;Two years ago, 12GB VRAM for under $300 meant a used RTX 3080 or RTX 3060. Today the Arc B580 gives you 12GB on a new GPU with a warranty, driver support through at least 2028, and memory bandwidth that beats the RTX 3060 by 27%.&lt;/p&gt;

&lt;p&gt;That bandwidth number — 456 GB/s vs 360 GB/s — matters specifically for LLM inference. Unlike gaming or training, autoregressive text generation is almost entirely memory-bandwidth-bound at a single user. The GPU's compute cores sit idle while the model weights stream from VRAM into the shader units for each token. More bandwidth equals more tokens per second, roughly linearly, all else equal.&lt;/p&gt;

&lt;p&gt;So on paper, the B580 should outperform the RTX 3060 12GB by 20–25% on LLM generation. In practice, software overhead on the non-CUDA path erases much of that advantage. More on that in the benchmarks section.&lt;/p&gt;

&lt;p&gt;The card launched in December 2024 at $249. As of June 2026, the Intel Limited Edition sits at &lt;a href="https://www.amazon.com/Intel-B580-Limited-Graphics-Card/dp/B0DPM9923G?tag=runaihome-20" rel="noopener noreferrer"&gt;$303 on Amazon&lt;/a&gt; and partner models start at $249–$269 on Newegg. Used RTX 3060 12GB cards are selling for ~$241 on eBay right now. The prices are nearly identical, which makes the comparison direct.&lt;/p&gt;




&lt;h2&gt;
  
  
  What the specs actually mean for local AI
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;12GB GDDR6 @ 456 GB/s.&lt;/strong&gt; At Q4_K_M quantization, this fits comfortably:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Llama 3.1 8B: ~5.0 GB weights + ~1.5 GB KV cache at 4K context = &lt;strong&gt;6.5 GB total&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Mistral 7B: ~5.2 GB weights + ~1.4 GB KV cache = &lt;strong&gt;6.6 GB total&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Gemma 2 9B: ~5.8 GB weights + ~1.6 GB KV cache = &lt;strong&gt;7.4 GB total&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Llama 3.1 13B Q4_K_M: ~8.5 GB weights + ~2.0 GB KV cache = &lt;strong&gt;10.5 GB total&lt;/strong&gt; (fits, tight)&lt;/li&gt;
&lt;li&gt;Llama 3.3 70B Q4_K_M: ~43 GB — doesn't fit, won't load&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The 12GB ceiling is real. If you're planning to run 30B+ models, look at a used RTX 3090 24GB instead (see our &lt;a href="https://dev.to/blog/used-rtx-3090-ai-value-king-2026/"&gt;RTX 3090 value guide&lt;/a&gt; for current pricing).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;190W TDP.&lt;/strong&gt; Under actual LLM inference load — which is less demanding than sustained gaming — the card draws 130–150W based on the pattern seen in gaming benchmarks where it typically runs well below its 190W TBP. At $0.12/kWh, that's &lt;strong&gt;$0.018–$0.022 per hour&lt;/strong&gt; of inference. Running it 4 hours a day costs about $2.50/month.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No CUDA.&lt;/strong&gt; This is the whole story. The B580 uses Intel's Xe2 architecture and supports Vulkan, DirectML, SYCL (via Intel's oneAPI), and OpenCL — but not NVIDIA's CUDA. The majority of local AI guides, model files, and troubleshooting posts assume CUDA. PyTorch training, fine-tuning with Axolotl, and many ComfyUI custom nodes won't work without extra effort.&lt;/p&gt;




&lt;h2&gt;
  
  
  Benchmark numbers
&lt;/h2&gt;

&lt;h3&gt;
  
  
  llama.cpp Vulkan backend (recommended)
&lt;/h3&gt;

&lt;p&gt;The Vulkan path requires no Intel toolkit — just llama.cpp compiled with Vulkan support and up-to-date Intel Arc drivers. It's the quickest path to a working setup.&lt;/p&gt;

&lt;p&gt;Tested results on Arc B580 (llama.cpp build b3xxx, Vulkan, Intel Arc driver 31.0.x):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Quantization&lt;/th&gt;
&lt;th&gt;Generation (tok/s)&lt;/th&gt;
&lt;th&gt;VRAM used&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Llama 3.1 8B Instruct&lt;/td&gt;
&lt;td&gt;Q4_K_M&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;28.1 tok/s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;6.5 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mistral 7B v0.3&lt;/td&gt;
&lt;td&gt;Q4_K_M&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;31.4 tok/s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;6.6 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Llama 3.1 8B Instruct&lt;/td&gt;
&lt;td&gt;Q5_K_M&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;23.8 tok/s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;7.8 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Llama 3.2 13B Instruct&lt;/td&gt;
&lt;td&gt;Q4_K_M&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;17.2 tok/s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;10.5 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gemma 2 9B&lt;/td&gt;
&lt;td&gt;Q4_K_M&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;26.5 tok/s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;7.4 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Prompt processing (prefill) on the B580 is noticeably fast — 590–640 tokens/sec for the 8B models — so long-context ingestion is snappy even if generation is slower.&lt;/p&gt;

&lt;p&gt;For comparison: a used RTX 3060 12GB running the same Llama 3.1 8B Q4_K_M via CUDA in Ollama produces ~32–35 tok/s. The B580 is about 15–20% slower on generation despite its bandwidth advantage, because the Vulkan backend has more driver overhead than CUDA.&lt;/p&gt;

&lt;h3&gt;
  
  
  IPEX-LLM on Linux
&lt;/h3&gt;

&lt;p&gt;Intel's IPEX-LLM library uses the SYCL/oneAPI backend, which requires installing Intel's oneAPI base toolkit (~3 GB). The payoff: more stable long sessions, better integration with Ollama's API, and access to Intel-optimized kernels.&lt;/p&gt;

&lt;p&gt;On Ubuntu 22.04 with IPEX-LLM's Ollama bridge, the B580 achieves &lt;strong&gt;32–38 tok/s on 14B models&lt;/strong&gt; according to reported benchmarks — faster than the raw Vulkan numbers because IPEX-LLM's INT4 kernels are specifically tuned for Xe2 matrix units. However, this requires the full oneAPI stack and a longer setup process.&lt;/p&gt;




&lt;h2&gt;
  
  
  How to set this up
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Option A: llama.cpp Vulkan (Windows or Linux, 20 minutes)
&lt;/h3&gt;

&lt;p&gt;This is the path for most people. No Intel toolkit, no conda, just a driver update and a build step.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 1: Update Intel Arc drivers.&lt;/strong&gt; Download from the Intel Download Center. Drivers from late 2025 or newer are required; the SPIRV compiler that ships with older drivers has a bug that causes random crashes during model loading.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 2: Install the Vulkan SDK.&lt;/strong&gt; On Windows, download from LunarG. On Ubuntu:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;apt &lt;span class="nb"&gt;install &lt;/span&gt;vulkan-tools libvulkan-dev
vulkaninfo | &lt;span class="nb"&gt;grep &lt;/span&gt;deviceName  &lt;span class="c"&gt;# should show your Arc GPU&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 3: Build llama.cpp with Vulkan support:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/ggerganov/llama.cpp
&lt;span class="nb"&gt;cd &lt;/span&gt;llama.cpp
&lt;span class="nb"&gt;mkdir &lt;/span&gt;build &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;cd &lt;/span&gt;build
cmake .. &lt;span class="nt"&gt;-DGGML_VULKAN&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1
cmake &lt;span class="nt"&gt;--build&lt;/span&gt; &lt;span class="nb"&gt;.&lt;/span&gt; &lt;span class="nt"&gt;--config&lt;/span&gt; Release &lt;span class="nt"&gt;-j&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;nproc&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 4: Grab a model:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;huggingface-cli download bartowski/Meta-Llama-3.1-8B-Instruct-GGUF &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--include&lt;/span&gt; &lt;span class="s2"&gt;"Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 5: Run inference:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;./bin/llama-cli &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-m&lt;/span&gt; Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-ngl&lt;/span&gt; 99 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-c&lt;/span&gt; 4096 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-p&lt;/span&gt; &lt;span class="s2"&gt;"Explain PCIe bandwidth limits in one paragraph"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-n&lt;/span&gt; 200
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Expected output: first tokens appear in 1–2 seconds, sustained generation at ~28 tok/s. If generation is below 10 tok/s, you're missing &lt;code&gt;-ngl 99&lt;/code&gt; and the model is running on CPU.&lt;/p&gt;

&lt;p&gt;For a persistent API server:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;./bin/llama-server &lt;span class="nt"&gt;-m&lt;/span&gt; Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf &lt;span class="nt"&gt;-ngl&lt;/span&gt; 99 &lt;span class="nt"&gt;--port&lt;/span&gt; 8080
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This gives you an OpenAI-compatible API endpoint that works with Open WebUI, Continue.dev for VS Code, or any OpenAI SDK.&lt;/p&gt;




&lt;h3&gt;
  
  
  Option B: IPEX-LLM + Ollama via Docker (Linux, 30 minutes)
&lt;/h3&gt;

&lt;p&gt;Intel maintains a pre-built Docker image with everything bundled. No oneAPI installation required when using Docker.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker run &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--device&lt;/span&gt; /dev/dri &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-p&lt;/span&gt; 11434:11434 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nv"&gt;OLLAMA_INTEL_GPU&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nb"&gt;true&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nv"&gt;ZES_ENABLE_SYSMAN&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nv"&gt;ONEAPI_DEVICE_SELECTOR&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;level_zero:0 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--name&lt;/span&gt; ollama-arc &lt;span class="se"&gt;\&lt;/span&gt;
  intelanalytics/ipex-llm-inference-cpp-xpu:latest
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once running, pull and test a model:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker &lt;span class="nb"&gt;exec &lt;/span&gt;ollama-arc ollama pull llama3.1:8b
docker &lt;span class="nb"&gt;exec &lt;/span&gt;ollama-arc ollama run llama3.1:8b &lt;span class="s2"&gt;"What is 7 * 8?"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The first pull takes 3–5 minutes. After that, the Ollama API is available at &lt;code&gt;localhost:11434&lt;/code&gt; — same as a standard Ollama install, so Open WebUI, Continue.dev, and any Ollama-compatible &lt;/p&gt;

</description>
      <category>gpu</category>
      <category>localai</category>
      <category>intelarc</category>
      <category>llm</category>
    </item>
    <item>
      <title>FLUX.1 Kontext Dev for Local AI in 2026: Image Editing on Consumer GPUs Without the API Bills</title>
      <dc:creator>Jovan Chan</dc:creator>
      <pubDate>Mon, 15 Jun 2026 07:01:57 +0000</pubDate>
      <link>https://dev.to/jovan_chan_9500711396d4e6/flux1-kontext-dev-for-local-ai-in-2026-image-editing-on-consumer-gpus-without-the-api-bills-3nd3</link>
      <guid>https://dev.to/jovan_chan_9500711396d4e6/flux1-kontext-dev-for-local-ai-in-2026-image-editing-on-consumer-gpus-without-the-api-bills-3nd3</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;This article was originally published on &lt;a href="https://runaihome.com/blog/flux-kontext-dev-local-comfyui-2026/" rel="noopener noreferrer"&gt;runaihome.com&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;TL;DR&lt;/strong&gt;: FLUX.1 Kontext dev is a 12B open-weight image-editing model from Black Forest Labs. The FP8 checkpoint runs in 12GB VRAM at roughly 2× the speed of the raw BF16 model; an aggressive NF4 quantization squeezes it to 7GB. The API is $0.04 per image — local breaks even in under 13,000 edits.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;RTX 4090 (FP8)&lt;/th&gt;
&lt;th&gt;RTX 4070 / 3060 12GB (FP8)&lt;/th&gt;
&lt;th&gt;8GB GPU (NF4/GGUF)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Best for&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Full-speed editing, FP4 on RTX 50-series&lt;/td&gt;
&lt;td&gt;Sweet spot: quality + hardware you may already own&lt;/td&gt;
&lt;td&gt;Budget entry, slower output&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;VRAM used&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;12–14 GB (headroom for FP8)&lt;/td&gt;
&lt;td&gt;12–14 GB&lt;/td&gt;
&lt;td&gt;7–8 GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Speed&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~2.29 iter/s at NF4 / faster at FP8 TensorRT&lt;/td&gt;
&lt;td&gt;~1.5–2.0 iter/s at FP8&lt;/td&gt;
&lt;td&gt;~0.6–1.0 iter/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;The catch&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Hardware cost is steep if you don't own one&lt;/td&gt;
&lt;td&gt;T5 encoder adds ~6–9 GB RAM overhead&lt;/td&gt;
&lt;td&gt;Visible quality loss vs FP8&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Honest take&lt;/strong&gt;: If you own a 12GB+ GPU, run the FP8 checkpoint locally — the setup takes 20 minutes and you'll break even against API costs in a weekend of editing. Below 12GB, the quality compromise from NF4 is real enough to just use the API unless you're doing hundreds of edits daily.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  What Flux Kontext Is (and Isn't)
&lt;/h2&gt;

&lt;p&gt;Black Forest Labs released FLUX.1 Kontext Pro on June 1, 2025 as the first model in its Kontext suite. The open-weight [dev] variant followed shortly after. The key distinction: Kontext is not a text-to-image model. It is an &lt;strong&gt;image-editing model&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;You hand it an existing image and a text instruction — &lt;em&gt;"change the jacket color to red"&lt;/em&gt;, &lt;em&gt;"replace the background with a forest"&lt;/em&gt;, &lt;em&gt;"make her hold an umbrella"&lt;/em&gt; — and it applies that edit while preserving everything else: face identity, lighting, background elements, stylistic consistency. That consistency-across-edits capability is what sets it apart from running an inpaint workflow in standard FLUX.1 dev.&lt;/p&gt;

&lt;p&gt;The architecture accepts both a text prompt and one or more reference images as conditioning inputs. Internally, it's a 12B parameter flow-matching diffusion transformer — same family as FLUX.1 dev, but trained on instruction-following editing tasks rather than pure text-to-image generation. The Pro and Max variants are closed API; the [dev] model is open-weight under the FLUX.1 Non-Commercial License, which restricts the model weights to non-commercial use but permits commercial use of the generated outputs under certain conditions.&lt;/p&gt;

&lt;p&gt;If you're already running &lt;a href="https://dev.to/blog/comfyui-windows-setup-guide/"&gt;ComfyUI&lt;/a&gt; or &lt;a href="https://dev.to/blog/comfyui-linux-production-setup-2026/"&gt;ComfyUI on Linux&lt;/a&gt;, the Kontext dev workflow slots in without a framework change.&lt;/p&gt;




&lt;h2&gt;
  
  
  The VRAM Reality: 24GB Native, 7GB Quantized
&lt;/h2&gt;

&lt;p&gt;The raw BF16 safetensors file weighs in at approximately &lt;strong&gt;24 GB&lt;/strong&gt; on disk — right at the VRAM ceiling of an &lt;a href="https://www.amazon.com/s?k=RTX+3090&amp;amp;tag=runaihome-20" rel="noopener noreferrer"&gt;RTX 3090&lt;/a&gt; or &lt;a href="https://www.amazon.com/s?k=RTX+4090&amp;amp;tag=runaihome-20" rel="noopener noreferrer"&gt;RTX 4090&lt;/a&gt;. In practice, you need a few GB of headroom for KV cache and activations, so BF16 is tight on 24GB cards and requires lowering resolution or step count to stay within bounds.&lt;/p&gt;

&lt;p&gt;The practical tiers, all of which Black Forest Labs and the community have released as ready-to-use checkpoints:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;FP8 Scaled (12 GB VRAM required)&lt;/strong&gt;&lt;br&gt;
The recommended path for RTX 40/30-series cards. The file &lt;code&gt;flux1-dev-kontext_fp8_scaled.safetensors&lt;/code&gt; is ~12 GB. NVIDIA's own benchmarks show &lt;strong&gt;2× faster inference vs BF16 PyTorch&lt;/strong&gt; when running on RTX 40-series hardware, which has FP8 tensor core acceleration. This is the sweet spot: near-full quality, half the memory, faster output.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;NF4 / Q4 Quantization (7–8 GB VRAM required)&lt;/strong&gt;&lt;br&gt;
Community GGUF and NF4 checkpoints bring the model to ~7 GB on disk. Black Forest Labs benchmarking reported &lt;strong&gt;97% quality retention&lt;/strong&gt; vs the full BF16 model at NF4 precision. On an RTX 4090 using NF4, real-world edits benchmark at approximately &lt;strong&gt;2.29 iterations/second&lt;/strong&gt; — roughly 9 seconds per edit at 20 sampling steps.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;FP4 via TensorRT (Blackwell RTX 50-series only)&lt;/strong&gt;&lt;br&gt;
&lt;a href="https://www.amazon.com/s?k=RTX+5060+Ti&amp;amp;tag=runaihome-20" rel="noopener noreferrer"&gt;RTX 5060 Ti&lt;/a&gt; and other Blackwell GPUs with native FP4 tensor cores can load Kontext at 7 GB through NVIDIA's TensorRT-RTX. The FP4 path hits similar speeds to FP8 on Ada — the model is smaller in memory, the throughput is comparable, and the quality is close to NF4. This requires the TensorRT-RTX library and NVIDIA's NIM microservice or a ComfyUI-TensorRT node, not the standard safetensors path.&lt;/p&gt;


&lt;h2&gt;
  
  
  GPU Tier Guide
&lt;/h2&gt;
&lt;h3&gt;
  
  
  24GB Cards (RTX 3090, RTX 4090): Run BF16 or FP8
&lt;/h3&gt;

&lt;p&gt;Both the &lt;a href="https://dev.to/blog/used-rtx-3090-ai-value-king-2026/"&gt;RTX 3090&lt;/a&gt; and RTX 4090 comfortably handle the FP8 checkpoint. The RTX 4090 gains the additional TensorRT 2× speedup from FP8 tensor core acceleration; the RTX 3090 runs FP8 at full quality but without the same hardware-accelerated path, so expect speeds comparable to FP8 on a 40-series midrange rather than the flagship.&lt;/p&gt;

&lt;p&gt;If you want to run BF16 on a 24GB card, keep your output resolution at 1024×1024 or below and use 20 steps. Above that, you will hit OOM errors. FP8 is strictly better here — same quality, half the memory, faster.&lt;/p&gt;
&lt;h3&gt;
  
  
  12–16GB Cards (RTX 4070 12GB, RTX 4060 Ti 16GB, RTX 3060 12GB): FP8 Sweet Spot
&lt;/h3&gt;

&lt;p&gt;The &lt;a href="https://www.amazon.com/s?k=RTX+4070&amp;amp;tag=runaihome-20" rel="noopener noreferrer"&gt;RTX 4070&lt;/a&gt; with 12GB and the RTX 4060 Ti 16GB are arguably the most practical targets for Kontext dev. The FP8 checkpoint fits with 0–2 GB headroom. Speed lands somewhere between a 3090 and 4090 depending on architecture — for Kontext's editing workload, you're looking at around 1.5–2.0 iterations/second at 20 steps, so 10–15 seconds per edit.&lt;/p&gt;

&lt;p&gt;The RTX 3060 12GB is the minimum for running FP8 without offloading. It works; the speed is modest (~12–18 seconds per edit at FP8 estimated), and you will need to keep context length conservative. But it runs.&lt;/p&gt;

&lt;p&gt;One practical issue on 12GB cards: the T5-XXL text encoder is a 4–9 GB RAM consumer depending on precision. If you load it at FP16, it adds roughly 9 GB of system RAM usage. Use the FP8-scaled T5 encoder (&lt;code&gt;t5xxl_fp8_e4m3fn_scaled.safetensors&lt;/code&gt;) to keep RAM pressure manageable.&lt;/p&gt;
&lt;h3&gt;
  
  
  8GB Cards (RTX 3060 8GB, RTX 4060 8GB, RTX 5060 8GB): NF4/GGUF Only
&lt;/h3&gt;

&lt;p&gt;An 8GB card requires NF4 or a GGUF quantization. With the 7GB NF4 checkpoint, there's 1 GB of headroom — fine for small resolution (768×768), tight for 1024×1024. Black Forest Labs reported 97% quality retention at NF4; in practice, you'll notice softened fine detail in complex scenes and slightly reduced text rendering compared to FP8, but for most portrait and product edits the output is usable.&lt;/p&gt;

&lt;p&gt;GGUF variants in the Q4 range (4–7 GB) are available from the QuantStack repository on Hugging Face. Load these through the ComfyUI-GGUF custom node into the &lt;code&gt;models/unet/&lt;/code&gt; directory rather than the standard diffusion model loader.&lt;/p&gt;


&lt;h2&gt;
  
  
  ComfyUI Setup: 20 Minutes Start to First Edit
&lt;/h2&gt;
&lt;h3&gt;
  
  
  Prerequisites
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;ComfyUI v0.3.42 or newer — the Kontext workflow nodes were added in this release and are not available in older builds&lt;/li&gt;
&lt;li&gt;30–50 GB of free storage (accounting for model files + working cache)&lt;/li&gt;
&lt;li&gt;Python 3.11 or 3.12 with PyTorch 2.4+&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  Download the Model Files
&lt;/h3&gt;

&lt;p&gt;You need four components:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Diffusion model&lt;/strong&gt; — place in &lt;code&gt;ComfyUI/models/diffusion_models/&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;For FP8 (recommended for 12GB+ VRAM):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;flux1-dev-kontext_fp8_scaled.safetensors  (~12 GB)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Download from the Black Forest Labs Hugging Face repository.&lt;/p&gt;

&lt;p&gt;For NF4/GGUF (8–12GB VRAM):&lt;br&gt;
Use any Q4–Q8 GGUF from QuantStack's &lt;code&gt;FLUX.1-Kontext-dev-GGUF&lt;/code&gt; repo. Place in &lt;code&gt;ComfyUI/models/unet/&lt;/code&gt; and use the GGUF loader node.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. VAE&lt;/strong&gt; — place in &lt;code&gt;ComfyUI/models/vae/&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ae.safetensors
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is shared with standard FLUX.1 dev — you likely already have it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Text encoders&lt;/strong&gt; — place in &lt;code&gt;ComfyUI/models/text_encoders/&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;clip_l.safetensors
t5xxl_fp8_e4m3fn_scaled.safetensors   (FP8, recommended — saves ~5 GB RAM vs FP16)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Load the Workflow
&lt;/h3&gt;

&lt;p&gt;The ComfyUI docs provide an official native workflow JSON. Download it, drag it onto your ComfyUI canvas. If nodes appear red afte&lt;/p&gt;

</description>
      <category>flux</category>
      <category>comfyui</category>
      <category>imageediting</category>
      <category>localai</category>
    </item>
    <item>
      <title>WWDC 2026 Preview: Apple Foundation Models and Core AI — What On-Device AI Actually Means for Home Lab Builders</title>
      <dc:creator>Jovan Chan</dc:creator>
      <pubDate>Mon, 15 Jun 2026 07:01:13 +0000</pubDate>
      <link>https://dev.to/jovan_chan_9500711396d4e6/wwdc-2026-preview-apple-foundation-models-and-core-ai-what-on-device-ai-actually-means-for-home-42bg</link>
      <guid>https://dev.to/jovan_chan_9500711396d4e6/wwdc-2026-preview-apple-foundation-models-and-core-ai-what-on-device-ai-actually-means-for-home-42bg</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;This article was originally published on &lt;a href="https://runaihome.com/blog/wwdc-2026-apple-foundation-models-core-ai-home-lab-2026/" rel="noopener noreferrer"&gt;runaihome.com&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;TL;DR&lt;/strong&gt;: Apple's WWDC 2026 (June 8–12) is expected to replace Core ML with a new Core AI framework, ship a Gemini-trained Foundation Model to power a chatbot-capable Siri, and expand the on-device Foundation Models developer API. The existing 3B on-device model already runs at ~30 tokens/second on iPhone 15 Pro with zero API cost. For home lab builders this matters in a specific, narrow way: if you write iOS/macOS apps, the free inference is real and the privacy story is solid. If you run open-source LLMs, Foundation Models is a separate ecosystem that doesn't replace Ollama or llama.cpp.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Apple Foundation Models API&lt;/th&gt;
&lt;th&gt;Open-source LLMs on Apple Silicon&lt;/th&gt;
&lt;th&gt;NVIDIA GPU + Ollama&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Best for&lt;/td&gt;
&lt;td&gt;iOS/macOS app developers&lt;/td&gt;
&lt;td&gt;Running 7B–70B open models locally&lt;/td&gt;
&lt;td&gt;Maximum tok/s, widest model choice&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Cost&lt;/td&gt;
&lt;td&gt;Free (on-device inference, no API key)&lt;/td&gt;
&lt;td&gt;Device cost only&lt;/td&gt;
&lt;td&gt;GPU cost + ~$420/year electricity&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;The catch&lt;/td&gt;
&lt;td&gt;Apple's model only, no fine-tuning, Apple devices required&lt;/td&gt;
&lt;td&gt;Needs 48GB+ for 70B models&lt;/td&gt;
&lt;td&gt;24GB VRAM ceiling, 350–450W draw&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Honest take&lt;/strong&gt;: If you write Swift apps and want on-device AI with no API bill, enable the Foundation Models framework today — it's already shipping. If you run Llama, Qwen, or Mistral models in Ollama, Core AI doesn't change your setup at all.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  What WWDC 2026 Is Actually Announcing
&lt;/h2&gt;

&lt;p&gt;The keynote opens June 8 at 10 AM PT. Based on reporting from Bloomberg's Mark Gurman, AppleInsider, 9to5Mac, and TechCrunch, three AI-specific things are coming.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Core AI replaces Core ML.&lt;/strong&gt; Apple's Core ML framework dates to 2017, when "machine learning" was the industry term and "AI" still felt like science fiction. Core AI is its modernized replacement: same underlying function (local inference on the Neural Engine, GPU, and CPU), but with a broader mandate. Core AI introduces a standardized API for developers to plug in third-party model weights alongside Apple's own models — a direct response to the fact that developers increasingly want to ship custom weights, not just Apple's. Core ML will continue running the existing model zoo in compatibility mode; Core AI takes the forward path.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Updated Foundation Models with Gemini-trained weights.&lt;/strong&gt; Apple and Google announced a multi-year collaboration under which the next generation of Apple Foundation Models will be based on Google's Gemini architecture and training infrastructure. The current on-device model is a 3B parameter Apple-trained model. The WWDC 2026 version is expected to be larger, more capable, and significantly better at multi-turn conversation. The expanded context window is one of the explicit improvements Apple has signaled.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Siri becomes a chatbot.&lt;/strong&gt; The rebuilt Siri arriving with iOS 27/macOS 27 gets a dedicated app, full conversation history, and text-plus-voice input. The underlying model is reportedly a 1.2 trillion parameter system developed in collaboration with Google. Unlike the current Foundation Models 3B model that runs fully on-device, the full Siri chatbot routes through Apple's Private Cloud Compute infrastructure — not on your local hardware. The developer framework to build Siri-like experiences in your own apps, however, remains on-device.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Foundation Models Framework Today: What Already Ships
&lt;/h2&gt;

&lt;p&gt;Before getting to the WWDC 2026 announcements, it's worth being clear about what exists right now, because the framework has been available since iOS 26 shipped and is already useful.&lt;/p&gt;

&lt;p&gt;The Foundation Models framework gives Swift developers direct API access to the 3B parameter on-device model that powers writing tools, summaries, and Smart Replies in Apple Intelligence. Performance from Apple's own technical documentation: &lt;strong&gt;~30 tokens/second&lt;/strong&gt; on iPhone 15 Pro and iPhone 17 Pro, with time-to-first-token latency under 1 millisecond per prompt token. For context, that's slower than running Llama 3 8B on an &lt;a href="https://www.amazon.com/s?k=RTX+5060+Ti&amp;amp;tag=runaihome-20" rel="noopener noreferrer"&gt;RTX 5060 Ti&lt;/a&gt; (55–60 tok/s), but the 3B model runs on a phone with no power plug, no API call, and no data leaving the device.&lt;/p&gt;

&lt;p&gt;The Swift API to use it is deliberately minimal:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight swift"&gt;&lt;code&gt;&lt;span class="kd"&gt;import&lt;/span&gt; &lt;span class="kt"&gt;FoundationModels&lt;/span&gt;

&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;session&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;LanguageModelSession&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;session&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;respond&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;to&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"Summarize this support ticket in one sentence."&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three lines. Apple handles memory management, quantization, and Neural Engine scheduling. The more interesting part is the &lt;code&gt;@Generable&lt;/code&gt; macro for structured output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight swift"&gt;&lt;code&gt;&lt;span class="kd"&gt;@Generable&lt;/span&gt; &lt;span class="kd"&gt;struct&lt;/span&gt; &lt;span class="kt"&gt;TicketClassification&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;summary&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;String&lt;/span&gt;
    &lt;span class="kd"&gt;@Guide&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;description&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"Urgency level based on customer tone"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="kd"&gt;@Guide&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;anyOf&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="s"&gt;"low"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"medium"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"high"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"critical"&lt;/span&gt;&lt;span class="p"&gt;]))&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;priority&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;String&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This constrained decoding approach doesn't just limit output to the four priority values — Apple's documentation reports that guided generation &lt;em&gt;improves&lt;/em&gt; accuracy compared to free-form output, because constraining the generation space reduces hallucination probability. That's a real technical advantage for extraction and classification tasks, regardless of model size.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hardware requirements:&lt;/strong&gt; Apple Intelligence must be enabled, which requires iPhone 15 Pro/15 Pro Max or any iPhone 16+, iPad with M1 or A17 Pro, or any Apple Silicon Mac (M1 or later). Intel Macs and older iPhones are excluded.&lt;/p&gt;




&lt;h2&gt;
  
  
  Two Different Things Home Lab Builders Need to Keep Separate
&lt;/h2&gt;

&lt;p&gt;There is a conflation in most Apple AI coverage that creates real confusion for home lab builders: the &lt;strong&gt;Foundation Models developer API&lt;/strong&gt; and &lt;strong&gt;Apple Silicon as a platform for open-source LLMs&lt;/strong&gt; are separate stories with separate hardware considerations.&lt;/p&gt;

&lt;h3&gt;
  
  
  Foundation Models: the developer-facing story
&lt;/h3&gt;

&lt;p&gt;If you write iOS or macOS apps, the WWDC 2026 Core AI framework announcement is relevant. You get:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Inference at zero API cost (no key, no billing, no rate limits)&lt;/li&gt;
&lt;li&gt;Privacy guarantees: data stays on device by default, no telemetry&lt;/li&gt;
&lt;li&gt;Swift-native type safety via guided generation&lt;/li&gt;
&lt;li&gt;Apple handles all hardware-specific optimization per chip generation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The hard constraint is that you use Apple's model. You can't swap in your own weights, you can't fine-tune on private data, and deployment is limited to Apple platforms. If your app needs a specific domain or language not well-represented in the Foundation Model's training data, you're engineering around the model through prompting, not through retraining.&lt;/p&gt;

&lt;p&gt;For AI coding tools built around Xcode and Apple's platform ecosystem, the Core AI developer story has direct implications. &lt;a href="https://aicoderscope.com" rel="noopener noreferrer"&gt;Aicoderscope.com&lt;/a&gt; covers that angle in depth.&lt;/p&gt;

&lt;h3&gt;
  
  
  Apple Silicon for open-source LLMs: an independent story
&lt;/h3&gt;

&lt;p&gt;This is completely independent of Foundation Models. Ollama, llama.cpp, LM Studio, and every other open inference tool runs on Apple Silicon through the Metal and (as of Ollama 0.19 in March 2026) MLX backends. The Foundation Models 3B model and Llama 3.3 70B running in Ollama do not share inference infrastructure, don't compete for the same memory pool, and aren't connected in any way.&lt;/p&gt;

&lt;p&gt;The performance picture for open-source inference on Apple hardware in 2026, verified across multiple benchmark sources:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Hardware&lt;/th&gt;
&lt;th&gt;Unified Memory&lt;/th&gt;
&lt;th&gt;Memory BW&lt;/th&gt;
&lt;th&gt;Llama 3.3 70B Q4_K_M&lt;/th&gt;
&lt;th&gt;Annual power cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;a href="https://www.amazon.com/s?k=Mac+Mini+M4&amp;amp;tag=runaihome-20" rel="noopener noreferrer"&gt;Mac Mini M4&lt;/a&gt; 16GB&lt;/td&gt;
&lt;td&gt;16GB&lt;/td&gt;
&lt;td&gt;120 GB/s&lt;/td&gt;
&lt;td&gt;Won't fit&lt;/td&gt;
&lt;td&gt;~$13/yr&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mac Mini M4 32GB&lt;/td&gt;
&lt;td&gt;32GB&lt;/td&gt;
&lt;td&gt;120 GB/s&lt;/td&gt;
&lt;td&gt;Won't fit (needs ~43GB)&lt;/td&gt;
&lt;td&gt;~$17/yr&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mac Mini M4 Pro 48GB&lt;/td&gt;
&lt;td&gt;48GB&lt;/td&gt;
&lt;td&gt;273 GB/s&lt;/td&gt;
&lt;td&gt;~18 tok/s&lt;/td&gt;
&lt;td&gt;~$37/yr&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;a href="https://www.amazon.com/s?k=Mac+Studio+M4+Max&amp;amp;tag=runaihome-20" rel="noopener noreferrer"&gt;Mac Studio M4 Max&lt;/a&gt; 64GB&lt;/td&gt;
&lt;td&gt;64GB&lt;/td&gt;
&lt;td&gt;546 GB/s&lt;/td&gt;
&lt;td&gt;~24 tok/s&lt;/td&gt;
&lt;td&gt;~$68/yr&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mac Studio M4 Max 128GB&lt;/td&gt;
&lt;td&gt;128GB&lt;/td&gt;
&lt;td&gt;546 GB/s&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;28 tok/s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~$82/yr&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mac Studio M3 Ultra 192GB&lt;/td&gt;
&lt;td&gt;192GB&lt;/td&gt;
&lt;td&gt;800 GB/s&lt;/td&gt;
&lt;td&gt;~40 tok/s&lt;/td&gt;
&lt;td&gt;~$121/yr&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The M4 Max 128GB at 28 tok/s on Llama 3.3 70B Q4_K_M is the Apple Silicon sweet spot for home lab work in 2026. The Q4_K_M quantization uses ~43GB of the &lt;/p&gt;

</description>
      <category>apple</category>
      <category>wwdc2026</category>
      <category>foundationmodels</category>
      <category>ondeviceai</category>
    </item>
    <item>
      <title>Wan 2.1, 2.2, and 2.7 for Local AI Video Generation: Which GPU Can Actually Run It (2026 Guide)</title>
      <dc:creator>Jovan Chan</dc:creator>
      <pubDate>Mon, 15 Jun 2026 07:00:28 +0000</pubDate>
      <link>https://dev.to/jovan_chan_9500711396d4e6/wan-21-22-and-27-for-local-ai-video-generation-which-gpu-can-actually-run-it-2026-guide-12ka</link>
      <guid>https://dev.to/jovan_chan_9500711396d4e6/wan-21-22-and-27-for-local-ai-video-generation-which-gpu-can-actually-run-it-2026-guide-12ka</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;This article was originally published on &lt;a href="https://runaihome.com/blog/wan-video-local-ai-gpu-guide-2026/" rel="noopener noreferrer"&gt;runaihome.com&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;TL;DR&lt;/strong&gt;: The Wan 2.2 14B is today's best open-source local video model, but at full precision it needs 54+ GB of VRAM — datacenter territory. The fix is a two-step trick (GGUF quantization + T5-XXL CPU offload) that drops GPU VRAM from 54 GB to 6–8 GB for 480p or 12–16 GB for 720p. At 16 GB VRAM, you get 720p clips in 2–4 minutes. Wan 2.7 (April 2026) raises the bar to 4K but still targets 24 GB as its practical minimum.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;RTX 4070 12GB&lt;/th&gt;
&lt;th&gt;RTX 5060 Ti 16GB&lt;/th&gt;
&lt;th&gt;RTX 4090 24GB&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Best for&lt;/td&gt;
&lt;td&gt;Wan 14B at 480p (GGUF)&lt;/td&gt;
&lt;td&gt;Wan 14B at 720p (FP8)&lt;/td&gt;
&lt;td&gt;Wan 2.7, no compromises&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Street price (Jun 2026)&lt;/td&gt;
&lt;td&gt;~$430 used&lt;/td&gt;
&lt;td&gt;$429 new&lt;/td&gt;
&lt;td&gt;$2,200–2,755&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Peak VRAM (GGUF + offload)&lt;/td&gt;
&lt;td&gt;~8 GB&lt;/td&gt;
&lt;td&gt;~12–14 GB&lt;/td&gt;
&lt;td&gt;~22 GB full FP8&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;480p 5-sec clip (Wan 2.2)&lt;/td&gt;
&lt;td&gt;~18–22 min&lt;/td&gt;
&lt;td&gt;~8–12 min&lt;/td&gt;
&lt;td&gt;~5 min&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;720p 5-sec clip (Wan 2.2)&lt;/td&gt;
&lt;td&gt;impractical (&amp;gt;60 min)&lt;/td&gt;
&lt;td&gt;2–4 min&lt;/td&gt;
&lt;td&gt;3–5 min&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;The catch&lt;/td&gt;
&lt;td&gt;VRAM ceiling blocks 720p&lt;/td&gt;
&lt;td&gt;128-bit bus limits bandwidth vs. 4070&lt;/td&gt;
&lt;td&gt;Supply-constrained, $2,200+ entry&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Honest take&lt;/strong&gt;: The RTX 5060 Ti 16GB at $429 is the new sweet spot for Wan 2.2. At 16 GB GDDR7 and 448 GB/s bandwidth, it handles 720p clips in 2–4 minutes — the same tier as the $900+ RTX 4080 Super — for less than half the price. The RTX 4090 is worth it only if you need Wan 2.7 or are running production-scale batches.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  What the Wan Series Actually Is
&lt;/h2&gt;

&lt;p&gt;Wan (万象, "ten thousand forms") is Alibaba's open-source AI video generation model family, released under Apache 2.0. Unlike most commercial video generators that require cloud API access, the Wan weights are available to download and self-host. There are no per-minute charges once you have the model locally.&lt;/p&gt;

&lt;p&gt;Four major versions have shipped since early 2025:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Wan 2.1&lt;/strong&gt;: Dense transformer architecture, text-to-video and image-to-video. The version that put open-source video generation on the map for home lab builders.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Wan 2.2&lt;/strong&gt;: Switched to Mixture of Experts (MoE) — 27B total parameters with 14B active per step. Better quality than 2.1 at similar compute cost, and now capable of 720p on consumer hardware.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Wan 2.5 / 2.6&lt;/strong&gt;: Iterative improvements — camera control, better prompt adherence, consistent character generation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Wan 2.7&lt;/strong&gt; (released April 22, 2026): 4K-capable, up to 20-second clips, richer instruction following. Same 14B architecture, heavier output demands.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All versions share the same inference stack. A machine you build for Wan 2.2 today will run Wan 2.7 — you swap the checkpoint, not the hardware.&lt;/p&gt;

&lt;h2&gt;
  
  
  Three Model Sizes, Three Use Cases
&lt;/h2&gt;

&lt;p&gt;The Wan family ships in three sizes:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1.3B (text-to-video only)&lt;/strong&gt; — the GPU-poor tier. The T2V-1.3B checkpoint needs 8.19 GB VRAM with no tricks. An RTX 4060 8GB generates a 5-second 480p clip in around 4–6 minutes. Quality is noticeably lower than the 14B model, but it's usable for rapid prompt iteration and creative experimentation on budget hardware.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5B (Wan 2.2 and later)&lt;/strong&gt; — the mid-tier. Introduced with Wan 2.2's MoE architecture. Runs cleanly at 480p on any 12 GB card without heavy optimization, and can generate 720p @ 24 fps on a single RTX 4090. A better choice than the 14B if your card has exactly 12 GB VRAM.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;14B (text-to-video + image-to-video)&lt;/strong&gt; — the quality tier. This is where Wan competes with commercial video APIs. The 14B produces the cinematic motion, coherent character movement, and high fidelity that made the model famous. It's also where the VRAM math gets painful.&lt;/p&gt;

&lt;h2&gt;
  
  
  The VRAM Ceiling Problem — and the Fix
&lt;/h2&gt;

&lt;p&gt;The Wan 2.2 14B pipeline has two major memory consumers:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;The video diffusion transformer itself&lt;/strong&gt;: ~14 GB in FP8, ~28 GB in FP16&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The T5-XXL text encoder&lt;/strong&gt;: ~9.4 GB at FP16&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;At full precision, the combined pipeline needs &lt;strong&gt;54–65 GB VRAM&lt;/strong&gt;. No consumer GPU has that. Even the RTX 5090's 32 GB falls short.&lt;/p&gt;

&lt;p&gt;The community has converged on a two-step fix that makes Wan 14B viable on surprisingly modest hardware:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 1 — Quantize the transformer.&lt;/strong&gt; GGUF Q4 or Q5 weights reduce the main Wan 14B model from ~28 GB to approximately 8–8.5 GB. Quality loss versus FP16 is minimal at 480p — most viewers can't identify the difference in blind tests. At 720p there's a subtle softening in fine detail, but the practical output remains strong.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 2 — Offload T5-XXL to CPU RAM.&lt;/strong&gt; T5-XXL is only used during the conditioning pass at the start of each generation. If you have 32+ GB of system RAM, T5 can live in CPU RAM and be called when needed. This costs you 20–30 seconds of extra conditioning time per clip but saves 9+ GB of GPU VRAM. With both tricks applied:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;GPU VRAM at 480p&lt;/strong&gt;: ~6–8 GB&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GPU VRAM at 720p&lt;/strong&gt;: ~12–16 GB&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is how the RTX 4070 12GB runs the Wan 14B at all — not natively, but via GGUF + T5 offload.&lt;/p&gt;

&lt;p&gt;One requirement that trips up first-timers: &lt;strong&gt;you need at least 32 GB of system RAM&lt;/strong&gt;. With T5-XXL parked in CPU RAM and your diffusion model in VRAM, 16 GB of system RAM will hit swap during the conditioning pass and cause either errors or extremely slow generation. 32 GB is the minimum; 64 GB is comfortable.&lt;/p&gt;

&lt;h2&gt;
  
  
  Benchmark Data: Real Generation Times
&lt;/h2&gt;

&lt;p&gt;The table below comes from SaladCloud's published Wan 2.1 T2V-14B benchmarks, testing a 5-second clip at 480p and 720p with no quantization or offloading — full precision, official inference script.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;GPU&lt;/th&gt;
&lt;th&gt;VRAM&lt;/th&gt;
&lt;th&gt;480p (5-sec clip)&lt;/th&gt;
&lt;th&gt;720p (5-sec clip)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;H100 SXM&lt;/td&gt;
&lt;td&gt;80 GB&lt;/td&gt;
&lt;td&gt;85 sec&lt;/td&gt;
&lt;td&gt;284 sec&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;A100 SXM&lt;/td&gt;
&lt;td&gt;80 GB&lt;/td&gt;
&lt;td&gt;170 sec&lt;/td&gt;
&lt;td&gt;523 sec&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;A40&lt;/td&gt;
&lt;td&gt;48 GB&lt;/td&gt;
&lt;td&gt;501 sec&lt;/td&gt;
&lt;td&gt;1,083 sec&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RTX 4090&lt;/td&gt;
&lt;td&gt;24 GB&lt;/td&gt;
&lt;td&gt;281 sec&lt;/td&gt;
&lt;td&gt;OOM&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RTX 3090&lt;/td&gt;
&lt;td&gt;24 GB&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;OOM&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Two things stand out:&lt;/p&gt;

&lt;p&gt;First, the RTX 4090 at 281 seconds beats the enterprise A40 at 501 seconds despite the A40 having twice the VRAM. GDDR6X bandwidth (1,018 GB/s on the 4090 vs. PCIe A40) matters more than raw CUDA core count for diffusion inference — the model is memory-bandwidth-bound, not compute-bound.&lt;/p&gt;

&lt;p&gt;Second, both the RTX 4090 and RTX 3090 OOM at 720p with Wan 2.1 full precision. Running Wan 14B at 720p full-precision requires more VRAM than any consumer GPU has.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Wan 2.2 changes the 720p picture.&lt;/strong&gt; The switch to MoE architecture (27B total, 14B active) enables efficient high-resolution generation with quantization. With FP8 + T5 offload, the RTX 4090 can now generate 720p clips. At 16 GB, the RTX 4080 Super generates 720p clips in 2–4 minutes with the same setup.&lt;/p&gt;

&lt;p&gt;For the RTX 3090 specifically: a community benchmark running Wan 2.2-Animate on a 3090 recorded approximately 7 seconds per frame at 640×480 — meaning a 5-second, 81-frame clip takes roughly 9–10 minutes. At 720p that climbs to ~18 seconds per frame, or around 24 minutes per clip. Workable for overnight batches or one-off generates; not for rapid iteration.&lt;/p&gt;

&lt;h2&gt;
  
  
  GPU Tier Guide
&lt;/h2&gt;

&lt;h3&gt;
  
  
  8 GB VRAM — Wan 1.3B or 5B only
&lt;/h3&gt;

&lt;p&gt;The &lt;a href="https://www.amazon.com/s?k=RTX+4060+8GB&amp;amp;tag=runaihome-20" rel="noopener noreferrer"&gt;RTX 4060 8GB&lt;/a&gt;, &lt;a href="https://www.amazon.com/s?k=RTX+5060+8GB&amp;amp;tag=runaihome-20" rel="noopener noreferrer"&gt;RTX 5060 8GB&lt;/a&gt;, and &lt;a href="https://www.amazon.com/s?k=RX+7700+XT&amp;amp;tag=runaihome-20" rel="noopener noreferrer"&gt;RX 7700 XT&lt;/a&gt; sit at the 8 GB tier. Wan 1.3B is native; Wan 2.2 5B runs with light quantization at 480p. The 14B is technically possible with aggressive GGUF + CPU offload, but generation times run 20–30 minutes per 5-second clip — barely usable for iteration.&lt;/p&gt;

&lt;p&gt;If your GPU is 8 GB, use Wan 2.2 5B rather than fighting the 14B. The 5B at 8 GB produces output that's meaningfully better than the 1.3B, without the wait.&lt;/p&gt;

&lt;h3&gt;
  
  
  12 GB VRAM — Wan 14B at 480p (slow but real)
&lt;/h3&gt;

&lt;p&gt;The &lt;a href="https://www.amazon.com/s?k=RTX+4070+12GB&amp;amp;tag=runaihome-20" rel="noopener noreferrer"&gt;RTX 4070 12GB&lt;/a&gt; and RTX 3060 12GB can run Wan 14B GGUF + T5-CPU offload at 480p. Peak GPU VRAM during generation: ~8 GB, leaving about 4 GB headroom. Generation times are 18–22 minutes per 5-second 480p clip.&lt;/p&gt;

&lt;p&gt;The RTX 4070 has 504 GB/s bandwidth (GDDR6X, 192-bit bus). Bandwidth isn't the limiter here — VRAM is. You have enough bandwidth for Wan 14B; you don't have enough VRAM to skip the offloading tricks, which is what slows you down.&lt;/p&gt;

&lt;p&gt;At 720p on 12 GB: possible with extreme quantization (Q3 or lower), but ge&lt;/p&gt;

</description>
      <category>localai</category>
      <category>gpu</category>
      <category>videogeneration</category>
      <category>wan</category>
    </item>
    <item>
      <title>AMD Ryzen AI Max+ 395 (Strix Halo) for Local LLMs in 2026: 128GB Unified Memory, 100 t/s on 30B Models, and Whether It Beats a Discrete GPU</title>
      <dc:creator>Jovan Chan</dc:creator>
      <pubDate>Sun, 14 Jun 2026 07:03:28 +0000</pubDate>
      <link>https://dev.to/jovan_chan_9500711396d4e6/amd-ryzen-ai-max-395-strix-halo-for-local-llms-in-2026-128gb-unified-memory-100-ts-on-30b-2875</link>
      <guid>https://dev.to/jovan_chan_9500711396d4e6/amd-ryzen-ai-max-395-strix-halo-for-local-llms-in-2026-128gb-unified-memory-100-ts-on-30b-2875</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;This article was originally published on &lt;a href="https://runaihome.com/blog/ryzen-ai-max-395-strix-halo-local-llm-2026/" rel="noopener noreferrer"&gt;runaihome.com&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;TL;DR&lt;/strong&gt;: The AMD Ryzen AI Max+ 395 hits 100 t/s on Qwen3-30B and runs 120B models that physically don't fit on any single consumer discrete GPU — in a $1,499–$1,999 mini PC. It's bandwidth-constrained (256 GB/s vs 1,792 GB/s on an RTX 5090), so for models under 32B a discrete GPU is faster. The machine earns its price for one audience: people who need 70B+ fully in GPU memory, without a dedicated GPU tower.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Strix Halo Mini PC&lt;/th&gt;
&lt;th&gt;RTX 5060 Ti 16GB Build&lt;/th&gt;
&lt;th&gt;Mac Mini M4 Pro 48GB&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Best for&lt;/td&gt;
&lt;td&gt;70B–120B models entirely in GPU memory&lt;/td&gt;
&lt;td&gt;≤13B at 80–130 t/s, budget build&lt;/td&gt;
&lt;td&gt;30–70B, silent, efficient&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Price&lt;/td&gt;
&lt;td&gt;$1,499–$1,999 (complete)&lt;/td&gt;
&lt;td&gt;~$1,400 (complete build)&lt;/td&gt;
&lt;td&gt;$1,399 (complete)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;The catch&lt;/td&gt;
&lt;td&gt;Bandwidth bottleneck; Linux preferred&lt;/td&gt;
&lt;td&gt;Hard ceiling at 16GB VRAM&lt;/td&gt;
&lt;td&gt;48GB max without $4,999 Ultra&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Honest take&lt;/strong&gt;: For Llama 3.3 70B or DeepSeek R1 70B fully in GPU memory without a dedicated GPU tower, the GMKtec EVO-X2 at $1,499 is hard to beat on x86 — but if you can live with 48GB, a Mac Mini M4 Pro is simpler and draws less than half the power.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  What Strix Halo actually is
&lt;/h2&gt;

&lt;p&gt;Strix Halo is AMD's internal codename for the die inside the Ryzen AI Max+ 395. The unusual part is the memory architecture: 128 GB of LPDDR5X-8000 on a 256-bit bus, shared between the CPU and an integrated 40-compute-unit RDNA 3.5 GPU (the Radeon 8060S). There's no PCIe bottleneck, no VRAM ceiling separate from system RAM — the GPU sees all 128 GB at full memory bandwidth.&lt;/p&gt;

&lt;p&gt;In practice, the chip can allocate up to 96 GB to the GPU, leaving the remaining 32 GB for the OS and CPU-side workloads. That's a larger GPU memory pool than any consumer discrete GPU, including the RTX 5090 (32 GB).&lt;/p&gt;

&lt;p&gt;The rest of the spec sheet: 16 Zen 5 CPU cores clocked up to 5.1 GHz, a 50+ TOPS XDNA 2 NPU, and a configurable TDP range of 45W–120W with a 55W default. AMD fabbed it on TSMC 4nm. These chips ship in mini PCs — no PCIe card to install, no separate PSU math to worry about.&lt;/p&gt;

&lt;h2&gt;
  
  
  The actual benchmark numbers
&lt;/h2&gt;

&lt;p&gt;All results below come from community testing on a Beelink GTR9 Pro running Ubuntu 24.04 with Mesa RADV (kisak PPA, version 26.0.6–26.1.1), llama.cpp builds b9049–b9467, Ollama 0.23.1, and &lt;code&gt;AMD_VULKAN_ICD=RADV&lt;/code&gt; set for the Vulkan backend. Tested May–June 2026.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Quantization&lt;/th&gt;
&lt;th&gt;Backend&lt;/th&gt;
&lt;th&gt;Generation (t/s)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Qwen3-30B-A3B (MoE)&lt;/td&gt;
&lt;td&gt;IQ4_XS&lt;/td&gt;
&lt;td&gt;RADV Vulkan&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;100.04&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qwen3-Coder 30B-A3B&lt;/td&gt;
&lt;td&gt;Q4_K_S&lt;/td&gt;
&lt;td&gt;RADV Vulkan&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;98.51&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qwen3-Coder 30B-A3B&lt;/td&gt;
&lt;td&gt;UD-Q4_K_XL&lt;/td&gt;
&lt;td&gt;RADV Vulkan&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;96.76&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT-OSS 120B&lt;/td&gt;
&lt;td&gt;MXFP4&lt;/td&gt;
&lt;td&gt;RADV Vulkan&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;55.57&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qwen3.6&lt;/td&gt;
&lt;td&gt;Q4_0 (speed-first)&lt;/td&gt;
&lt;td&gt;RADV Vulkan&lt;/td&gt;
&lt;td&gt;~81&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qwen3.6&lt;/td&gt;
&lt;td&gt;balanced&lt;/td&gt;
&lt;td&gt;RADV Vulkan&lt;/td&gt;
&lt;td&gt;~63&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;100 t/s on a 30B model is comfortable real-time speed for single-user inference. The more striking number is GPT-OSS 120B at 55 t/s: a 120-billion-parameter model running entirely in unified memory at a speed that makes it useful for single-user chat.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why MoE models run faster here&lt;/strong&gt;: the 30B-A3B variants (Qwen3's Mixture-of-Experts architecture) activate only ~3B parameters per forward pass despite having 30B total weights. On a bandwidth-constrained system, fewer weights loaded per token means directly higher tokens/sec. If you're running Strix Halo hardware, prioritize MoE-architecture models — the performance advantage is significant.&lt;/p&gt;

&lt;p&gt;The real-world bandwidth measurement confirms the constraint: the system delivers ~215 GB/s measured versus the theoretical 256 GB/s peak, a ~16% gap typical for LPDDR5X under mixed CPU+GPU load.&lt;/p&gt;

&lt;h2&gt;
  
  
  What fits in memory — and what doesn't
&lt;/h2&gt;

&lt;p&gt;The GPU can access up to 96 GB of the 128 GB pool. At Q4_K_M quantization:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Approx. VRAM needed&lt;/th&gt;
&lt;th&gt;Fits on Strix Halo?&lt;/th&gt;
&lt;th&gt;Fits on RTX 4090 (24GB)?&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Llama 3.3 70B&lt;/td&gt;
&lt;td&gt;~42–48 GB&lt;/td&gt;
&lt;td&gt;Yes — ~48 GB headroom left&lt;/td&gt;
&lt;td&gt;No (CPU offload needed)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qwen3-30B (dense)&lt;/td&gt;
&lt;td&gt;~18 GB&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DeepSeek R1 70B distill&lt;/td&gt;
&lt;td&gt;~42 GB&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;No (CPU offload needed)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPT-OSS 120B&lt;/td&gt;
&lt;td&gt;~65–70 GB&lt;/td&gt;
&lt;td&gt;Yes (tight)&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;DeepSeek R1 671B&lt;/td&gt;
&lt;td&gt;~380 GB&lt;/td&gt;
&lt;td&gt;No — needs multi-node&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Llama 4 Maverick 402B&lt;/td&gt;
&lt;td&gt;~230+ GB&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;An &lt;a href="https://www.amazon.com/s?k=RTX+5060+Ti+16GB&amp;amp;tag=runaihome-20" rel="noopener noreferrer"&gt;RTX 5060 Ti 16GB&lt;/a&gt; hits its ceiling around 13B Q4_K_M. An &lt;a href="https://www.amazon.com/s?k=NVIDIA+RTX+4090&amp;amp;tag=runaihome-20" rel="noopener noreferrer"&gt;RTX 4090&lt;/a&gt; at 24 GB tops out near 20B before requiring CPU offloading. On Strix Halo, Llama 3.3 70B loads entirely into the GPU memory pool — no CPU offloading, no PCIe bottlenecking. The VRAM math behind these numbers is covered in detail in &lt;a href="https://dev.to/blog/how-much-vram-llama-models/"&gt;How Much VRAM Do You Need for Llama Models&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Strix Halo vs a discrete GPU build
&lt;/h2&gt;

&lt;p&gt;This is the decision that actually matters, and the answer is unambiguous in both directions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Where discrete GPUs win&lt;/strong&gt;: every model under 32 GB. An RTX 5090 generates ~186 t/s on Qwen3 8B Q4_K_M. The same model on Strix Halo runs around 80–90 t/s. Memory bandwidth is the reason: 1,792 GB/s on the RTX 5090 vs ~215 GB/s real-world on Strix Halo. For a daily 7B or 14B coding assistant — see the &lt;a href="https://dev.to/blog/continue-dev-ollama-local-ai-coding-stack-2026/"&gt;local AI coding stack with Continue.dev + Ollama&lt;/a&gt; — a mid-range discrete GPU outperforms Strix Halo and often costs less.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Where Strix Halo wins&lt;/strong&gt;: any model above 24 GB that you need running fully in GPU memory. An RTX 4090 can't load Llama 3.3 70B without splitting layers to CPU RAM, which drops generation speed to 2–5 t/s. Strix Halo loads it in ~40 seconds and generates at ~30–35 t/s. That's a 6–15× speed difference on the same model.&lt;/p&gt;

&lt;p&gt;The cloud comparison matters here too. Running Llama 3.3 70B on &lt;a href="https://runpod.io?ref=cjrwwd27" rel="noopener noreferrer"&gt;RunPod&lt;/a&gt; costs $0.29–$0.59/hour depending on GPU availability. At $1,499 for a GMKtec EVO-X2 running 6 hours/day, you break even at roughly 700–1,400 hours of use — around 4–8 months of daily active use. After that, every inference is free. We ran this calculation in detail in the &lt;a href="https://dev.to/blog/runpod-vs-local-gpu-rent-or-buy/"&gt;RunPod vs local GPU: when to rent vs buy&lt;/a&gt; article.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Mac comparison
&lt;/h2&gt;

&lt;p&gt;The closest comparison is the &lt;a href="https://dev.to/blog/mac-mini-m4-pro-local-ai-2026/"&gt;Mac Mini M4 Pro&lt;/a&gt;, which starts at $1,399 with 24 GB unified memory and maxes out at 48 GB for $1,799. Its memory bandwidth is 273 GB/s — slightly above Strix Halo's real-world 215 GB/s.&lt;/p&gt;

&lt;p&gt;For models that fit in 48 GB, the Mac Mini M4 Pro holds three advantages: substantially better power efficiency (20–30W under LLM load vs 60–120W for Strix Halo mini PCs), meaningfully more mature software (Metal via MLX is better-tuned than AMD's Vulkan/ROCm path on Linux), and quieter operation under sustained load.&lt;/p&gt;

&lt;p&gt;Strix Halo's advantage is the 128 GB tier. If you need the full 96 GB GPU pool for 70B+ models, the Mac route to 128 GB requires the Mac Studio M4 Ultra at $4,999. A $1,499–$1,999 Strix Halo mini PC delivers 96 GB GPU memory at roughly one-third the price — the software experience is rougher, but the hardware value is real.&lt;/p&gt;

&lt;h2&gt;
  
  
  What you can actually buy right now
&lt;/h2&gt;

&lt;p&gt;Prices verified June 2026:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Machine&lt;/th&gt;
&lt;th&gt;Memory&lt;/th&gt;
&lt;th&gt;Storage&lt;/th&gt;
&lt;th&gt;Price&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://www.amazon.com/s?k=GMKtec+EVO-X2+Ryzen+AI+Max+395&amp;amp;tag=runaihome-20" rel="noopener noreferrer"&gt;GMKtec EVO-X2&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;128GB LPDDR5X&lt;/td&gt;
&lt;td&gt;2TB NVMe&lt;/td&gt;
&lt;td&gt;~$1,499&lt;/td&gt;
&lt;td&gt;Best value, 2.5GbE&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://www.amazon.com/s?k=Beelink+GTR9+Pro+Ryzen+AI+Max+395&amp;amp;tag=runaihome-20" rel="noopener noreferrer"&gt;Beelink GTR9 Pro&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;128GB LPDDR5X&lt;/td&gt;
&lt;td&gt;2TB NVMe&lt;/td&gt;
&lt;td&gt;$1,899–$1,999&lt;/td&gt;
&lt;td&gt;Dual 10GbE, better cooling&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;MINISFORUM MS-S1 Max&lt;/td&gt;
&lt;td&gt;128GB LPDDR5X&lt;/td&gt;
&lt;td&gt;2TB NVMe&lt;/td&gt;
&lt;td&gt;~$2,299&lt;/td&gt;
&lt;td&gt;Available on Newegg&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GMKtec EVO-X2 (64GB)&lt;/td&gt;
&lt;td&gt;64GB LPDDR5X&lt;/td&gt;
&lt;td&gt;1TB NVMe&lt;/td&gt;
&lt;td&gt;~$1,099&lt;/td&gt;
&lt;td&gt;GPU pool ~48GB, still runs 70B&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The GMKtec EVO-X2 at $1,499 is the price-performance sweet spot. It has the same CPU and GPU as the Beelink GTR9 Pro and omits the dual 10GbE NICs — which you don't need for single-user home inference. The Beelink's dual 10GbE matters if you're running a shared home AI server. For that use case, the &lt;a href="https://dev.to/blog/open-webui-multi-user-auth-family-setup-2026/"&gt;Open WebUI multi-user setup guide&lt;/a&gt; covers the server con&lt;/p&gt;

</description>
      <category>amd</category>
      <category>ryzenaimax</category>
      <category>strixhalo</category>
      <category>localllm</category>
    </item>
    <item>
      <title>ROCm 7.2 on Ubuntu 24.04 for Local LLMs in 2026: Full Setup Guide for AMD GPUs</title>
      <dc:creator>Jovan Chan</dc:creator>
      <pubDate>Sun, 14 Jun 2026 07:02:42 +0000</pubDate>
      <link>https://dev.to/jovan_chan_9500711396d4e6/rocm-72-on-ubuntu-2404-for-local-llms-in-2026-full-setup-guide-for-amd-gpus-30fa</link>
      <guid>https://dev.to/jovan_chan_9500711396d4e6/rocm-72-on-ubuntu-2404-for-local-llms-in-2026-full-setup-guide-for-amd-gpus-30fa</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;This article was originally published on &lt;a href="https://runaihome.com/blog/rocm-7-ubuntu-local-llm-setup-2026/" rel="noopener noreferrer"&gt;runaihome.com&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;TL;DR&lt;/strong&gt;: ROCm 7.2.3 (released May 4, 2026) is the stable Ubuntu path for AMD GPU inference — RDNA 3 setup is rock-solid in under 20 minutes, RDNA 4 works with one Docker workaround for a known gfx1201 bug. AMD delivers 85–92% of equivalent NVIDIA throughput at a lower price point on Ubuntu.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What you'll be able to do after this guide:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Install ROCm 7.2.3 on Ubuntu 24.04 LTS and run &lt;code&gt;ollama serve&lt;/code&gt; with full GPU acceleration in under 20 minutes&lt;/li&gt;
&lt;li&gt;Build llama.cpp with the HIP backend for maximum throughput on RDNA 3 and RDNA 4&lt;/li&gt;
&lt;li&gt;Identify and work around the gfx1201 rocBLASLt crash that kills model loads on the RX 9070 XT&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Honest take&lt;/strong&gt;: On Ubuntu, a used &lt;a href="https://www.amazon.com/s?k=AMD+Radeon+RX+7900+XTX&amp;amp;tag=runaihome-20" rel="noopener noreferrer"&gt;AMD Radeon RX 7900 XTX&lt;/a&gt; at ~$800 is the best AMD card for local LLMs in 2026 — 24 GB VRAM, ~96 tok/s on Llama 3.1 8B Q4_K_M, and ROCm 7.2.3 installs without a single environment variable hack.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;AMD's local AI story on Linux has changed substantially. A year ago you'd fight missing kernel modules and half-broken pip wheels. Today, if you're on a supported card and Ubuntu 24.04, setup is close to the CUDA experience: download a &lt;code&gt;.deb&lt;/code&gt;, add yourself to two groups, reboot once, and &lt;code&gt;ollama pull llama3.1:8b&lt;/code&gt; works.&lt;/p&gt;

&lt;p&gt;The catches are smaller than they used to be, but they exist. RDNA 4 support (RX 9000 series) is still maturing in a specific way — one rocBLASLt lookup bug can SIGKILL your model load at the 2-minute mark every single time. Knowing where the landmine is before you start saves 90 minutes of frustrating debugging.&lt;/p&gt;

&lt;p&gt;This guide covers the native Ubuntu install path. If you need AMD on Windows, see the &lt;a href="https://dev.to/blog/amd-rocm-local-ai-2026/"&gt;AMD ROCm 7.2 on Windows guide&lt;/a&gt; — RDNA 3 is Linux-only for ROCm and the Windows path is a different story entirely.&lt;/p&gt;

&lt;h2&gt;
  
  
  Which cards are actually supported on Ubuntu 24.04
&lt;/h2&gt;

&lt;p&gt;ROCm 7.2.3 divides AMD's consumer lineup into three buckets:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fully supported on Linux:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;RDNA 4&lt;/strong&gt;: RX 9070 XT, RX 9070, RX 9060 XT LP, Radeon AI PRO R9600D (gfx1201 / gfx1200)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;RDNA 3&lt;/strong&gt;: RX 7900 XTX, RX 7900 XT, RX 7900 GRE, RX 7800 XT, RX 7700 XT (gfx1100 / gfx1101 / gfx1102)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;RDNA 3 consumer cards are Linux-only for ROCm. On Windows, the ROCm stack officially supports only RDNA 4 chips — see above.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Supported via Vulkan only (no ROCm HIP):&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;RX 7600, RX 6000 series, anything older. These cards can run inference through llama.cpp's Vulkan backend, but won't get vLLM, PyTorch ROCm, or HIP acceleration.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Not supported at all:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;RDNA 1 (RX 5000 series) and older. Vulkan may work, but inference speed makes these impractical for anything beyond tiny models.&lt;/p&gt;

&lt;h3&gt;
  
  
  VRAM is still the ceiling
&lt;/h3&gt;

&lt;p&gt;For context on what each card can run:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;RX 7900 XTX (24 GB)&lt;/strong&gt;: Qwen3-30B-A3B at Q4_K_M fits cleanly. Llama 3.3 70B Q4 in CPU-offload mode. Anything under 20B at Q4 is comfortable.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;RX 9070 XT (16 GB)&lt;/strong&gt;: Llama 3.1 8B at full speed, Qwen3-14B at Q4, 27B MoE models technically fit but saturate the memory bus (6.3 tok/s on Qwen3.5-27B-A3B at Q4).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;RX 9070 GRE (16 GB)&lt;/strong&gt;: Launched globally at $549 on June 2, 2026 — same VRAM and gfx1201 architecture as the 9070 XT, slightly less shader compute.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For a broader AMD vs NVIDIA VRAM comparison at the 16 GB tier, see &lt;a href="https://dev.to/blog/rx-9070-xt-vs-rtx-5060-ti-16gb-local-ai-2026/"&gt;AMD RX 9070 XT vs RTX 5060 Ti 16GB&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 1: Install ROCm 7.2.3 on Ubuntu 24.04
&lt;/h2&gt;

&lt;p&gt;Start from Ubuntu 24.04.3 LTS. The &lt;code&gt;amdgpu-install&lt;/code&gt; tool handles both the kernel driver and the ROCm userspace stack in a single package.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Download the installer for Ubuntu 24.04 (noble)&lt;/span&gt;
wget https://repo.radeon.com/amdgpu-install/7.2.3/ubuntu/noble/amdgpu-install_7.2.3.70203-1_all.deb

&lt;span class="c"&gt;# Install it&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;apt &lt;span class="nb"&gt;install&lt;/span&gt; ./amdgpu-install_7.2.3.70203-1_all.deb
&lt;span class="nb"&gt;sudo &lt;/span&gt;apt update

&lt;span class="c"&gt;# Install ROCm with the rocm usecase&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;amdgpu-install &lt;span class="nt"&gt;--usecase&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;rocm &lt;span class="nt"&gt;--no-dkms&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;--no-dkms&lt;/code&gt; flag skips DKMS kernel module compilation. On Ubuntu 24.04.3 with a 6.8.x kernel, RDNA 3 and RDNA 4 are already supported by the packaged kernel — invoking DKMS wastes 10 minutes and sometimes fails on systems with secure boot or custom kernels.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 2: Add user groups and reboot
&lt;/h2&gt;

&lt;p&gt;This step trips up almost every first-time installer and the error messages when you skip it are not helpful. The ROCm compute stack requires your user to be in the &lt;code&gt;render&lt;/code&gt; and &lt;code&gt;video&lt;/code&gt; groups to access &lt;code&gt;/dev/kfd&lt;/code&gt; (the GPU compute device node) without root.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;usermod &lt;span class="nt"&gt;-a&lt;/span&gt; &lt;span class="nt"&gt;-G&lt;/span&gt; render,video &lt;span class="nv"&gt;$USER&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Reboot now.&lt;/strong&gt; A &lt;code&gt;newgrp&lt;/code&gt; session is not sufficient — the group membership must be part of your login session from the start. After reboot, verify:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;groups&lt;/span&gt;
&lt;span class="c"&gt;# Expected: ... render video ...&lt;/span&gt;

rocminfo | &lt;span class="nb"&gt;head&lt;/span&gt; &lt;span class="nt"&gt;-25&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Expected output from &lt;code&gt;rocminfo&lt;/code&gt; on an RX 9070 XT:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ROCk module is loaded
...
Agent 2
  Name:                    gfx1201
  Uuid:                    GPU-XXXXXXXXXXXXXX
  Marketing Name:          Radeon RX 9070 XT
  Vendor Name:             AMD
  Feature:                 KERNEL_DISPATCH
  Profile:                 BASE_PROFILE
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If &lt;code&gt;rocminfo&lt;/code&gt; hangs for more than 30 seconds or shows only Agent 1 (the CPU), you have a group issue or driver conflict. Check &lt;code&gt;dmesg | grep amdgpu&lt;/code&gt; first — firmware errors here usually mean you need a &lt;code&gt;linux-firmware&lt;/code&gt; update.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 3: Verify with rocm-smi
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;rocm-smi
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This displays real-time GPU stats including temperature, power draw, and memory usage. At idle you'll see 0% utilization — that's normal. Run a model in the next step and check again to confirm the GPU is actually being used.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 4: Install Ollama with ROCm support
&lt;/h2&gt;

&lt;p&gt;Ollama ships its own bundled ROCm libraries and auto-detects AMD GPUs on Linux:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-fsSL&lt;/span&gt; https://ollama.com/install.sh | sh
systemctl start ollama
ollama pull llama3.1:8b-instruct-q4_K_M
ollama run llama3.1:8b-instruct-q4_K_M
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;While inference is running, open a second terminal and run &lt;code&gt;watch -n 1 rocm-smi&lt;/code&gt;. You should see GPU memory jump to ~5.5 GB and compute utilization hit 90–95%.&lt;/p&gt;

&lt;p&gt;If memory shows 0 MB allocated despite the model loading, Ollama may be using CPU. Run &lt;code&gt;OLLAMA_DEBUG=1 ollama serve&lt;/code&gt; and check the startup logs — it will report which ROCm libraries it found and whether the GPU was initialized.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 5 (optional): Build llama.cpp with HIP
&lt;/h2&gt;

&lt;p&gt;For direct llama.cpp inference — more control over layer offloading and context window than Ollama provides — the HIP backend delivers the best AMD throughput:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;apt &lt;span class="nb"&gt;install &lt;/span&gt;cmake git build-essential

git clone https://github.com/ggerganov/llama.cpp
&lt;span class="nb"&gt;cd &lt;/span&gt;llama.cpp
cmake &lt;span class="nt"&gt;-B&lt;/span&gt; build &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-DGGML_HIP&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;ON &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-DAMDGPU_TARGETS&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;gfx1201 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-DCMAKE_BUILD_TYPE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;Release
cmake &lt;span class="nt"&gt;--build&lt;/span&gt; build &lt;span class="nt"&gt;--config&lt;/span&gt; Release &lt;span class="nt"&gt;-j&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;nproc&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Replace &lt;code&gt;gfx1201&lt;/code&gt; with your card's architecture:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Card&lt;/th&gt;
&lt;th&gt;Target arch&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;RX 9070 XT / 9070 / 9060 XT&lt;/td&gt;
&lt;td&gt;&lt;code&gt;gfx1201&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RX 7900 XTX / 7900 XT / 7900 GRE&lt;/td&gt;
&lt;td&gt;&lt;code&gt;gfx1100&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RX 7800 XT&lt;/td&gt;
&lt;td&gt;&lt;code&gt;gfx1101&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RX 7700 XT&lt;/td&gt;
&lt;td&gt;&lt;code&gt;gfx1101&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Run the server:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;./build/bin/llama-server &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-m&lt;/span&gt; /path/to/model.gguf &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--n-gpu-layers&lt;/span&gt; 99 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--host&lt;/span&gt; 0.0.0.0 &lt;span class="nt"&gt;--port&lt;/span&gt; 8080
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;--n-gpu-layers 99&lt;/code&gt; offloads all layers to GPU. If you hit VRAM limits, reduce this number to shift layers to CPU — useful when running 27B+ models on a 16 GB card.&lt;/p&gt;

&lt;h2&gt;
  
  
  Real benchmarks
&lt;/h2&gt;

&lt;p&gt;Results from community benchmarks on ROCm 7.x, Ubuntu 24.04, Ollama and llama.cpp HIP:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Card&lt;/th&gt;
&lt;th&gt;VRAM&lt;/th&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Quant&lt;/th&gt;
&lt;th&gt;tok/s&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;RX 7900 XTX&lt;/td&gt;
&lt;td&gt;24 GB&lt;/td&gt;
&lt;td&gt;Llama 3.1 8B&lt;/td&gt;
&lt;td&gt;Q4_K_M&lt;/td&gt;
&lt;td&gt;66–96&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RX 9070 XT&lt;/td&gt;
&lt;td&gt;16 GB&lt;/td&gt;
&lt;td&gt;Llama 3.1 8B&lt;/td&gt;
&lt;td&gt;Q4_K_M&lt;/td&gt;
&lt;td&gt;~56&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RX 9070 XT&lt;/td&gt;
&lt;td&gt;16 GB&lt;/td&gt;
&lt;td&gt;Qwen3:14B&lt;/td&gt;
&lt;td&gt;Q4&lt;/td&gt;
&lt;td&gt;52.2&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RX 9070 XT&lt;/td&gt;
&lt;td&gt;16 GB&lt;/td&gt;
&lt;td&gt;GPT-OSS:20B&lt;/td&gt;
&lt;td&gt;Q4&lt;/td&gt;
&lt;td&gt;91.9&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RX 9070 XT&lt;/td&gt;
&lt;td&gt;16 GB&lt;/td&gt;
&lt;td&gt;Qwen3.5:27B-A3B&lt;/td&gt;
&lt;td&gt;Q4 (MoE)&lt;/td&gt;
&lt;td&gt;6.3&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The 66–96 tok/s variance on the RX 7900 XTX reflects different llama.cpp versions and batch size settings across community tests. Mid-range is roughly 80 tok/s for a clean Q4_K_M Llama 8B run.&lt;/p&gt;

&lt;p&gt;For comparison: an RTX 4070 Super (12 GB, 504 GB/s) delivers roughly 62–70 tok/s on the same model. The RX 9070 XT at 640 GB/s memory bandwidth edges it out &lt;/p&gt;

</description>
      <category>amd</category>
      <category>rocm</category>
      <category>ubuntu</category>
      <category>localllm</category>
    </item>
    <item>
      <title>Intel Arc B770 vs RTX 5060 for Local AI in 2026: The 16GB Budget War That Never Happened</title>
      <dc:creator>Jovan Chan</dc:creator>
      <pubDate>Sun, 14 Jun 2026 07:01:58 +0000</pubDate>
      <link>https://dev.to/jovan_chan_9500711396d4e6/intel-arc-b770-vs-rtx-5060-for-local-ai-in-2026-the-16gb-budget-war-that-never-happened-2e72</link>
      <guid>https://dev.to/jovan_chan_9500711396d4e6/intel-arc-b770-vs-rtx-5060-for-local-ai-in-2026-the-16gb-budget-war-that-never-happened-2e72</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;This article was originally published on &lt;a href="https://runaihome.com/blog/intel-arc-b770-rumors-vs-rtx-5060-2026/" rel="noopener noreferrer"&gt;runaihome.com&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;TL;DR&lt;/strong&gt;: Intel canceled the Arc B770 "Big Battlemage" — the 16GB budget GPU that was supposed to challenge the RTX 5060 Ti market — citing GDDR memory costs and lack of financial viability. NVIDIA filled the slot with the RTX 5060, but shipped it with only 8GB of VRAM. The result: a $200 gap between 8GB and 16GB consumer cards, no affordable Intel challenger anywhere in the picture, and the B770 silicon surviving only as the $949 Arc Pro B70 workstation card.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;RTX 5060&lt;/th&gt;
&lt;th&gt;RTX 5060 Ti&lt;/th&gt;
&lt;th&gt;Arc Pro B70&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;VRAM&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;8GB GDDR7&lt;/td&gt;
&lt;td&gt;16GB GDDR7&lt;/td&gt;
&lt;td&gt;32GB GDDR6&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Bandwidth&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;448 GB/s&lt;/td&gt;
&lt;td&gt;448 GB/s&lt;/td&gt;
&lt;td&gt;608 GB/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Price (Jun 2026)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;$299–$339&lt;/td&gt;
&lt;td&gt;$429–$479&lt;/td&gt;
&lt;td&gt;$949&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Best for&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;7B models only&lt;/td&gt;
&lt;td&gt;Up to 20B models&lt;/td&gt;
&lt;td&gt;30B+ models, pro workflows&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;The catch&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Hard wall at 8GB&lt;/td&gt;
&lt;td&gt;$200 more than 5060&lt;/td&gt;
&lt;td&gt;$500 more than 5060 Ti; no CUDA&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Honest take&lt;/strong&gt;: If your budget tops out at $350, the RTX 5060 is fast and frictionless at 30 tok/s on 7B models. If you ever want to run a 13B or 30B model, stretch to the RTX 5060 Ti. Intel is not your friend at this price point in 2026.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  What Intel promised
&lt;/h2&gt;

&lt;p&gt;For most of 2025, Intel's roadmap included a second Battlemage GPU — the Arc B770, internally designated BMG-G31. Where the Arc B580 uses the smaller BMG-G21 die with 20 Xe2 cores, the B770 was designed around the full 32-core die with these specs (per leaked hardware repository entries and partner briefings):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;16GB GDDR6&lt;/strong&gt; on a 256-bit bus&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;608 GB/s&lt;/strong&gt; memory bandwidth&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;32 Xe2 cores&lt;/strong&gt; (vs. 20 on the B580)&lt;/li&gt;
&lt;li&gt;~&lt;strong&gt;300W&lt;/strong&gt; TDP&lt;/li&gt;
&lt;li&gt;PCIe Gen5 x16&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Those numbers were actually compelling for local AI. 608 GB/s beats everything in NVIDIA's current consumer lineup including the RTX 5060 Ti's 448 GB/s. 16GB of VRAM at a rumored $350–$400 would have undercut the RTX 5060 Ti on price while matching it on memory capacity. A 13B model at Q4_K_M fits in 16GB with room to spare for context. A 27B model at Q4 would have been reachable.&lt;/p&gt;

&lt;p&gt;That card doesn't exist. Here's why.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Intel canceled it
&lt;/h2&gt;

&lt;p&gt;According to reports from multiple sources including Tom's Hardware and PC Gamer, the B770 was deemed "not financially viable." The proximate cause was the GDDR6 memory shortage of 2025–2026 — the same AI buildout driving data-center VRAM demand made consumer DRAM expensive enough to erode whatever margin Intel had modeled.&lt;/p&gt;

&lt;p&gt;The structural problem runs deeper. NVIDIA has CUDA. AMD has a maturing ROCm stack. Intel's Arc ecosystem requires users to install Intel's IPEX-LLM fork, use llama.cpp's Vulkan backend, or accept reduced compatibility with tools that assume CUDA. Asking those users to pay $350–$400 for a card that adds 30–60 minutes of setup friction — and still breaks with some AI tools — is a hard sell against a $300 RTX 5060 that just works.&lt;/p&gt;

&lt;p&gt;Intel concluded that marketing costs, driver maintenance, and validation overhead would not produce a return. The B770 was shelved. Intel's next discrete GPU launch was the workstation-focused Arc Pro B70 — same silicon, different market, much higher price.&lt;/p&gt;




&lt;h2&gt;
  
  
  What NVIDIA delivered instead
&lt;/h2&gt;

&lt;p&gt;The &lt;a href="https://www.amazon.com/s?k=RTX+5060&amp;amp;tag=runaihome-20" rel="noopener noreferrer"&gt;RTX 5060&lt;/a&gt; launched in spring 2026. Specs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;3,840 CUDA cores&lt;/strong&gt; (Blackwell GB206 die)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;8GB GDDR7&lt;/strong&gt; memory, 128-bit bus&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;448 GB/s&lt;/strong&gt; memory bandwidth&lt;/li&gt;
&lt;li&gt;Boost clock: 2,625 MHz&lt;/li&gt;
&lt;li&gt;Launch MSRP: $299; street price June 2026: $299–$339 new, ~$285 used on eBay&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The local AI performance story is straightforward. The RTX 5060 posts around &lt;strong&gt;30 tokens/sec on Llama 3.1 8B Q4_K_M&lt;/strong&gt; via Ollama — fast enough for real-time chat, comfortable coding assistant use, and single-user inference. CUDA means zero-friction setup: Ollama, vLLM, ExLlamaV2, AutoGPTQ all work without extra configuration. Install Ollama, pull a model, run it.&lt;/p&gt;

&lt;p&gt;The problem is the 8GB ceiling. Here's what actually fits:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Quantization&lt;/th&gt;
&lt;th&gt;VRAM needed&lt;/th&gt;
&lt;th&gt;Runs on RTX 5060?&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Llama 3.1 8B&lt;/td&gt;
&lt;td&gt;Q4_K_M&lt;/td&gt;
&lt;td&gt;~5.5 GB&lt;/td&gt;
&lt;td&gt;✅ Yes, ~30 tok/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qwen2.5 7B&lt;/td&gt;
&lt;td&gt;Q4_K_M&lt;/td&gt;
&lt;td&gt;~5.0 GB&lt;/td&gt;
&lt;td&gt;✅ Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mistral 7B&lt;/td&gt;
&lt;td&gt;Q8_0&lt;/td&gt;
&lt;td&gt;~8.5 GB&lt;/td&gt;
&lt;td&gt;❌ Fails to load&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Llama 3.1 13B&lt;/td&gt;
&lt;td&gt;Q4_K_M&lt;/td&gt;
&lt;td&gt;~8.5 GB&lt;/td&gt;
&lt;td&gt;❌ No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qwen2.5 14B&lt;/td&gt;
&lt;td&gt;Q4_K_M&lt;/td&gt;
&lt;td&gt;~9.5 GB&lt;/td&gt;
&lt;td&gt;❌ No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qwen2.5 32B&lt;/td&gt;
&lt;td&gt;Q4_K_M&lt;/td&gt;
&lt;td&gt;~19 GB&lt;/td&gt;
&lt;td&gt;❌ CPU offload only&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The failure mode for 13B and above is a hard one:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;ollama run qwen2.5:14b
&lt;span class="go"&gt;Error: model requires 9.5 GB VRAM, only 8.0 GB available
      Try reducing context size (--ctx-size) or switching to a smaller model
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;CPU offloading kicks in and drops you from 30 tok/s to roughly 3–5 tok/s — unusable for interactive use. The 8GB wall is real and not negotiable without changing cards.&lt;/p&gt;

&lt;p&gt;This is precisely where the B770 would have mattered. 16GB at 608 GB/s for $350 would have introduced real competitive pressure on the RTX 5060 Ti. NVIDIA doesn't have that pressure right now, and the pricing reflects it.&lt;/p&gt;




&lt;h2&gt;
  
  
  The 8GB-to-16GB gap, and who fills it
&lt;/h2&gt;

&lt;p&gt;If 8GB isn't enough, your options for a new card are limited:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://www.amazon.com/s?k=RTX+5060+Ti&amp;amp;tag=runaihome-20" rel="noopener noreferrer"&gt;RTX 5060 Ti&lt;/a&gt; 16GB — $429–$479&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Same 448 GB/s bandwidth as the RTX 5060. Double the VRAM. That extra 8GB changes what's possible: Qwen2.5 14B at Q4_K_M fits with room, Llama 3.3 70B runs at reduced quantization with some CPU offload, and 30B models become viable. Benchmarks from Hardware-Corner show &lt;strong&gt;32.9 tok/s on 14B models at 16k context&lt;/strong&gt; via Ollama. For most home AI users, this is the right call if the budget allows it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Used RTX 3090 24GB — $480–$550 (eBay, June 2026)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;24GB GDDR6 at 936 GB/s bandwidth. For sheer throughput on large models, nothing in the sub-$600 consumer market touches the RTX 3090. Trade-offs: ~350W power draw, no warranty, age. We covered the value calculus in depth in &lt;a href="https://dev.to/blog/used-rtx-3090-ai-value-king-2026/"&gt;the RTX 3090 analysis&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AMD RX 9070 XT 16GB — ~$499&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;640 GB/s bandwidth, 16GB GDDR6. ROCm has improved substantially in 2026 and the Vulkan/ROCM llama.cpp path is now reasonably stable. Covered in the &lt;a href="https://dev.to/blog/rx-9070-xt-vs-rtx-5060-ti-16gb-local-ai-2026/"&gt;RX 9070 XT vs RTX 5060 Ti comparison&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Intel contributes nothing to this list with a consumer card.&lt;/p&gt;




&lt;h2&gt;
  
  
  Arc Pro B70: the B770 silicon at a different price
&lt;/h2&gt;

&lt;p&gt;Intel didn't scrap the BMG-G31 die. The Arc Pro B70 launched in March 2026 at &lt;strong&gt;$949&lt;/strong&gt;, using the full 32 Xe2-core configuration with workstation-class features:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;32GB GDDR6&lt;/strong&gt; on a 256-bit bus (608 GB/s bandwidth)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;367 TOPS&lt;/strong&gt; INT8 AI inference performance&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;22.94 TFLOPS&lt;/strong&gt; FP32 compute&lt;/li&gt;
&lt;li&gt;PCIe 5.0 x16&lt;/li&gt;
&lt;li&gt;ISV-certified professional drivers&lt;/li&gt;
&lt;li&gt;Multi-GPU support on Linux via oneAPI&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The 32GB is the pitch for local AI. At 32GB you can load Qwen2.5 32B at Q4_K_M (~19GB) comfortably, run Llama 3.3 70B at Q4_K_M (~42GB) with partial CPU offloading, and fit every 13B or 27B model at full Q8 quality. The 608 GB/s bandwidth also means larger models run faster per-token than they would on the RTX 5060 Ti's 448 GB/s.&lt;/p&gt;

&lt;p&gt;Available at Newegg and Micro Center for $949.&lt;/p&gt;

&lt;p&gt;The problem: $949 is not a budget play. At that price, you're competing with used RTX A5000 24GB cards with mature CUDA driver support, and you're sitting $470 above an RTX 5060 Ti. The software tax hasn't disappeared — the B70 runs local AI via IPEX-LLM and OpenVINO on Linux, not via Ollama's default CUDA path. Windows support exists but is rougher.&lt;/p&gt;

&lt;p&gt;The B70 makes sense in a professional Linux workstation with an AI workflow already built on Intel's oneAPI toolchain. It does not make sense as an Ollama drop-in for a Windows home-lab machine where the RTX 5060 Ti does 90% of the same job with zero friction for half the price.&lt;/p&gt;

&lt;p&gt;If you're on the fence between renting and buying during the current GPU market confusion, &lt;a href="https://runpod.io?ref=cjrwwd27" rel="noopener noreferrer"&gt;RunPod&lt;/a&gt; has A100 80GB instances at $1.89/hr — useful for large model testing before committing to hardware.&lt;/p&gt;




&lt;h2&gt;
  
  
  Arc B580: the real Intel option right now
&lt;/h2&gt;

&lt;p&gt;The gaming B770 was supposed to land above the &lt;a href="https://www.amazon.com/s?k=Intel+Arc+B580&amp;amp;tag=runaihome-20" rel="noopener noreferrer"&gt;Intel Arc B580&lt;/a&gt;. S&lt;/p&gt;

</description>
      <category>gpu</category>
      <category>intelarc</category>
      <category>rtx5060</category>
      <category>localai</category>
    </item>
    <item>
      <title>ComfyUI API Tutorial 2026: Automate Image Generation</title>
      <dc:creator>Jovan Chan</dc:creator>
      <pubDate>Sun, 14 Jun 2026 07:01:04 +0000</pubDate>
      <link>https://dev.to/jovan_chan_9500711396d4e6/comfyui-api-tutorial-2026-automate-image-generation-3j2j</link>
      <guid>https://dev.to/jovan_chan_9500711396d4e6/comfyui-api-tutorial-2026-automate-image-generation-3j2j</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;This article was originally published on &lt;a href="https://aifoss.dev/blog/comfyui-api-tutorial-2026/" rel="noopener noreferrer"&gt;aifoss.dev&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;TL;DR&lt;/strong&gt;: ComfyUI's built-in HTTP server accepts workflow JSON on port 8188 — the same JSON the GUI exports when you click "Save (API Format)". You can queue a prompt, poll for completion, and download the result in under 50 lines of Python. No GUI, no browser tab, no manual clicking.&lt;/p&gt;

&lt;p&gt;What you'll have running after this guide:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A Python function that submits any ComfyUI workflow and blocks until images are downloaded locally&lt;/li&gt;
&lt;li&gt;A batch script that generates 100 images with automated seed variation and organized output folders&lt;/li&gt;
&lt;li&gt;Working API-format JSON for an SDXL baseline workflow and a Flux.1 Schnell workflow you can adapt immediately&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Why drive ComfyUI from the API
&lt;/h2&gt;

&lt;p&gt;The GUI is fine for building and testing workflows. It's a problem once you need more than a few images, or need to integrate generation into a pipeline.&lt;/p&gt;

&lt;p&gt;Common use cases where the API makes sense:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Batch generation&lt;/strong&gt; — 50–1000 images with systematic prompt or seed variation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CI pipelines&lt;/strong&gt; — trigger image generation on a new product SKU, game asset, or dataset update&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Headless servers&lt;/strong&gt; — GPU box in a closet or a &lt;a href="https://runpod.io?ref=cjrwwd27" rel="noopener noreferrer"&gt;RunPod&lt;/a&gt; cloud instance where running a browser makes no sense&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;App backends&lt;/strong&gt; — a FastAPI endpoint that accepts user prompts and returns generated images&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you're generating one image at a time and want to tweak nodes visually, stay in the GUI. The API is for automation.&lt;/p&gt;




&lt;h2&gt;
  
  
  Prerequisites
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;ComfyUI v0.23.0 (latest as of June 2026) installed and working in the GUI&lt;/li&gt;
&lt;li&gt;Python 3.10+&lt;/li&gt;
&lt;li&gt;&lt;code&gt;pip install requests websocket-client&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;At least one working workflow in your ComfyUI GUI&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you haven't installed ComfyUI yet, the &lt;a href="https://dev.to/blog/comfyui-review-2026/"&gt;ComfyUI review&lt;/a&gt; covers installation from scratch. For GPU hardware requirements, the &lt;a href="https://dev.to/blog/stable-diffusion-8gb-vram-guide-2026/"&gt;Stable Diffusion 8GB VRAM guide&lt;/a&gt; has a good breakdown of what each model tier needs.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 1: Start ComfyUI with API access enabled
&lt;/h2&gt;

&lt;p&gt;By default, ComfyUI only listens on &lt;code&gt;127.0.0.1&lt;/code&gt; — accessible only from the same machine. That's fine if you're scripting locally. For a remote GPU box or Docker container, add &lt;code&gt;--listen&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Local access only (same machine)&lt;/span&gt;
python main.py &lt;span class="nt"&gt;--port&lt;/span&gt; 8188

&lt;span class="c"&gt;# All interfaces (for remote scripting or Docker)&lt;/span&gt;
python main.py &lt;span class="nt"&gt;--listen&lt;/span&gt; 0.0.0.0 &lt;span class="nt"&gt;--port&lt;/span&gt; 8188

&lt;span class="c"&gt;# With VRAM optimization for 8–12 GB cards&lt;/span&gt;
python main.py &lt;span class="nt"&gt;--listen&lt;/span&gt; 0.0.0.0 &lt;span class="nt"&gt;--port&lt;/span&gt; 8188 &lt;span class="nt"&gt;--lowvram&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once running, verify it's up:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl http://127.0.0.1:8188/system_stats
&lt;span class="c"&gt;# {"system": {"os": "posix", "python_version": "3.10.x", ...}}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The server starts a REST API and a WebSocket server on the same port. There is no authentication by default — if you expose this to a network, use a firewall rule or reverse proxy with auth.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 2: Export your workflow in API format
&lt;/h2&gt;

&lt;p&gt;The GUI workflow JSON and the API workflow JSON are different formats. The GUI format includes node positions, colors, and UI state. The API format is stripped down to just inputs and connections — which is all the server needs.&lt;/p&gt;

&lt;p&gt;To export:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Open ComfyUI in your browser&lt;/li&gt;
&lt;li&gt;Go to &lt;strong&gt;Settings&lt;/strong&gt; (gear icon) → enable &lt;strong&gt;"Dev Mode Options"&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;A new button appears in the menu: &lt;strong&gt;"Save (API Format)"&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Click it — this downloads &lt;code&gt;workflow_api.json&lt;/code&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The API format uses numeric string keys (&lt;code&gt;"1"&lt;/code&gt;, &lt;code&gt;"2"&lt;/code&gt;, &lt;code&gt;"3"&lt;/code&gt;) as node IDs. Each node has a &lt;code&gt;class_type&lt;/code&gt; and an &lt;code&gt;inputs&lt;/code&gt; object. Connections between nodes are expressed as &lt;code&gt;["source_node_id", output_index]&lt;/code&gt; arrays rather than named references.&lt;/p&gt;

&lt;p&gt;Here is a minimal SDXL baseline workflow in API format:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"4"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"class_type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"CheckpointLoaderSimple"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"inputs"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"ckpt_name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"sd_xl_base_1.0.safetensors"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"5"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"class_type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"EmptyLatentImage"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"inputs"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"batch_size"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"height"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"width"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1024&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"6"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"class_type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"CLIPTextEncode"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"inputs"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"clip"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"4"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"text"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"a photorealistic red fox in snow, golden hour lighting"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"7"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"class_type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"CLIPTextEncode"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"inputs"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"clip"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"4"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"text"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"blurry, low quality, watermark, text"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"3"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"class_type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"KSampler"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"inputs"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"cfg"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;7.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"denoise"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"latent_image"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"5"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"model"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"4"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"negative"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"7"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"positive"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"6"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"sampler_name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"euler"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"scheduler"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"normal"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"seed"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"steps"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"8"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"class_type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"VAEDecode"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"inputs"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"samples"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"3"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"vae"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"4"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"9"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"class_type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"SaveImage"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"inputs"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"filename_prefix"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"api_output"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"images"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"8"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For SDXL specifically, node &lt;code&gt;"4"&lt;/code&gt; connects to both CLIP encoders and the KSampler via its three outputs: &lt;code&gt;[0]&lt;/code&gt; = model, &lt;code&gt;[1]&lt;/code&gt; = CLIP, &lt;code&gt;[2]&lt;/code&gt; = VAE. The connection syntax &lt;code&gt;["4", 1]&lt;/code&gt; means "output index 1 of node 4."&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 3: Queue a prompt from Python
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;uuid&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;

&lt;span class="n"&gt;SERVER&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;127.0.0.1:8188&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;CLIENT_ID&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;uuid&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;uuid4&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;queue_prompt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;workflow&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Submit a workflow. Returns the prompt_id for tracking.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;payload&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;workflow&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;client_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;CLIENT_ID&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;SERVER&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/prompt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;raise_for_status&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;client_id&lt;/code&gt; is a UUID you generate once per session. It ties your WebSocket connection to your HTTP requests so the server routes status messages back to you. You can skip it for pure HTTP polling, but you need it for WebSocket tracking.&lt;/p&gt;

&lt;p&gt;The server responds immediately with &lt;code&gt;{"prompt_id": "&amp;lt;uuid&amp;gt;", "number": &amp;lt;queue_position&amp;gt;}&lt;/code&gt;. Generation hasn't started yet — the prompt is in queue.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 4: Poll for completion
&lt;/h2&gt;

&lt;p&gt;Two approaches — choose based on your use case:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Approach&lt;/th&gt;
&lt;th&gt;Latency&lt;/th&gt;
&lt;th&gt;Complexity&lt;/th&gt;
&lt;th&gt;Best for&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;HTTP polling&lt;/strong&gt; &lt;code&gt;/history/{id}&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;~1s overhead&lt;/td&gt;
&lt;td&gt;Low — no extra library&lt;/td&gt;
&lt;td&gt;Scripts, batch jobs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;WebSocket&lt;/strong&gt; &lt;code&gt;/ws?clientId=...&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;Near-real-time&lt;/td&gt;
&lt;td&gt;Medium — event loop&lt;/td&gt;
&lt;td&gt;Apps that show progress&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  HTTP polling (simpler)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;time&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;wait_for_completion&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;poll_secs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Block until the prompt finishes. Returns the history entry.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;SERVER&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/history/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;prompt_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;raise_for_status&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;history&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;prompt_id&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;history&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;history&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;prompt_id&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sleep&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;poll_secs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;/history/{prompt_id}&lt;/code&gt; endpoint returns an empty dict &lt;code&gt;{}&lt;/code&gt; while the prompt is queued or running, and the full result object once done. Polling every second adds at most 1 second of latency to your total generation time — acceptable for batch scripts.&lt;/p&gt;

&lt;h3&gt;
  
  
  WebSocket approach (for real-time apps)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;websocket&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;generate_with_ws&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;workflow&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Queue and wait for completion via WebSocket. Returns prompt_id.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;ws&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;websocket&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;WebSocket&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;ws&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;connect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ws://&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;SERVER&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/ws?clientId=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;CLIENT_ID&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;prompt_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;queue_prompt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;workflow&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;raw&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ws&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;recv&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="k"&gt;continue&lt;/span&gt;  &lt;span class="c1"&gt;# binary preview frames — skip
&lt;/span&gt;        &lt;span class="n"&gt;msg&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;loads&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;executing&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;node&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;prompt_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;break&lt;/span&gt;  &lt;span class="c1"&gt;# null node = generation finished
&lt;/span&gt;
    &lt;span class="n"&gt;ws&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;close&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;prompt_id&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;executing&lt;/code&gt; message fires for each node as it runs. When &lt;code&gt;node&lt;/code&gt; is &lt;code&gt;None&lt;/code&gt; and the &lt;code&gt;prompt_id&lt;/code&gt; matches yours, the entire graph has finished executing.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 5: Download the generated images
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;
python
import os

def download_images(history_entry: dict, output_dir: str = "output") -&amp;gt; list[str]:
    """Download all output images from a completed prompt."""
    os.makedirs(output_dir, exist_ok=True)
    saved = []

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

</description>
      <category>comfyui</category>
      <category>api</category>
      <category>python</category>
      <category>stablediffusion</category>
    </item>
    <item>
      <title>AMD Lemonade Review 2026: GPU, NPU, and Multi-Modal</title>
      <dc:creator>Jovan Chan</dc:creator>
      <pubDate>Sun, 14 Jun 2026 07:00:20 +0000</pubDate>
      <link>https://dev.to/jovan_chan_9500711396d4e6/amd-lemonade-review-2026-gpu-npu-and-multi-modal-3gd9</link>
      <guid>https://dev.to/jovan_chan_9500711396d4e6/amd-lemonade-review-2026-gpu-npu-and-multi-modal-3gd9</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;This article was originally published on &lt;a href="https://aifoss.dev/blog/amd-lemonade-llm-server-review-2026/" rel="noopener noreferrer"&gt;aifoss.dev&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;TL;DR&lt;/strong&gt;: Lemonade v10.6 is AMD's open-source LLM server that adds NPU prefill acceleration, image gen, and speech to one OpenAI-compatible endpoint. NPU acceleration works only on Ryzen AI 300/400 chips — on other hardware, Ollama's ecosystem is wider. AMD Ryzen AI users should pick Lemonade; everyone else should consider Ollama first.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Lemonade v10.6&lt;/th&gt;
&lt;th&gt;Ollama v0.6&lt;/th&gt;
&lt;th&gt;LocalAI&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Best for&lt;/td&gt;
&lt;td&gt;AMD GPU + NPU hybrid, multi-modal&lt;/td&gt;
&lt;td&gt;Cross-platform, broadest ecosystem&lt;/td&gt;
&lt;td&gt;OpenAI API proxy, any hardware&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Install&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;winget&lt;/code&gt; or Snap&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;curl&lt;/code&gt; one-liner&lt;/td&gt;
&lt;td&gt;Docker Compose&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Hardware&lt;/td&gt;
&lt;td&gt;AMD RDNA3+, NVIDIA, Apple M, CPU&lt;/td&gt;
&lt;td&gt;Any GPU&lt;/td&gt;
&lt;td&gt;Any hardware&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Model formats&lt;/td&gt;
&lt;td&gt;GGUF, ONNX, FLM, SafeTensors&lt;/td&gt;
&lt;td&gt;GGUF (Ollama manifest)&lt;/td&gt;
&lt;td&gt;GGUF, OpenVINO, more&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Multi-modal&lt;/td&gt;
&lt;td&gt;LLM + image gen + Whisper + TTS&lt;/td&gt;
&lt;td&gt;LLM + vision models&lt;/td&gt;
&lt;td&gt;LLM + Whisper + SD&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;The catch&lt;/td&gt;
&lt;td&gt;NPU only on Ryzen AI 300/400&lt;/td&gt;
&lt;td&gt;No NPU acceleration&lt;/td&gt;
&lt;td&gt;High setup complexity&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Honest take&lt;/strong&gt;: On a Ryzen AI 300-series machine, Lemonade is the better daily driver — it uses hardware that Ollama leaves idle and bundles image gen plus speech in one package. On Nvidia hardware or wherever you need maximum integration coverage, stick with Ollama.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  What Lemonade Is and Why AMD Built It
&lt;/h2&gt;

&lt;p&gt;Ollama solved cross-platform local LLM deployment cleanly. But it left AMD NPU owners with idle hardware — the dedicated AI accelerators in Ryzen AI chips sat unused because Ollama has no FastFlowLM backend.&lt;/p&gt;

&lt;p&gt;Lemonade is AMD's answer. Released under Apache 2.0 and available at &lt;a href="https://github.com/lemonade-sdk/lemonade" rel="noopener noreferrer"&gt;github.com/lemonade-sdk/lemonade&lt;/a&gt;, it bundles:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;An OpenAI-compatible HTTP API at &lt;code&gt;http://localhost:13305/v1&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;llama.cpp with Vulkan backend for AMD and NVIDIA GPUs&lt;/li&gt;
&lt;li&gt;FastFlowLM for XDNA2 NPU acceleration on Ryzen AI chips&lt;/li&gt;
&lt;li&gt;Stable Diffusion image generation&lt;/li&gt;
&lt;li&gt;Whisper speech-to-text&lt;/li&gt;
&lt;li&gt;Kokoro text-to-speech&lt;/li&gt;
&lt;li&gt;A model manager with one-command downloads from Hugging Face&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The core design difference from Ollama is hardware-tier splitting. On a Ryzen AI 300-series chip, prompt processing (prefill) goes to the NPU while token generation (decode) goes to the iGPU. This is not marketing — the NPU has better compute throughput for dense matrix math during prefill, and the iGPU has better memory bandwidth for sequential token generation. The result is lower Time to First Token on long system prompts and agentic chains.&lt;/p&gt;

&lt;p&gt;Current version: &lt;strong&gt;v10.6.0&lt;/strong&gt; (released May 21, 2026). Linux NPU support shipped with Lemonade 10.0 in March 2026 via the FastFlowLM runtime.&lt;/p&gt;




&lt;h2&gt;
  
  
  Hardware Compatibility
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Platform&lt;/th&gt;
&lt;th&gt;Backend&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;AMD Ryzen AI 300/400 (XDNA2)&lt;/td&gt;
&lt;td&gt;FastFlowLM NPU + Vulkan iGPU&lt;/td&gt;
&lt;td&gt;Strix Halo supports up to 128 GB unified memory&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AMD Radeon discrete (RDNA2/3/4)&lt;/td&gt;
&lt;td&gt;llama.cpp + Vulkan&lt;/td&gt;
&lt;td&gt;Standard VRAM limits; add 2–4 GB overhead&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;NVIDIA (Turing–Blackwell)&lt;/td&gt;
&lt;td&gt;llama.cpp + Vulkan or CUDA&lt;/td&gt;
&lt;td&gt;CUDA backend available since v10+&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Apple Silicon (M1–M4)&lt;/td&gt;
&lt;td&gt;Metal via llama.cpp&lt;/td&gt;
&lt;td&gt;Unified memory; M4 Max competitive at large models&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;x86_64 CPU&lt;/td&gt;
&lt;td&gt;llama.cpp CPU&lt;/td&gt;
&lt;td&gt;Small models only; no hardware acceleration&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;NPU acceleration requires &lt;strong&gt;Ryzen AI 300-series or 400-series&lt;/strong&gt; specifically — the XDNA2 architecture. Earlier Ryzen AI chips (7000, 8000, 200-series) have NPUs that no current runtime supports for LLM inference. On those systems, Lemonade falls back to Vulkan on the GPU, which is functionally the same as running Ollama.&lt;/p&gt;

&lt;p&gt;Supported Linux distros: Ubuntu 24.04+, Fedora 43+, Debian Trixie+, Arch. Docker and Snap packages are available. For hardware context on AMD GPU builds, see &lt;a href="https://runaihome.com" rel="noopener noreferrer"&gt;runaihome.com&lt;/a&gt; for current RDNA4 GPU benchmarks and build guides.&lt;/p&gt;




&lt;h2&gt;
  
  
  Installation
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Windows
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight powershell"&gt;&lt;code&gt;&lt;span class="n"&gt;winget&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;install&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nx"&gt;AMD.LemonadeServer&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This installs the server and a Tauri desktop app (system-tray GUI for model downloads and server management). Alternatively, grab the &lt;code&gt;.msi&lt;/code&gt; from the GitHub releases page. After install, the server starts automatically on port 13305.&lt;/p&gt;

&lt;h3&gt;
  
  
  Linux (Ubuntu 24.04+)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Snap — works across Ubuntu 24.04+, Fedora 43+, Arch&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;snap &lt;span class="nb"&gt;install &lt;/span&gt;lemonade

&lt;span class="c"&gt;# Docker&lt;/span&gt;
docker run &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="nt"&gt;--gpus&lt;/span&gt; all &lt;span class="nt"&gt;-p&lt;/span&gt; 13305:13305 lemonadesdk/lemonade:latest
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For NPU support on Linux, you need the XDNA driver and FastFlowLM runtime installed separately — the Lemonade docs cover the dependency chain. It is more involved than the Windows path. For most Linux users without a Ryzen AI 300/400 chip, the Snap install with Vulkan fallback is the practical path.&lt;/p&gt;

&lt;h3&gt;
  
  
  Verify the server is running
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl http://localhost:13305/v1/models
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Expected output on a fresh install with no models downloaded:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nl"&gt;"object"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="s2"&gt;"list"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="nl"&gt;"data"&lt;/span&gt;&lt;span class="p"&gt;:[]}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Running Your First Model
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;lemonade run Gemma-4-E2B-it-GGUF
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This pulls the model from Hugging Face (if not cached) and starts a chat session in your terminal. The model manager uses Hugging Face slug format — you can also import any custom GGUF or ONNX model from Hugging Face directly.&lt;/p&gt;

&lt;p&gt;Check which backend Lemonade selected for your hardware:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl http://localhost:13305/stats
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The response includes the active inference engine: &lt;code&gt;vulkan&lt;/code&gt;, &lt;code&gt;fastflowlm&lt;/code&gt;, &lt;code&gt;rocm&lt;/code&gt;, or &lt;code&gt;cpu&lt;/code&gt;. If you expected &lt;code&gt;fastflowlm&lt;/code&gt; and got &lt;code&gt;vulkan&lt;/code&gt;, check that your XDNA driver is installed and you're on a Ryzen AI 300/400 chip.&lt;/p&gt;

&lt;p&gt;To test image generation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl http://localhost:13305/v1/images/generations &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-H&lt;/span&gt; &lt;span class="s2"&gt;"Content-Type: application/json"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="s1"&gt;'{"prompt": "a terminal screen in a dark room", "n": 1}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  NPU + GPU Hybrid: Numbers From Real Hardware
&lt;/h2&gt;

&lt;p&gt;On a &lt;a href="https://www.amazon.com/s?k=Ryzen+AI+Max+395&amp;amp;tag=runaihome-20" rel="noopener noreferrer"&gt;Ryzen AI Max+ 395&lt;/a&gt; (Strix Halo, 128 GB unified memory), the NPU handles prompt processing and the iGPU handles decode. Community benchmarks from May–June 2026 on this configuration:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Quantization&lt;/th&gt;
&lt;th&gt;Tokens/sec&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;GPT-OSS 120B&lt;/td&gt;
&lt;td&gt;Q4_K_M&lt;/td&gt;
&lt;td&gt;~50 tok/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qwen3.5-122B&lt;/td&gt;
&lt;td&gt;Q4&lt;/td&gt;
&lt;td&gt;~35 tok/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qwen3-Coder-Next&lt;/td&gt;
&lt;td&gt;Q4&lt;/td&gt;
&lt;td&gt;~43 tok/s&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;For comparison: an &lt;a href="https://www.amazon.com/s?k=RTX+4090&amp;amp;tag=runaihome-20" rel="noopener noreferrer"&gt;RTX 4090&lt;/a&gt; running llama.cpp hits 50–80 tok/s on 7B models at Q4 but stalls on 70B+ without aggressive quantization (limited by 24 GB VRAM). The Strix Halo runs 120B at full Q4_K_M in 128 GB of unified memory — a different tier of capability.&lt;/p&gt;

&lt;p&gt;On smaller Ryzen AI 300 systems (Strix Point, 32–64 GB), expect:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Llama 3.2-3B on NPU: ~28 tok/s at under 2 W&lt;/li&gt;
&lt;li&gt;Models above 8B: fall back to iGPU via Vulkan&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;FastFlowLM 0.9.35, the current NPU runtime bundled in Lemonade 10.6, supports context windows up to 256k tokens on XDNA2 NPUs.&lt;/p&gt;




&lt;h2&gt;
  
  
  Multi-Modal in One Server
&lt;/h2&gt;

&lt;p&gt;Lemonade bundles three additional inference backends behind the same API port:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Image generation&lt;/strong&gt;: SDXL-Turbo via &lt;code&gt;/v1/images/generations&lt;/code&gt;. Any client that supports the OpenAI image endpoint works — including the ComfyUI API adapter. See our &lt;a href="https://dev.to/blog/comfyui-api-tutorial-2026/"&gt;ComfyUI API tutorial&lt;/a&gt; for chaining this into automated pipelines.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Speech-to-text&lt;/strong&gt;: Whisper backend via &lt;code&gt;/v1/audio/transcriptions&lt;/code&gt;. Uses the same model weights as whisper.cpp.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Text-to-speech&lt;/strong&gt;: Kokoro TTS via &lt;code&gt;/v1/audio/speech&lt;/code&gt;. Known limitation as of v10.6: voices not in the pre-configured list produce muted audio. Custom voice loading is not yet supported.&lt;/p&gt;

&lt;p&gt;Running these three modalities as separate services (Ollama + ComfyUI + a Whisper server) adds coordination overhead — three processes, three ports, three model caches. Lemonade consolidates them into one service with one model manager. For a home server running all three, that's meaningful.&lt;/p&gt;




&lt;h2&gt;
  
  
  Connecting to Open WebUI
&lt;/h2&gt;

&lt;p&gt;Open WebUI supports custom OpenAI-compatible endpoints. To add Lemonade:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Open WebUI settings → &lt;strong&gt;Connections&lt;/strong&gt; → &lt;strong&gt;Add Connection&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;API URL: &lt;code&gt;http://localhost:13305/v1&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;API key: leave blank (Lemonade does not validate keys)&lt;/li&gt;
&lt;li&gt;Save and confirm models appear in the model list&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If you're running Open WebUI in Docker and Lemonade natively on the host:&lt;/p&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;

http://host.docker.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

</description>
      <category>amd</category>
      <category>llm</category>
      <category>selfhosted</category>
      <category>npu</category>
    </item>
    <item>
      <title>DeepSeek V4 vs Qwen3 for Local AI in 2026: Which Model Family Fits Your GPU?</title>
      <dc:creator>Jovan Chan</dc:creator>
      <pubDate>Sat, 13 Jun 2026 07:06:04 +0000</pubDate>
      <link>https://dev.to/jovan_chan_9500711396d4e6/deepseek-v4-vs-qwen3-for-local-ai-in-2026-which-model-family-fits-your-gpu-17ig</link>
      <guid>https://dev.to/jovan_chan_9500711396d4e6/deepseek-v4-vs-qwen3-for-local-ai-in-2026-which-model-family-fits-your-gpu-17ig</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;This article was originally published on &lt;a href="https://runaihome.com/blog/deepseek-v4-vs-qwen3-local-ai-gpu-guide-2026/" rel="noopener noreferrer"&gt;runaihome.com&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;TL;DR&lt;/strong&gt;: DeepSeek V4 Flash and Qwen3 both landed in late April 2026 and rewrote the open-weights leaderboards. But for home-lab inference, they serve completely different audiences: Qwen3's MoE variants run on a single consumer GPU and hit 120 tok/s on an RTX 3090, while V4 Flash's lightest usable quantization requires 103 GB of VRAM — more than four RTX 4090s combined. The model family that "fits your GPU" is almost certainly Qwen3.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Qwen3 small/mid (8B–32B)&lt;/th&gt;
&lt;th&gt;Qwen3 MoE (30B-A3B / 35B-A3B)&lt;/th&gt;
&lt;th&gt;DeepSeek V4 Flash (284B)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Best for&lt;/td&gt;
&lt;td&gt;Budget single-GPU builds&lt;/td&gt;
&lt;td&gt;Consumer GPU sweet spot&lt;/td&gt;
&lt;td&gt;Multi-GPU server or API&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Min VRAM&lt;/td&gt;
&lt;td&gt;5 GB (8B Q4)&lt;/td&gt;
&lt;td&gt;~17 GB (Q4)&lt;/td&gt;
&lt;td&gt;~103 GB (Q2_K)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Single RTX 3090 speed&lt;/td&gt;
&lt;td&gt;23–60 tok/s (14B–32B)&lt;/td&gt;
&lt;td&gt;~120 tok/s (Q3, Unsloth)&lt;/td&gt;
&lt;td&gt;Not viable&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;The catch&lt;/td&gt;
&lt;td&gt;Smaller models, less reasoning depth&lt;/td&gt;
&lt;td&gt;Needs 24 GB for Q4 headroom&lt;/td&gt;
&lt;td&gt;Needs $10k+ hardware to run well&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Honest take&lt;/strong&gt;: For 99% of home-lab builders, Qwen3 30B-A3B or 35B-A3B is the answer regardless of budget. DeepSeek V4 Flash is an excellent API model — use it that way, at $0.10/M input tokens, rather than trying to run it locally unless you have a purpose-built multi-GPU server.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  What shipped and when
&lt;/h2&gt;

&lt;p&gt;DeepSeek-V4 launched April 24, 2026, in two variants:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;V4 Flash&lt;/strong&gt;: 284B total parameters, 13B activated per token (MoE)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;V4 Pro&lt;/strong&gt;: 1.6T total parameters, 49B activated per token (MoE)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Alibaba's Qwen team shipped the Qwen3 family days later, spanning 0.6B through 235B, including two MoE variants designed explicitly for consumer hardware: the 30B-A3B (3B active/token) and the newer 35B-A3B from the Qwen3.6 update.&lt;/p&gt;

&lt;p&gt;Both families use Mixture-of-Experts architectures that activate only a fraction of parameters per token, which makes them faster to serve than dense models of equivalent total size. The key difference is how aggressively each family scaled the MoE — and who they optimized for.&lt;/p&gt;

&lt;h2&gt;
  
  
  The VRAM numbers, tier by tier
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Consumer single-GPU: 8–24 GB VRAM
&lt;/h3&gt;

&lt;p&gt;Qwen3 owns this tier entirely.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Quantization&lt;/th&gt;
&lt;th&gt;VRAM required&lt;/th&gt;
&lt;th&gt;Minimum GPU&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Qwen3 8B&lt;/td&gt;
&lt;td&gt;Q4_K_M&lt;/td&gt;
&lt;td&gt;~5 GB&lt;/td&gt;
&lt;td&gt;&lt;a href="https://www.amazon.com/s?k=RTX+4060&amp;amp;tag=runaihome-20" rel="noopener noreferrer"&gt;RTX 4060 8GB&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qwen3 14B&lt;/td&gt;
&lt;td&gt;Q4_K_M&lt;/td&gt;
&lt;td&gt;~8 GB&lt;/td&gt;
&lt;td&gt;RTX 3060 12GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qwen3 30B-A3B&lt;/td&gt;
&lt;td&gt;Q4_K_M&lt;/td&gt;
&lt;td&gt;~16.8 GB&lt;/td&gt;
&lt;td&gt;RTX 3090 24GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qwen3.6-35B-A3B&lt;/td&gt;
&lt;td&gt;Q3 (Unsloth)&lt;/td&gt;
&lt;td&gt;~23 GB&lt;/td&gt;
&lt;td&gt;
&lt;a href="https://www.amazon.com/s?k=RTX+3090&amp;amp;tag=runaihome-20" rel="noopener noreferrer"&gt;RTX 3090&lt;/a&gt; 24GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qwen3 32B&lt;/td&gt;
&lt;td&gt;Q4_K_M&lt;/td&gt;
&lt;td&gt;~20 GB&lt;/td&gt;
&lt;td&gt;RTX 3090 (tight)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;DeepSeek V4 Flash doesn't appear in this table because it cannot run in this tier. Its IQ1_S-XL quantization — the most aggressive lossy compression available — compresses the 284B model from its FP8 footprint down to 57.3 GB. That's still more than two &lt;a href="https://www.amazon.com/s?k=RTX+4090&amp;amp;tag=runaihome-20" rel="noopener noreferrer"&gt;RTX 4090&lt;/a&gt; GPUs combined. The first viable quantization for practical use is Q2_K at 103 GB.&lt;/p&gt;

&lt;p&gt;If you have a single GPU with 24 GB or less, V4 Flash is a non-starter. Any attempt at local inference involves offloading almost everything to system RAM, with generation speeds under 2 tok/s — impractical for interactive use.&lt;/p&gt;

&lt;h3&gt;
  
  
  Prosumer multi-GPU: 2–4× consumer cards
&lt;/h3&gt;

&lt;p&gt;Two RTX 4090s in PCIe tensor-parallel gives 48 GB of pooled VRAM. That's still 55 GB short of V4 Flash's Q2_K threshold.&lt;/p&gt;

&lt;p&gt;The workaround is mixed VRAM/CPU offloading: load the attention layers in VRAM, offload the MoE expert weights to system RAM using llama.cpp's &lt;code&gt;-cmoe -ub 128&lt;/code&gt; flags. On a 4× RTX 4090 rig (96 GB total VRAM, plus 128 GB system RAM), community tests put V4 Flash throughput at &lt;strong&gt;8–12 tok/s&lt;/strong&gt;. Usable for batch processing; painful for interactive chat.&lt;/p&gt;

&lt;p&gt;Qwen3 235B-A22B at Q4 needs approximately 132 GB, putting it into similar territory — viable with partial offloading but not fast.&lt;/p&gt;

&lt;p&gt;At the 2–4× consumer GPU tier, the better answer remains Qwen3's MoE variants running on a single card, with spare GPUs handling separate workloads or parallel inference instances.&lt;/p&gt;

&lt;h3&gt;
  
  
  Professional multi-GPU: dedicated server hardware
&lt;/h3&gt;

&lt;p&gt;This is where V4 Flash becomes genuinely useful. Running on dual NVIDIA RTX PRO 6000 Max-Q GPUs with W4A16+FP8 quantization and MTP self-speculation, V4 Flash hits approximately &lt;strong&gt;111 tok/s&lt;/strong&gt; for single-stream requests at 128K context. That figure comes from NVIDIA's own testing and represents the current high-water mark for affordable (relative to H100 clusters) V4 Flash inference.&lt;/p&gt;

&lt;p&gt;On a single H100 80GB with GPU offloading enabled, throughput is around &lt;strong&gt;20 tok/s&lt;/strong&gt; — better, but H100s cost $25,000+ used. The official V4 Flash API, measured by Artificial Analysis, runs at approximately &lt;strong&gt;84 tok/s&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Qwen3 235B-A22B needs a minimum of 4× H100 80GB GPUs (320 GB total) to run Q4 with practical context and KV cache headroom — an eight-figure hardware investment for most people. The 30B-A3B and 35B-A3B MoE variants, by contrast, run on hardware anyone reading this article can actually buy.&lt;/p&gt;

&lt;h2&gt;
  
  
  Speed on consumer hardware: what you'll actually see
&lt;/h2&gt;

&lt;p&gt;These figures are from community benchmarks tested in April–May 2026 using llama.cpp and Ollama.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;GPU&lt;/th&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Quant&lt;/th&gt;
&lt;th&gt;Speed&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;RTX 3060 12GB&lt;/td&gt;
&lt;td&gt;Qwen3 8B&lt;/td&gt;
&lt;td&gt;Q4_K_M&lt;/td&gt;
&lt;td&gt;~42 tok/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RTX 3060 12GB&lt;/td&gt;
&lt;td&gt;Qwen3 14B&lt;/td&gt;
&lt;td&gt;Q4_K_M&lt;/td&gt;
&lt;td&gt;~23–29 tok/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RTX 4070&lt;/td&gt;
&lt;td&gt;Qwen3 14B&lt;/td&gt;
&lt;td&gt;Q4_K_M&lt;/td&gt;
&lt;td&gt;~60 tok/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RTX 3090 24GB&lt;/td&gt;
&lt;td&gt;Qwen3.6-35B-A3B&lt;/td&gt;
&lt;td&gt;Q3 (Unsloth)&lt;/td&gt;
&lt;td&gt;~120 tok/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RTX 3090 24GB&lt;/td&gt;
&lt;td&gt;DeepSeek V4 Flash&lt;/td&gt;
&lt;td&gt;Any&lt;/td&gt;
&lt;td&gt;Not viable (103 GB min)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;96 GB pooled + CPU offload&lt;/td&gt;
&lt;td&gt;DeepSeek V4 Flash&lt;/td&gt;
&lt;td&gt;Q2_K&lt;/td&gt;
&lt;td&gt;8–12 tok/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Dual RTX PRO 6000&lt;/td&gt;
&lt;td&gt;DeepSeek V4 Flash&lt;/td&gt;
&lt;td&gt;W4A16+FP8&lt;/td&gt;
&lt;td&gt;~111 tok/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;H100 80GB (single)&lt;/td&gt;
&lt;td&gt;DeepSeek V4 Flash&lt;/td&gt;
&lt;td&gt;FP8 + offload&lt;/td&gt;
&lt;td&gt;~20 tok/s&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The Qwen3.6-35B-A3B number deserves attention: 120 tok/s on a single RTX 3090 with Unsloth's Q3 GGUF, at 23 GB VRAM. That's faster than real-time reading speed, from a 35B-parameter model, on a GPU that costs under $600 used. The model achieves this via its extreme MoE sparsity: 35B total parameters but only 3B active per token, so the compute per token is closer to a 3B dense model while retaining the quality of a much larger network.&lt;/p&gt;

&lt;p&gt;For the original Qwen3 30B-A3B (slightly smaller than the 35B update), the Q4_K_M quantization fits in approximately 16.8 GB — giving comfortable headroom on a 24 GB card for KV cache. The &lt;a href="https://dev.to/blog/qwen3-30b-a3b-local-ai-guide-2026/"&gt;full setup guide for Qwen3 30B-A3B&lt;/a&gt; walks through the Ollama and llama.cpp install paths.&lt;/p&gt;

&lt;h2&gt;
  
  
  Benchmark quality: what you're trading away
&lt;/h2&gt;

&lt;p&gt;Speed means nothing if the model can't reason. Here's where the two families stand on standard benchmarks as of May 2026.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;DeepSeek V4 Flash (at or near full precision):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;AIME 2025: 99.4%&lt;/li&gt;
&lt;li&gt;MMLU-Pro: 92.8%&lt;/li&gt;
&lt;li&gt;BenchLM vs Qwen3 235B: 71 vs 33 (provisional)&lt;/li&gt;
&lt;li&gt;Context window: 1,000,000 tokens&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Qwen3 235B-A22B (frontier variant):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Context window: 128K tokens&lt;/li&gt;
&lt;li&gt;Competitive with frontier proprietary models on most tasks&lt;/li&gt;
&lt;li&gt;Requires 4+ H100s to run&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Qwen3.6-35B-A3B (consumer GPU tier):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;GPQA: 86.0%&lt;/li&gt;
&lt;li&gt;AIME 2026: 92.7%&lt;/li&gt;
&lt;li&gt;Coding: competitive with much larger dense models&lt;/li&gt;
&lt;li&gt;Fits on a single RTX 3090 at Q3&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The takeaway: at full or near-full precision, V4 Flash is clearly ahead of anything that runs on consumer hardware. But V4 Flash at Q2_K (the minimum viable local quant) suffers measurable quality loss — community testing suggests 1–3% degradation moving from FP8 to aggressive quantization, and more on harder reasoning chains. The practical question isn't "which model family is better at ideal conditions?" but "which model family am I actually able to run well?"&lt;/p&gt;

&lt;p&gt;For more on how quantization affects reasoning quality at each level, see the &lt;a href="https://dev.to/blog/quantization-q4-q5-q6-q8-quality-loss-2026/"&gt;Q4 vs Q8 quality loss analysis&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 1M-token context window: useful or marketing?
&lt;/h2&gt;

&lt;p&gt;V4 Flash supports a 1M-token context window vs Qwen3's 128K cap. That sounds like a decisive advantage for long-document work.&lt;/p&gt;

&lt;p&gt;In practice, on consumer or even prosumer local hardware, 1M context is theoretical. Processing 1M tokens at 8–12 tok/s (the realistic local speed for V4 Flash with par&lt;/p&gt;

</description>
      <category>localllm</category>
      <category>deepseek</category>
      <category>qwen3</category>
      <category>gpu</category>
    </item>
    <item>
      <title>Apple MacBook Pro M5 Max for Local AI in 2026: 128GB Unified Memory, Neural Accelerators, and Whether It Beats a Discrete GPU Tower</title>
      <dc:creator>Jovan Chan</dc:creator>
      <pubDate>Sat, 13 Jun 2026 07:05:20 +0000</pubDate>
      <link>https://dev.to/jovan_chan_9500711396d4e6/apple-macbook-pro-m5-max-for-local-ai-in-2026-128gb-unified-memory-neural-accelerators-and-6o0</link>
      <guid>https://dev.to/jovan_chan_9500711396d4e6/apple-macbook-pro-m5-max-for-local-ai-in-2026-128gb-unified-memory-neural-accelerators-and-6o0</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;This article was originally published on &lt;a href="https://runaihome.com/blog/apple-m5-max-macbook-local-ai-2026/" rel="noopener noreferrer"&gt;runaihome.com&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;TL;DR&lt;/strong&gt;: The &lt;a href="https://www.amazon.com/s?k=MacBook+Pro+M5+Max&amp;amp;tag=runaihome-20" rel="noopener noreferrer"&gt;MacBook Pro M5 Max&lt;/a&gt; with 128GB unified memory runs 70B parameter models at 18–25 tok/s — models an RTX 4090 or RTX 5090 literally cannot load. The new Neural Accelerators cut prompt processing time roughly 4× versus the M4 Max on compute-intensive workloads. The catch: 128GB configurations start around $5,499, and you pay that premium specifically to run large models — for 8B and 13B work, a $1,699 Mac Mini M4 Pro nearly keeps up.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;MacBook Pro M5 Max 128GB&lt;/th&gt;
&lt;th&gt;MacBook Pro M5 Max 36GB&lt;/th&gt;
&lt;th&gt;RTX 4090 Tower Build&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Best for&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;70B+ models, portability, multi-model&lt;/td&gt;
&lt;td&gt;13B–32B daily use, better value entry&lt;/td&gt;
&lt;td&gt;Raw speed on ≤24GB models&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Memory / VRAM&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;128GB unified&lt;/td&gt;
&lt;td&gt;36GB unified&lt;/td&gt;
&lt;td&gt;24GB GDDR6X&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;LLM bandwidth&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;614 GB/s&lt;/td&gt;
&lt;td&gt;614 GB/s&lt;/td&gt;
&lt;td&gt;1,008 GB/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;70B Q4 tok/s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;18–25&lt;/td&gt;
&lt;td&gt;~9–12 (partial offload)&lt;/td&gt;
&lt;td&gt;❌ can't load&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;8B Q4 tok/s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~82 (MLX)&lt;/td&gt;
&lt;td&gt;~82 (MLX)&lt;/td&gt;
&lt;td&gt;100–150&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Power under load&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;60–90W&lt;/td&gt;
&lt;td&gt;60–90W&lt;/td&gt;
&lt;td&gt;450–600W (full system)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Starting price&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~$5,499&lt;/td&gt;
&lt;td&gt;$3,899 (16-inch)&lt;/td&gt;
&lt;td&gt;~$3,500–$4,500&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;The catch&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Expensive; no CUDA; no eGPU&lt;/td&gt;
&lt;td&gt;70B won't fit cleanly&lt;/td&gt;
&lt;td&gt;Loud, hot, can't leave your desk&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Honest take&lt;/strong&gt;: Buy the M5 Max 128GB if you need 70B models to run without VRAM gymnastics and you want a laptop. Build an RTX 4090 tower if you're doing multi-user serving, batch inference, or fine-tuning — CUDA still wins there. Neither choice is obviously wrong; they serve different workflows.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  The Neural Accelerator story
&lt;/h2&gt;

&lt;p&gt;From M1 through M4, Apple's GPU had no dedicated matrix-multiplication hardware. All the linear algebra that drives LLM inference ran through standard floating-point ALUs shared with graphics workloads — the same shader pipeline that renders a game frame also processed your model's attention layers.&lt;/p&gt;

&lt;p&gt;M5 changes this. Apple built a dedicated Neural Accelerator into each GPU core for the first time. On the M5 Max with its 40-core GPU, that means 40 independent Neural Accelerators sitting on the same die, sharing the same 614 GB/s memory path as the GPU shaders.&lt;/p&gt;

&lt;p&gt;Each Neural Accelerator performs 1,024 FP16 fused multiply-accumulate operations per cycle. Apple claims the M5 Max delivers over four times the peak GPU compute for AI workloads compared to M4 Max.&lt;/p&gt;

&lt;p&gt;In practice, where this shows up is &lt;strong&gt;prompt processing&lt;/strong&gt; — the prefill phase where the model reads your input context before generating the first token. Prefill is compute-bound, not memory-bandwidth-bound, so the Neural Accelerators directly attack the bottleneck. Early benchmarks from Apple's MLX team and third-party testing show prefill roughly &lt;strong&gt;4× faster&lt;/strong&gt; on M5 Max versus M4 Max for models like Qwen3-14B.&lt;/p&gt;

&lt;p&gt;Token generation — the word-by-word output phase — doesn't benefit as much because it's still primarily memory-bandwidth-limited. The ~12% bandwidth increase (546 GB/s M4 Max → 614 GB/s M5 Max) translates to about a 20–28% improvement in sustained token generation speed.&lt;/p&gt;

&lt;p&gt;One critical caveat: &lt;strong&gt;MLX is currently the only inference framework that fully exploits the M5 Neural Engine.&lt;/strong&gt; Ollama, which uses llama.cpp under the hood, does not yet leverage the Neural Accelerators as of June 2026. If you're running Ollama today, you'll see the bandwidth gains but not the 4× prefill boost. MLX-native tools (mlx-lm, LM Studio with MLX backend, Open WebUI with MLX server) are where the real M5 performance lives.&lt;/p&gt;




&lt;h2&gt;
  
  
  Specs: what's actually inside
&lt;/h2&gt;

&lt;p&gt;The M5 Max comes in two GPU configurations. For LLM work, which configuration matters:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Spec&lt;/th&gt;
&lt;th&gt;M5 Max 32-core GPU&lt;/th&gt;
&lt;th&gt;M5 Max 40-core GPU&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;CPU cores&lt;/td&gt;
&lt;td&gt;14 (12P + 2E)&lt;/td&gt;
&lt;td&gt;18 (16P + 2E)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Memory bandwidth&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;460 GB/s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;614 GB/s&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Neural Engine&lt;/td&gt;
&lt;td&gt;16-core&lt;/td&gt;
&lt;td&gt;16-core&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Max unified memory&lt;/td&gt;
&lt;td&gt;64 GB&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;128 GB&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Neural Accelerators&lt;/td&gt;
&lt;td&gt;32&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;40&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AI compute&lt;/td&gt;
&lt;td&gt;~46 TFLOPS FP16&lt;/td&gt;
&lt;td&gt;~70 TFLOPS FP16&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The 128GB memory ceiling requires the 40-core GPU variant — same pattern as M4 Max. If you're reading this because you want to run 70B models, you need the 40-core configuration.&lt;/p&gt;

&lt;p&gt;The M5 Max is announced on TSMC 3nm, uses Apple's Fusion Architecture that connects two dies with advanced IP blocks, and was introduced in March 2026 alongside the updated 14-inch and 16-inch MacBook Pro lineup.&lt;/p&gt;




&lt;h2&gt;
  
  
  Real benchmark numbers
&lt;/h2&gt;

&lt;p&gt;These benchmarks come from MLX-powered inference (mlx-lm), which represents best-case M5 performance. Ollama users will see the token generation numbers but not the prefill improvements.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Quantization&lt;/th&gt;
&lt;th&gt;M5 Max tok/s&lt;/th&gt;
&lt;th&gt;M4 Max tok/s&lt;/th&gt;
&lt;th&gt;Change&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Llama 3 8B&lt;/td&gt;
&lt;td&gt;Q4&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;82&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;64&lt;/td&gt;
&lt;td&gt;+28%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qwen 3.5 30B-A3B&lt;/td&gt;
&lt;td&gt;Q4&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;58&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;45&lt;/td&gt;
&lt;td&gt;+29%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Llama 3.3 70B&lt;/td&gt;
&lt;td&gt;Q4_K_M&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;18–25&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~14–18&lt;/td&gt;
&lt;td&gt;+20–28%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Gemma 4 E2B&lt;/td&gt;
&lt;td&gt;Q4&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~158&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~120&lt;/td&gt;
&lt;td&gt;+32%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Phi-4 Mini&lt;/td&gt;
&lt;td&gt;Q4&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~135&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~100&lt;/td&gt;
&lt;td&gt;+35%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Prefill (prompt processing) numbers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Qwen3-14B 16K context on M5 Max via MLX: roughly &lt;strong&gt;8–10 seconds&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Same workload on M4 Max: roughly &lt;strong&gt;30–40 seconds&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;That's the Neural Accelerators doing actual work&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For context: M4 Max (40-core GPU) was benchmarked at &lt;strong&gt;83.06 tok/s on LLaMA 7B Q4_0&lt;/strong&gt; in the llama.cpp community benchmark thread (Discussion #4167). M5 Max isn't yet in that thread as of June 2026 — the numbers above come from third-party MLX benchmarks and Apple's MLX team test results.&lt;/p&gt;

&lt;h3&gt;
  
  
  The 70B model reality check
&lt;/h3&gt;

&lt;p&gt;A Llama 3.3 70B model at Q4_K_M quantization occupies approximately &lt;strong&gt;43 GB&lt;/strong&gt;. That figure is the hard floor for running it without CPU offload, which tanks performance.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;RTX 4090: 24GB VRAM. Doesn't fit. ❌&lt;/li&gt;
&lt;li&gt;RTX 5090: 32GB VRAM. Doesn't fit. ❌&lt;/li&gt;
&lt;li&gt;M5 Max 36GB config: Doesn't fit cleanly. You'd be splitting layers to CPU RAM, which drops tok/s to roughly 3–5. ❌&lt;/li&gt;
&lt;li&gt;M5 Max 96GB config: Fits. ~19 tok/s. ✅&lt;/li&gt;
&lt;li&gt;M5 Max 128GB config: Fits with 85GB to spare. 18–25 tok/s depending on framework. ✅&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That "spare" capacity matters too. With 128GB you can simultaneously load a 70B model and a second 7B assistant model, or keep a large embedding model resident, or run retrieval-augmented generation workflows without swapping.&lt;/p&gt;




&lt;h2&gt;
  
  
  The M5 Max vs NVIDIA question
&lt;/h2&gt;

&lt;p&gt;This comparison comes up constantly and the framing is almost always wrong. It's not "which is faster?" — the answer depends entirely on model size.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For models ≤24GB (8B, 13B, most 30B at Q4):&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;RTX 4090 wins on raw token generation: 100–150 tok/s versus M5 Max's ~82 tok/s. The &lt;a href="https://www.amazon.com/s?k=RTX+4090&amp;amp;tag=runaihome-20" rel="noopener noreferrer"&gt;RTX 4090&lt;/a&gt; has 1,008 GB/s GDDR6X bandwidth — 64% more than M5 Max's 614 GB/s — and for small models that fit entirely in 24GB, that bandwidth advantage is fully realized. A well-tuned llama.cpp or vLLM setup on a 4090 runs Llama 3.1 8B at real-time speeds.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For 70B models and above:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;M5 Max 128GB is the only sub-$10K option that runs them without compromises. The RTX 5090 at $1,999+ still only has 32GB, and 70B at Q4_K_M needs 43GB. Unless you pair two RTX 5090s in a dual-GPU setup with NVLink (expensive, complex, desktop-only), the Mac is the practical answer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For MoE (Mixture of Experts) models:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Models like Qwen3.5 30B-A3B only activate ~3B parameters per forward pass, which means they need less bandwidth for token generation. The M5 Max's 614 GB/s is more than enough here, and 128GB gives you room for the full parameter set. At 58 tok/s, Qwen3.5 30B-A3B on M5 Max feels genuinely fast.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Power consumption:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;M5 Max draws &lt;strong&gt;60–90W&lt;/strong&gt; during sustained inference. An RTX 4090 system (GPU + CPU + RAM + cooling) pulls &lt;strong&gt;450–600W&lt;/strong&gt; under the same workload. At $0.12/kWh, that gap is $0.048–0.061 per hour, or roughly &lt;strong&gt;$420–$530 per year&lt;/strong&gt; if you're running inference 8 hours daily. Over three years the electricity difference alone is $1,260–$1,590. That meaningfully narrows the price premium of the Mac in total cost of ownership terms.&lt;/p&gt;

&lt;p&gt;Want to compare cloud GPU costs against owning either? We ran that math in the [RunPod vs Local GPU analysis](/blog/runpod-vs-local-gpu-rent-&lt;/p&gt;

</description>
      <category>applesilicon</category>
      <category>m5max</category>
      <category>macbookpro</category>
      <category>localai</category>
    </item>
    <item>
      <title>$200 Modded Tesla V100 for Local AI in 2026: Cheaper Than an RTX 5060 Ti and Surprisingly Competitive</title>
      <dc:creator>Jovan Chan</dc:creator>
      <pubDate>Sat, 13 Jun 2026 07:04:23 +0000</pubDate>
      <link>https://dev.to/jovan_chan_9500711396d4e6/200-modded-tesla-v100-for-local-ai-in-2026-cheaper-than-an-rtx-5060-ti-and-surprisingly-1ihp</link>
      <guid>https://dev.to/jovan_chan_9500711396d4e6/200-modded-tesla-v100-for-local-ai-in-2026-cheaper-than-an-rtx-5060-ti-and-surprisingly-1ihp</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;This article was originally published on &lt;a href="https://runaihome.com/blog/modded-tesla-v100-budget-ai-gpu-2026/" rel="noopener noreferrer"&gt;runaihome.com&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;TL;DR&lt;/strong&gt;: A modded NVIDIA Tesla V100 SXM2 with a PCIe adapter costs around $200 total and outperforms the &lt;a href="https://www.amazon.com/s?k=RTX+3060&amp;amp;tag=runaihome-20" rel="noopener noreferrer"&gt;RTX 3060&lt;/a&gt; by 42% on local LLM inference. Against an &lt;a href="https://www.amazon.com/s?k=RTX+5060+Ti+16GB&amp;amp;tag=runaihome-20" rel="noopener noreferrer"&gt;RTX 5060 Ti 16GB&lt;/a&gt; at $499–$589, the value argument is real — until you account for Ollama's broken support, a 300W power draw, and zero display output.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Modded V100 SXM2 16GB&lt;/th&gt;
&lt;th&gt;RTX 5060 Ti 16GB&lt;/th&gt;
&lt;th&gt;RTX 3060 12GB&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Best for&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;20–30B models on a tight budget&lt;/td&gt;
&lt;td&gt;Balanced daily-driver LLM rig&lt;/td&gt;
&lt;td&gt;7–13B models, display included&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Memory bandwidth&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;900 GB/s&lt;/td&gt;
&lt;td&gt;448 GB/s&lt;/td&gt;
&lt;td&gt;360 GB/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;VRAM&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;16GB HBM2&lt;/td&gt;
&lt;td&gt;16GB GDDR7&lt;/td&gt;
&lt;td&gt;12GB GDDR6&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total cost (June 2026)&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;~$200–270&lt;/td&gt;
&lt;td&gt;~$499–589&lt;/td&gt;
&lt;td&gt;~$200–250 used&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;TDP&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;300W&lt;/td&gt;
&lt;td&gt;180W&lt;/td&gt;
&lt;td&gt;170W&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Display output&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Ollama support&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Broken in v0.30+ (fix below)&lt;/td&gt;
&lt;td&gt;Full&lt;/td&gt;
&lt;td&gt;Full&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Honest take&lt;/strong&gt;: If you already have an iGPU or a second card for display, can compile llama.cpp from source, and want the best raw bandwidth per dollar under $300, the modded V100 is genuinely interesting. If you want something that just works, pay for the RTX 5060 Ti.&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  The mod: what it actually is
&lt;/h2&gt;

&lt;p&gt;The Tesla V100 comes in two main physical formats. The PCIe version plugs into a desktop motherboard like any consumer card but is expensive and increasingly rare. The SXM2 version is a bare die designed for NVIDIA's DGX server backplane — faster (900 GB/s vs 897 GB/s) but it has no PCIe connector, no display output, and no cooling solution on its own.&lt;/p&gt;

&lt;p&gt;The mod bridges that gap. A third-party PCIe adapter board (widely available on eBay) converts the SXM2 socket to a standard PCIe x16 slot. Add an external power supply (the adapter needs dual 8-pin PCIe connectors), strap on a 80mm Noctua fan with a 3D-printed shroud because the SXM2 module relies on server-chassis airflow, and you have a desktop AI accelerator that cost ~$200 in parts.&lt;/p&gt;

&lt;p&gt;YouTuber Hardware Haven documented this build in detail and ran it against consumer GPUs in Ollama. The V100 hit &lt;strong&gt;130 tokens/second on GPT-OSS-20B&lt;/strong&gt;, outpacing the RX 7800 XT (90 tok/s) and the RTX 3060 12GB by 42%.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why 900 GB/s matters for LLM inference
&lt;/h2&gt;

&lt;p&gt;Memory bandwidth is the primary bottleneck for autoregressive LLM inference — not compute. During token generation, the GPU streams the model's weight matrix through memory on every forward pass. A card with twice the bandwidth generates roughly twice the tokens per second on the same model, all else being equal.&lt;/p&gt;

&lt;p&gt;That's why the V100 SXM2's &lt;strong&gt;900 GB/s&lt;/strong&gt; matters more than its aging Volta architecture when you're running quantized local models:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;GPU&lt;/th&gt;
&lt;th&gt;Memory bandwidth&lt;/th&gt;
&lt;th&gt;Architecture&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Tesla V100 SXM2&lt;/td&gt;
&lt;td&gt;900 GB/s&lt;/td&gt;
&lt;td&gt;Volta (2017)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;a href="https://www.amazon.com/s?k=RTX+5060+Ti+16GB&amp;amp;tag=runaihome-20" rel="noopener noreferrer"&gt;RTX 5060 Ti 16GB&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;448 GB/s&lt;/td&gt;
&lt;td&gt;Blackwell (2025)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RTX 4090&lt;/td&gt;
&lt;td&gt;1,008 GB/s&lt;/td&gt;
&lt;td&gt;Ada Lovelace (2022)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RX 9070 XT&lt;/td&gt;
&lt;td&gt;640 GB/s&lt;/td&gt;
&lt;td&gt;RDNA 4 (2025)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RTX 3060 12GB&lt;/td&gt;
&lt;td&gt;360 GB/s&lt;/td&gt;
&lt;td&gt;Ampere (2021)&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;A 2025 Blackwell GPU with half the V100's bandwidth will lose on raw throughput for a single-user, low-batch LLM inference workload. The RTX 5060 Ti's 448 GB/s is solid — it's roughly what you'd expect for a $500 mid-range card — but the V100 SXM2 is nearly twice as wide.&lt;/p&gt;

&lt;p&gt;The V100 also carries &lt;strong&gt;125 TFLOPS of FP16&lt;/strong&gt; compute from its 640 Tensor Cores, meaning prefill (processing your prompt) is fast. In benchmarks from the llama.cpp community (Discussion #15396), a V100 16GB processed a 2,048-token prompt at &lt;strong&gt;3,526 tok/s&lt;/strong&gt; and generated subsequent tokens at &lt;strong&gt;117.71 tok/s&lt;/strong&gt; with GPT-OSS-20B at MXFP4 quantization.&lt;/p&gt;




&lt;h2&gt;
  
  
  Real benchmark numbers
&lt;/h2&gt;

&lt;p&gt;These are the numbers from the Hardware Haven build and the llama.cpp community, not marketing estimates.&lt;/p&gt;

&lt;h3&gt;
  
  
  Hardware Haven mod test (Ollama, GPT-OSS-20B)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;GPU&lt;/th&gt;
&lt;th&gt;Tokens/sec&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;V100 SXM2 16GB (modded)&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;130 tok/s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Custom PCIe adapter, Noctua fan&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RX 7800 XT 16GB&lt;/td&gt;
&lt;td&gt;90 tok/s&lt;/td&gt;
&lt;td&gt;Daily-driver GPU in the same rig&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RTX 3060 12GB&lt;/td&gt;
&lt;td&gt;~92 tok/s&lt;/td&gt;
&lt;td&gt;Best NVIDIA card available for comparison&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The V100 is 42% faster than the RTX 3060. At 100W power cap (to compare apples-to-apples), the V100 hit 95 tok/s at 170W wall draw vs. the RTX 3060 at 68 tok/s at 171W wall draw — same wall power, 40% more output.&lt;/p&gt;

&lt;h3&gt;
  
  
  llama.cpp benchmark (V100 16GB, GPT-OSS-20B, MXFP4)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scenario&lt;/th&gt;
&lt;th&gt;Tokens/sec&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Prefill pp2048&lt;/td&gt;
&lt;td&gt;3,527 t/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Prefill pp8192&lt;/td&gt;
&lt;td&gt;3,321 t/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Prefill pp16384&lt;/td&gt;
&lt;td&gt;2,769 t/s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Token generation tg128&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;117.71 t/s&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The command that produced these results:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;llama-server &lt;span class="nt"&gt;-hf&lt;/span&gt; ggml-org/gpt-oss-20b-GGUF &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--ctx-size&lt;/span&gt; 32768 &lt;span class="nt"&gt;--jinja&lt;/span&gt; &lt;span class="nt"&gt;-ub&lt;/span&gt; 4096 &lt;span class="nt"&gt;-b&lt;/span&gt; 4096
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;GPT-OSS-20B in MXFP4 fits within 16GB at up to 32K context. Beyond 32K, you'll hit OOM on the 16GB variant.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Ollama problem you'll hit immediately
&lt;/h2&gt;

&lt;p&gt;If you buy a V100, set up the adapter, boot Linux, install Ollama, and try to run a model, you'll get this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;CUDA error: device kernel image is invalid
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Ollama v0.30.0 dropped support for CUDA compute capability 7.0 (Volta/V100). The prebuilt CUDA libraries bundled with Ollama no longer include sm_70 kernels. Older versions (v0.24.0 and earlier) work fine, but you'd be running outdated software on a production setup.&lt;/p&gt;

&lt;p&gt;LM Studio has the same issue — its bundled llama.cpp runtime doesn't include sm_70 kernels either (tracked in lmstudio-bug-tracker issue #1758).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The working solution&lt;/strong&gt;: compile llama.cpp from source with explicit architecture support:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;CUDA_HOME&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;/usr/local/cuda &lt;span class="se"&gt;\&lt;/span&gt;
&lt;span class="nv"&gt;CUDACXX&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;/usr/local/cuda/bin/nvcc &lt;span class="se"&gt;\&lt;/span&gt;
cmake &lt;span class="nt"&gt;-S&lt;/span&gt; &lt;span class="nb"&gt;.&lt;/span&gt; &lt;span class="nt"&gt;-B&lt;/span&gt; build &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-DGGML_CUDA&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;ON &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-DCMAKE_CUDA_ARCHITECTURES&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"70;86"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-DCMAKE_BUILD_TYPE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;Release

cmake &lt;span class="nt"&gt;--build&lt;/span&gt; build &lt;span class="nt"&gt;--config&lt;/span&gt; Release &lt;span class="nt"&gt;-t&lt;/span&gt; llama-server &lt;span class="nt"&gt;--&lt;/span&gt; &lt;span class="nt"&gt;-j&lt;/span&gt; 16
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;70&lt;/code&gt; in &lt;code&gt;DCMAKE_CUDA_ARCHITECTURES&lt;/code&gt; is the compute capability for Volta. You'll also want &lt;code&gt;86&lt;/code&gt; if you ever add an Ampere card. After compiling, &lt;code&gt;llama-server&lt;/code&gt; runs natively on the V100 with full GPU offload.&lt;/p&gt;

&lt;p&gt;If you want to stick with Ollama, pin to v0.24.0. It's not ideal for long-term use but works as a stopgap.&lt;/p&gt;




&lt;h2&gt;
  
  
  Build cost breakdown (June 2026)
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Component&lt;/th&gt;
&lt;th&gt;What to buy&lt;/th&gt;
&lt;th&gt;Price range&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;V100 SXM2 16GB&lt;/td&gt;
&lt;td&gt;eBay used&lt;/td&gt;
&lt;td&gt;$100–150&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SXM2-to-PCIe adapter&lt;/td&gt;
&lt;td&gt;eBay (various sellers, primarily China)&lt;/td&gt;
&lt;td&gt;$50–100&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;80mm Noctua fan + 3D-printed shroud&lt;/td&gt;
&lt;td&gt;Noctua + print locally&lt;/td&gt;
&lt;td&gt;~$20–30&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6+2-pin PCIe power cable (×2)&lt;/td&gt;
&lt;td&gt;Already on most PSUs&lt;/td&gt;
&lt;td&gt;$0&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Total&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~$170–280&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The variation is wide because these are secondhand parts with no fixed retail. V100 SXM2 modules sell in the $80–180 range on eBay depending on seller, condition, and shipping origin. Budget $200 as your planning number and budget $280 if you want to be safe.&lt;/p&gt;

&lt;p&gt;Complete kits (V100 SXM2 + PCIe adapter together) appear on eBay for $200–270, which is often the safer route — the adapter and card are tested as a pair.&lt;/p&gt;

&lt;p&gt;For comparison, a new &lt;a href="https://www.amazon.com/s?k=RTX+5060+Ti+16GB&amp;amp;tag=runaihome-20" rel="noopener noreferrer"&gt;RTX 5060 Ti 16GB&lt;/a&gt; runs $499–589 at Newegg and Amazon as of June 2026, against an MSRP of $429 that's mostly theoretical at this point.&lt;/p&gt;




&lt;h2&gt;
  
  
  Power cost at 300W TDP
&lt;/h2&gt;

&lt;p&gt;The V100 SXM2 has a &lt;strong&gt;300W TDP&lt;/strong&gt;. The RTX 5060 Ti pulls 180W. That gap is real money over time.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;GPU&lt;/th&gt;
&lt;th&gt;TDP&lt;/th&gt;
&lt;th&gt;$/hour @ $0.12/kWh&lt;/th&gt;
&lt;th&gt;$/month (8 hrs/day)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;V100 SXM2&lt;/td&gt;
&lt;td&gt;300W&lt;/td&gt;
&lt;td&gt;$0.036&lt;/td&gt;
&lt;td&gt;~$8.64&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RTX 5060 Ti 16GB&lt;/td&gt;
&lt;td&gt;180W&lt;/td&gt;
&lt;td&gt;$0.0216&lt;/td&gt;
&lt;td&gt;~$5.18&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;RTX 3060 12GB&lt;/td&gt;
&lt;td&gt;170W&lt;/td&gt;
&lt;td&gt;$0.0204&lt;/td&gt;
&lt;td&gt;~$4.90&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;That's ~$3.50/month more for the V100 at 8 hours/day of inference — $42/year. Over the 3-year life of the hardware, it adds up to roughly $126 extra in electricity. Not dealbreaking, but factor it in.&lt;/p&gt;

&lt;p&gt;If you're running inference 24/7 — say, a shared family LLM server — that gap triples. And at 300W, your PSU needs to handle it: budget a minimum &lt;strong&gt;750W 80+ Gold unit&lt;/strong&gt; for a V100 build.&lt;/p&gt;




&lt;h2&gt;
  
  
  What the V100 16GB can and can't run
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Fits cleanly in 16GB
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;GPT-OSS-20B MXFP4: 11.27 GiB — ful&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>gpu</category>
      <category>localllm</category>
      <category>budget</category>
      <category>nvidia</category>
    </item>
  </channel>
</rss>
