<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Hermes Rodríguez</title>
    <description>The latest articles on DEV Community by Hermes Rodríguez (@hrodrig).</description>
    <link>https://dev.to/hrodrig</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1307171%2F7f243604-d498-4a13-87d1-7bf6401a5a8d.jpeg</url>
      <title>DEV Community: Hermes Rodríguez</title>
      <link>https://dev.to/hrodrig</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/hrodrig"/>
    <language>en</language>
    <item>
      <title>~21 tok/s Gemma 4 on a Ryzen mini PC: llama.cpp, Vulkan, and the messy truth about local chat</title>
      <dc:creator>Hermes Rodríguez</dc:creator>
      <pubDate>Fri, 10 Apr 2026 12:46:26 +0000</pubDate>
      <link>https://dev.to/hrodrig/21-toks-gemma-4-on-a-ryzen-mini-pc-llamacpp-vulkan-and-the-messy-truth-about-local-chat-m82</link>
      <guid>https://dev.to/hrodrig/21-toks-gemma-4-on-a-ryzen-mini-pc-llamacpp-vulkan-and-the-messy-truth-about-local-chat-m82</guid>
      <description>&lt;p&gt;Hands-on guide based on a real setup: &lt;strong&gt;Ubuntu 24.04 LTS&lt;/strong&gt;, &lt;strong&gt;AMD Radeon 760M&lt;/strong&gt; (Ryzen iGPU), &lt;strong&gt;lots of RAM&lt;/strong&gt; (e.g. 96 GiB), &lt;strong&gt;llama.cpp&lt;/strong&gt; built with &lt;strong&gt;GGML_VULKAN&lt;/strong&gt;, OpenAI-compatible API via &lt;strong&gt;llama-server&lt;/strong&gt;, &lt;strong&gt;Open WebUI&lt;/strong&gt; in Docker, and &lt;strong&gt;OpenCode&lt;/strong&gt; or &lt;strong&gt;VS Code&lt;/strong&gt; (§11) using the same API.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Who this is for:&lt;/strong&gt; if you buy (or plan to buy) a &lt;strong&gt;mini PC&lt;/strong&gt; or small tower with &lt;strong&gt;plenty of RAM and disk&lt;/strong&gt;, this walkthrough gets you to &lt;strong&gt;local inference&lt;/strong&gt; — GGUF weights on your box, chat and APIs on your LAN, without treating a cloud vendor as mandatory for every request. The documented path is &lt;strong&gt;AMD iGPU + Vulkan&lt;/strong&gt;; if your hardware differs, keep the &lt;strong&gt;Ubuntu → llama.cpp → weights → server&lt;/strong&gt; flow and adjust &lt;strong&gt;§5–§6&lt;/strong&gt; (deps and build) for your GPU.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Reference hardware (validated while writing this guide):&lt;/strong&gt; &lt;strong&gt;Minisforum UM760 Slim&lt;/strong&gt; mini PC (&lt;em&gt;Device Type: MINI PC&lt;/em&gt; on the chassis label; vendor &lt;strong&gt;Minisforum&lt;/strong&gt; / &lt;strong&gt;Micro Computer (HK) Tech Limited&lt;/strong&gt;) with &lt;strong&gt;AMD Ryzen 5 7640HS&lt;/strong&gt;, &lt;strong&gt;Radeon 760M Graphics&lt;/strong&gt;, &lt;strong&gt;96 GiB&lt;/strong&gt; &lt;strong&gt;DDR5&lt;/strong&gt; RAM, &lt;strong&gt;~1 TiB&lt;/strong&gt; NVMe, &lt;strong&gt;Ubuntu 24.04 LTS&lt;/strong&gt;. This is not a minimum-requirements bar—it &lt;strong&gt;anchors&lt;/strong&gt; compile times, download comfort, and token throughput vs other CPUs, RAM, or disks. To &lt;strong&gt;verify memory type and size&lt;/strong&gt; on your box, see §3 (&lt;em&gt;Quick hardware inventory&lt;/em&gt;). A &lt;strong&gt;photo of the box&lt;/strong&gt; is at the &lt;strong&gt;end&lt;/strong&gt;, under Closing thoughts.&lt;/p&gt;

&lt;p&gt;Replace &lt;code&gt;YOUR_USER&lt;/code&gt;, model paths, and hostname as needed. If the machine is &lt;strong&gt;server-only&lt;/strong&gt; (no monitor), start with §4.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdjziq4fb01nlq7tlbzvx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdjziq4fb01nlq7tlbzvx.png" alt="local LLM stack on Ubuntu — reference illustration" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Too long; didn’t read&lt;/em&gt; — one-minute skim before the full guide. Full table of contents →&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;What you’re building:&lt;/strong&gt; &lt;strong&gt;local&lt;/strong&gt; inference on &lt;strong&gt;Ubuntu 24.04&lt;/strong&gt; with &lt;strong&gt;llama.cpp&lt;/strong&gt; + &lt;strong&gt;Vulkan&lt;/strong&gt;, a &lt;strong&gt;GGUF&lt;/strong&gt; weights file, OpenAI-style API via &lt;strong&gt;&lt;code&gt;llama-server&lt;/code&gt;&lt;/strong&gt; (&lt;strong&gt;&lt;code&gt;:8080&lt;/code&gt;&lt;/strong&gt;); optional &lt;strong&gt;Open WebUI&lt;/strong&gt; in Docker (&lt;strong&gt;&lt;code&gt;:3000&lt;/code&gt;&lt;/strong&gt;); &lt;strong&gt;OpenCode&lt;/strong&gt; and &lt;strong&gt;Visual Studio Code&lt;/strong&gt; can talk to the same &lt;strong&gt;&lt;code&gt;http://…:8080/v1&lt;/code&gt;&lt;/strong&gt; base URL as an OpenAI-compatible provider (&lt;strong&gt;§11&lt;/strong&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Shortest path:&lt;/strong&gt; &lt;strong&gt;BIOS/UMA&lt;/strong&gt; if relevant (§2) → deps + &lt;strong&gt;Vulkan&lt;/strong&gt; (§5) → build &lt;strong&gt;llama.cpp&lt;/strong&gt; (§6) → download &lt;strong&gt;&lt;code&gt;.gguf&lt;/code&gt;&lt;/strong&gt; (§7: &lt;strong&gt;&lt;code&gt;wget --continue&lt;/code&gt;&lt;/strong&gt; or &lt;strong&gt;&lt;code&gt;huggingface-cli&lt;/code&gt;&lt;/strong&gt;; &lt;strong&gt;&lt;code&gt;screen&lt;/code&gt;&lt;/strong&gt; / &lt;strong&gt;&lt;code&gt;tmux&lt;/code&gt;&lt;/strong&gt; for long SSH sessions) → smoke-test &lt;strong&gt;&lt;code&gt;llama-cli&lt;/code&gt;&lt;/strong&gt; → run &lt;strong&gt;&lt;code&gt;llama-server&lt;/code&gt;&lt;/strong&gt; manually or under &lt;strong&gt;systemd&lt;/strong&gt; (§8–§9) → point &lt;strong&gt;Open WebUI&lt;/strong&gt; at the host (§10) → &lt;strong&gt;optional:&lt;/strong&gt; &lt;strong&gt;OpenCode&lt;/strong&gt; / &lt;strong&gt;VS Code&lt;/strong&gt; (§11).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tight RAM / OOM:&lt;/strong&gt; same user as the service; match &lt;strong&gt;&lt;code&gt;llama-cli&lt;/code&gt;&lt;/strong&gt; &lt;strong&gt;&lt;code&gt;-c&lt;/code&gt;&lt;/strong&gt; / &lt;strong&gt;&lt;code&gt;-ngl&lt;/code&gt;&lt;/strong&gt; to &lt;strong&gt;&lt;code&gt;ExecStart&lt;/code&gt;&lt;/strong&gt;; if it fails, drop &lt;strong&gt;&lt;code&gt;-c&lt;/code&gt;&lt;/strong&gt; (e.g. &lt;strong&gt;4096&lt;/strong&gt;) and &lt;strong&gt;&lt;code&gt;-ngl&lt;/code&gt;&lt;/strong&gt; (e.g. &lt;strong&gt;40&lt;/strong&gt;) before chasing &lt;strong&gt;99&lt;/strong&gt; / &lt;strong&gt;999&lt;/strong&gt;. Don’t &lt;strong&gt;enable&lt;/strong&gt; the unit until the GGUF is &lt;strong&gt;fully&lt;/strong&gt; downloaded.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;More models:&lt;/strong&gt; §7 covers &lt;strong&gt;Gemma 4&lt;/strong&gt;, &lt;strong&gt;Qwen Coder&lt;/strong&gt;, &lt;strong&gt;DeepSeek Lite&lt;/strong&gt;, &lt;strong&gt;Llama 3.1&lt;/strong&gt; (downloads, &lt;strong&gt;&lt;code&gt;huggingface-cli&lt;/code&gt;&lt;/strong&gt;, quick tests).&lt;/li&gt;
&lt;li&gt;Swap in &lt;strong&gt;&lt;code&gt;YOUR_USER&lt;/code&gt;&lt;/strong&gt;, paths, and hostname; &lt;strong&gt;server-only&lt;/strong&gt; box → start at §4.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Table of contents
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;Links jump to headings on GitHub, Cursor, and most Markdown viewers. If a link does not match your renderer, search for the heading in the file.&lt;/em&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;TL;DR&lt;/li&gt;
&lt;li&gt;1. Context and choices&lt;/li&gt;
&lt;li&gt;2. BIOS (before or right after installing Ubuntu)&lt;/li&gt;
&lt;li&gt;
3. Installing Ubuntu

&lt;ul&gt;
&lt;li&gt;Quick hardware inventory (optional)&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

4. Ubuntu Server without a desktop (headless)

&lt;ul&gt;
&lt;li&gt;Installation&lt;/li&gt;
&lt;li&gt;Networking&lt;/li&gt;
&lt;li&gt;Vulkan without a display (&lt;code&gt;vkcube&lt;/code&gt; not applicable)&lt;/li&gt;
&lt;li&gt;Rest of this guide&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;5. Base dependencies and Vulkan check&lt;/li&gt;

&lt;li&gt;

6. Building llama.cpp with Vulkan

&lt;ul&gt;
&lt;li&gt;Update and rebuild &lt;code&gt;llama.cpp&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

7. GGUF models and paths

&lt;ul&gt;
&lt;li&gt;What GGUF is (name, role, trade-offs)&lt;/li&gt;
&lt;li&gt;Quant labels in filenames (Q2, Q4, Q8, suffixes like &lt;code&gt;_K_M&lt;/code&gt;, IQ…)&lt;/li&gt;
&lt;li&gt;Where models live and how to list them&lt;/li&gt;
&lt;li&gt;Concrete example: Gemma 4 26B A4B Instruct (GGUF, bartowski)&lt;/li&gt;
&lt;li&gt;Advanced example: Kimi K2 Instruct 0905 (Unsloth, split GGUF)&lt;/li&gt;
&lt;li&gt;Example: local Llama 3.1 8B Instruct Q8_0&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;llama-bench&lt;/code&gt;: measure throughput (tokens/s)&lt;/li&gt;
&lt;li&gt;Quick terminal test&lt;/li&gt;
&lt;li&gt;Adding or switching models&lt;/li&gt;
&lt;li&gt;Experimenting with more models: setup, testing, and limits&lt;/li&gt;
&lt;li&gt;One playbook: Gemma 4, Qwen Coder, DeepSeek Coder, and Llama 3.1 (download → Open WebUI)&lt;/li&gt;
&lt;li&gt;Common steps (every model swap)&lt;/li&gt;
&lt;li&gt;Reference table (repos + sample file)&lt;/li&gt;
&lt;li&gt;Download (&lt;code&gt;wget --continue&lt;/code&gt;, one file per command)&lt;/li&gt;
&lt;li&gt;Per-model quick test (right after download)&lt;/li&gt;
&lt;li&gt;Typical &lt;code&gt;ExecStart&lt;/code&gt; tweaks (example)&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;8. Minimal web server (&lt;code&gt;llama-server&lt;/code&gt;)&lt;/li&gt;

&lt;li&gt;9. systemd service (start on boot)&lt;/li&gt;

&lt;li&gt;

10. Open WebUI with Docker (port 3000 → backend on 8080)

&lt;ul&gt;
&lt;li&gt;Connect Open WebUI to llama-server&lt;/li&gt;
&lt;li&gt;Chat up and running (example)&lt;/li&gt;
&lt;li&gt;No browsing or GitHub fetch: real limits (and confident wrong answers)&lt;/li&gt;
&lt;li&gt;Model picker shows &lt;strong&gt;“No results found”&lt;/strong&gt; / no models listed&lt;/li&gt;
&lt;li&gt;“Failed to fetch models” under &lt;strong&gt;Ollama&lt;/strong&gt; (Settings → Models)&lt;/li&gt;
&lt;li&gt;Updating Open WebUI (Docker)&lt;/li&gt;
&lt;li&gt;If you also run Ollama&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

11. OpenCode and VS Code with your &lt;code&gt;llama-server&lt;/code&gt;

&lt;ul&gt;
&lt;li&gt;OpenCode&lt;/li&gt;
&lt;li&gt;Visual Studio Code&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

12. Troubleshooting: Vulkan / &lt;code&gt;glslc&lt;/code&gt; on Ubuntu 24.04

&lt;ul&gt;
&lt;li&gt;12.1 Universe repository and packages&lt;/li&gt;
&lt;li&gt;12.2 LunarG repository (Vulkan SDK)&lt;/li&gt;
&lt;li&gt;12.3 Conflict between Ubuntu’s &lt;code&gt;libshaderc-dev&lt;/code&gt; and LunarG’s Shaderc&lt;/li&gt;
&lt;li&gt;12.4 Snap fallback for &lt;code&gt;glslc&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

13. Performance and models (rough guide)

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;htop&lt;/code&gt; looks “light” while you chat (is that normal?)&lt;/li&gt;
&lt;li&gt;AMD: &lt;code&gt;amdgpu_pm_info&lt;/code&gt; and &lt;code&gt;dri/N&lt;/code&gt; (not always &lt;code&gt;dri/0&lt;/code&gt;)&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

14. Remote desktop (Ubuntu 24.04 Desktop, LAN)

&lt;ul&gt;
&lt;li&gt;14.1 Enable on the mini PC&lt;/li&gt;
&lt;li&gt;14.2 Connect from another machine&lt;/li&gt;
&lt;li&gt;14.3 Firewall (&lt;code&gt;ufw&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;14.4 If connection fails&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;Final checklist&lt;/li&gt;

&lt;li&gt;Quick port reference&lt;/li&gt;

&lt;li&gt;Closing thoughts&lt;/li&gt;

&lt;/ul&gt;

&lt;h2&gt;
  
  
  1. Context and choices
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Topic&lt;/th&gt;
&lt;th&gt;Recommendation&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;OS&lt;/td&gt;
&lt;td&gt;Ubuntu 24.04 LTS (desktop or server; server without a GUI saves RAM).&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AMD iGPU&lt;/td&gt;
&lt;td&gt;Vulkan + Mesa is usually simpler than ROCm for llama.cpp inference.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Models&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;GGUF&lt;/strong&gt; format; Q4_K_M quantization (balance) or Q8_0 (higher quality, larger).&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Engine&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;llama.cpp&lt;/strong&gt; with &lt;code&gt;-DGGML_VULKAN=1&lt;/code&gt; uses the &lt;strong&gt;GPU&lt;/strong&gt; for layers (&lt;code&gt;-ngl&lt;/code&gt;).&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Lots of RAM&lt;/td&gt;
&lt;td&gt;You can load large models in system RAM even if the iGPU has little dedicated VRAM; the BIOS can give the GPU a larger framebuffer (see §2).&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Reference diagram (browser / container / host):&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzdytd7gma9fyqudcg4v6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzdytd7gma9fyqudcg4v6.png" alt="Reference diagram (browser / container / host)" width="800" height="173"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frygeqyvo5foqvdwlgurs.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frygeqyvo5foqvdwlgurs.png" alt="Illustration: browser and IDE → Open WebUI container → llama-server and GGUF on the host" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  2. BIOS (before or right after installing Ubuntu)
&lt;/h2&gt;

&lt;p&gt;On &lt;strong&gt;Minisforum&lt;/strong&gt; boxes (e.g. &lt;strong&gt;UM760 Slim&lt;/strong&gt;) with AMI BIOS and Ryzen:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Enter BIOS (&lt;strong&gt;Del&lt;/strong&gt;, &lt;strong&gt;F2&lt;/strong&gt;, or &lt;strong&gt;F7&lt;/strong&gt; on many systems).&lt;/li&gt;
&lt;li&gt;Typical path: &lt;strong&gt;Advanced → AMD CBS → NBIO Common Options → GFX Configuration&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Set &lt;strong&gt;UMA Frame Buffer Size&lt;/strong&gt; (or similar) from &lt;em&gt;Auto&lt;/em&gt; / 2 GiB to &lt;strong&gt;8 G&lt;/strong&gt; or &lt;strong&gt;16 G&lt;/strong&gt; if available.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Goal: give the iGPU more unified memory for model layers; with plenty of system RAM the trade-off is usually worth it.&lt;/p&gt;




&lt;h2&gt;
  
  
  3. Installing Ubuntu
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Enable &lt;strong&gt;third-party software&lt;/strong&gt; for graphics and Wi‑Fi if you use the graphical installer.&lt;/li&gt;
&lt;li&gt;The &lt;strong&gt;minimal&lt;/strong&gt; install drops extra packages if the box is mainly an inference server.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Typical order of this guide (§4 and §10 are optional depending on your setup):&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffstg5ebfb8y5i6pnmrmh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffstg5ebfb8y5i6pnmrmh.png" alt="Tipical installation steps" width="646" height="1250"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Quick hardware inventory (optional)
&lt;/h3&gt;

&lt;p&gt;Before picking huge models and quantizations, check &lt;strong&gt;RAM&lt;/strong&gt;, &lt;strong&gt;disk on &lt;code&gt;/&lt;/code&gt;&lt;/strong&gt;, and whether the &lt;strong&gt;integrated GPU&lt;/strong&gt; shows up on the PCI bus (this does not replace a Vulkan test, but it sets expectations).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;lspci | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-i&lt;/span&gt; &lt;span class="nt"&gt;-E&lt;/span&gt; &lt;span class="s1"&gt;'vga|3d|display'&lt;/span&gt;
free &lt;span class="nt"&gt;-h&lt;/span&gt;
&lt;span class="nb"&gt;df&lt;/span&gt; &lt;span class="nt"&gt;-h&lt;/span&gt; /
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What to look for in &lt;code&gt;lspci&lt;/code&gt;:&lt;/strong&gt; on &lt;strong&gt;Ryzen Phoenix / Hawk Point&lt;/strong&gt; boards you often see something like &lt;strong&gt;&lt;code&gt;VGA compatible controller: … Phoenix1&lt;/code&gt;&lt;/strong&gt; plus an AMD &lt;strong&gt;HDMI audio&lt;/strong&gt; line. The marketing name “Radeon 760M” may not appear verbatim; the real check is that an &lt;strong&gt;AMD VGA/Display&lt;/strong&gt; controller exists and that &lt;strong&gt;&lt;code&gt;vulkaninfo&lt;/code&gt;&lt;/strong&gt; / &lt;strong&gt;&lt;code&gt;llama-cli&lt;/code&gt;&lt;/strong&gt; see &lt;strong&gt;RADV&lt;/strong&gt; (§4–§5).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;free&lt;/code&gt;:&lt;/strong&gt; total and &lt;strong&gt;available&lt;/strong&gt; RAM tell you how large a GGUF you can keep &lt;strong&gt;comfortably&lt;/strong&gt; in memory alongside the OS.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;df&lt;/code&gt;:&lt;/strong&gt; each &lt;code&gt;.gguf&lt;/code&gt; costs whatever the card lists (e.g. ~8 GiB for an 8B Q8_0); leave headroom for updates, Docker, and rebuilds.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;DDR4 vs DDR5 (re-check RAM type):&lt;/strong&gt; data comes from firmware &lt;strong&gt;SMBIOS&lt;/strong&gt;. Install &lt;strong&gt;&lt;code&gt;sudo apt install -y dmidecode&lt;/code&gt;&lt;/strong&gt; if needed. &lt;strong&gt;Note:&lt;/strong&gt; some &lt;code&gt;dmidecode&lt;/code&gt; builds indent fields with &lt;strong&gt;spaces&lt;/strong&gt;, not tabs—an overly strict &lt;code&gt;grep&lt;/code&gt; can print &lt;strong&gt;nothing&lt;/strong&gt; even when DMI works.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# One line per interesting field (tab- or space-indented)&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;dmidecode &lt;span class="nt"&gt;-t&lt;/span&gt; memory 2&amp;gt;/dev/null | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-iE&lt;/span&gt; &lt;span class="s1"&gt;'Locator|Size:|Type:|Speed:|Configured Memory Speed:'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If that is still empty, dump the start of the table—some boards expose only a subset of fields:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;dmidecode &lt;span class="nt"&gt;-t&lt;/span&gt; memory | &lt;span class="nb"&gt;head&lt;/span&gt; &lt;span class="nt"&gt;-n&lt;/span&gt; 120
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For each populated slot, &lt;strong&gt;&lt;code&gt;Type:&lt;/code&gt;&lt;/strong&gt; should read &lt;strong&gt;&lt;code&gt;DDR5&lt;/code&gt;&lt;/strong&gt;, &lt;strong&gt;&lt;code&gt;DDR4&lt;/code&gt;&lt;/strong&gt;, etc. All-&lt;strong&gt;&lt;code&gt;Unknown&lt;/code&gt;&lt;/strong&gt; or an empty dump may mean a &lt;strong&gt;locked&lt;/strong&gt; BIOS, a &lt;strong&gt;hypervisor&lt;/strong&gt; restriction, or needs a firmware update—cross-check the &lt;strong&gt;mini PC spec sheet&lt;/strong&gt; or &lt;strong&gt;DIMM/SODIMM silkscreen/label&lt;/strong&gt;. &lt;strong&gt;Ryzen 7040&lt;/strong&gt; mobile (e.g. 7640HS) is usually &lt;strong&gt;DDR5-only&lt;/strong&gt; on recent kits; still verify through one of these paths.&lt;/p&gt;




&lt;h2&gt;
  
  
  4. Ubuntu Server without a desktop (headless)
&lt;/h2&gt;

&lt;p&gt;When the mini PC only serves the model (SSH + browser on another machine), &lt;strong&gt;Ubuntu Server 24.04 LTS&lt;/strong&gt; saves RAM and attack surface by skipping GNOME and desktop services.&lt;/p&gt;

&lt;h3&gt;
  
  
  Installation
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Download the &lt;strong&gt;Ubuntu Server&lt;/strong&gt; ISO from &lt;a href="https://ubuntu.com/download/server" rel="noopener noreferrer"&gt;ubuntu.com/download/server&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;In the installer, enable &lt;strong&gt;OpenSSH&lt;/strong&gt; for remote administration.&lt;/li&gt;
&lt;li&gt;Create a normal user with &lt;code&gt;sudo&lt;/code&gt; (this guide assumes that user’s &lt;code&gt;$HOME&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;BIOS (§2) is configured the same as on a desktop.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Networking
&lt;/h3&gt;

&lt;p&gt;After first boot:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;hostname&lt;/span&gt; &lt;span class="nt"&gt;-I&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;systemctl status ssh
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Open only what you need in the firewall (e.g. SSH, and later 8080/3000 if not using VPN only):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;apt &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-y&lt;/span&gt; ufw
&lt;span class="nb"&gt;sudo &lt;/span&gt;ufw allow OpenSSH
&lt;span class="c"&gt;# Optional: sudo ufw allow 8080/tcp &amp;amp;&amp;amp; sudo ufw allow 3000/tcp&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;ufw &lt;span class="nb"&gt;enable&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Vulkan without a display (&lt;code&gt;vkcube&lt;/code&gt; not applicable)
&lt;/h3&gt;

&lt;p&gt;Server images have no display server by default: &lt;strong&gt;you cannot run &lt;code&gt;vkcube&lt;/code&gt;&lt;/strong&gt; unless you add a minimal GUI just for that test. To validate Vulkan from the console:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;apt update
&lt;span class="nb"&gt;sudo &lt;/span&gt;apt &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-y&lt;/span&gt; vulkan-tools
vulkaninfo &lt;span class="nt"&gt;--summary&lt;/span&gt; 2&amp;gt;/dev/null | &lt;span class="nb"&gt;head&lt;/span&gt; &lt;span class="nt"&gt;-n&lt;/span&gt; 80
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;What to look for:&lt;/strong&gt; besides the instance version (e.g. &lt;code&gt;Vulkan Instance Version: 1.4.x&lt;/code&gt;), the &lt;strong&gt;&lt;code&gt;Devices:&lt;/code&gt;&lt;/strong&gt; section should list &lt;strong&gt;your AMD GPU&lt;/strong&gt; (&lt;code&gt;deviceName&lt;/code&gt; like &lt;em&gt;Radeon …&lt;/em&gt;, &lt;code&gt;deviceType&lt;/code&gt; &lt;em&gt;INTEGRATED_GPU&lt;/em&gt; or &lt;em&gt;DISCRETE_GPU&lt;/em&gt;, &lt;code&gt;vendorID&lt;/code&gt; &lt;strong&gt;0x1002&lt;/strong&gt; on AMD hardware).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Real-world sample (trimmed):&lt;/strong&gt; you often see the instance and a long extension list first; &lt;code&gt;Devices:&lt;/code&gt; comes later. As a &lt;strong&gt;normal user&lt;/strong&gt; you may see &lt;strong&gt;only&lt;/strong&gt; a software device:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Vulkan Instance Version: 1.4.313
...
Devices:
========
GPU0:
    apiVersion         = 1.4.318
    deviceType         = PHYSICAL_DEVICE_TYPE_CPU
    deviceName         = llvmpipe (LLVM …, 256 bits)
    driverName         = llvmpipe
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Same machine, but &lt;code&gt;sudo&lt;/code&gt; shows the Radeon:&lt;/strong&gt; if your user only gets &lt;code&gt;llvmpipe&lt;/code&gt; but &lt;strong&gt;root&lt;/strong&gt; sees e.g. &lt;strong&gt;GPU0&lt;/strong&gt; &lt;code&gt;AMD Radeon 760M Graphics (RADV PHOENIX)&lt;/code&gt; (&lt;code&gt;vendorID&lt;/code&gt; &lt;strong&gt;0x1002&lt;/strong&gt;, &lt;code&gt;INTEGRATED_GPU&lt;/code&gt;) &lt;strong&gt;and&lt;/strong&gt; &lt;strong&gt;GPU1&lt;/strong&gt; &lt;code&gt;llvmpipe&lt;/code&gt;, the kernel and Mesa are fine; your user lacks &lt;strong&gt;permission&lt;/strong&gt; on the DRM nodes (&lt;code&gt;/dev/dri/renderD*&lt;/code&gt;). You should &lt;strong&gt;not&lt;/strong&gt; run &lt;code&gt;llama-server&lt;/code&gt; as root long-term to “fix” Vulkan—fix group membership instead.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;groups&lt;/span&gt;                    &lt;span class="c"&gt;# should include render and video&lt;/span&gt;
&lt;span class="nb"&gt;ls&lt;/span&gt; &lt;span class="nt"&gt;-l&lt;/span&gt; /dev/dri/
&lt;span class="nb"&gt;sudo &lt;/span&gt;usermod &lt;span class="nt"&gt;-aG&lt;/span&gt; render,video &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$USER&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;span class="c"&gt;# Log out of the desktop session or reboot, then (tighter grep: a broad&lt;/span&gt;
&lt;span class="c"&gt;# GPU|deviceName|deviceType pattern may also match layer descriptions containing "GPU"):&lt;/span&gt;
vulkaninfo &lt;span class="nt"&gt;--summary&lt;/span&gt; 2&amp;gt;/dev/null | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-E&lt;/span&gt; &lt;span class="s1"&gt;'^GPU[0-9]+:|^[[:space:]]+device(Name|Type)'&lt;/span&gt; | &lt;span class="nb"&gt;head&lt;/span&gt; &lt;span class="nt"&gt;-n&lt;/span&gt; 30
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Expected output without &lt;code&gt;sudo&lt;/code&gt;&lt;/strong&gt; (RADV as &lt;strong&gt;GPU0&lt;/strong&gt;, &lt;code&gt;llvmpipe&lt;/code&gt; as an extra device):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;GPU0:
    deviceType         = PHYSICAL_DEVICE_TYPE_INTEGRATED_GPU
    deviceName         = AMD Radeon 760M Graphics (RADV PHOENIX)
GPU1:
    deviceType         = PHYSICAL_DEVICE_TYPE_CPU
    deviceName         = llvmpipe (LLVM 20.1.2, 256 bits)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Typical “before” example:&lt;/strong&gt; if &lt;code&gt;groups&lt;/code&gt; &lt;strong&gt;does not&lt;/strong&gt; list &lt;code&gt;render&lt;/code&gt; or &lt;code&gt;video&lt;/code&gt;, and you only see entries like &lt;code&gt;adm cdrom sudo dip plugdev users lpadmin docker&lt;/code&gt;, that matches “Vulkan as your user = llvmpipe only; as root = RADV + llvmpipe”.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;After &lt;code&gt;usermod&lt;/code&gt;:&lt;/strong&gt; the command may print nothing, but &lt;strong&gt;your already-running session keeps the old group set&lt;/strong&gt;—&lt;code&gt;groups&lt;/code&gt; in the same shell will not change until you &lt;strong&gt;log out of the desktop&lt;/strong&gt; (or &lt;strong&gt;reboot&lt;/strong&gt;). Open a new terminal and check again; &lt;strong&gt;&lt;code&gt;id -nG&lt;/code&gt;&lt;/strong&gt; is a handy way to list all group names. For a quick test without logging out of the whole session: &lt;strong&gt;&lt;code&gt;newgrp render&lt;/code&gt;&lt;/strong&gt; (spawns a subshell with that group active; fine for testing only).&lt;/p&gt;

&lt;p&gt;On Ubuntu 24.04 the groups are usually &lt;strong&gt;&lt;code&gt;render&lt;/code&gt;&lt;/strong&gt; and &lt;strong&gt;&lt;code&gt;video&lt;/code&gt;&lt;/strong&gt;. Once the new session includes them, &lt;code&gt;vulkaninfo&lt;/code&gt; &lt;strong&gt;without&lt;/strong&gt; &lt;code&gt;sudo&lt;/code&gt; should list the AMD device as well as &lt;code&gt;llvmpipe&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;A healthy summary often has the Radeon as &lt;strong&gt;GPU0&lt;/strong&gt; and llvmpipe as an extra entry:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;GPU0:
    vendorID           = 0x1002
    deviceType         = PHYSICAL_DEVICE_TYPE_INTEGRATED_GPU
    deviceName         = AMD Radeon 760M Graphics (RADV PHOENIX)
    driverName         = radv
GPU1:
    deviceType         = PHYSICAL_DEVICE_TYPE_CPU
    deviceName         = llvmpipe (LLVM …)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Only &lt;code&gt;llvmpipe&lt;/code&gt; even as root:&lt;/strong&gt; then &lt;code&gt;llvmpipe&lt;/code&gt; / &lt;code&gt;PHYSICAL_DEVICE_TYPE_CPU&lt;/code&gt; is &lt;strong&gt;CPU-only&lt;/strong&gt; Vulkan (Mesa) and the iGPU is not in the Vulkan device list. Check &lt;code&gt;lspci -nn | grep -i vga&lt;/code&gt;, the &lt;strong&gt;&lt;code&gt;amdgpu&lt;/code&gt;&lt;/strong&gt; module, &lt;code&gt;mesa-vulkan-drivers&lt;/code&gt;, and BIOS. On very minimal servers the render stack may still need setup before Vulkan enumerates the chip.&lt;/p&gt;

&lt;h3&gt;
  
  
  Rest of this guide
&lt;/h3&gt;

&lt;p&gt;Install the same packages as §5, build llama.cpp in §6, and use &lt;strong&gt;Open WebUI from another PC&lt;/strong&gt; at &lt;code&gt;http://SERVER_IP:3000&lt;/code&gt;. Docker + &lt;code&gt;llama-server&lt;/code&gt; does not require a graphical session on the server.&lt;/p&gt;




&lt;h2&gt;
  
  
  5. Base dependencies and Vulkan check
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;apt update &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;sudo &lt;/span&gt;apt upgrade &lt;span class="nt"&gt;-y&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;apt &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-y&lt;/span&gt; build-essential cmake git libvulkan-dev vulkan-tools
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Confirm the GPU is visible:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;vkcube
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A window with a spinning cube should open. Close it when done.&lt;/p&gt;

&lt;p&gt;If &lt;strong&gt;vkcube&lt;/strong&gt; works but &lt;code&gt;vulkaninfo --summary&lt;/code&gt; as your user still shows only &lt;code&gt;llvmpipe&lt;/code&gt;, add the same &lt;strong&gt;&lt;code&gt;render&lt;/code&gt;&lt;/strong&gt; and &lt;strong&gt;&lt;code&gt;video&lt;/code&gt;&lt;/strong&gt; groups as in §4 (and log out/in).&lt;/p&gt;




&lt;h2&gt;
  
  
  6. Building llama.cpp with Vulkan
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/ggerganov/llama.cpp
&lt;span class="nb"&gt;cd &lt;/span&gt;llama.cpp
cmake &lt;span class="nt"&gt;-B&lt;/span&gt; build &lt;span class="nt"&gt;-DGGML_VULKAN&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1
cmake &lt;span class="nt"&gt;--build&lt;/span&gt; build &lt;span class="nt"&gt;--config&lt;/span&gt; Release &lt;span class="nt"&gt;-j&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;nproc&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If &lt;strong&gt;cmake&lt;/strong&gt; fails with &lt;em&gt;Could NOT find Vulkan&lt;/em&gt; or &lt;em&gt;missing: glslc&lt;/em&gt;, go to §12 (common on Ubuntu 24.04).&lt;/p&gt;

&lt;h3&gt;
  
  
  Update and rebuild &lt;code&gt;llama.cpp&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Newer GGUF architectures&lt;/strong&gt; (Gemma 4, recent MoE builds, etc.) often need a &lt;strong&gt;fresh llama.cpp&lt;/strong&gt;. Before blaming the weight file, update the tree and rebuild the &lt;strong&gt;same &lt;code&gt;build&lt;/code&gt;&lt;/strong&gt; folder (or wipe &lt;code&gt;build&lt;/code&gt; and rerun CMake if CMakeLists changed a lot):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$HOME&lt;/span&gt;&lt;span class="s2"&gt;/llama.cpp"&lt;/span&gt;
git pull
cmake &lt;span class="nt"&gt;--build&lt;/span&gt; build &lt;span class="nt"&gt;--config&lt;/span&gt; Release &lt;span class="nt"&gt;-j&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;nproc&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If &lt;strong&gt;&lt;code&gt;git pull&lt;/code&gt;&lt;/strong&gt; changes CMake heavily and linking fails:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;rm&lt;/span&gt; &lt;span class="nt"&gt;-rf&lt;/span&gt; build
cmake &lt;span class="nt"&gt;-B&lt;/span&gt; build &lt;span class="nt"&gt;-DGGML_VULKAN&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1
cmake &lt;span class="nt"&gt;--build&lt;/span&gt; build &lt;span class="nt"&gt;--config&lt;/span&gt; Release &lt;span class="nt"&gt;-j&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;nproc&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After rebuilding, if you use &lt;strong&gt;§9&lt;/strong&gt;, restart so the service picks up new binaries: &lt;code&gt;sudo systemctl restart llama-web.service&lt;/code&gt;. Check &lt;code&gt;journalctl -u llama-web.service -n 30 --no-pager&lt;/code&gt; if a GGUF is rejected.&lt;/p&gt;

&lt;p&gt;Useful binaries:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;build/bin/llama-cli&lt;/code&gt; — terminal tests.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;build/bin/llama-server&lt;/code&gt; — HTTP API compatible with OpenAI-style clients.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  7. GGUF models and paths
&lt;/h2&gt;

&lt;h3&gt;
  
  
  What GGUF is (name, role, trade-offs)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;GGUF&lt;/strong&gt; (&lt;strong&gt;G&lt;/strong&gt;GML &lt;strong&gt;U&lt;/strong&gt;niversal &lt;strong&gt;F&lt;/strong&gt;ile &lt;strong&gt;F&lt;/strong&gt;ormat) is a &lt;strong&gt;single-file&lt;/strong&gt; container aimed at &lt;strong&gt;inference&lt;/strong&gt; with &lt;strong&gt;llama.cpp&lt;/strong&gt; and friends: it packs &lt;strong&gt;weights&lt;/strong&gt; in a tensor layout tuned for efficient loading, &lt;strong&gt;metadata&lt;/strong&gt;, and—in practice—what you need to &lt;strong&gt;tokenize&lt;/strong&gt; and &lt;strong&gt;run&lt;/strong&gt; the model without pulling in the full PyTorch/JAX training stack.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Why it matters here:&lt;/strong&gt; you download a &lt;strong&gt;&lt;code&gt;.gguf&lt;/code&gt;&lt;/strong&gt;, pass its path as &lt;strong&gt;&lt;code&gt;-m&lt;/code&gt;&lt;/strong&gt; to &lt;code&gt;llama-cli&lt;/code&gt; / &lt;code&gt;llama-server&lt;/code&gt;, and the engine runs &lt;strong&gt;locally&lt;/strong&gt; (CPU, and in this guide &lt;strong&gt;Vulkan&lt;/strong&gt; on the GPU). You do not need the original framework runtime just to serve the converted file.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Typical upsides:&lt;/strong&gt; &lt;strong&gt;one portable blob&lt;/strong&gt;; &lt;strong&gt;quantized&lt;/strong&gt; variants (Q4_K_M, Q8_0, IQ*, …) trade a bit of quality for &lt;strong&gt;disk / RAM / VRAM&lt;/strong&gt;; &lt;strong&gt;huge Hugging Face catalog&lt;/strong&gt; (community repos such as &lt;em&gt;TheBloke&lt;/em&gt;, &lt;em&gt;bartowski&lt;/em&gt;, Unsloth, …); first-class support in &lt;strong&gt;llama.cpp&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Limitations:&lt;/strong&gt; &lt;strong&gt;quality&lt;/strong&gt; depends on &lt;strong&gt;quant level&lt;/strong&gt; and conversion tooling; &lt;strong&gt;brand-new&lt;/strong&gt; architectures may need a &lt;strong&gt;fresh llama.cpp build&lt;/strong&gt; or lack mature GGUFs yet; &lt;strong&gt;training / fine-tuning&lt;/strong&gt; usually happens elsewhere, then you &lt;strong&gt;convert/export&lt;/strong&gt; to GGUF; it is not a full cloud SaaS substitute without extra plumbing.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The rest of this section assumes a &lt;strong&gt;ready-to-run GGUF&lt;/strong&gt;; paths and downloads always point at that file.&lt;/p&gt;

&lt;h3&gt;
  
  
  Quant labels in filenames (Q2, Q4, Q8, suffixes like &lt;code&gt;_K_M&lt;/code&gt;, IQ…)
&lt;/h3&gt;

&lt;p&gt;Repos list GGUFs with prefixes like &lt;strong&gt;Q2_&lt;/strong&gt;, &lt;strong&gt;Q3_&lt;/strong&gt;, &lt;strong&gt;Q4_&lt;/strong&gt;, &lt;strong&gt;Q5_&lt;/strong&gt;, &lt;strong&gt;Q6_&lt;/strong&gt;, &lt;strong&gt;Q8_&lt;/strong&gt; and cousins (&lt;strong&gt;IQ2_&lt;/strong&gt;, &lt;strong&gt;IQ3_&lt;/strong&gt;, …). Naming is not one single marketing standard, but &lt;strong&gt;in practice&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The &lt;strong&gt;Q&lt;/strong&gt; and &lt;strong&gt;number&lt;/strong&gt; hint at &lt;strong&gt;quantization depth&lt;/strong&gt;—roughly how many &lt;strong&gt;bits&lt;/strong&gt; are used for weights (&lt;strong&gt;simplified&lt;/strong&gt;). &lt;strong&gt;Lower&lt;/strong&gt; → &lt;strong&gt;smaller&lt;/strong&gt; file, less &lt;strong&gt;RAM/VRAM&lt;/strong&gt;, sometimes &lt;strong&gt;more&lt;/strong&gt; quality loss; &lt;strong&gt;higher&lt;/strong&gt; (e.g. &lt;strong&gt;Q8&lt;/strong&gt;) → heavier and often closer to “full” model behavior.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Suffixes&lt;/strong&gt; such as &lt;strong&gt;&lt;code&gt;_K_M&lt;/code&gt;&lt;/strong&gt;, &lt;strong&gt;&lt;code&gt;_K_S&lt;/code&gt;&lt;/strong&gt;, &lt;strong&gt;&lt;code&gt;_K_L&lt;/code&gt;&lt;/strong&gt;, … are &lt;strong&gt;llama.cpp k-quant&lt;/strong&gt; schemes: they &lt;strong&gt;mix&lt;/strong&gt; layers/blocks at different precisions to balance quality vs size—it is not “literally 4-bit everything.”&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;IQ&lt;/strong&gt; (&lt;em&gt;imatrix&lt;/em&gt; / importance-weighted) lines aim for &lt;strong&gt;aggressive&lt;/strong&gt; compression while protecting weights that matter most for output quality.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;For this guide:&lt;/strong&gt; &lt;strong&gt;Q4_K_M&lt;/strong&gt; is a common &lt;strong&gt;sweet spot&lt;/strong&gt; for &lt;strong&gt;disk&lt;/strong&gt;, &lt;strong&gt;memory&lt;/strong&gt;, and &lt;strong&gt;quality&lt;/strong&gt;; &lt;strong&gt;Q8_0&lt;/strong&gt;-class files if you favor quality and have RAM to spare. If names feel overwhelming, sort by &lt;strong&gt;MiB/GiB&lt;/strong&gt; under the repo’s &lt;em&gt;Files&lt;/em&gt; tab and pick the largest file that &lt;strong&gt;fits&lt;/strong&gt; your machine comfortably.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Hugging Face CLI (&lt;code&gt;huggingface-cli&lt;/code&gt;):&lt;/strong&gt; &lt;strong&gt;Ubuntu 24.04&lt;/strong&gt; ships &lt;em&gt;externally managed&lt;/em&gt; system Python (&lt;strong&gt;PEP 668&lt;/strong&gt;), so &lt;strong&gt;&lt;code&gt;python3 -m pip install …&lt;/code&gt; fails&lt;/strong&gt; with &lt;code&gt;externally-managed-environment&lt;/code&gt;. Prefer a small &lt;strong&gt;virtualenv&lt;/strong&gt; for this tool. This guide uses &lt;strong&gt;&lt;code&gt;$HOME/.venv/huggingface&lt;/code&gt;&lt;/strong&gt;: install &lt;strong&gt;&lt;code&gt;python3-venv&lt;/code&gt;&lt;/strong&gt;, create the venv &lt;strong&gt;once&lt;/strong&gt;, run &lt;strong&gt;&lt;code&gt;source …/bin/activate&lt;/code&gt;&lt;/strong&gt; before &lt;code&gt;pip&lt;/code&gt; / &lt;code&gt;huggingface-cli&lt;/code&gt;, or call &lt;strong&gt;&lt;code&gt;"$HOME/.venv/huggingface/bin/huggingface-cli"&lt;/code&gt;&lt;/strong&gt; directly. Avoid &lt;strong&gt;&lt;code&gt;--break-system-packages&lt;/code&gt;&lt;/strong&gt; unless you understand the risk. &lt;strong&gt;Alternative:&lt;/strong&gt; &lt;strong&gt;&lt;code&gt;pipx install 'huggingface_hub[cli]'&lt;/code&gt;&lt;/strong&gt; (after &lt;strong&gt;&lt;code&gt;sudo apt install pipx&lt;/code&gt;&lt;/strong&gt; and &lt;strong&gt;&lt;code&gt;pipx ensurepath&lt;/code&gt;&lt;/strong&gt;).&lt;/p&gt;

&lt;p&gt;Use one consistent directory (avoid mixing &lt;code&gt;~/models&lt;/code&gt; and &lt;code&gt;llama.cpp/models&lt;/code&gt; by mistake):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;mkdir&lt;/span&gt; &lt;span class="nt"&gt;-p&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$HOME&lt;/span&gt;&lt;span class="s2"&gt;/models"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Where models live and how to list them
&lt;/h3&gt;

&lt;p&gt;llama.cpp has &lt;strong&gt;no&lt;/strong&gt; built-in model catalog: a model is a &lt;strong&gt;&lt;code&gt;.gguf&lt;/code&gt; file&lt;/strong&gt;. You always pass the path with &lt;strong&gt;&lt;code&gt;-m&lt;/code&gt;&lt;/strong&gt; (absolute paths are best in &lt;code&gt;systemd&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;List the usual folder:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;ls&lt;/span&gt; &lt;span class="nt"&gt;-lh&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$HOME&lt;/span&gt;&lt;span class="s2"&gt;/models"&lt;/span&gt;/&lt;span class="k"&gt;*&lt;/span&gt;.gguf 2&amp;gt;/dev/null
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If that prints nothing, you may still have GGUFs elsewhere (Downloads, etc.).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Search under your home (limited depth, faster):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;find &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$HOME&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="nt"&gt;-maxdepth&lt;/span&gt; 5 &lt;span class="nt"&gt;-name&lt;/span&gt; &lt;span class="s1"&gt;'*.gguf'&lt;/span&gt; 2&amp;gt;/dev/null &lt;span class="nt"&gt;-ls&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Sort by size:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;find &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$HOME&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt; &lt;span class="nt"&gt;-maxdepth&lt;/span&gt; 5 &lt;span class="nt"&gt;-name&lt;/span&gt; &lt;span class="s1"&gt;'*.gguf'&lt;/span&gt; 2&amp;gt;/dev/null &lt;span class="nt"&gt;-printf&lt;/span&gt; &lt;span class="s1"&gt;'%s\t%p\n'&lt;/span&gt; | &lt;span class="nb"&gt;sort&lt;/span&gt; &lt;span class="nt"&gt;-n&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Important:&lt;/strong&gt; Open WebUI does &lt;strong&gt;not&lt;/strong&gt; enumerate “every GGUF on disk”. What matters is whichever file &lt;strong&gt;&lt;code&gt;llama-server&lt;/code&gt;&lt;/strong&gt; loads via &lt;strong&gt;&lt;code&gt;-m&lt;/code&gt;&lt;/strong&gt;. To “use another model”, change that &lt;code&gt;-m&lt;/code&gt; (and restart the process or service §9), or run &lt;strong&gt;another&lt;/strong&gt; &lt;code&gt;llama-server&lt;/code&gt; on &lt;strong&gt;another&lt;/strong&gt; port (advanced; not detailed here).&lt;/p&gt;

&lt;p&gt;Generic example (swap the URL for the file link under the repo’s &lt;em&gt;Files&lt;/em&gt; tab on Hugging Face):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;wget &lt;span class="nt"&gt;-O&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$HOME&lt;/span&gt;&lt;span class="s2"&gt;/models/model-name.gguf"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="s2"&gt;"https://huggingface.co/ORG/REPO/resolve/main/file.gguf?download=true"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Concrete example: Gemma 4 26B A4B Instruct (GGUF, bartowski)
&lt;/h3&gt;

&lt;p&gt;Recent quantized model (&lt;strong&gt;Apache 2.0&lt;/strong&gt;), &lt;strong&gt;Gemma 4&lt;/strong&gt; / MoE architecture; a good fit for machines with &lt;strong&gt;lots of RAM&lt;/strong&gt; (e.g. ~96 GiB). Full file list and sizes: &lt;a href="https://huggingface.co/bartowski/google_gemma-4-26B-A4B-it-GGUF" rel="noopener noreferrer"&gt;bartowski/google_gemma-4-26B-A4B-it-GGUF&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Reasonable disk/RAM use: &lt;strong&gt;Q4_K_M&lt;/strong&gt; (~17 GiB per the model card). Maximum quality in this repo: &lt;strong&gt;Q8_0&lt;/strong&gt; (~27 GiB).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Important:&lt;/strong&gt; you need a &lt;strong&gt;recent llama.cpp&lt;/strong&gt; with Gemma 4 support (before building: &lt;code&gt;cd llama.cpp &amp;amp;&amp;amp; git pull&lt;/code&gt;). If loading the GGUF reports architecture or tokenizer errors, update and rebuild (§6).&lt;/p&gt;

&lt;p&gt;Recommended download (Q4_K_M):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;mkdir&lt;/span&gt; &lt;span class="nt"&gt;-p&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$HOME&lt;/span&gt;&lt;span class="s2"&gt;/models"&lt;/span&gt;
wget &lt;span class="nt"&gt;--continue&lt;/span&gt; &lt;span class="nt"&gt;-O&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$HOME&lt;/span&gt;&lt;span class="s2"&gt;/models/google_gemma-4-26B-A4B-it-Q4_K_M.gguf"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="s2"&gt;"https://huggingface.co/bartowski/google_gemma-4-26B-A4B-it-GGUF/resolve/main/google_gemma-4-26B-A4B-it-Q4_K_M.gguf?download=true"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Higher-quality option (Q8_0):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;wget &lt;span class="nt"&gt;--continue&lt;/span&gt; &lt;span class="nt"&gt;-O&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$HOME&lt;/span&gt;&lt;span class="s2"&gt;/models/google_gemma-4-26B-A4B-it-Q8_0.gguf"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="s2"&gt;"https://huggingface.co/bartowski/google_gemma-4-26B-A4B-it-GGUF/resolve/main/google_gemma-4-26B-A4B-it-Q8_0.gguf?download=true"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Equivalent using &lt;a href="https://huggingface.co/docs/huggingface_hub/guides/cli" rel="noopener noreferrer"&gt;huggingface-cli&lt;/a&gt; (handy for resumable downloads):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;apt &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-y&lt;/span&gt; python3-venv
python3 &lt;span class="nt"&gt;-m&lt;/span&gt; venv &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$HOME&lt;/span&gt;&lt;span class="s2"&gt;/.venv/huggingface"&lt;/span&gt;   &lt;span class="c"&gt;# once; skip if this directory already exists&lt;/span&gt;
&lt;span class="nb"&gt;source&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$HOME&lt;/span&gt;&lt;span class="s2"&gt;/.venv/huggingface/bin/activate"&lt;/span&gt;
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-U&lt;/span&gt; &lt;span class="s2"&gt;"huggingface_hub[cli]"&lt;/span&gt;
huggingface-cli download bartowski/google_gemma-4-26B-A4B-it-GGUF &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--include&lt;/span&gt; &lt;span class="s2"&gt;"google_gemma-4-26B-A4B-it-Q4_K_M.gguf"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--local-dir&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$HOME&lt;/span&gt;&lt;span class="s2"&gt;/models"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Notes:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;On Hugging Face the model is tagged &lt;strong&gt;Image-Text-to-Text&lt;/strong&gt;; for text-only chat, &lt;code&gt;llama-server&lt;/code&gt; / Open WebUI usually work with the GGUF and embedded template. If message formatting breaks, check the &lt;em&gt;Prompt format&lt;/em&gt; section on the model card.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;resolve/main/...&lt;/code&gt; URLs can break if files are renamed; if so, open the repo and copy the &lt;em&gt;download&lt;/em&gt; link for the exact &lt;code&gt;.gguf&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Important:&lt;/strong&gt; when running &lt;code&gt;llama-cli&lt;/code&gt; or &lt;code&gt;llama-server&lt;/code&gt;, use the real path to the &lt;code&gt;.gguf&lt;/code&gt; (absolute or relative to your current working directory).&lt;/p&gt;

&lt;h3&gt;
  
  
  Advanced example: Kimi K2 Instruct 0905 (Unsloth, split GGUF)
&lt;/h3&gt;

&lt;p&gt;A &lt;strong&gt;very large&lt;/strong&gt; MoE (~32 B activated params / 1 T total per the model card). Community GGUFs: &lt;a href="https://huggingface.co/unsloth/Kimi-K2-Instruct-0905-GGUF" rel="noopener noreferrer"&gt;unsloth/Kimi-K2-Instruct-0905-GGUF&lt;/a&gt;. Run guide and flags: &lt;a href="https://docs.unsloth.ai/basics/kimi-k2" rel="noopener noreferrer"&gt;Unsloth — Kimi K2&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hardware warning:&lt;/strong&gt; Unsloth’s README recommends &lt;strong&gt;≥ 128 GB unified RAM&lt;/strong&gt; even for “small” quants. Boxes in the ~64–80 GiB range may &lt;strong&gt;fail to load&lt;/strong&gt;, run &lt;strong&gt;very slowly&lt;/strong&gt;, or thrash &lt;strong&gt;swap&lt;/strong&gt;—treat it as an experiment (see §7 &lt;em&gt;Experimenting with more models&lt;/em&gt;).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hugging Face:&lt;/strong&gt; access may be &lt;strong&gt;gated&lt;/strong&gt;; sign in, accept terms on the model page, and use &lt;strong&gt;&lt;code&gt;huggingface-cli login&lt;/code&gt;&lt;/strong&gt; if required.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Shards:&lt;/strong&gt; each quantization lives in a folder (&lt;code&gt;UD-TQ1_0/&lt;/code&gt;, &lt;code&gt;UD-IQ1_S/&lt;/code&gt;, &lt;code&gt;IQ4_XS/&lt;/code&gt;, …) with files like &lt;code&gt;…-00001-of-00006.gguf&lt;/code&gt;, … Download &lt;strong&gt;every&lt;/strong&gt; &lt;code&gt;.gguf&lt;/code&gt; in &lt;strong&gt;that&lt;/strong&gt; folder. For &lt;strong&gt;&lt;code&gt;llama-cli&lt;/code&gt;&lt;/strong&gt; and &lt;strong&gt;&lt;code&gt;llama-server&lt;/code&gt;&lt;/strong&gt;, &lt;strong&gt;&lt;code&gt;-m&lt;/code&gt;&lt;/strong&gt; must point at the &lt;strong&gt;first&lt;/strong&gt; shard (&lt;code&gt;…-00001-of-….gguf&lt;/code&gt;); current &lt;code&gt;llama.cpp&lt;/code&gt; loaders pick up sibling shards in the same directory.&lt;/p&gt;

&lt;p&gt;Download &lt;strong&gt;one&lt;/strong&gt; folder (example &lt;strong&gt;UD-TQ1_0&lt;/strong&gt;, six parts; confirm names under &lt;em&gt;Files&lt;/em&gt; on Hugging Face):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;apt &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-y&lt;/span&gt; python3-venv
python3 &lt;span class="nt"&gt;-m&lt;/span&gt; venv &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$HOME&lt;/span&gt;&lt;span class="s2"&gt;/.venv/huggingface"&lt;/span&gt;   &lt;span class="c"&gt;# once; skip if this directory already exists&lt;/span&gt;
&lt;span class="nb"&gt;source&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$HOME&lt;/span&gt;&lt;span class="s2"&gt;/.venv/huggingface/bin/activate"&lt;/span&gt;
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-U&lt;/span&gt; &lt;span class="s2"&gt;"huggingface_hub[cli]"&lt;/span&gt;
huggingface-cli login    &lt;span class="c"&gt;# if token or gated access is required&lt;/span&gt;

&lt;span class="nb"&gt;mkdir&lt;/span&gt; &lt;span class="nt"&gt;-p&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$HOME&lt;/span&gt;&lt;span class="s2"&gt;/models/kimi-k2-0905"&lt;/span&gt;
huggingface-cli download unsloth/Kimi-K2-Instruct-0905-GGUF &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--include&lt;/span&gt; &lt;span class="s2"&gt;"UD-TQ1_0/*.gguf"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--local-dir&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$HOME&lt;/span&gt;&lt;span class="s2"&gt;/models/kimi-k2-0905"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Other folders in the same repo are other quants (more disk / more quality). Pick based on &lt;strong&gt;free disk&lt;/strong&gt; and &lt;strong&gt;RAM&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Before loading: &lt;strong&gt;&lt;code&gt;git pull&lt;/code&gt;&lt;/strong&gt; and rebuild &lt;strong&gt;llama.cpp&lt;/strong&gt; (§6). Short smoke test:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$HOME&lt;/span&gt;&lt;span class="s2"&gt;/llama.cpp"&lt;/span&gt;
./build/bin/llama-cli &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-m&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$HOME&lt;/span&gt;&lt;span class="s2"&gt;/models/kimi-k2-0905/UD-TQ1_0/Kimi-K2-Instruct-0905-UD-TQ1_0-00001-of-00006.gguf"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-c&lt;/span&gt; 4096 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-ngl&lt;/span&gt; 80 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-p&lt;/span&gt; &lt;span class="s2"&gt;"Say hi in one sentence."&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Tune &lt;strong&gt;&lt;code&gt;-ngl&lt;/code&gt;&lt;/strong&gt; and &lt;strong&gt;&lt;code&gt;-c&lt;/code&gt;&lt;/strong&gt;; on architecture/tokenizer errors, update and rebuild. For &lt;strong&gt;§9&lt;/strong&gt; / Open WebUI, &lt;strong&gt;&lt;code&gt;ExecStart&lt;/code&gt;&lt;/strong&gt; uses the same path to the &lt;strong&gt;first&lt;/strong&gt; shard; read the &lt;strong&gt;&lt;code&gt;id&lt;/code&gt;&lt;/strong&gt; from &lt;code&gt;/v1/models&lt;/code&gt; via &lt;code&gt;curl&lt;/code&gt; once &lt;code&gt;llama-server&lt;/code&gt; is up for &lt;em&gt;Model IDs&lt;/em&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Example: local Llama 3.1 8B Instruct Q8_0
&lt;/h3&gt;

&lt;p&gt;If you already have e.g. &lt;strong&gt;&lt;code&gt;$HOME/models/Llama-3.1-8B-Instruct-Q8_0.gguf&lt;/code&gt;&lt;/strong&gt; (~8 GiB on disk), &lt;strong&gt;replace&lt;/strong&gt; every &lt;code&gt;-m&lt;/code&gt; path in this guide with yours. &lt;strong&gt;Q8_0&lt;/strong&gt; favors quality over speed; for higher &lt;strong&gt;tok/s&lt;/strong&gt; on an iGPU, try a &lt;strong&gt;Q4_K_M&lt;/strong&gt; in the same model family.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;llama-bench&lt;/code&gt;: measure throughput (tokens/s)
&lt;/h3&gt;

&lt;p&gt;Use this to compare &lt;strong&gt;the same machine&lt;/strong&gt; with different &lt;strong&gt;&lt;code&gt;-ngl&lt;/code&gt;&lt;/strong&gt;, different GGUFs, or different builds (CPU vs Vulkan), without UI noise.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Verify the binary&lt;/strong&gt; (size/date are hints; it should refresh after rebuilds):
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$HOME&lt;/span&gt;&lt;span class="s2"&gt;/llama.cpp"&lt;/span&gt;
&lt;span class="nb"&gt;ls&lt;/span&gt; &lt;span class="nt"&gt;-lh&lt;/span&gt; build/bin/llama-bench
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;&lt;p&gt;If it is &lt;strong&gt;missing&lt;/strong&gt;, rebuild the project (§6); most full builds already include &lt;code&gt;llama-bench&lt;/code&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Flags&lt;/strong&gt; change across versions—always start from help:&lt;br&gt;
&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;./build/bin/llama-bench &lt;span class="nt"&gt;--help&lt;/span&gt; | less
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Minimal example&lt;/strong&gt; (swap the path):
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;./build/bin/llama-bench &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-m&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$HOME&lt;/span&gt;&lt;span class="s2"&gt;/models/Llama-3.1-8B-Instruct-Q8_0.gguf"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-ngl&lt;/span&gt; 999 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-n&lt;/span&gt; 128
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;-m&lt;/code&gt;&lt;/strong&gt;: path to the &lt;code&gt;.gguf&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;-ngl&lt;/code&gt;&lt;/strong&gt;: GPU layers; many builds accept &lt;strong&gt;&lt;code&gt;999&lt;/code&gt;&lt;/strong&gt; or &lt;strong&gt;&lt;code&gt;-1&lt;/code&gt;&lt;/strong&gt; as “as many as possible”. If rejected, try &lt;strong&gt;&lt;code&gt;35&lt;/code&gt;&lt;/strong&gt;, &lt;strong&gt;&lt;code&gt;45&lt;/code&gt;&lt;/strong&gt;, etc., and increase until it breaks or slows down.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;-n&lt;/code&gt;&lt;/strong&gt;: generated tokens per benchmark run (tune for longer runs).&lt;/li&gt;
&lt;/ul&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Reading output:&lt;/strong&gt; you usually see &lt;em&gt;prompt processing&lt;/em&gt; vs &lt;em&gt;generation&lt;/em&gt; tok/s. If numbers are tiny and logs show &lt;strong&gt;no&lt;/strong&gt; Vulkan / &lt;code&gt;ggml_vulkan&lt;/code&gt;, the binary might lack &lt;code&gt;GGML_VULKAN&lt;/code&gt;, or &lt;code&gt;/dev/dri&lt;/code&gt; permissions were wrong at build/run time (§4).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Fair comparisons:&lt;/strong&gt; same &lt;code&gt;llama-bench&lt;/code&gt; build, same model, same &lt;code&gt;-n&lt;/code&gt;, only change &lt;strong&gt;&lt;code&gt;-ngl&lt;/code&gt;&lt;/strong&gt; or the &lt;strong&gt;&lt;code&gt;.gguf&lt;/code&gt;&lt;/strong&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Sample real output&lt;/strong&gt; (same command order as above; &lt;strong&gt;Ubuntu 24.04&lt;/strong&gt;, &lt;strong&gt;Radeon 760M RADV&lt;/strong&gt;, &lt;strong&gt;Llama 3.1 8B Instruct Q8_0&lt;/strong&gt;; numbers shift with BIOS, thermals, quantization, and llama.cpp revision):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon 760M Graphics (RADV PHOENIX) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| llama 8B Q8_0                  |   7.95 GiB |     8.03 B | Vulkan     | 999 |           pp512 |        235.96 ± 0.19 |
| llama 8B Q8_0                  |   7.95 GiB |     8.03 B | Vulkan     | 999 |           tg128 |          9.80 ± 0.00 |

build: 4d688f9eb (8016)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;The &lt;strong&gt;&lt;code&gt;ggml_vulkan&lt;/code&gt;&lt;/strong&gt; lines show &lt;strong&gt;one&lt;/strong&gt; Vulkan device and that the bench is on &lt;strong&gt;RADV&lt;/strong&gt; (not &lt;code&gt;llvmpipe&lt;/code&gt; only). Errors or zero devices → revisit §4–§5.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;pp512&lt;/code&gt;&lt;/strong&gt;: prompt processing — tok/s for a ~512-token prefill; usually &lt;strong&gt;higher&lt;/strong&gt; than generation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;tg128&lt;/code&gt;&lt;/strong&gt;: token generation — tok/s while emitting &lt;strong&gt;128&lt;/strong&gt; output tokens; closest bench metric to “reply speed” in chat. Here ≈&lt;strong&gt;9.8 t/s&lt;/strong&gt; for Q8_0 on this iGPU.&lt;/li&gt;
&lt;li&gt;The &lt;strong&gt;&lt;code&gt;build:&lt;/code&gt;&lt;/strong&gt; line is your llama.cpp &lt;strong&gt;&lt;code&gt;llama-bench&lt;/code&gt;&lt;/strong&gt; commit; it changes after &lt;code&gt;git pull&lt;/code&gt; + rebuild.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Another sample&lt;/strong&gt; (&lt;strong&gt;same mini PC class&lt;/strong&gt;, &lt;strong&gt;Gemma 4 26B&lt;/strong&gt; Instruct &lt;strong&gt;Q4_K_M&lt;/strong&gt; — the model this guide uses in many examples):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;./build/bin/llama-bench &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-m&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$HOME&lt;/span&gt;&lt;span class="s2"&gt;/models/google_gemma-4-26B-A4B-it-Q4_K_M.gguf"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-ngl&lt;/span&gt; 999 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-n&lt;/span&gt; 128
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon 760M Graphics (RADV PHOENIX) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| gemma4 ?B Q4_K - Medium        |  15.85 GiB |    25.23 B | Vulkan     | 999 |           pp512 |        239.04 ± 1.97 |
| gemma4 ?B Q4_K - Medium        |  15.85 GiB |    25.23 B | Vulkan     | 999 |           tg128 |         20.94 ± 0.02 |

build: d12cc3d1c (8720)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;The &lt;strong&gt;&lt;code&gt;model&lt;/code&gt;&lt;/strong&gt; column is &lt;strong&gt;unreliable&lt;/strong&gt; on some &lt;code&gt;llama-bench&lt;/code&gt; builds: you may see &lt;strong&gt;&lt;code&gt;gemma4 ?B&lt;/code&gt;&lt;/strong&gt;, &lt;strong&gt;&lt;code&gt;gemma4 7B&lt;/code&gt;&lt;/strong&gt;, or similar &lt;strong&gt;even for Gemma 4 26B A4B&lt;/strong&gt; GGUFs. Trust &lt;strong&gt;size&lt;/strong&gt; (~&lt;strong&gt;15.85 GiB&lt;/strong&gt;), &lt;strong&gt;params&lt;/strong&gt; (~&lt;strong&gt;25.23 B&lt;/strong&gt;), and your &lt;strong&gt;&lt;code&gt;-m&lt;/code&gt;&lt;/strong&gt; path to &lt;code&gt;…26B…Q4_K_M.gguf&lt;/code&gt;. &lt;em&gt;llama-bench mis-labels the first column; this run is Gemma 4 **26B&lt;/em&gt;* Q4_K_M*.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;What this run says:&lt;/strong&gt; with &lt;strong&gt;Vulkan&lt;/strong&gt; and &lt;strong&gt;&lt;code&gt;ngl&lt;/code&gt; 999&lt;/strong&gt;, expect on the order of &lt;strong&gt;~239 tok/s&lt;/strong&gt; for &lt;strong&gt;prefill&lt;/strong&gt; (&lt;strong&gt;&lt;code&gt;pp512&lt;/code&gt;&lt;/strong&gt;) and &lt;strong&gt;~21 tok/s&lt;/strong&gt; for &lt;strong&gt;generation&lt;/strong&gt; (&lt;strong&gt;&lt;code&gt;tg128&lt;/code&gt;&lt;/strong&gt;). That &lt;strong&gt;~21 t/s&lt;/strong&gt; is the most useful single number for “raw” reply speed (no Open WebUI overhead, no long reasoning block, no huge prompts); real chat often lands near this ballpark or a bit lower.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Other GGUFs&lt;/strong&gt;, &lt;strong&gt;&lt;code&gt;ngl&lt;/code&gt;&lt;/strong&gt;, or &lt;strong&gt;&lt;code&gt;build:&lt;/code&gt;&lt;/strong&gt; revisions will move &lt;strong&gt;&lt;code&gt;tg*&lt;/code&gt;&lt;/strong&gt; a lot; record your own table after major changes.&lt;/p&gt;

&lt;h3&gt;
  
  
  Quick terminal test
&lt;/h3&gt;

&lt;p&gt;From the &lt;code&gt;llama.cpp&lt;/code&gt; directory:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;./build/bin/llama-cli &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-m&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$HOME&lt;/span&gt;&lt;span class="s2"&gt;/models/google_gemma-4-26B-A4B-it-Q4_K_M.gguf"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-p&lt;/span&gt; &lt;span class="s2"&gt;"Answer in one sentence what Linux is."&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-cnv&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-ngl&lt;/span&gt; 99
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Gemma 4 and on-screen reasoning (&lt;code&gt;[Start thinking]&lt;/code&gt; … &lt;code&gt;[End thinking]&lt;/code&gt;):&lt;/strong&gt; many &lt;strong&gt;Instruct&lt;/strong&gt; GGUFs emit a “thinking” block before the final answer. On a &lt;strong&gt;recent &lt;code&gt;llama-cli&lt;/code&gt;&lt;/strong&gt;, &lt;code&gt;--help&lt;/code&gt; normally documents (verify with &lt;code&gt;./build/bin/llama-cli --help | grep -iE 'reason|think|template'&lt;/code&gt;):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;-rea, --reasoning on|off|auto&lt;/code&gt;&lt;/strong&gt; — default &lt;strong&gt;&lt;code&gt;auto&lt;/code&gt;&lt;/strong&gt; (template decides). For &lt;strong&gt;clean screenshots&lt;/strong&gt;, use &lt;strong&gt;&lt;code&gt;--reasoning off&lt;/code&gt;&lt;/strong&gt; (short &lt;strong&gt;&lt;code&gt;-rea off&lt;/code&gt;&lt;/strong&gt; if your build prints it).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;--reasoning-budget N&lt;/code&gt;&lt;/strong&gt; — &lt;strong&gt;&lt;code&gt;0&lt;/code&gt;&lt;/strong&gt; ends the thinking block immediately; &lt;strong&gt;&lt;code&gt;-1&lt;/code&gt;&lt;/strong&gt; is unrestricted. Pair with &lt;strong&gt;&lt;code&gt;off&lt;/code&gt;&lt;/strong&gt; if needed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;--chat-template-kwargs STRING&lt;/code&gt;&lt;/strong&gt; — JSON for the template parser (e.g. &lt;strong&gt;&lt;code&gt;'{"enable_thinking": false}'&lt;/code&gt;&lt;/strong&gt; in bash with outer single quotes).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;--reasoning-format FORMAT&lt;/code&gt;&lt;/strong&gt; — tag handling / extraction (DeepSeek-style paths); &lt;strong&gt;&lt;code&gt;--reasoning off&lt;/code&gt;&lt;/strong&gt; is usually enough for Gemma in interactive CLI.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Screenshot-friendly example (same command as above + reasoning disabled):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;./build/bin/llama-cli &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-m&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$HOME&lt;/span&gt;&lt;span class="s2"&gt;/models/google_gemma-4-26B-A4B-it-Q4_K_M.gguf"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-p&lt;/span&gt; &lt;span class="s2"&gt;"Answer in one sentence what Linux is."&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-cnv&lt;/span&gt; &lt;span class="nt"&gt;-ngl&lt;/span&gt; 99 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--reasoning&lt;/span&gt; off
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Reference run&lt;/strong&gt; (validated hardware in the intro; &lt;strong&gt;no&lt;/strong&gt; &lt;code&gt;[Start thinking]&lt;/code&gt; block; &lt;strong&gt;t/s&lt;/strong&gt; are indicative):&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxt1nfyvk4sjdl30fchvu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxt1nfyvk4sjdl30fchvu.png" alt="llama-cli: Gemma 4 26B Q4_K_M with  raw `--reasoning off` endraw , one-sentence answer and prompt/generation **t/s**." width="800" height="458"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You can also export the env vars mentioned in &lt;code&gt;--help&lt;/code&gt; (&lt;strong&gt;&lt;code&gt;LLAMA_ARG_REASONING&lt;/code&gt;&lt;/strong&gt;, &lt;strong&gt;&lt;code&gt;LLAMA_ARG_THINK_BUDGET&lt;/code&gt;&lt;/strong&gt;, …) if you prefer not to repeat flags.&lt;/p&gt;

&lt;p&gt;For &lt;strong&gt;&lt;code&gt;llama-server&lt;/code&gt;&lt;/strong&gt; (§8–§9), add the same switches to &lt;strong&gt;&lt;code&gt;ExecStart&lt;/code&gt;&lt;/strong&gt; (&lt;strong&gt;&lt;code&gt;--reasoning off&lt;/code&gt;&lt;/strong&gt;, &lt;strong&gt;&lt;code&gt;--reasoning-budget 0&lt;/code&gt;&lt;/strong&gt;, &lt;strong&gt;&lt;code&gt;--chat-template-kwargs …&lt;/code&gt;&lt;/strong&gt;) as your binary supports. If &lt;strong&gt;nothing&lt;/strong&gt; disables it, try another GGUF/variant, or another model for a one-off capture (e.g. Llama in this same §7).&lt;/p&gt;

&lt;p&gt;Example with a local &lt;strong&gt;Llama 3.1 8B&lt;/strong&gt; (single-turn demo; chat template depends on the GGUF). An overly vague &lt;strong&gt;&lt;code&gt;-p&lt;/code&gt;&lt;/strong&gt; (“summarize llama.cpp”) may yield “I don’t have that information”; give &lt;strong&gt;context&lt;/strong&gt; in the question (e.g. open-source inference, GGUF, local execution).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;./build/bin/llama-cli &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-m&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$HOME&lt;/span&gt;&lt;span class="s2"&gt;/models/Llama-3.1-8B-Instruct-Q8_0.gguf"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-p&lt;/span&gt; &lt;span class="s2"&gt;"Answer in exactly one sentence: What does the llama.cpp project do for running language models locally?"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-ngl&lt;/span&gt; 999
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Actual reference screenshot&lt;/strong&gt; (same &lt;strong&gt;validated&lt;/strong&gt; hardware in the intro: Ryzen 5 &lt;strong&gt;7640HS&lt;/strong&gt;, Radeon &lt;strong&gt;760M&lt;/strong&gt;, &lt;strong&gt;DDR5&lt;/strong&gt;; &lt;strong&gt;t/s&lt;/strong&gt; varies with thermals, BIOS, and &lt;code&gt;llama.cpp&lt;/code&gt; commit):&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwzjdukig8roakl1iozsl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwzjdukig8roakl1iozsl.png" alt="llama-cli: Llama 3.1 8B Instruct Q8_0 — answer about llama.cpp and prompt/generation t/s." width="800" height="517"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;-ngl 99&lt;/code&gt; / &lt;code&gt;999&lt;/code&gt;&lt;/strong&gt;: tries to offload many layers to the GPU; on large models or a small unified VRAM budget you may need to &lt;strong&gt;lower&lt;/strong&gt; &lt;code&gt;-ngl&lt;/code&gt; or increase the BIOS framebuffer (§2).&lt;/li&gt;
&lt;li&gt;On startup, look for lines like &lt;code&gt;ggml_vulkan:&lt;/code&gt; and your GPU name (e.g. Radeon 760M) to confirm Vulkan.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Adding or switching models
&lt;/h3&gt;

&lt;p&gt;Each &lt;strong&gt;additional model&lt;/strong&gt; you want to run—another family, quantization, or file from Hugging Face—is &lt;strong&gt;one&lt;/strong&gt; more &lt;code&gt;.gguf&lt;/code&gt; in your folder (e.g. &lt;code&gt;$HOME/models&lt;/code&gt;). ML slang often says &lt;strong&gt;“weights”&lt;/strong&gt; for the &lt;strong&gt;trained parameters&lt;/strong&gt; inside that file; here it is enough to think “another &lt;code&gt;.gguf&lt;/code&gt;.” The flow is always &lt;strong&gt;download → test → point the server&lt;/strong&gt; at that path.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Download&lt;/strong&gt; using the same pattern as above (&lt;code&gt;wget&lt;/code&gt;, &lt;code&gt;huggingface-cli&lt;/code&gt;, or the repo’s &lt;em&gt;download&lt;/em&gt; link on Hugging Face).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Smoke-test in the terminal&lt;/strong&gt; with &lt;code&gt;llama-cli -m "$HOME/models/your-new-file.gguf"&lt;/code&gt; (like the quick test). If the architecture is brand new and load fails, update and rebuild llama.cpp (§6).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Manual &lt;code&gt;llama-server&lt;/code&gt; (§8):&lt;/strong&gt; stop the process (&lt;strong&gt;Ctrl+C&lt;/strong&gt;) and start it again with &lt;code&gt;-m&lt;/code&gt; pointing at the new file.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;systemd service (§9):&lt;/strong&gt; edit &lt;code&gt;/etc/systemd/system/llama-web.service&lt;/code&gt;, change only the &lt;code&gt;-m /full/path/new.gguf&lt;/code&gt; argument inside &lt;code&gt;ExecStart&lt;/code&gt;, save, then run:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;systemctl daemon-reload
&lt;span class="nb"&gt;sudo &lt;/span&gt;systemctl restart llama-web.service
&lt;span class="nb"&gt;sudo &lt;/span&gt;systemctl status llama-web.service
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Open WebUI (§10):&lt;/strong&gt; &lt;code&gt;llama-server&lt;/code&gt; loads &lt;strong&gt;one&lt;/strong&gt; model at a time (whichever you set at startup). After restarting the service, reload the UI; the model dropdown may show the filename or a generic label (&lt;code&gt;default&lt;/code&gt;), depending on the version.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OpenCode / VS Code (§11):&lt;/strong&gt; same host and port (&lt;code&gt;…:8080/v1&lt;/code&gt;); in editors use the server IP or &lt;code&gt;127.0.0.1&lt;/code&gt; depending on where the IDE runs.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Serving &lt;strong&gt;several models at once&lt;/strong&gt; requires multiple &lt;code&gt;llama-server&lt;/code&gt; processes on &lt;strong&gt;different ports&lt;/strong&gt; (and matching entries in Open WebUI or more containers); that advanced layout is not spelled out here.&lt;/p&gt;

&lt;h3&gt;
  
  
  Experimenting with more models: setup, testing, and limits
&lt;/h3&gt;

&lt;p&gt;If you want to &lt;strong&gt;try multiple GGUFs&lt;/strong&gt;, follow a clear flow and know your hardware ceiling—this avoids pointless downloads and false “it’s broken” moments.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Recommended flow&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Check disk and RAM&lt;/strong&gt; (&lt;code&gt;free -h&lt;/code&gt;, &lt;code&gt;df -h /&lt;/code&gt;, §3). Each quantization costs what the model card says; keep headroom.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Update &lt;code&gt;llama.cpp&lt;/code&gt;&lt;/strong&gt; when the model is new (§6, &lt;em&gt;Update and rebuild&lt;/em&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Download&lt;/strong&gt; the &lt;code&gt;.gguf&lt;/code&gt; into &lt;code&gt;$HOME/models&lt;/code&gt; (&lt;code&gt;wget&lt;/code&gt;, &lt;code&gt;huggingface-cli&lt;/code&gt;, etc.).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Smoke-test&lt;/strong&gt; with &lt;code&gt;llama-cli&lt;/code&gt; and &lt;strong&gt;short&lt;/strong&gt; generations; confirm &lt;code&gt;ggml_vulkan&lt;/code&gt; if the GPU should participate (§7).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Optional:&lt;/strong&gt; &lt;code&gt;llama-bench&lt;/code&gt; with the same &lt;code&gt;-ngl&lt;/code&gt; you plan for production to compare quantizations (§7).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Change &lt;code&gt;-m&lt;/code&gt;&lt;/strong&gt; in &lt;strong&gt;§9&lt;/strong&gt; (or manual §8), &lt;code&gt;daemon-reload&lt;/code&gt; + &lt;code&gt;restart&lt;/code&gt;, then &lt;strong&gt;&lt;code&gt;curl /v1/models&lt;/code&gt;&lt;/strong&gt; and &lt;strong&gt;Open WebUI&lt;/strong&gt; (Admin → Connections; &lt;strong&gt;Model IDs&lt;/strong&gt; if needed).&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Typical limits on a mini PC with an iGPU&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Topic&lt;/th&gt;
&lt;th&gt;What it means&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;RAM&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;GGUF size + OS + context cannot grow without limit; huge &lt;strong&gt;MoE&lt;/strong&gt; releases (e.g. &lt;strong&gt;Kimi K2&lt;/strong&gt;-class GGUFs) can &lt;strong&gt;exceed&lt;/strong&gt; usable RAM on 64–96 GiB class boxes or crawl at &lt;strong&gt;extremely&lt;/strong&gt; low tok/s.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;iGPU Vulkan&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Caps &lt;strong&gt;tok/s&lt;/strong&gt; on GPU; lots of RAM helps you &lt;strong&gt;load&lt;/strong&gt; weights, not mimic a big discrete GPU.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;One active model per &lt;code&gt;llama-server&lt;/code&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Switching models means changing &lt;strong&gt;&lt;code&gt;-m&lt;/code&gt;&lt;/strong&gt; and &lt;strong&gt;restarting&lt;/strong&gt; (or a second server on another port).&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Templates / chat&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Weird chat in Open WebUI may be the GGUF &lt;strong&gt;chat template&lt;/strong&gt;; check the Hugging Face card or try another frontend.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Network / disk&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Large downloads take time; use &lt;code&gt;wget --continue&lt;/code&gt; or resumable &lt;code&gt;huggingface-cli&lt;/code&gt;.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Set expectations:&lt;/strong&gt; an &lt;strong&gt;8B–13B&lt;/strong&gt; or a quantized &lt;strong&gt;26B&lt;/strong&gt; can be a great fit with ample RAM; &lt;strong&gt;datacenter-scale&lt;/strong&gt; GGUF may &lt;strong&gt;not fit&lt;/strong&gt; or run &lt;strong&gt;under ~1–2 tok/s&lt;/strong&gt; with aggressive paging—that is a &lt;strong&gt;memory bandwidth&lt;/strong&gt; issue, not an Ubuntu bug.&lt;/p&gt;

&lt;h3&gt;
  
  
  One playbook: Gemma 4, Qwen Coder, DeepSeek Coder, and Llama 3.1 (download → Open WebUI)
&lt;/h3&gt;

&lt;p&gt;For a &lt;strong&gt;mini PC–style&lt;/strong&gt; setup: Ubuntu 24.04, &lt;strong&gt;AMD iGPU Vulkan&lt;/strong&gt;, &lt;strong&gt;~64–96 GiB&lt;/strong&gt; RAM, &lt;strong&gt;&lt;code&gt;llama-server&lt;/code&gt;&lt;/strong&gt; on &lt;strong&gt;8080&lt;/strong&gt;, &lt;strong&gt;systemd&lt;/strong&gt; §9, &lt;strong&gt;Open WebUI&lt;/strong&gt; §10. Swap in your paths and username.&lt;/p&gt;

&lt;h4&gt;
  
  
  Common steps (every model swap)
&lt;/h4&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Refresh the engine&lt;/strong&gt; if the weight is new or load fails: &lt;code&gt;cd ~/llama.cpp &amp;amp;&amp;amp; git pull&lt;/code&gt; and rebuild (§6).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Download&lt;/strong&gt; the &lt;code&gt;.gguf&lt;/code&gt; (per-family commands below). &lt;strong&gt;Verify&lt;/strong&gt; the filename under Hugging Face → &lt;em&gt;Files&lt;/em&gt;; if it is renamed, fix the URL.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Smoke test&lt;/strong&gt; (tune &lt;code&gt;-ngl&lt;/code&gt; and &lt;code&gt;-c&lt;/code&gt;); or use the &lt;strong&gt;copy-paste commands per model&lt;/strong&gt; under &lt;em&gt;Per-model quick test&lt;/em&gt; below.
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd&lt;/span&gt; ~/llama.cpp
./build/bin/llama-cli &lt;span class="nt"&gt;-m&lt;/span&gt; &lt;span class="s2"&gt;"/absolute/path/to/file.gguf"&lt;/span&gt; &lt;span class="nt"&gt;-ngl&lt;/span&gt; 999 &lt;span class="nt"&gt;-c&lt;/span&gt; 4096 &lt;span class="nt"&gt;-n&lt;/span&gt; 80 &lt;span class="nt"&gt;-p&lt;/span&gt; &lt;span class="s2"&gt;"Answer in one short sentence."&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Tuning:&lt;/strong&gt; on &lt;strong&gt;OOM&lt;/strong&gt;, &lt;strong&gt;hangs&lt;/strong&gt;, or very slow output, &lt;strong&gt;lower &lt;code&gt;-ngl&lt;/code&gt;&lt;/strong&gt; (e.g. 50, 35) and/or &lt;strong&gt;&lt;code&gt;-c&lt;/code&gt;&lt;/strong&gt; (e.g. 2048). &lt;strong&gt;Unified&lt;/strong&gt; iGPU memory is usually the limiter, not raw RAM alone.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;llama-bench&lt;/code&gt;&lt;/strong&gt; (optional, §7) with the same path and &lt;code&gt;-ngl&lt;/code&gt; to compare quants or families.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;systemd (§9):&lt;/strong&gt; in &lt;code&gt;/etc/systemd/system/llama-web.service&lt;/code&gt;, edit &lt;strong&gt;&lt;code&gt;ExecStart&lt;/code&gt;&lt;/strong&gt;: same path in &lt;strong&gt;&lt;code&gt;-m&lt;/code&gt;&lt;/strong&gt;, and match &lt;strong&gt;&lt;code&gt;-c&lt;/code&gt;&lt;/strong&gt; / &lt;strong&gt;&lt;code&gt;-ngl&lt;/code&gt;&lt;/strong&gt; to what worked in the smoke test.
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;systemctl daemon-reload
&lt;span class="nb"&gt;sudo &lt;/span&gt;systemctl restart llama-web.service
&lt;span class="nb"&gt;sudo &lt;/span&gt;systemctl status llama-web.service
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;API check:&lt;/strong&gt; &lt;code&gt;curl -s http://127.0.0.1:8080/v1/models&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Open WebUI:&lt;/strong&gt; Admin → Connections → OpenAI (&lt;code&gt;host.docker.internal:8080/v1&lt;/code&gt;). If the picker stays empty, paste the &lt;strong&gt;&lt;code&gt;id&lt;/code&gt;&lt;/strong&gt; from that JSON into &lt;strong&gt;Model IDs&lt;/strong&gt;, save, and hard-refresh.&lt;/li&gt;
&lt;/ol&gt;

&lt;h4&gt;
  
  
  Reference table (repos + sample file)
&lt;/h4&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Family&lt;/th&gt;
&lt;th&gt;Hugging Face repo&lt;/th&gt;
&lt;th&gt;Sample file (quant)&lt;/th&gt;
&lt;th&gt;Notes (~machine with plenty of RAM)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Gemma 4&lt;/strong&gt; 26B Instruct&lt;/td&gt;
&lt;td&gt;&lt;a href="https://huggingface.co/bartowski/google_gemma-4-26B-A4B-it-GGUF" rel="noopener noreferrer"&gt;bartowski/google_gemma-4-26B-A4B-it-GGUF&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;google_gemma-4-26B-A4B-it-Q4_K_M.gguf&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;~17 GiB on disk; usually needs &lt;strong&gt;fresh llama.cpp&lt;/strong&gt;. Start &lt;strong&gt;&lt;code&gt;-c&lt;/code&gt;&lt;/strong&gt; around &lt;strong&gt;4096&lt;/strong&gt;–&lt;strong&gt;8192&lt;/strong&gt;.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Qwen2.5 Coder&lt;/strong&gt; 7B&lt;/td&gt;
&lt;td&gt;&lt;a href="https://huggingface.co/bartowski/Qwen2.5-Coder-7B-Instruct-GGUF" rel="noopener noreferrer"&gt;bartowski/Qwen2.5-Coder-7B-Instruct-GGUF&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Much lighter than Gemma 26B. For &lt;strong&gt;14B / 32B&lt;/strong&gt;, check &lt;em&gt;Files&lt;/em&gt; sizes; 32B Q4 is often &lt;strong&gt;~18–20 GiB+&lt;/strong&gt; and heavier.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;DeepSeek Coder V2 Lite&lt;/strong&gt; Instruct&lt;/td&gt;
&lt;td&gt;&lt;a href="https://huggingface.co/bartowski/DeepSeek-Coder-V2-Lite-Instruct-GGUF" rel="noopener noreferrer"&gt;bartowski/DeepSeek-Coder-V2-Lite-Instruct-GGUF&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;DeepSeek-Coder-V2-Lite-Instruct-Q4_K_M.gguf&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;“Lite” ≈ &lt;strong&gt;~10 GiB&lt;/strong&gt; class in Q4_K_M; solid code/disk trade-off locally.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Llama 3.1&lt;/strong&gt; 8B Instruct&lt;/td&gt;
&lt;td&gt;&lt;a href="https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF" rel="noopener noreferrer"&gt;bartowski/Meta-Llama-3.1-8B-Instruct-GGUF&lt;/a&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf&lt;/code&gt; or &lt;code&gt;-Q8_0.gguf&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Q4_K_M&lt;/strong&gt; faster; &lt;strong&gt;Q8_0&lt;/strong&gt; heavier / often higher quality. If your file name differs, keep your real path in &lt;strong&gt;&lt;code&gt;-m&lt;/code&gt;&lt;/strong&gt;.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h4&gt;
  
  
  Download (&lt;code&gt;wget --continue&lt;/code&gt;, one file per command)
&lt;/h4&gt;

&lt;p&gt;If you use &lt;strong&gt;SSH&lt;/strong&gt; and the download runs a long time, run it inside &lt;strong&gt;&lt;code&gt;screen&lt;/code&gt;&lt;/strong&gt; or &lt;strong&gt;&lt;code&gt;tmux&lt;/code&gt;&lt;/strong&gt; so a dropped connection does not kill the job. Example with &lt;strong&gt;&lt;code&gt;screen&lt;/code&gt;&lt;/strong&gt; (install if needed: &lt;code&gt;sudo apt install -y screen&lt;/code&gt;):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;screen &lt;span class="nt"&gt;-S&lt;/span&gt; hf-models
&lt;span class="nb"&gt;mkdir&lt;/span&gt; &lt;span class="nt"&gt;-p&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$HOME&lt;/span&gt;&lt;span class="s2"&gt;/models"&lt;/span&gt;

wget &lt;span class="nt"&gt;--continue&lt;/span&gt; &lt;span class="nt"&gt;-O&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$HOME&lt;/span&gt;&lt;span class="s2"&gt;/models/google_gemma-4-26B-A4B-it-Q4_K_M.gguf"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="s2"&gt;"https://huggingface.co/bartowski/google_gemma-4-26B-A4B-it-GGUF/resolve/main/google_gemma-4-26B-A4B-it-Q4_K_M.gguf?download=true"&lt;/span&gt;
&lt;span class="c"&gt;# When this wget finishes, you can paste the next command from the block below without leaving screen.&lt;/span&gt;

&lt;span class="c"&gt;# Detach (leave download running): Ctrl+A, release, D&lt;/span&gt;
&lt;span class="c"&gt;# Reattach later: screen -r hf-models&lt;/span&gt;
&lt;span class="c"&gt;# List sessions: screen -ls&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The same pattern works for the other URLs in this section or for &lt;strong&gt;&lt;code&gt;huggingface-cli download&lt;/code&gt;&lt;/strong&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;mkdir&lt;/span&gt; &lt;span class="nt"&gt;-p&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$HOME&lt;/span&gt;&lt;span class="s2"&gt;/models"&lt;/span&gt;

&lt;span class="c"&gt;# Gemma 4 26B Q4_K_M&lt;/span&gt;
wget &lt;span class="nt"&gt;--continue&lt;/span&gt; &lt;span class="nt"&gt;-O&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$HOME&lt;/span&gt;&lt;span class="s2"&gt;/models/google_gemma-4-26B-A4B-it-Q4_K_M.gguf"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="s2"&gt;"https://huggingface.co/bartowski/google_gemma-4-26B-A4B-it-GGUF/resolve/main/google_gemma-4-26B-A4B-it-Q4_K_M.gguf?download=true"&lt;/span&gt;

&lt;span class="c"&gt;# Qwen2.5 Coder 7B Q4_K_M&lt;/span&gt;
wget &lt;span class="nt"&gt;--continue&lt;/span&gt; &lt;span class="nt"&gt;-O&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$HOME&lt;/span&gt;&lt;span class="s2"&gt;/models/Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="s2"&gt;"https://huggingface.co/bartowski/Qwen2.5-Coder-7B-Instruct-GGUF/resolve/main/Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf?download=true"&lt;/span&gt;

&lt;span class="c"&gt;# DeepSeek Coder V2 Lite Q4_K_M&lt;/span&gt;
wget &lt;span class="nt"&gt;--continue&lt;/span&gt; &lt;span class="nt"&gt;-O&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$HOME&lt;/span&gt;&lt;span class="s2"&gt;/models/DeepSeek-Coder-V2-Lite-Instruct-Q4_K_M.gguf"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="s2"&gt;"https://huggingface.co/bartowski/DeepSeek-Coder-V2-Lite-Instruct-GGUF/resolve/main/DeepSeek-Coder-V2-Lite-Instruct-Q4_K_M.gguf?download=true"&lt;/span&gt;

&lt;span class="c"&gt;# Llama 3.1 8B Q4_K_M&lt;/span&gt;
wget &lt;span class="nt"&gt;--continue&lt;/span&gt; &lt;span class="nt"&gt;-O&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$HOME&lt;/span&gt;&lt;span class="s2"&gt;/models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="s2"&gt;"https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf?download=true"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Meta / Llama (gated):&lt;/strong&gt; if &lt;code&gt;wget&lt;/code&gt; returns &lt;strong&gt;403&lt;/strong&gt; or Hugging Face asks you to sign in, open the model page while logged in, &lt;strong&gt;accept the license&lt;/strong&gt;, create a &lt;strong&gt;read&lt;/strong&gt; token, and run &lt;strong&gt;&lt;code&gt;huggingface-cli login&lt;/code&gt;&lt;/strong&gt;. &lt;em&gt;Gated&lt;/em&gt; repos usually need &lt;strong&gt;&lt;code&gt;huggingface-cli download ...&lt;/code&gt;&lt;/strong&gt;, not anonymous &lt;code&gt;wget&lt;/code&gt; to &lt;code&gt;resolve/main/...&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;huggingface-cli&lt;/code&gt; alternative&lt;/strong&gt; (resumable; each command pulls &lt;strong&gt;one&lt;/strong&gt; GGUF under &lt;code&gt;--local-dir&lt;/code&gt;):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;apt &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-y&lt;/span&gt; python3-venv
python3 &lt;span class="nt"&gt;-m&lt;/span&gt; venv &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$HOME&lt;/span&gt;&lt;span class="s2"&gt;/.venv/huggingface"&lt;/span&gt;   &lt;span class="c"&gt;# once; skip if this directory already exists&lt;/span&gt;
&lt;span class="nb"&gt;source&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$HOME&lt;/span&gt;&lt;span class="s2"&gt;/.venv/huggingface/bin/activate"&lt;/span&gt;
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-U&lt;/span&gt; &lt;span class="s2"&gt;"huggingface_hub[cli]"&lt;/span&gt;
&lt;span class="c"&gt;# huggingface-cli login   # required for *gated* repos (e.g. Llama/Meta); optional otherwise&lt;/span&gt;
&lt;span class="nb"&gt;mkdir&lt;/span&gt; &lt;span class="nt"&gt;-p&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$HOME&lt;/span&gt;&lt;span class="s2"&gt;/models"&lt;/span&gt;

huggingface-cli download bartowski/google_gemma-4-26B-A4B-it-GGUF &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--include&lt;/span&gt; &lt;span class="s2"&gt;"google_gemma-4-26B-A4B-it-Q4_K_M.gguf"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--local-dir&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$HOME&lt;/span&gt;&lt;span class="s2"&gt;/models"&lt;/span&gt;

huggingface-cli download bartowski/Qwen2.5-Coder-7B-Instruct-GGUF &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--include&lt;/span&gt; &lt;span class="s2"&gt;"Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--local-dir&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$HOME&lt;/span&gt;&lt;span class="s2"&gt;/models"&lt;/span&gt;

huggingface-cli download bartowski/DeepSeek-Coder-V2-Lite-Instruct-GGUF &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--include&lt;/span&gt; &lt;span class="s2"&gt;"DeepSeek-Coder-V2-Lite-Instruct-Q4_K_M.gguf"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--local-dir&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$HOME&lt;/span&gt;&lt;span class="s2"&gt;/models"&lt;/span&gt;

huggingface-cli download bartowski/Meta-Llama-3.1-8B-Instruct-GGUF &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--include&lt;/span&gt; &lt;span class="s2"&gt;"Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--local-dir&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$HOME&lt;/span&gt;&lt;span class="s2"&gt;/models"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Depending on the CLI version, the &lt;code&gt;.gguf&lt;/code&gt; may end up in a &lt;strong&gt;subfolder&lt;/strong&gt; under &lt;code&gt;--local-dir&lt;/code&gt;. Point &lt;strong&gt;&lt;code&gt;-m&lt;/code&gt;&lt;/strong&gt; at the real absolute path (for example &lt;code&gt;find "$HOME/models" -name '*.gguf'&lt;/code&gt;).&lt;/p&gt;

&lt;h4&gt;
  
  
  Per-model quick test (right after download)
&lt;/h4&gt;

&lt;p&gt;Run &lt;strong&gt;one&lt;/strong&gt; block (paths match the &lt;code&gt;wget&lt;/code&gt; names above). &lt;strong&gt;&lt;code&gt;-n&lt;/code&gt;&lt;/strong&gt; caps generated tokens so the run stays short; if your &lt;code&gt;llama-cli&lt;/code&gt; rejects &lt;strong&gt;&lt;code&gt;-n&lt;/code&gt;&lt;/strong&gt;, check &lt;code&gt;./build/bin/llama-cli --help&lt;/code&gt; (sometimes &lt;code&gt;--predict&lt;/code&gt; or another alias). Earlier in §7, &lt;em&gt;Quick terminal test&lt;/em&gt; shows a &lt;strong&gt;&lt;code&gt;-cnv&lt;/code&gt;&lt;/strong&gt; example for Gemma and a Llama variant.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Gemma 4 26B Q4_K_M&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$HOME&lt;/span&gt;&lt;span class="s2"&gt;/llama.cpp"&lt;/span&gt;
./build/bin/llama-cli &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-m&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$HOME&lt;/span&gt;&lt;span class="s2"&gt;/models/google_gemma-4-26B-A4B-it-Q4_K_M.gguf"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-c&lt;/span&gt; 4096 &lt;span class="nt"&gt;-ngl&lt;/span&gt; 999 &lt;span class="nt"&gt;-n&lt;/span&gt; 80 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-p&lt;/span&gt; &lt;span class="s2"&gt;"Answer in one short sentence what a tensor is in machine learning."&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Qwen2.5 Coder 7B Q4_K_M&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$HOME&lt;/span&gt;&lt;span class="s2"&gt;/llama.cpp"&lt;/span&gt;
./build/bin/llama-cli &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-m&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$HOME&lt;/span&gt;&lt;span class="s2"&gt;/models/Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-c&lt;/span&gt; 4096 &lt;span class="nt"&gt;-ngl&lt;/span&gt; 999 &lt;span class="nt"&gt;-n&lt;/span&gt; 128 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-p&lt;/span&gt; &lt;span class="s2"&gt;"Write a one-line Python factorial(n) function; code only."&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;DeepSeek Coder V2 Lite Q4_K_M&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$HOME&lt;/span&gt;&lt;span class="s2"&gt;/llama.cpp"&lt;/span&gt;
./build/bin/llama-cli &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-m&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$HOME&lt;/span&gt;&lt;span class="s2"&gt;/models/DeepSeek-Coder-V2-Lite-Instruct-Q4_K_M.gguf"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-c&lt;/span&gt; 4096 &lt;span class="nt"&gt;-ngl&lt;/span&gt; 999 &lt;span class="nt"&gt;-n&lt;/span&gt; 128 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-p&lt;/span&gt; &lt;span class="s2"&gt;"Write a JavaScript arrow function that adds two numbers; code only."&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Llama 3.1 8B Instruct Q4_K_M&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$HOME&lt;/span&gt;&lt;span class="s2"&gt;/llama.cpp"&lt;/span&gt;
./build/bin/llama-cli &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-m&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$HOME&lt;/span&gt;&lt;span class="s2"&gt;/models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-c&lt;/span&gt; 4096 &lt;span class="nt"&gt;-ngl&lt;/span&gt; 999 &lt;span class="nt"&gt;-n&lt;/span&gt; 80 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-p&lt;/span&gt; &lt;span class="s2"&gt;"Say in one sentence what llama.cpp is for."&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;On startup you should see &lt;strong&gt;&lt;code&gt;ggml:&lt;/code&gt;&lt;/strong&gt; / &lt;strong&gt;&lt;code&gt;ggml_vulkan:&lt;/code&gt;&lt;/strong&gt; lines naming your GPU when Vulkan is in use (§4–§5).&lt;/p&gt;

&lt;h4&gt;
  
  
  Typical &lt;code&gt;ExecStart&lt;/code&gt; tweaks (example)
&lt;/h4&gt;

&lt;p&gt;Same shape as §9; only &lt;strong&gt;&lt;code&gt;-m&lt;/code&gt;&lt;/strong&gt; (and possibly &lt;strong&gt;&lt;code&gt;-c&lt;/code&gt;&lt;/strong&gt; / &lt;strong&gt;&lt;code&gt;-ngl&lt;/code&gt;&lt;/strong&gt;) change:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;…/llama-server \
    -m /home/YOUR_USER/models/THE_FILE_YOU_TESTED.gguf \
    --host 0.0.0.0 \
    --port 8080 \
    -c 8192 \
    -ngl 999 \
    --n-predict -1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If &lt;strong&gt;Gemma 26B Q4&lt;/strong&gt; or another big model &lt;strong&gt;OOM&lt;/strong&gt;s on a box with only &lt;strong&gt;~16 GiB&lt;/strong&gt; RAM, lower &lt;strong&gt;&lt;code&gt;-c&lt;/code&gt;&lt;/strong&gt; (e.g. &lt;strong&gt;4096&lt;/strong&gt;) and &lt;strong&gt;&lt;code&gt;-ngl&lt;/code&gt;&lt;/strong&gt; (e.g. &lt;strong&gt;40&lt;/strong&gt; or less) &lt;strong&gt;before&lt;/strong&gt; pushing &lt;strong&gt;99&lt;/strong&gt; / &lt;strong&gt;999&lt;/strong&gt;. Always validate with &lt;strong&gt;&lt;code&gt;llama-cli&lt;/code&gt;&lt;/strong&gt; using the same &lt;strong&gt;&lt;code&gt;-m&lt;/code&gt;&lt;/strong&gt;, &lt;strong&gt;&lt;code&gt;-c&lt;/code&gt;&lt;/strong&gt;, and &lt;strong&gt;&lt;code&gt;-ngl&lt;/code&gt;&lt;/strong&gt; you plan in &lt;strong&gt;&lt;code&gt;ExecStart&lt;/code&gt;&lt;/strong&gt;, then automate with systemd (§9).&lt;/p&gt;




&lt;h2&gt;
  
  
  8. Minimal web server (&lt;code&gt;llama-server&lt;/code&gt;)
&lt;/h2&gt;

&lt;p&gt;Run manually, listening on all interfaces on port &lt;strong&gt;8080&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$HOME&lt;/span&gt;&lt;span class="s2"&gt;/llama.cpp"&lt;/span&gt;
./build/bin/llama-server &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-m&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$HOME&lt;/span&gt;&lt;span class="s2"&gt;/models/google_gemma-4-26B-A4B-it-Q4_K_M.gguf"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--host&lt;/span&gt; 0.0.0.0 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--port&lt;/span&gt; 8080 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-c&lt;/span&gt; 8192 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-ngl&lt;/span&gt; 99 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--n-predict&lt;/span&gt; &lt;span class="nt"&gt;-1&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;On another machine: &lt;code&gt;http://SERVER_IP:8080&lt;/code&gt; (llama.cpp’s built-in UI is very basic).&lt;/p&gt;




&lt;h2&gt;
  
  
  9. systemd service (start on boot)
&lt;/h2&gt;

&lt;p&gt;Create &lt;code&gt;/etc/systemd/system/llama-web.service&lt;/code&gt; (e.g. with &lt;code&gt;sudo nano&lt;/code&gt;):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ini"&gt;&lt;code&gt;&lt;span class="nn"&gt;[Unit]&lt;/span&gt;
&lt;span class="py"&gt;Description&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;Llama.cpp API server (Vulkan)&lt;/span&gt;
&lt;span class="py"&gt;After&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;network.target&lt;/span&gt;

&lt;span class="nn"&gt;[Service]&lt;/span&gt;
&lt;span class="py"&gt;Type&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;simple&lt;/span&gt;
&lt;span class="py"&gt;User&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;YOUR_USER&lt;/span&gt;
&lt;span class="py"&gt;Group&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;YOUR_USER&lt;/span&gt;
&lt;span class="c"&gt;# Vulkan on AMD: the service user must access /dev/dri (groups in §4).
# If the service loads the model on CPU only, check `groups` / `id` for that user.
&lt;/span&gt;&lt;span class="py"&gt;SupplementaryGroups&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;render video&lt;/span&gt;
&lt;span class="py"&gt;WorkingDirectory&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;/home/YOUR_USER/llama.cpp&lt;/span&gt;
&lt;span class="py"&gt;ExecStart&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;/home/YOUR_USER/llama.cpp/build/bin/llama-server &lt;/span&gt;&lt;span class="se"&gt;\
&lt;/span&gt;    &lt;span class="s"&gt;-m /home/YOUR_USER/models/google_gemma-4-26B-A4B-it-Q4_K_M.gguf &lt;/span&gt;&lt;span class="se"&gt;\
&lt;/span&gt;    &lt;span class="s"&gt;--host 0.0.0.0 &lt;/span&gt;&lt;span class="se"&gt;\
&lt;/span&gt;    &lt;span class="s"&gt;--port 8080 &lt;/span&gt;&lt;span class="se"&gt;\
&lt;/span&gt;    &lt;span class="s"&gt;-c 8192 &lt;/span&gt;&lt;span class="se"&gt;\
&lt;/span&gt;    &lt;span class="s"&gt;-ngl 99 &lt;/span&gt;&lt;span class="se"&gt;\
&lt;/span&gt;    &lt;span class="s"&gt;--n-predict -1&lt;/span&gt;
&lt;span class="py"&gt;Restart&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;always&lt;/span&gt;
&lt;span class="py"&gt;RestartSec&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;3&lt;/span&gt;

&lt;span class="nn"&gt;[Install]&lt;/span&gt;
&lt;span class="py"&gt;WantedBy&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;multi-user.target&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Enable it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;systemctl daemon-reload
&lt;span class="nb"&gt;sudo &lt;/span&gt;systemctl &lt;span class="nb"&gt;enable&lt;/span&gt; &lt;span class="nt"&gt;--now&lt;/span&gt; llama-web.service
&lt;span class="nb"&gt;sudo &lt;/span&gt;systemctl status llama-web.service
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Recommended order (tight RAM):&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The &lt;code&gt;.gguf&lt;/code&gt; must be &lt;strong&gt;fully downloaded&lt;/strong&gt;; a truncated file makes the unit &lt;strong&gt;fail&lt;/strong&gt; or &lt;strong&gt;restart in a loop&lt;/strong&gt; (&lt;code&gt;Restart=always&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Smoke-test with &lt;code&gt;llama-cli&lt;/code&gt; first&lt;/strong&gt; as the &lt;strong&gt;same user&lt;/strong&gt; as the systemd unit, with the &lt;strong&gt;same&lt;/strong&gt; &lt;strong&gt;&lt;code&gt;-m&lt;/code&gt;&lt;/strong&gt;, &lt;strong&gt;&lt;code&gt;-c&lt;/code&gt;&lt;/strong&gt;, and &lt;strong&gt;&lt;code&gt;-ngl&lt;/code&gt;&lt;/strong&gt; as in &lt;code&gt;ExecStart&lt;/code&gt; (§7 &lt;em&gt;Per-model quick test&lt;/em&gt; or step 3’s generic example). If that already OOMs or hangs, &lt;strong&gt;tune flags&lt;/strong&gt; before &lt;code&gt;enable --now&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;If systemd shows &lt;strong&gt;OOM&lt;/strong&gt; in &lt;code&gt;journalctl&lt;/code&gt;, the process &lt;strong&gt;dies and respawns&lt;/strong&gt; every few seconds, or the kernel kills the worker, edit &lt;strong&gt;&lt;code&gt;ExecStart&lt;/code&gt;&lt;/strong&gt;: drop &lt;strong&gt;&lt;code&gt;-c&lt;/code&gt;&lt;/strong&gt; (e.g. &lt;strong&gt;4096&lt;/strong&gt;) and &lt;strong&gt;&lt;code&gt;-ngl&lt;/code&gt;&lt;/strong&gt; (e.g. &lt;strong&gt;40&lt;/strong&gt; or less) instead of staying on &lt;strong&gt;99&lt;/strong&gt; / &lt;strong&gt;999&lt;/strong&gt; until &lt;code&gt;status&lt;/code&gt; shows a stable &lt;strong&gt;active (running)&lt;/strong&gt;; then &lt;code&gt;sudo systemctl daemon-reload&lt;/code&gt; and &lt;code&gt;sudo systemctl restart llama-web.service&lt;/code&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If startup fails, check logs: &lt;code&gt;journalctl -u llama-web.service -n 80 --no-pager&lt;/code&gt; (GGUF path, &lt;code&gt;/dev/dri&lt;/code&gt; permissions, &lt;strong&gt;&lt;code&gt;-ngl&lt;/code&gt;&lt;/strong&gt;, Vulkan).&lt;/p&gt;




&lt;h2&gt;
  
  
  10. Open WebUI with Docker (port 3000 → backend on 8080)
&lt;/h2&gt;

&lt;p&gt;Install Docker if needed:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;apt &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-y&lt;/span&gt; docker.io
&lt;span class="nb"&gt;sudo &lt;/span&gt;usermod &lt;span class="nt"&gt;-aG&lt;/span&gt; docker &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$USER&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;span class="c"&gt;# Log out again, or run: newgrp docker&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Container (UI on &lt;strong&gt;3000&lt;/strong&gt;; engine stays on host &lt;strong&gt;8080&lt;/strong&gt;):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker run &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-p&lt;/span&gt; 3000:8080 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--add-host&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;host.docker.internal:host-gateway &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-v&lt;/span&gt; open-webui:/app/backend/data &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--name&lt;/span&gt; open-webui &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--restart&lt;/span&gt; always &lt;span class="se"&gt;\&lt;/span&gt;
  ghcr.io/open-webui/open-webui:main
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In the browser: &lt;code&gt;http://SERVER_IP:3000&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Connect Open WebUI to llama-server
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Not the same as “External tools”.&lt;/strong&gt; In regular user settings you may see &lt;strong&gt;External tools&lt;/strong&gt; (&lt;em&gt;Manage tool servers&lt;/em&gt;, &lt;code&gt;openapi.json&lt;/code&gt;): that is for optional &lt;strong&gt;tool&lt;/strong&gt; servers, &lt;strong&gt;not&lt;/strong&gt; for the main LLM backend. Putting your URL only there leaves the model picker empty.&lt;/p&gt;

&lt;p&gt;Use &lt;strong&gt;Admin Settings&lt;/strong&gt;, not the gear icon that only shows &lt;em&gt;General / Interface / External tools&lt;/em&gt; (&lt;a href="https://docs.openwebui.com/getting-started/quick-start/settings/" rel="noopener noreferrer"&gt;personal user settings&lt;/a&gt;). Typical path: &lt;strong&gt;profile avatar&lt;/strong&gt; → &lt;strong&gt;Admin Settings&lt;/strong&gt; / &lt;strong&gt;Administration&lt;/strong&gt; → &lt;strong&gt;Settings&lt;/strong&gt; → &lt;strong&gt;Connections&lt;/strong&gt; → &lt;strong&gt;OpenAI&lt;/strong&gt; → &lt;strong&gt;Add connection&lt;/strong&gt;. If &lt;em&gt;Admin Settings&lt;/em&gt; is missing, your account is not an instance admin (the first registered user usually is). Docs: &lt;a href="https://docs.openwebui.com/getting-started/quick-start/connect-a-provider/starting-with-openai-compatible/" rel="noopener noreferrer"&gt;OpenAI-Compatible&lt;/a&gt;.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Admin panel → Settings → Connections&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;OpenAI&lt;/strong&gt; section (llama-server mimics the OpenAI API):

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Base URL:&lt;/strong&gt; &lt;code&gt;http://host.docker.internal:8080/v1&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;API key:&lt;/strong&gt; any string (e.g. &lt;code&gt;sk-no-key-required&lt;/code&gt;).&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Save and use &lt;strong&gt;verify connection&lt;/strong&gt; if shown.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Turn off “Direct connections”&lt;/strong&gt; (or equivalent) if you enabled it: otherwise the browser will try to resolve &lt;code&gt;host.docker.internal&lt;/code&gt; outside Docker and fail. The UI should proxy to the backend.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Chat up and running (example)
&lt;/h3&gt;

&lt;p&gt;With the backend wired, pick a model in chat (often the same label as the &lt;strong&gt;&lt;code&gt;.gguf&lt;/code&gt; filename&lt;/strong&gt; &lt;code&gt;llama-server&lt;/code&gt; loaded), send a prompt, and the reply is generated on the host. The screenshot shows &lt;strong&gt;&lt;code&gt;google_gemma-4-26B-A4B-it-Q4_K_M.gguf&lt;/code&gt;&lt;/strong&gt;: the header dropdown reflects that file, and you get a &lt;strong&gt;“Thought for …”&lt;/strong&gt;-style block (internal reasoning before the visible answer). That &lt;strong&gt;adds latency&lt;/strong&gt; before you see the final text; for &lt;strong&gt;terminal&lt;/strong&gt; use and less explicit “thinking” output with Gemma, try &lt;strong&gt;&lt;code&gt;llama-cli&lt;/code&gt;&lt;/strong&gt; with &lt;strong&gt;&lt;code&gt;--reasoning off&lt;/code&gt;&lt;/strong&gt; (§7 &lt;em&gt;Quick terminal test&lt;/em&gt;).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fin3u615sdllq33mujugj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fin3u615sdllq33mujugj.png" alt="Open WebUI: chat with Gemma 4 26B Q4_K_M, GGUF picker, and reasoning (“Thought for …”)." width="800" height="724"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  No browsing or GitHub fetch: real limits (and confident wrong answers)
&lt;/h3&gt;

&lt;p&gt;With &lt;strong&gt;&lt;code&gt;llama-server&lt;/code&gt;&lt;/strong&gt; + &lt;strong&gt;Open WebUI&lt;/strong&gt; as wired here, the model is &lt;strong&gt;text → text&lt;/strong&gt; only: it does &lt;strong&gt;not&lt;/strong&gt; browse the web, issue its own &lt;strong&gt;internet&lt;/strong&gt; requests, download a &lt;strong&gt;&lt;code&gt;https://github.com/...&lt;/code&gt;&lt;/strong&gt; tree, or run code in a sandbox. All it “sees” is what &lt;strong&gt;you&lt;/strong&gt; type (plus whatever context the UI forwards) and knowledge &lt;strong&gt;frozen&lt;/strong&gt; inside the &lt;strong&gt;GGUF&lt;/strong&gt; up to training cutoff.&lt;/p&gt;

&lt;p&gt;It may still answer &lt;strong&gt;very confidently&lt;/strong&gt; as if it had tools—for example claiming it &lt;strong&gt;“can analyze a public repo if you share the link”&lt;/strong&gt; or outlining how it will &lt;strong&gt;“read”&lt;/strong&gt; a remote &lt;code&gt;README&lt;/code&gt;. In this stack &lt;strong&gt;that is false&lt;/strong&gt; if you only paste a URL: the backend &lt;strong&gt;never fetches&lt;/strong&gt; HTML or the repo; Gemma (or any local GGUF) &lt;strong&gt;hallucinates&lt;/strong&gt; or repeats patterns from training. Real analysis needs &lt;strong&gt;you to paste files&lt;/strong&gt; / diffs, or &lt;strong&gt;separate&lt;/strong&gt; plumbing (RAG, Open &lt;strong&gt;WebUI functions&lt;/strong&gt;, agents, APIs) that this guide does &lt;strong&gt;not&lt;/strong&gt; set up.&lt;/p&gt;

&lt;p&gt;A &lt;strong&gt;“Thought for …”&lt;/strong&gt; / reasoning block (§7, §10) does &lt;strong&gt;not&lt;/strong&gt; verify anything online—it only extends generation and can read like a &lt;strong&gt;super-capable assistant&lt;/strong&gt;; double-check claims about repos, “current” versions, or anything that depends on &lt;em&gt;today&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Same stack, different tone:&lt;/strong&gt; ask bluntly &lt;em&gt;can you browse the Internet for new info?&lt;/em&gt; and Gemma may &lt;strong&gt;plainly refuse&lt;/strong&gt;—no live search, only training data plus whatever &lt;strong&gt;you&lt;/strong&gt; paste. That does &lt;strong&gt;not&lt;/strong&gt; undo the GitHub-URL problem above: the model &lt;strong&gt;shifts persona&lt;/strong&gt; with prompt framing (literal capability question vs. “please review this repo”). &lt;strong&gt;Ground truth&lt;/strong&gt; is unchanged: &lt;strong&gt;&lt;code&gt;llama-server&lt;/code&gt; still issues no HTTP&lt;/strong&gt; on its own until you wire tools.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiwix65pt2rzakkihw3v2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiwix65pt2rzakkihw3v2.png" alt="Open WebUI (English): *Can you browse the Internet…?* — honest “no live web” reply; same stack, still no automatic fetch." width="800" height="339"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Live demo (the joke writes itself):&lt;/strong&gt; the assistant just told you to &lt;em&gt;“send the link”&lt;/em&gt;; you reply &lt;em&gt;analyze &lt;code&gt;https://github.com/…/pgwd&lt;/code&gt; and tell me what to improve&lt;/em&gt;—or the &lt;strong&gt;same&lt;/strong&gt; request in &lt;strong&gt;Spanish&lt;/strong&gt; (or any other language you type in the UI); &lt;strong&gt;&lt;code&gt;llama-server&lt;/code&gt; does not switch behavior by chat language&lt;/strong&gt;. Open WebUI shows &lt;strong&gt;Thinking…&lt;/strong&gt; and Gemma looks busy, but &lt;strong&gt;&lt;code&gt;llama-server&lt;/code&gt; never fetched that repo&lt;/strong&gt;: it only sees the &lt;strong&gt;message string&lt;/strong&gt;. The answer may sound technical yet be &lt;strong&gt;untethered from the real tree&lt;/strong&gt;—paste files, use &lt;strong&gt;git&lt;/strong&gt; yourself, or wire tools if you want grounded review.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxi2swo3skn5aho4dr49t.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxi2swo3skn5aho4dr49t.png" alt="Open WebUI: after “analyze this GitHub repo…”, the model shows Thinking… — no URL fetch in this stack." width="800" height="345"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Same experiment, a minute later:&lt;/strong&gt; the model may return &lt;strong&gt;Thought for ~45–60s&lt;/strong&gt; and a long “review” that &lt;strong&gt;reads like a real audit&lt;/strong&gt;. The screenshot below is &lt;strong&gt;English&lt;/strong&gt; (&lt;em&gt;analyze in details…&lt;/em&gt;): it leans into &lt;strong&gt;Flask&lt;/strong&gt; and &lt;strong&gt;Blueprints&lt;/strong&gt;; in &lt;strong&gt;another&lt;/strong&gt; chat the same Gemma might rattle off &lt;strong&gt;Go&lt;/strong&gt; &lt;code&gt;cmd/&lt;/code&gt;/&lt;code&gt;internal/&lt;/code&gt;—still with &lt;strong&gt;no&lt;/strong&gt; tree read. That is template + guesswork, not repository access: some bullets may match the name (&lt;em&gt;pgwd&lt;/em&gt;, “dashboard”, …), some may be &lt;strong&gt;wrong&lt;/strong&gt;; &lt;strong&gt;length&lt;/strong&gt; and &lt;strong&gt;“thought”&lt;/strong&gt; time are not a substitute for &lt;strong&gt;cloning&lt;/strong&gt; and diffing.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa02eg5ippo6ihtgj45p7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fa02eg5ippo6ihtgj45p7.png" alt="Open WebUI (English example): detailed reply after a bare GitHub URL with no fetch — “Thought for …” plus persuasive text; verify against real code." width="800" height="714"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Model picker shows &lt;strong&gt;“No results found”&lt;/strong&gt; / no models listed
&lt;/h3&gt;

&lt;p&gt;This almost never means “the &lt;code&gt;.gguf&lt;/code&gt; is missing on disk”; it means &lt;strong&gt;Open WebUI is not getting &lt;code&gt;/v1/models&lt;/code&gt;&lt;/strong&gt; from the backend you configured. Walk through in order:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;&lt;code&gt;llama-server&lt;/code&gt; must be running&lt;/strong&gt; on the same host as Docker (§8 manual or §9 &lt;code&gt;systemd&lt;/code&gt;). Nothing listening on &lt;strong&gt;8080&lt;/strong&gt; → empty list.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;On the host&lt;/strong&gt; (mini PC shell), hit the API:&lt;br&gt;
&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;curl &lt;span class="nt"&gt;-sS&lt;/span&gt; http://127.0.0.1:8080/v1/models | &lt;span class="nb"&gt;head&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You should see JSON (&lt;code&gt;data&lt;/code&gt;, at least one &lt;code&gt;id&lt;/code&gt;). &lt;strong&gt;Connection refused&lt;/strong&gt; → start or fix &lt;code&gt;llama-server&lt;/code&gt;. If it only bound a weird interface, use &lt;strong&gt;&lt;code&gt;--host 0.0.0.0&lt;/code&gt;&lt;/strong&gt; in &lt;code&gt;ExecStart&lt;/code&gt; (not only &lt;code&gt;127.0.0.1&lt;/code&gt; if LAN clients need 8080; for Docker→host this is the usual choice).&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;From the Open WebUI container&lt;/strong&gt;, the host port must be reachable:
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker &lt;span class="nb"&gt;exec &lt;/span&gt;open-webui sh &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="s1"&gt;'wget -qO- http://host.docker.internal:8080/v1/models 2&amp;gt;/dev/null || curl -sS http://host.docker.internal:8080/v1/models'&lt;/span&gt; | &lt;span class="nb"&gt;head&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If this fails but step 2 works, you are missing &lt;strong&gt;&lt;code&gt;--add-host=host.docker.internal:host-gateway&lt;/code&gt;&lt;/strong&gt; in &lt;code&gt;docker run&lt;/code&gt; (§10), or a firewall blocks Docker bridge → host (&lt;code&gt;ufw&lt;/code&gt; may need a rule; many setups allow it by default).&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;UI wiring:&lt;/strong&gt; &lt;strong&gt;Settings → Connections → OpenAI&lt;/strong&gt; (or &lt;strong&gt;Admin&lt;/strong&gt; → &lt;strong&gt;Settings&lt;/strong&gt;, depending on version), base URL &lt;strong&gt;&lt;code&gt;http://host.docker.internal:8080/v1&lt;/code&gt;&lt;/strong&gt; (&lt;strong&gt;&lt;code&gt;/v1&lt;/code&gt; required&lt;/strong&gt;). Save a dummy API key and &lt;strong&gt;verify&lt;/strong&gt; if offered.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Do not mix with Ollama:&lt;/strong&gt; putting the &lt;code&gt;llama-server&lt;/code&gt; URL only under &lt;strong&gt;Ollama&lt;/strong&gt;, or using port 8080 &lt;strong&gt;without&lt;/strong&gt; &lt;code&gt;/v1&lt;/code&gt;, can leave the dropdown empty. See the table below.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;After fixing, &lt;strong&gt;hard-refresh&lt;/strong&gt; the UI. The model label may match the &lt;strong&gt;&lt;code&gt;.gguf&lt;/code&gt; name&lt;/strong&gt;, &lt;strong&gt;&lt;code&gt;default&lt;/code&gt;&lt;/strong&gt;, or whatever &lt;code&gt;id&lt;/code&gt; appears in the JSON from step 2.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  “Failed to fetch models” under &lt;strong&gt;Ollama&lt;/strong&gt; (Settings → Models)
&lt;/h3&gt;

&lt;p&gt;If &lt;strong&gt;Settings → Models → Manage Models&lt;/strong&gt; shows the &lt;strong&gt;Ollama&lt;/strong&gt; service with URL &lt;code&gt;http://host.docker.internal:8080&lt;/code&gt; (and nothing else), you often get &lt;strong&gt;Failed to fetch models&lt;/strong&gt;. That usually means &lt;strong&gt;two different backends are mixed up&lt;/strong&gt;:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;What you run&lt;/th&gt;
&lt;th&gt;Typical port&lt;/th&gt;
&lt;th&gt;Where to configure it in Open WebUI&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;llama-server&lt;/strong&gt; (this guide)&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;8080&lt;/strong&gt;, OpenAI-style API&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Settings → Connections → OpenAI&lt;/strong&gt; (or equivalent), base URL &lt;strong&gt;&lt;code&gt;http://host.docker.internal:8080/v1&lt;/code&gt;&lt;/strong&gt; (the &lt;strong&gt;&lt;code&gt;/v1&lt;/code&gt; suffix is required&lt;/strong&gt;).&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;strong&gt;Ollama&lt;/strong&gt; (only if installed separately)&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;11434&lt;/strong&gt;, Ollama API&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Ollama&lt;/strong&gt; connection / model management, typically &lt;code&gt;http://host.docker.internal:11434&lt;/code&gt; (only if Ollama listens on the host and the container can reach it).&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;code&gt;llama-server&lt;/code&gt; is &lt;strong&gt;not&lt;/strong&gt; Ollama. If you put the llama-server URL in the &lt;strong&gt;Ollama&lt;/strong&gt; field, the UI uses the wrong protocol and fails even when port 8080 is open.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;If you only use llama-server:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Keep &lt;strong&gt;Connections → OpenAI&lt;/strong&gt; exactly as above (&lt;code&gt;…8080/v1&lt;/code&gt;, dummy key, verify).&lt;/li&gt;
&lt;li&gt;If you do not run Ollama, &lt;strong&gt;clear or disable&lt;/strong&gt; the Ollama URL (do not point it at 8080).&lt;/li&gt;
&lt;li&gt;Return to &lt;strong&gt;Models&lt;/strong&gt; or chat: available models follow whatever &lt;code&gt;llama-server&lt;/code&gt; loaded with &lt;code&gt;-m&lt;/code&gt; (§8–§9).&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If &lt;strong&gt;&lt;code&gt;host.docker.internal&lt;/code&gt; does not resolve&lt;/strong&gt; inside the container, confirm your &lt;code&gt;docker run&lt;/code&gt; includes &lt;code&gt;--add-host=host.docker.internal:host-gateway&lt;/code&gt; (§10). On Linux that hostname is not defined by default without it.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwev12dfqy1isqcvaduua.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwev12dfqy1isqcvaduua.png" alt="Illustration: conceptual flow for upgrading the UI (image pull, recreate container, persistent volume)" width="800" height="447"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Updating Open WebUI (Docker)
&lt;/h3&gt;

&lt;p&gt;The UI often shows a banner like &lt;em&gt;“A new version (v0.x.y) is now available…”&lt;/em&gt; when a newer image exists. Your &lt;strong&gt;chats and settings&lt;/strong&gt; live in the &lt;strong&gt;&lt;code&gt;open-webui&lt;/code&gt; named volume&lt;/strong&gt;; they are kept when you recreate the container as long as you mount the same &lt;code&gt;-v open-webui:/app/backend/data&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpmqtynzg0nn5p8u3eh7t.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpmqtynzg0nn5p8u3eh7t.png" alt="Updating Open WebUI" width="800" height="423"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Pull&lt;/strong&gt; the updated image (same tag you used at install; this guide uses &lt;code&gt;main&lt;/code&gt;):
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker pull ghcr.io/open-webui/open-webui:main
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Stop and remove&lt;/strong&gt; only the container (the volume stays intact):
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker stop open-webui
docker &lt;span class="nb"&gt;rm &lt;/span&gt;open-webui
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Run&lt;/strong&gt; the &lt;strong&gt;same&lt;/strong&gt; &lt;code&gt;docker run&lt;/code&gt; block from §10 again (same &lt;code&gt;-p 3000:8080&lt;/code&gt;, &lt;code&gt;--add-host=host.docker.internal:host-gateway&lt;/code&gt;, &lt;code&gt;-v open-webui:…&lt;/code&gt;, container name &lt;code&gt;open-webui&lt;/code&gt;, etc.). The new container starts from the image you just pulled.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;If you originally used a &lt;strong&gt;different tag&lt;/strong&gt; (e.g. &lt;code&gt;v0.8.12&lt;/code&gt; or a &lt;code&gt;cuda&lt;/code&gt; variant) instead of &lt;code&gt;main&lt;/code&gt;, substitute that tag in both &lt;code&gt;docker pull&lt;/code&gt; and &lt;code&gt;docker run&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Notes:&lt;/strong&gt; updating the UI does &lt;strong&gt;not&lt;/strong&gt; update &lt;code&gt;llama-server&lt;/code&gt; or your GGUF weights; the engine is still §6–§9. If you do not want to track &lt;code&gt;main&lt;/code&gt;, pin an explicit image tag in &lt;code&gt;docker run&lt;/code&gt; and repeat this flow when you choose to upgrade.&lt;/p&gt;

&lt;h3&gt;
  
  
  If you also run Ollama
&lt;/h3&gt;

&lt;p&gt;A default endpoint may appear on port &lt;strong&gt;11434&lt;/strong&gt;. To keep using &lt;strong&gt;your&lt;/strong&gt; Vulkan llama-server with the same &lt;code&gt;-ngl&lt;/code&gt;/RAM behavior, prioritize the OpenAI entry pointing at &lt;code&gt;:8080/v1&lt;/code&gt; and do not rely on Ollama for that backend.&lt;/p&gt;




&lt;h2&gt;
  
  
  11. OpenCode and VS Code with your &lt;code&gt;llama-server&lt;/code&gt;
&lt;/h2&gt;

&lt;p&gt;Same API surface as Open WebUI: &lt;strong&gt;&lt;code&gt;llama-server&lt;/code&gt; exposes an OpenAI-compatible endpoint&lt;/strong&gt; at &lt;code&gt;http://HOST:8080/v1&lt;/code&gt; (keep §8 or §9 running). Use the mini PC’s IP instead of &lt;code&gt;127.0.0.1&lt;/code&gt; when you work from another machine on the LAN (and open port 8080 in the firewall if needed).&lt;/p&gt;

&lt;h3&gt;
  
  
  OpenCode
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://opencode.ai/" rel="noopener noreferrer"&gt;OpenCode&lt;/a&gt; can use &lt;strong&gt;OpenAI-compatible&lt;/strong&gt; providers through &lt;code&gt;@ai-sdk/openai-compatible&lt;/code&gt;. The official docs include a &lt;strong&gt;llama.cpp / llama-server&lt;/strong&gt; example: &lt;a href="https://opencode.ai/docs/providers/" rel="noopener noreferrer"&gt;Providers — llama.cpp&lt;/a&gt;.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Confirm &lt;code&gt;llama-server&lt;/code&gt; answers (e.g. &lt;code&gt;curl -s http://127.0.0.1:8080/v1/models&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;Create or edit &lt;strong&gt;&lt;code&gt;opencode.json&lt;/code&gt;&lt;/strong&gt; for your project or OpenCode’s config path (&lt;code&gt;$schema&lt;/code&gt;: &lt;code&gt;https://opencode.ai/config.json&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;Add a provider with &lt;code&gt;"npm": "@ai-sdk/openai-compatible"&lt;/code&gt; and &lt;code&gt;"options.baseURL": "http://127.0.0.1:8080/v1"&lt;/code&gt; (or the remote IP).&lt;/li&gt;
&lt;li&gt;Under &lt;code&gt;provider.&amp;lt;id&amp;gt;.models&lt;/code&gt;, add keys that match what the API expects. If unsure, read the &lt;code&gt;id&lt;/code&gt; field from &lt;code&gt;/v1/models&lt;/code&gt;; it is often the &lt;code&gt;.gguf&lt;/code&gt; filename or &lt;code&gt;default&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;In OpenCode, use &lt;code&gt;/models&lt;/code&gt; to pick &lt;code&gt;provider_id/model_id&lt;/code&gt;, or set &lt;code&gt;"model": "provider_id/model_id"&lt;/code&gt; in the JSON.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Minimal example (adjust IDs to your setup):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"$schema"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"https://opencode.ai/config.json"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"provider"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"llama-local"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"npm"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"@ai-sdk/openai-compatible"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"llama-server (local)"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"options"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"baseURL"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"http://127.0.0.1:8080/v1"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"models"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"default"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Local model (default)"&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"model"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"llama-local/default"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If OpenCode cannot see the model, align &lt;code&gt;models&lt;/code&gt; keys with &lt;code&gt;/v1/models&lt;/code&gt;. Tools and heavy agentic flows &lt;strong&gt;depend on the GGUF&lt;/strong&gt;; a general chat model may underperform on coding-agent tasks.&lt;/p&gt;

&lt;h3&gt;
  
  
  Visual Studio Code
&lt;/h3&gt;

&lt;p&gt;VS Code does not talk to your server by itself—you need an &lt;strong&gt;extension&lt;/strong&gt; that supports a custom OpenAI-style endpoint.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Common picks: &lt;strong&gt;&lt;a href="https://www.continue.dev/" rel="noopener noreferrer"&gt;Continue&lt;/a&gt;&lt;/strong&gt; and others advertising &lt;strong&gt;OpenAI-compatible API&lt;/strong&gt; or “local LLM”. You typically set &lt;strong&gt;Base URL&lt;/strong&gt; to &lt;code&gt;http://127.0.0.1:8080/v1&lt;/code&gt; (or the server IP) and &lt;strong&gt;API key&lt;/strong&gt; to any placeholder (e.g. &lt;code&gt;sk-local&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Visual Studio GitHub Copilot&lt;/strong&gt; does not route through your &lt;code&gt;llama-server&lt;/code&gt;; it is a separate service.&lt;/li&gt;
&lt;li&gt;From another PC, use the host IP where &lt;code&gt;llama-server&lt;/code&gt; runs—not &lt;code&gt;host.docker.internal&lt;/code&gt; (that name is for containers such as Open WebUI).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Extensions usually trail cloud models on tools and huge context. Start on the same machine you already validated with &lt;code&gt;llama-cli&lt;/code&gt; or Open WebUI.&lt;/p&gt;




&lt;h2&gt;
  
  
  12. Troubleshooting: Vulkan / &lt;code&gt;glslc&lt;/code&gt; on Ubuntu 24.04
&lt;/h2&gt;

&lt;p&gt;Typical CMake symptoms:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;Could NOT find Vulkan (missing: ... glslc)&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Vulkan found but &lt;code&gt;glslc&lt;/code&gt; still missing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Suggested order (simplest first):&lt;/p&gt;

&lt;h3&gt;
  
  
  12.1 Universe repository and packages
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;add-apt-repository universe
&lt;span class="nb"&gt;sudo &lt;/span&gt;apt update
&lt;span class="nb"&gt;sudo &lt;/span&gt;apt &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-y&lt;/span&gt; libvulkan-dev vulkan-tools shaderc
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Verify:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;command&lt;/span&gt; &lt;span class="nt"&gt;-v&lt;/span&gt; glslc &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; glslc &lt;span class="nt"&gt;--version&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Clean and reconfigure the build:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd&lt;/span&gt; ~/llama.cpp
&lt;span class="nb"&gt;rm&lt;/span&gt; &lt;span class="nt"&gt;-rf&lt;/span&gt; build
cmake &lt;span class="nt"&gt;-B&lt;/span&gt; build &lt;span class="nt"&gt;-DGGML_VULKAN&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;1
cmake &lt;span class="nt"&gt;--build&lt;/span&gt; build &lt;span class="nt"&gt;--config&lt;/span&gt; Release &lt;span class="nt"&gt;-j&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;nproc&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  12.2 LunarG repository (Vulkan SDK)
&lt;/h3&gt;

&lt;p&gt;If your Ubuntu mirror does not offer &lt;code&gt;shaderc&lt;/code&gt; or &lt;code&gt;glslc&lt;/code&gt; is still missing:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;wget &lt;span class="nt"&gt;-qO-&lt;/span&gt; https://packages.lunarg.com/lunarg-signing-key-pub.asc &lt;span class="se"&gt;\&lt;/span&gt;
  | &lt;span class="nb"&gt;sudo tee&lt;/span&gt; /etc/apt/trusted.gpg.d/lunarg.asc
&lt;span class="nb"&gt;sudo &lt;/span&gt;wget &lt;span class="nt"&gt;-qO&lt;/span&gt; /etc/apt/sources.list.d/lunarg-vulkan-noble.list &lt;span class="se"&gt;\&lt;/span&gt;
  https://packages.lunarg.com/vulkan/lunarg-vulkan-noble.list
&lt;span class="nb"&gt;sudo &lt;/span&gt;apt update
&lt;span class="nb"&gt;sudo &lt;/span&gt;apt &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-y&lt;/span&gt; vulkan-sdk
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then &lt;code&gt;rm -rf build&lt;/code&gt; and run &lt;code&gt;cmake&lt;/code&gt; again.&lt;/p&gt;

&lt;h3&gt;
  
  
  12.3 Conflict between Ubuntu’s &lt;code&gt;libshaderc-dev&lt;/code&gt; and LunarG’s Shaderc
&lt;/h3&gt;

&lt;p&gt;If &lt;code&gt;dpkg&lt;/code&gt; complains about overwriting files between packages, as a last resort you can force-remove the blocking package, then repair:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;dpkg &lt;span class="nt"&gt;--remove&lt;/span&gt; &lt;span class="nt"&gt;--force-depends&lt;/span&gt; libshaderc-dev
&lt;span class="nb"&gt;sudo &lt;/span&gt;apt &lt;span class="nt"&gt;--fix-broken&lt;/span&gt; &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-y&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;apt &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-y&lt;/span&gt; shaderc
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Only do this if you understand mixed repos can leave messy dependencies; often sticking to &lt;strong&gt;either&lt;/strong&gt; LunarG &lt;strong&gt;or&lt;/strong&gt; Ubuntu for Shaderc dev packages is enough.&lt;/p&gt;

&lt;h3&gt;
  
  
  12.4 Snap fallback for &lt;code&gt;glslc&lt;/code&gt;
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;snap &lt;span class="nb"&gt;install &lt;/span&gt;google-shaderc
&lt;span class="nb"&gt;sudo ln&lt;/span&gt; &lt;span class="nt"&gt;-sf&lt;/span&gt; /snap/bin/glslc /usr/local/bin/glslc
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Check &lt;code&gt;glslc --version&lt;/code&gt; again and retry CMake.&lt;/p&gt;




&lt;h2&gt;
  
  
  13. Performance and models (rough guide)
&lt;/h2&gt;

&lt;p&gt;With &lt;strong&gt;lots of RAM&lt;/strong&gt; and a &lt;strong&gt;modest iGPU&lt;/strong&gt;, unified VRAM and &lt;code&gt;-ngl&lt;/code&gt; cap GPU tokens/s; larger models can spill into system RAM.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Scale&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Gemma 4 26B A4B (e.g. Q4_K_M ~17 GiB)&lt;/td&gt;
&lt;td&gt;Good balance with high RAM; needs an up-to-date llama.cpp.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Same family Q8_0 (~27 GiB)&lt;/td&gt;
&lt;td&gt;Better quality; more pressure on RAM/unified VRAM.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Mixtral 8×7B, 70B, others&lt;/td&gt;
&lt;td&gt;Feasible mainly thanks to RAM; slower.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Use a lower quantization (e.g. Q4_K_M) if you prioritize &lt;strong&gt;speed&lt;/strong&gt; over &lt;strong&gt;quality&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;For hard numbers &lt;strong&gt;on your&lt;/strong&gt; box, run &lt;strong&gt;&lt;code&gt;llama-bench&lt;/code&gt;&lt;/strong&gt; (§7): it is the most direct way to compare &lt;code&gt;-ngl&lt;/code&gt; and quantizations without the web UI in the way.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;code&gt;htop&lt;/code&gt; looks “light” while you chat (is that normal?)
&lt;/h3&gt;

&lt;p&gt;If &lt;strong&gt;&lt;code&gt;htop&lt;/code&gt;&lt;/strong&gt; shows &lt;strong&gt;&lt;code&gt;llama-server&lt;/code&gt;&lt;/strong&gt; / &lt;strong&gt;&lt;code&gt;llama-cli&lt;/code&gt;&lt;/strong&gt; with &lt;strong&gt;low CPU&lt;/strong&gt; across cores and only a &lt;strong&gt;few GiB&lt;/strong&gt; of &lt;strong&gt;RES&lt;/strong&gt;, that is often &lt;strong&gt;expected&lt;/strong&gt; when:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;-ngl&lt;/code&gt;&lt;/strong&gt; leaves much of the model on the &lt;strong&gt;iGPU&lt;/strong&gt; — heavy matmul runs on the graphics core; the &lt;strong&gt;CPU&lt;/strong&gt; orchestrates and shuffles data, so you may &lt;strong&gt;not&lt;/strong&gt; see all cores pegged at 100%.&lt;/li&gt;
&lt;li&gt;The &lt;strong&gt;GGUF is small&lt;/strong&gt; (e.g. 7B/8B &lt;strong&gt;Q4&lt;/strong&gt;) — small &lt;strong&gt;resident&lt;/strong&gt; RAM footprint; a &lt;strong&gt;26B&lt;/strong&gt; run would show much more &lt;strong&gt;RES&lt;/strong&gt; if most weights live in system memory.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Bursts&lt;/strong&gt; happen while scoring the prompt and &lt;strong&gt;generating&lt;/strong&gt; tokens; between turns or while you read output, usage &lt;strong&gt;drops&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;With &lt;strong&gt;unified memory (UMA)&lt;/strong&gt;, some model cost may &lt;strong&gt;not&lt;/strong&gt; show up as a huge process RSS: the &lt;strong&gt;GPU&lt;/strong&gt; also holds part of the working set.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Do &lt;strong&gt;not&lt;/strong&gt; assume nothing is working just because &lt;code&gt;htop&lt;/code&gt; stays calm: check &lt;strong&gt;t/s&lt;/strong&gt; in &lt;strong&gt;&lt;code&gt;llama-cli&lt;/code&gt;&lt;/strong&gt;, &lt;strong&gt;&lt;code&gt;llama-bench&lt;/code&gt;&lt;/strong&gt; (§7), or a &lt;strong&gt;GPU&lt;/strong&gt; monitor if you want to see graphics load.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Reference screenshot&lt;/strong&gt; (same class of mini PC as the validated hardware; &lt;strong&gt;SSH&lt;/strong&gt; + &lt;strong&gt;&lt;code&gt;htop&lt;/code&gt;&lt;/strong&gt;: &lt;code&gt;llama.cpp&lt;/code&gt; around &lt;strong&gt;~5 GiB RES&lt;/strong&gt; and &lt;strong&gt;moderate&lt;/strong&gt; CPU on one core—consistent with a &lt;strong&gt;non-huge&lt;/strong&gt; model and &lt;strong&gt;GPU&lt;/strong&gt;-bound &lt;strong&gt;&lt;code&gt;‑ngl&lt;/code&gt;&lt;/strong&gt;):&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flmywiw3u9wgcgka9647v.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flmywiw3u9wgcgka9647v.png" alt="htop during inference: llama.cpp with moderate CPU and RAM (Vulkan / -ngl)." width="800" height="939"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  AMD: &lt;code&gt;amdgpu_pm_info&lt;/code&gt; and &lt;code&gt;dri/N&lt;/code&gt; (not always &lt;code&gt;dri/0&lt;/code&gt;)
&lt;/h3&gt;

&lt;p&gt;Many snippets use &lt;strong&gt;&lt;code&gt;/sys/kernel/debug/dri/0/amdgpu_pm_info&lt;/code&gt;&lt;/strong&gt;. On Ryzen mini PCs with &lt;strong&gt;amdgpu&lt;/strong&gt;, &lt;strong&gt;&lt;code&gt;dri/0&lt;/code&gt; often does not exist&lt;/strong&gt;: the kernel exposes the GPU under a &lt;strong&gt;PCI BDF&lt;/strong&gt; directory (&lt;code&gt;0000:c4:00.0&lt;/code&gt;, …) and provides &lt;strong&gt;symlinks&lt;/strong&gt; such as &lt;strong&gt;&lt;code&gt;dri/1&lt;/code&gt;&lt;/strong&gt; or &lt;strong&gt;&lt;code&gt;dri/128&lt;/code&gt;&lt;/strong&gt; into the same tree. If &lt;code&gt;cat&lt;/code&gt; returns &lt;em&gt;No such file or directory&lt;/em&gt;, inspect first:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;mount | &lt;span class="nb"&gt;grep &lt;/span&gt;debugfs   &lt;span class="c"&gt;# expect debugfs on /sys/kernel/debug&lt;/span&gt;
&lt;span class="nb"&gt;ls&lt;/span&gt; &lt;span class="nt"&gt;-la&lt;/span&gt; /sys/kernel/debug/dri/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then read &lt;strong&gt;&lt;code&gt;amdgpu_pm_info&lt;/code&gt;&lt;/strong&gt; using the &lt;strong&gt;&lt;code&gt;N&lt;/code&gt;&lt;/strong&gt; or PCI path that belongs to your AMDGPU (&lt;strong&gt;&lt;code&gt;1&lt;/code&gt;&lt;/strong&gt; or &lt;strong&gt;&lt;code&gt;0000:…:….0&lt;/code&gt;&lt;/strong&gt; usually works):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo cat&lt;/span&gt; /sys/kernel/debug/dri/1/amdgpu_pm_info
&lt;span class="c"&gt;# same content if 1 → 0000:c4:00.0:&lt;/span&gt;
&lt;span class="c"&gt;# sudo cat /sys/kernel/debug/dri/0000:c4:00.0/amdgpu_pm_info&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If the directory exists but &lt;strong&gt;&lt;code&gt;amdgpu_pm_info&lt;/code&gt; is missing&lt;/strong&gt;, your kernel may &lt;strong&gt;not export&lt;/strong&gt; that node; try &lt;code&gt;ls … | grep -i pm&lt;/code&gt;. That does &lt;strong&gt;not&lt;/strong&gt; mean Vulkan is broken.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How to read it (sample text, idle mini PC):&lt;/strong&gt; &lt;strong&gt;GPU Load: 0 %&lt;/strong&gt; with &lt;strong&gt;VCN powered down&lt;/strong&gt; matches &lt;strong&gt;idle&lt;/strong&gt;. While &lt;strong&gt;&lt;code&gt;llama-cli&lt;/code&gt;&lt;/strong&gt; / &lt;strong&gt;&lt;code&gt;llama-server&lt;/code&gt;&lt;/strong&gt; runs a long &lt;strong&gt;&lt;code&gt;‑ngl&lt;/code&gt;&lt;/strong&gt; job, run &lt;code&gt;cat&lt;/code&gt; &lt;strong&gt;during&lt;/strong&gt; generation: you should usually see &lt;strong&gt;Load &amp;gt; 0 %&lt;/strong&gt; (the counter may not peg the iGPU). For a live view, &lt;strong&gt;&lt;code&gt;radeontop&lt;/code&gt;&lt;/strong&gt; is often easier (&lt;code&gt;sudo apt install -y radeontop&lt;/code&gt;).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;GFX Clocks and Power:
    2800 MHz (MCLK)
    800 MHz (SCLK)
    ...
GPU Temperature: 36 C
GPU Load: 0 %
VCN Load: 0 %
VCN: Powered down
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;(Illustrative excerpt; clocks, millivolts, and watts vary with BIOS, governor, and workload.)&lt;/p&gt;




&lt;h2&gt;
  
  
  14. Remote desktop (Ubuntu 24.04 Desktop, LAN)
&lt;/h2&gt;

&lt;p&gt;When the mini PC runs &lt;strong&gt;GNOME&lt;/strong&gt; and you want the full desktop from &lt;strong&gt;another machine on the same network&lt;/strong&gt; (Windows, Mac, Linux), &lt;strong&gt;Ubuntu 24.04&lt;/strong&gt; usually ships &lt;strong&gt;RDP&lt;/strong&gt; built in; you often &lt;strong&gt;do not&lt;/strong&gt; need &lt;strong&gt;xrdp&lt;/strong&gt; unless you want different behavior.&lt;/p&gt;

&lt;h3&gt;
  
  
  14.1 Enable on the mini PC
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Settings&lt;/strong&gt; → &lt;strong&gt;System&lt;/strong&gt; → &lt;strong&gt;Remote Desktop&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Turn &lt;strong&gt;Remote Desktop&lt;/strong&gt; on.&lt;/li&gt;
&lt;li&gt;Finish the assistant (password / auth as GNOME shows).&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Underlying package: &lt;strong&gt;&lt;code&gt;gnome-remote-desktop&lt;/code&gt;&lt;/strong&gt;. If the toggle is missing or fails:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;apt update
&lt;span class="nb"&gt;sudo &lt;/span&gt;apt &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;--reinstall&lt;/span&gt; gnome-remote-desktop
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Log out or reboot and open Settings again.&lt;/p&gt;

&lt;h3&gt;
  
  
  14.2 Connect from another machine
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Native &lt;strong&gt;RDP&lt;/strong&gt; clients: &lt;strong&gt;Windows&lt;/strong&gt; (Remote Desktop Connection / &lt;code&gt;mstsc&lt;/code&gt;), &lt;strong&gt;macOS&lt;/strong&gt; (Microsoft Remote Desktop from the App Store), &lt;strong&gt;Linux&lt;/strong&gt; (e.g. Remmina, RDP protocol).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Host:&lt;/strong&gt; the Ubuntu box’s &lt;strong&gt;LAN IP&lt;/strong&gt; (&lt;code&gt;hostname -I | awk '{print $1}'&lt;/code&gt; on the mini PC).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Port:&lt;/strong&gt; &lt;strong&gt;3389/TCP&lt;/strong&gt; by default.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  14.3 Firewall (&lt;code&gt;ufw&lt;/code&gt;)
&lt;/h3&gt;

&lt;p&gt;If &lt;code&gt;ufw&lt;/code&gt; is enabled:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;ufw allow 3389/tcp comment &lt;span class="s1"&gt;'GNOME RDP'&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;ufw status
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  14.4 If connection fails
&lt;/h3&gt;

&lt;p&gt;On the &lt;strong&gt;Ubuntu host&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;hostname&lt;/span&gt; &lt;span class="nt"&gt;-I&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;ss &lt;span class="nt"&gt;-tlnp&lt;/span&gt; | &lt;span class="nb"&gt;grep &lt;/span&gt;3389 &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="nb"&gt;true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With Remote Desktop enabled, something should listen on &lt;strong&gt;3389&lt;/strong&gt;. Confirm the client is on the &lt;strong&gt;same LAN&lt;/strong&gt; and that no AP isolation blocks client-to-client Wi‑Fi.&lt;/p&gt;

&lt;p&gt;If GNOME/RDP misbehaves on &lt;strong&gt;Wayland&lt;/strong&gt;, try the &lt;strong&gt;Ubuntu on Xorg&lt;/strong&gt; session on the login screen and enable Remote Desktop again.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Security:&lt;/strong&gt; exposing RDP to the &lt;strong&gt;public Internet&lt;/strong&gt; without VPN/tunnel is a bad idea; keep it on a &lt;strong&gt;trusted LAN&lt;/strong&gt; or behind VPN/WireGuard.&lt;/p&gt;




&lt;h2&gt;
  
  
  Final checklist
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;[ ] BIOS: UMA / VRAM for iGPU adjusted if applicable.&lt;/li&gt;
&lt;li&gt;[ ] Vulkan OK: on desktop &lt;code&gt;vkcube&lt;/code&gt;; on &lt;strong&gt;Ubuntu Server&lt;/strong&gt; &lt;code&gt;vulkaninfo --summary&lt;/code&gt; shows the GPU.&lt;/li&gt;
&lt;li&gt;[ ] User is in &lt;strong&gt;&lt;code&gt;render&lt;/code&gt;&lt;/strong&gt; and &lt;strong&gt;&lt;code&gt;video&lt;/code&gt;&lt;/strong&gt; (&lt;code&gt;id -nG&lt;/code&gt;); if you ran &lt;code&gt;usermod&lt;/code&gt;, you &lt;strong&gt;logged out or rebooted&lt;/strong&gt; (an old shell session does not pick up new groups).&lt;/li&gt;
&lt;li&gt;[ ] &lt;code&gt;cmake -B build -DGGML_VULKAN=1&lt;/code&gt; succeeds; build reaches 100 %.&lt;/li&gt;
&lt;li&gt;[ ] You can &lt;strong&gt;update &lt;code&gt;llama.cpp&lt;/code&gt;&lt;/strong&gt; (&lt;code&gt;git pull&lt;/code&gt;, rebuild §6) and follow &lt;strong&gt;try model → systemd → Open WebUI&lt;/strong&gt; when experimenting with new GGUFs (§7, &lt;em&gt;Experimenting…&lt;/em&gt;).&lt;/li&gt;
&lt;li&gt;[ ] &lt;code&gt;llama-cli&lt;/code&gt; shows the Vulkan device when loading the model.&lt;/li&gt;
&lt;li&gt;[ ] &lt;code&gt;llama-server&lt;/code&gt; responds on &lt;code&gt;:8080&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;[ ] Open WebUI on &lt;code&gt;:3000&lt;/code&gt; with &lt;code&gt;http://host.docker.internal:8080/v1&lt;/code&gt; and &lt;strong&gt;Direct connections&lt;/strong&gt; off.&lt;/li&gt;
&lt;li&gt;[ ] You know the model does &lt;strong&gt;not&lt;/strong&gt; browse or read GitHub from a URL alone; it may &lt;strong&gt;hallucinate&lt;/strong&gt; capabilities (§10 &lt;em&gt;No browsing or GitHub fetch&lt;/em&gt;).&lt;/li&gt;
&lt;li&gt;[ ] You know &lt;strong&gt;how to upgrade Open WebUI&lt;/strong&gt;: &lt;code&gt;docker pull&lt;/code&gt;, &lt;code&gt;stop&lt;/code&gt;/&lt;code&gt;rm&lt;/code&gt; the container, rerun the same &lt;code&gt;docker run&lt;/code&gt; with the &lt;code&gt;open-webui&lt;/code&gt; volume (§10).&lt;/li&gt;
&lt;li&gt;[ ] &lt;code&gt;systemd&lt;/code&gt; service enabled if you want a persistent boot setup.&lt;/li&gt;
&lt;li&gt;[ ] You know &lt;strong&gt;how to switch models&lt;/strong&gt;: after adding another &lt;code&gt;.gguf&lt;/code&gt;, you update &lt;code&gt;-m&lt;/code&gt; in &lt;code&gt;llama-web.service&lt;/code&gt; (or in the manual command), run &lt;code&gt;sudo systemctl daemon-reload &amp;amp;&amp;amp; sudo systemctl restart llama-web.service&lt;/code&gt;, and reload Open WebUI.&lt;/li&gt;
&lt;li&gt;[ ] You can &lt;strong&gt;list&lt;/strong&gt; your &lt;code&gt;.gguf&lt;/code&gt; files (&lt;code&gt;ls&lt;/code&gt; / &lt;code&gt;find&lt;/code&gt;, §7) and &lt;strong&gt;measure&lt;/strong&gt; throughput with &lt;code&gt;llama-bench&lt;/code&gt; (§7) when comparing quantizations or &lt;code&gt;-ngl&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;[ ] You can follow the &lt;strong&gt;unified playbook&lt;/strong&gt; for Gemma 4 / Qwen Coder / DeepSeek Lite / Llama 3.1 (§7): download → &lt;code&gt;llama-cli&lt;/code&gt; → &lt;code&gt;systemd&lt;/code&gt; → &lt;code&gt;/v1/models&lt;/code&gt; → Open WebUI.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://dev.toOptional"&gt; &lt;/a&gt; &lt;strong&gt;Remote desktop&lt;/strong&gt; §14: RDP enabled in Settings, &lt;strong&gt;3389&lt;/strong&gt; allowed in &lt;code&gt;ufw&lt;/code&gt; if needed, smoke tested from another PC on the LAN.&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Quick port reference
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Service&lt;/th&gt;
&lt;th&gt;Port&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;llama-server&lt;/td&gt;
&lt;td&gt;8080&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Open WebUI&lt;/td&gt;
&lt;td&gt;3000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Remote desktop (GNOME RDP)&lt;/td&gt;
&lt;td&gt;3389 TCP&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Ollama (optional)&lt;/td&gt;
&lt;td&gt;11434&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Closing thoughts
&lt;/h2&gt;

&lt;p&gt;Running local inference on Ubuntu with Vulkan and an AMD iGPU is not a one-click setup, but it is worth it: a model that answers &lt;strong&gt;on your LAN&lt;/strong&gt;, without routing every request through a third-party API, and with the freedom to swap GGUFs or quantizations when you need to.&lt;/p&gt;

&lt;p&gt;The stack moves fast: &lt;strong&gt;llama.cpp&lt;/strong&gt;, Ubuntu packages, and Hugging Face repos &lt;strong&gt;change&lt;/strong&gt; over time. If a command or package name no longer matches this guide, &lt;code&gt;cmake&lt;/code&gt; and &lt;code&gt;apt&lt;/code&gt; errors usually point you in the right direction; double-check the project’s current docs.&lt;/p&gt;

&lt;p&gt;Once the checklist is green, the natural next step is tuning &lt;strong&gt;&lt;code&gt;-ngl&lt;/code&gt;&lt;/strong&gt;, context size (&lt;code&gt;-c&lt;/code&gt;), and the model until you get the quality-vs-tokens-per-second balance you want &lt;strong&gt;on your&lt;/strong&gt; hardware.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;This is the mini PC&lt;/strong&gt; we used for the tests and validation in this guide: &lt;strong&gt;Minisforum UM760 Slim&lt;/strong&gt; (&lt;strong&gt;Ryzen 5 7640HS&lt;/strong&gt;, &lt;strong&gt;Radeon 760M&lt;/strong&gt;), &lt;strong&gt;Ubuntu 24.04 LTS&lt;/strong&gt;, plenty of &lt;strong&gt;DDR5&lt;/strong&gt; RAM and NVMe — the same box behind the &lt;strong&gt;&lt;code&gt;llama-bench&lt;/code&gt;&lt;/strong&gt; runs, &lt;strong&gt;&lt;code&gt;llama-cli&lt;/code&gt;&lt;/strong&gt; screenshots, &lt;strong&gt;Open WebUI&lt;/strong&gt; examples, and the other reference captures. The photo is the &lt;strong&gt;actual&lt;/strong&gt; machine (powered on, front panel as shown), not a marketing render.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fattwa07dclf2y6szn3mn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fattwa07dclf2y6szn3mn.png" alt="Minisforum UM760 Slim — the physical box used to validate this guide (Ryzen 5 7640HS, Radeon 760M, Ubuntu 24.04)." width="732" height="554"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Now go tinker:&lt;/strong&gt; this walkthrough is rooted in &lt;strong&gt;Ryzen + iGPU&lt;/strong&gt;, but the playbook travels—&lt;strong&gt;mini PCs&lt;/strong&gt; (Minisforum, Beelink, &lt;strong&gt;ASUS ExpertCenter PN&lt;/strong&gt;, &lt;strong&gt;ZOTAC ZBOX&lt;/strong&gt;, modern &lt;strong&gt;Intel NUC-class&lt;/strong&gt; boxes…), &lt;strong&gt;Mac mini&lt;/strong&gt; / &lt;strong&gt;Mac Studio&lt;/strong&gt; on Apple Silicon if that is your stack, or compact power boxes like &lt;strong&gt;NVIDIA DGX Spark&lt;/strong&gt; when budget and goals match. Build &lt;strong&gt;llama.cpp&lt;/strong&gt; (or your preferred runtime), stress &lt;strong&gt;GGUF&lt;/strong&gt; quantizations, run &lt;strong&gt;&lt;code&gt;llama-bench&lt;/code&gt;&lt;/strong&gt; on &lt;strong&gt;your&lt;/strong&gt; iron, and tune &lt;strong&gt;&lt;code&gt;-ngl&lt;/code&gt;&lt;/strong&gt; until the ceiling feels right. &lt;strong&gt;Share&lt;/strong&gt; what you learn—a &lt;strong&gt;dev.to&lt;/strong&gt; post, a blog, &lt;strong&gt;Mastodon&lt;/strong&gt;, article comments, or whatever community you use; real numbers beat brochure claims every time.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;One quiet takeaway:&lt;/em&gt; on &lt;strong&gt;your&lt;/strong&gt; codebases the model usually helps more as a &lt;strong&gt;copilot you feed&lt;/strong&gt;—a diff, a log slice, a trimmed README—than as an &lt;strong&gt;all-knowing reviewer&lt;/strong&gt; from a bare URL or a polished persona. When the answer feels &lt;em&gt;too&lt;/em&gt; slick without anything concrete in the prompt, the limit is rarely the mini PC: it is &lt;strong&gt;text-in, text-out&lt;/strong&gt; with nobody else reading disk for you. §10 walks the receipts; day-to-day, &lt;strong&gt;you&lt;/strong&gt; supply the ground truth.&lt;/p&gt;

&lt;p&gt;AI disclosure: I wrote the technical walkthrough from my own setup (Ubuntu 24.04, llama.cpp + Vulkan, Minisforum mini PC, real llama-bench numbers and screenshots). I used AI tools (e.g. ChatGPT/Gemini/Cursor-style assistants) for brainstorming titles, structure, and Reddit post wording, and for editing clarity in places—not for inventing the commands, hardware facts, or benchmarks, which I ran and documented myself. The project itself (self-hosted stack) does not require callers to use cloud LLMs; it’s local inference. Happy to clarify further if needed.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>performance</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>gghstats-selfhosted: production-shaped manifests for gghstats</title>
      <dc:creator>Hermes Rodríguez</dc:creator>
      <pubDate>Fri, 03 Apr 2026 15:43:55 +0000</pubDate>
      <link>https://dev.to/hrodrig/gghstats-selfhosted-production-shaped-manifests-for-gghstats-56j9</link>
      <guid>https://dev.to/hrodrig/gghstats-selfhosted-production-shaped-manifests-for-gghstats-56j9</guid>
      <description>&lt;p&gt;You already read &lt;strong&gt;&lt;a href="https://dev.to/hrodrig/gghstats-keep-github-traffic-past-14-days"&gt;gghstats: Keep GitHub traffic past 14 days&lt;/a&gt;&lt;/strong&gt; — &lt;strong&gt;&lt;a href="https://github.com/hrodrig/gghstats" rel="noopener noreferrer"&gt;gghstats&lt;/a&gt;&lt;/strong&gt; is the small Go service that keeps &lt;strong&gt;GitHub traffic history&lt;/strong&gt; in &lt;strong&gt;SQLite&lt;/strong&gt; instead of losing it after GitHub’s ~14-day window. The app ships binaries, a multi-arch &lt;strong&gt;GHCR&lt;/strong&gt; image, and a focused README.&lt;/p&gt;

&lt;p&gt;What lives &lt;em&gt;outside&lt;/em&gt; the application repo is everything that answers: &lt;strong&gt;how do I run this for real on my box, VPS, or cluster?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;That’s &lt;strong&gt;&lt;a href="https://github.com/hrodrig/gghstats-selfhosted" rel="noopener noreferrer"&gt;gghstats-selfhosted&lt;/a&gt;&lt;/strong&gt; — a separate repository with &lt;strong&gt;deployment manifests only&lt;/strong&gt;: &lt;strong&gt;&lt;code&gt;docker run&lt;/code&gt;&lt;/strong&gt;, &lt;strong&gt;Docker Compose&lt;/strong&gt; (minimal, Traefik + HTTPS, optional observability), and a &lt;strong&gt;Helm chart&lt;/strong&gt; for Kubernetes. No Go code here; it stays easy to fork, pin, and diff.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Live app demo (read-only):&lt;/strong&gt; &lt;a href="https://gghstats.hermesrodriguez.com" rel="noopener noreferrer"&gt;gghstats.hermesrodriguez.com&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  What the app looks like
&lt;/h3&gt;

&lt;p&gt;Flat screenshots with a &lt;strong&gt;white&lt;/strong&gt; backdrop, perspective, and soft shadow (local asset pipeline — not in the GitHub repos):&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Main dashboard (repository list):&lt;/em&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc0hgzcvirisp5vwhtwss.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc0hgzcvirisp5vwhtwss.png" alt="gghstats — main dashboard" width="800" height="490"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Repository detail (charts and tables):&lt;/em&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F17d3kh7t05ku9zv4fdue.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F17d3kh7t05ku9zv4fdue.png" alt="gghstats — repository detail" width="800" height="692"&gt;&lt;/a&gt;&lt;/p&gt;


&lt;h2&gt;
  
  
  Why split “app” and “how to run it”?
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;gghstats&lt;/strong&gt; = releases, security advisories, feature issues, container tags.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;gghstats-selfhosted&lt;/strong&gt; = Compose files, Helm templates, env patterns, and docs that change when &lt;em&gt;deployment&lt;/em&gt; stories evolve (Traefik labels, persistence, layout under &lt;strong&gt;&lt;code&gt;run/&lt;/code&gt;&lt;/strong&gt;).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You can ignore the self-hosted repo forever and still run from &lt;strong&gt;GHCR&lt;/strong&gt; with a one-liner — but if you want &lt;strong&gt;opinionated layouts&lt;/strong&gt; (shared &lt;strong&gt;&lt;code&gt;GGHSTATS_HOST_DATA&lt;/code&gt;&lt;/strong&gt;, secrets outside the git clone, optional Prometheus/Grafana/Loki behind the same Traefik network), the split keeps each repo readable.&lt;/p&gt;
&lt;h3&gt;
  
  
  Who this is for
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Self-hosters&lt;/strong&gt; who already run a &lt;strong&gt;VPS&lt;/strong&gt;, &lt;strong&gt;Compose&lt;/strong&gt;, or a &lt;strong&gt;small Kubernetes&lt;/strong&gt; cluster and want a &lt;strong&gt;repeatable&lt;/strong&gt; layout instead of a one-off paste from Stack Overflow.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Operators&lt;/strong&gt; who care about &lt;strong&gt;pinning an image tag&lt;/strong&gt;, a &lt;strong&gt;single persistent path&lt;/strong&gt; for SQLite, and &lt;strong&gt;secrets that never live in git&lt;/strong&gt; — the same reasons larger teams split “app” and “platform,” just with a tiny footprint.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  What this layout tries to spare you
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Wiring &lt;strong&gt;Traefik + Let’s Encrypt&lt;/strong&gt; from scratch on every new service.&lt;/li&gt;
&lt;li&gt;Stuffing &lt;strong&gt;Compose samples, Helm, and env docs&lt;/strong&gt; into the &lt;strong&gt;application&lt;/strong&gt; repo (noisier release notes, harder security reviews).&lt;/li&gt;
&lt;li&gt;Forgetting &lt;strong&gt;which volume&lt;/strong&gt; holds &lt;strong&gt;&lt;code&gt;gghstats.db&lt;/code&gt;&lt;/strong&gt; when you bump the image six months later.&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  Context (other approaches)
&lt;/h3&gt;

&lt;p&gt;GitHub’s Traffic view only goes back about &lt;strong&gt;14 days&lt;/strong&gt;. Other open-source projects chase “GitHub stats” with different goals (dashboards, exporters, hosted analytics). &lt;strong&gt;gghstats&lt;/strong&gt; stays narrow: &lt;strong&gt;persist&lt;/strong&gt; traffic via the API into &lt;strong&gt;SQLite&lt;/strong&gt;, ship one &lt;strong&gt;Go&lt;/strong&gt; binary or &lt;strong&gt;GHCR&lt;/strong&gt; image, and let a &lt;strong&gt;separate&lt;/strong&gt; repo own &lt;strong&gt;how&lt;/strong&gt; you run it. If that trade-off fits you, these manifests are the glue.&lt;/p&gt;


&lt;h2&gt;
  
  
  What you get in &lt;code&gt;run/&lt;/code&gt;
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Path&lt;/th&gt;
&lt;th&gt;Roughly&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;code&gt;run/standalone/{linux,macos,windows}/&lt;/code&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Notes for the binary-only path&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;code&gt;run/docker/&lt;/code&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Single-container &lt;code&gt;docker run&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;code&gt;run/docker-compose/minimal/&lt;/code&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;One service, quick VPS&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;code&gt;run/docker-compose/traefik/&lt;/code&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;HTTPS + Let’s Encrypt + edge network for the app&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;code&gt;run/docker-compose/observability/&lt;/code&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Optional Prometheus / Grafana / Loki (after Traefik)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;code&gt;run/kubernetes/helm/gghstats/&lt;/code&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Helm chart &lt;strong&gt;&lt;code&gt;gghstats&lt;/code&gt;&lt;/strong&gt; (same name as the app; &lt;strong&gt;not&lt;/strong&gt; the GitHub repo name)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;&lt;code&gt;run/kubernetes/manifests/&lt;/code&gt;&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Plain YAML if you prefer not to use Helm&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The table in the &lt;strong&gt;&lt;a href="https://github.com/hrodrig/gghstats-selfhosted#pick-a-path" rel="noopener noreferrer"&gt;README&lt;/a&gt;&lt;/strong&gt; is the fastest way to jump to the flow you want.&lt;/p&gt;


&lt;h2&gt;
  
  
  Quick starts (copy, adjust, run)
&lt;/h2&gt;

&lt;p&gt;Pick one path. Replace &lt;strong&gt;&lt;code&gt;ghp_xxx&lt;/code&gt;&lt;/strong&gt;, host paths, &lt;strong&gt;&lt;code&gt;your-github-user/*&lt;/code&gt;&lt;/strong&gt;, image tags, and domains with yours. Pin &lt;strong&gt;&lt;code&gt;ghcr.io/hrodrig/gghstats:&lt;/code&gt;&lt;/strong&gt; to a tag that exists on &lt;strong&gt;&lt;a href="https://github.com/hrodrig/gghstats/releases" rel="noopener noreferrer"&gt;GHCR / releases&lt;/a&gt;&lt;/strong&gt; (example below uses &lt;strong&gt;&lt;code&gt;v0.1.2&lt;/code&gt;&lt;/strong&gt;).&lt;/p&gt;
&lt;h3&gt;
  
  
  GitHub token (scopes and safety)
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;&lt;code&gt;GGHSTATS_GITHUB_TOKEN&lt;/code&gt;&lt;/strong&gt; must be a &lt;strong&gt;Personal Access Token&lt;/strong&gt; that can reach the repos matched by &lt;strong&gt;&lt;code&gt;GGHSTATS_FILTER&lt;/code&gt;&lt;/strong&gt;. Follow &lt;strong&gt;&lt;a href="https://github.com/hrodrig/gghstats#token-setup" rel="noopener noreferrer"&gt;gghstats — Token setup&lt;/a&gt;&lt;/strong&gt;:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Token type&lt;/th&gt;
&lt;th&gt;Scopes / access (summary)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Classic PAT&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;&lt;code&gt;public_repo&lt;/code&gt;&lt;/strong&gt; — enough for &lt;strong&gt;public&lt;/strong&gt; repos only. Use &lt;strong&gt;&lt;code&gt;repo&lt;/code&gt;&lt;/strong&gt; if you track &lt;strong&gt;private&lt;/strong&gt; repositories (or use &lt;strong&gt;&lt;code&gt;GGHSTATS_INCLUDE_PRIVATE=true&lt;/code&gt;&lt;/strong&gt;).&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Fine-grained PAT&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Grant access to the repositories you need; include whatever &lt;strong&gt;repository permissions&lt;/strong&gt; GitHub requires for the &lt;strong&gt;Traffic&lt;/strong&gt; and repo &lt;strong&gt;metadata&lt;/strong&gt; APIs for those repos (the token wizard lists them per permission).&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Safety:&lt;/strong&gt; do &lt;strong&gt;not&lt;/strong&gt; commit the token, put it in a public gist, or paste it into issues. Prefer env vars, Compose &lt;strong&gt;&lt;code&gt;env_file&lt;/code&gt;&lt;/strong&gt;, or Kubernetes &lt;strong&gt;Secrets&lt;/strong&gt;. The app’s startup banner only shows a &lt;strong&gt;masked&lt;/strong&gt; token. If the dashboard is empty after sync, verify &lt;strong&gt;filter&lt;/strong&gt; rules and &lt;strong&gt;token scope&lt;/strong&gt; (see &lt;strong&gt;&lt;a href="https://github.com/hrodrig/gghstats#dashboard-shows-no-repositories" rel="noopener noreferrer"&gt;Troubleshooting&lt;/a&gt;&lt;/strong&gt; in the app README).&lt;/p&gt;
&lt;h3&gt;
  
  
  Binary (no Docker)
&lt;/h3&gt;

&lt;p&gt;Grab a release binary from &lt;strong&gt;&lt;a href="https://github.com/hrodrig/gghstats/releases" rel="noopener noreferrer"&gt;gghstats Releases&lt;/a&gt;&lt;/strong&gt;, extract, then:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;GGHSTATS_GITHUB_TOKEN&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;ghp_xxx
./gghstats serve
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Open &lt;strong&gt;&lt;a href="http://localhost:8080" rel="noopener noreferrer"&gt;http://localhost:8080&lt;/a&gt;&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Docker (one container)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;GGHSTATS_HOST_DATA&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;/home/gghstats/gghstats-data
&lt;span class="nb"&gt;mkdir&lt;/span&gt; &lt;span class="nt"&gt;-p&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$GGHSTATS_HOST_DATA&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;

docker run &lt;span class="nt"&gt;-d&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nv"&gt;GGHSTATS_GITHUB_TOKEN&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;ghp_xxx &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="nv"&gt;GGHSTATS_FILTER&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"your-github-user/*"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-p&lt;/span&gt; 8080:8080 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-v&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;GGHSTATS_HOST_DATA&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;:/data"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--name&lt;/span&gt; gghstats &lt;span class="se"&gt;\&lt;/span&gt;
  ghcr.io/hrodrig/gghstats:v0.1.2
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Docker Compose (minimal, one service)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/hrodrig/gghstats-selfhosted.git
&lt;span class="nb"&gt;cd &lt;/span&gt;gghstats-selfhosted
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;GGHSTATS_HOST_DATA&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;/home/gghstats/gghstats-data
&lt;span class="nb"&gt;mkdir&lt;/span&gt; &lt;span class="nt"&gt;-p&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$GGHSTATS_HOST_DATA&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;span class="nb"&gt;cp &lt;/span&gt;run/common/.env.example &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;GGHSTATS_HOST_DATA&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;/.env"&lt;/span&gt;
&lt;span class="c"&gt;# Edit "${GGHSTATS_HOST_DATA}/.env" — at least GGHSTATS_GITHUB_TOKEN, GGHSTATS_VERSION, GGHSTATS_HOST_DATA&lt;/span&gt;

docker compose &lt;span class="nt"&gt;--env-file&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;GGHSTATS_HOST_DATA&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;/.env"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-f&lt;/span&gt; run/docker-compose/minimal/docker-compose.yml up &lt;span class="nt"&gt;-d&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Docker Compose + Traefik (HTTPS, production-shaped)
&lt;/h3&gt;

&lt;p&gt;Needs DNS &lt;strong&gt;A/AAAA&lt;/strong&gt; to this host and &lt;strong&gt;80&lt;/strong&gt; / &lt;strong&gt;443&lt;/strong&gt; reachable.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/hrodrig/gghstats-selfhosted.git
&lt;span class="nb"&gt;cd &lt;/span&gt;gghstats-selfhosted
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;GGHSTATS_HOST_DATA&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;/home/gghstats/gghstats-data
&lt;span class="nb"&gt;mkdir&lt;/span&gt; &lt;span class="nt"&gt;-p&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$GGHSTATS_HOST_DATA&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
&lt;span class="nb"&gt;cp &lt;/span&gt;run/common/.env.example &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;GGHSTATS_HOST_DATA&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;/.env"&lt;/span&gt;
&lt;span class="c"&gt;# Edit "${GGHSTATS_HOST_DATA}/.env" — token, GGHSTATS_HOSTNAME, ACME_EMAIL, GGHSTATS_VERSION, GGHSTATS_HOST_DATA, …&lt;/span&gt;

docker compose &lt;span class="nt"&gt;--env-file&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="k"&gt;${&lt;/span&gt;&lt;span class="nv"&gt;GGHSTATS_HOST_DATA&lt;/span&gt;&lt;span class="k"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;/.env"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-f&lt;/span&gt; run/docker-compose/traefik/docker-compose.yml up &lt;span class="nt"&gt;-d&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Kubernetes (Helm)
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;helm repo add gghstats https://hrodrig.github.io/gghstats-selfhosted
helm repo update
helm show values gghstats/gghstats &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; my-values.yaml
&lt;span class="c"&gt;# Edit my-values.yaml — e.g. image.tag, persistence, resources; keep githubToken.value empty (PAT goes in the Secret below)&lt;/span&gt;

kubectl create namespace gghstats
kubectl create secret generic gghstats-secret &lt;span class="nt"&gt;-n&lt;/span&gt; gghstats &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--from-literal&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;github-token&lt;span class="o"&gt;=&lt;/span&gt;ghp_xxx
helm &lt;span class="nb"&gt;install &lt;/span&gt;gghstats gghstats/gghstats &lt;span class="nt"&gt;-n&lt;/span&gt; gghstats &lt;span class="nt"&gt;-f&lt;/span&gt; my-values.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;&lt;code&gt;my-values.yaml&lt;/code&gt;:&lt;/strong&gt; start from &lt;strong&gt;&lt;code&gt;helm show values&lt;/code&gt;&lt;/strong&gt; (above) so you inherit defaults and &lt;strong&gt;&lt;code&gt;values.schema.json&lt;/code&gt;&lt;/strong&gt; constraints (e.g. &lt;strong&gt;&lt;code&gt;resources&lt;/code&gt;&lt;/strong&gt;). Do &lt;strong&gt;not&lt;/strong&gt; put the PAT in that file — use the &lt;strong&gt;&lt;code&gt;Secret&lt;/code&gt;&lt;/strong&gt; and leave &lt;strong&gt;&lt;code&gt;githubToken.value&lt;/code&gt;&lt;/strong&gt; empty. Details: &lt;strong&gt;&lt;a href="https://github.com/hrodrig/gghstats-selfhosted#kubernetes-helm" rel="noopener noreferrer"&gt;README — Kubernetes Helm&lt;/a&gt;&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example &lt;code&gt;my-values.yaml&lt;/code&gt; fragment&lt;/strong&gt; — token only in the &lt;strong&gt;&lt;code&gt;Secret&lt;/code&gt;&lt;/strong&gt; created above; the chart reads it via &lt;strong&gt;&lt;code&gt;githubToken.existingSecret&lt;/code&gt;&lt;/strong&gt; (or default &lt;strong&gt;&lt;code&gt;secretName&lt;/code&gt;&lt;/strong&gt;):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Excerpt — always start from: helm show values gghstats/gghstats &amp;gt; my-values.yaml&lt;/span&gt;
&lt;span class="na"&gt;image&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;tag&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;v0.1.2"&lt;/span&gt;

&lt;span class="na"&gt;githubToken&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;
  &lt;span class="na"&gt;existingSecret&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gghstats-secret"&lt;/span&gt;

&lt;span class="na"&gt;resources&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;requests&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt; &lt;span class="nv"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;50m"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;128Mi"&lt;/span&gt; &lt;span class="pi"&gt;}&lt;/span&gt;
  &lt;span class="na"&gt;limits&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;{&lt;/span&gt; &lt;span class="nv"&gt;cpu&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;1"&lt;/span&gt;&lt;span class="pi"&gt;,&lt;/span&gt; &lt;span class="nv"&gt;memory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;512Mi"&lt;/span&gt; &lt;span class="pi"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Helm chart security (defaults):&lt;/strong&gt; the workload runs &lt;strong&gt;non-root&lt;/strong&gt; (UID/GID &lt;strong&gt;1000&lt;/strong&gt;), with &lt;strong&gt;&lt;code&gt;readOnlyRootFilesystem: true&lt;/code&gt;&lt;/strong&gt;, &lt;strong&gt;&lt;code&gt;capabilities.drop: [ALL]&lt;/code&gt;&lt;/strong&gt;, and a &lt;strong&gt;&lt;code&gt;RuntimeDefault&lt;/code&gt; seccomp profile&lt;/strong&gt;; SQLite lives under the &lt;strong&gt;&lt;code&gt;/data&lt;/code&gt;&lt;/strong&gt; mount and &lt;strong&gt;&lt;code&gt;/tmp&lt;/code&gt;&lt;/strong&gt; is a small &lt;strong&gt;&lt;code&gt;emptyDir&lt;/code&gt;&lt;/strong&gt;. Adjust only if your image requires it — see the &lt;strong&gt;&lt;a href="https://github.com/hrodrig/gghstats-selfhosted/tree/main/run/kubernetes/helm/gghstats#readme" rel="noopener noreferrer"&gt;chart README&lt;/a&gt;&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Reality check:&lt;/strong&gt; &lt;strong&gt;&lt;code&gt;0.1.x&lt;/code&gt;&lt;/strong&gt; is still early; &lt;strong&gt;pin tags&lt;/strong&gt;, read &lt;strong&gt;&lt;a href="https://github.com/hrodrig/gghstats-selfhosted/blob/main/CHANGELOG.md" rel="noopener noreferrer"&gt;CHANGELOG&lt;/a&gt;&lt;/strong&gt; on upgrades, and expect manifests to evolve with releases.&lt;/p&gt;




&lt;h2&gt;
  
  
  Two repos, one story
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;How the two repos relate&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;  gghstats (app repo)              gghstats-selfhosted (deploy repo)
           |                                      |
           | builds                               | Compose, Helm, run/
           v                                      v
      GHCR image ─────────────┬──────────── Manifests
                              │
                              v
                       your VPS / Kubernetes
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvys8aa0n6ajrawke36bs.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvys8aa0n6ajrawke36bs.png" alt="How the two repos relate" width="800" height="254"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Links
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;What&lt;/th&gt;
&lt;th&gt;Where&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Deployment manifests&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/hrodrig/gghstats-selfhosted" rel="noopener noreferrer"&gt;github.com/hrodrig/gghstats-selfhosted&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Application&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://github.com/hrodrig/gghstats" rel="noopener noreferrer"&gt;github.com/hrodrig/gghstats&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Helm index&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;a href="https://hrodrig.github.io/gghstats-selfhosted/index.yaml" rel="noopener noreferrer"&gt;hrodrig.github.io/gghstats-selfhosted/index.yaml&lt;/a&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Versioning, contributing, changelog&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;a href="https://github.com/hrodrig/gghstats-selfhosted#versioning" rel="noopener noreferrer"&gt;README — Versioning&lt;/a&gt; · &lt;a href="https://github.com/hrodrig/gghstats-selfhosted/blob/main/CONTRIBUTING.md" rel="noopener noreferrer"&gt;CONTRIBUTING&lt;/a&gt; · &lt;a href="https://github.com/hrodrig/gghstats-selfhosted/blob/main/CHANGELOG.md" rel="noopener noreferrer"&gt;CHANGELOG&lt;/a&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  Closing
&lt;/h2&gt;

&lt;p&gt;If you’re already self-hosting databases, proxies, and dashboards, &lt;strong&gt;gghstats&lt;/strong&gt; is one more small service — and &lt;strong&gt;gghstats-selfhosted&lt;/strong&gt; is the folder structure I wished existed when I wired mine up: copy &lt;strong&gt;&lt;code&gt;run/common/.env.example&lt;/code&gt;&lt;/strong&gt;, set &lt;strong&gt;&lt;code&gt;GGHSTATS_HOST_DATA&lt;/code&gt;&lt;/strong&gt;, choose Compose or Helm, and keep your PAT out of git.&lt;/p&gt;

&lt;p&gt;Questions and PRs welcome on &lt;strong&gt;&lt;code&gt;develop&lt;/code&gt;&lt;/strong&gt;; merge and releases follow the &lt;strong&gt;&lt;a href="https://github.com/hrodrig/gghstats-selfhosted/blob/main/CONTRIBUTING.md" rel="noopener noreferrer"&gt;repo docs&lt;/a&gt;&lt;/strong&gt;.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Cross-posted from the author’s notes; exact commands, versioning, and release policy are always the repositories linked above.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Disclosure (Dev.to / transparency):&lt;/strong&gt; The author used &lt;strong&gt;AI-assisted editing&lt;/strong&gt; (e.g. drafting structure, wording, and Markdown) and &lt;strong&gt;reviewed and approved&lt;/strong&gt; the final text. Technical claims are meant to match the linked repositories at publish time; if something drifts, trust the repos and &lt;strong&gt;CHANGELOG&lt;/strong&gt; over this post.&lt;/p&gt;

</description>
      <category>gghstats</category>
      <category>selfhosted</category>
      <category>helm</category>
      <category>kubernetes</category>
    </item>
    <item>
      <title>gghstats: Keep GitHub traffic past 14 days</title>
      <dc:creator>Hermes Rodríguez</dc:creator>
      <pubDate>Sun, 29 Mar 2026 04:41:28 +0000</pubDate>
      <link>https://dev.to/hrodrig/gghstats-keep-github-traffic-past-14-days-3ckg</link>
      <guid>https://dev.to/hrodrig/gghstats-keep-github-traffic-past-14-days-3ckg</guid>
      <description>&lt;p&gt;We've all been there. You ship an open-source project, a tiny CLI, or a docs site. You watch &lt;strong&gt;Insights → Traffic&lt;/strong&gt; for a week: views spike, clones climb, life is good.&lt;/p&gt;

&lt;p&gt;Then you come back a month later and ask a simple question: &lt;em&gt;did that blog post actually move the needle over time?&lt;/em&gt; GitHub’s answer is blunt: &lt;strong&gt;detailed traffic (views and clones) only lives in a rolling 14-day window.&lt;/strong&gt; Past that, the granularity is gone unless you exported it yourself.&lt;/p&gt;

&lt;p&gt;I wanted &lt;strong&gt;historical&lt;/strong&gt; traffic — without a SaaS middleman, without babysitting CSV exports, and with something I could run beside my other self-hosted stuff. That’s why I built &lt;strong&gt;&lt;a href="https://github.com/hrodrig/gghstats" rel="noopener noreferrer"&gt;gghstats&lt;/a&gt;&lt;/strong&gt;. The first stable line is &lt;strong&gt;v0.1.0&lt;/strong&gt; (binaries on &lt;a href="https://github.com/hrodrig/gghstats/releases" rel="noopener noreferrer"&gt;Releases&lt;/a&gt;, multi-arch image on &lt;a href="https://github.com/hrodrig/gghstats/pkgs/container/gghstats" rel="noopener noreferrer"&gt;GHCR&lt;/a&gt;).&lt;/p&gt;




&lt;h2&gt;
  
  
  The problem in one sentence
&lt;/h2&gt;

&lt;p&gt;GitHub is a great place to host code; it is &lt;strong&gt;not&lt;/strong&gt; a long-term analytics warehouse for repository traffic. If you care about trends, seasonality, or “what happened after launch,” you need your &lt;strong&gt;own&lt;/strong&gt; copy of that data.&lt;/p&gt;




&lt;h2&gt;
  
  
  What gghstats does
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;gghstats&lt;/strong&gt; is a small &lt;strong&gt;Go&lt;/strong&gt; service that:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Uses the GitHub API (with a personal access token) to &lt;strong&gt;pull&lt;/strong&gt; traffic metrics on a schedule.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Merges&lt;/strong&gt; them into a local &lt;strong&gt;SQLite&lt;/strong&gt; database so history accumulates instead of disappearing.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Serves&lt;/strong&gt; a web UI and JSON API so you can browse aggregates and per-repo charts.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;On &lt;strong&gt;startup&lt;/strong&gt; it runs a &lt;strong&gt;full sync&lt;/strong&gt; once (so repo discovery matches your filter right away), then repeats on &lt;strong&gt;&lt;code&gt;GGHSTATS_SYNC_INTERVAL&lt;/code&gt;&lt;/strong&gt; (default &lt;code&gt;1h&lt;/code&gt;). No waiting for the first tick to see data.&lt;/p&gt;

&lt;p&gt;It’s deliberately boring technology: one binary, one file for the DB, backups = copy &lt;code&gt;gghstats.db&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Live demo (read-only UI):&lt;/strong&gt; &lt;a href="https://gghstats.hermesrodriguez.com" rel="noopener noreferrer"&gt;gghstats.hermesrodriguez.com&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Stack (opinionated and minimal)
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Piece&lt;/th&gt;
&lt;th&gt;Why&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Go&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Fast, single binary, easy to ship in Docker.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;SQLite&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;No separate DB server; ship backups with the rest of your backups.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Chart.js&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Charts in the dashboard without a heavy frontend framework.&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Bootstrap grid&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Layout and responsive behavior without reinventing CSS — the UI is intentionally &lt;strong&gt;neo-brutalist&lt;/strong&gt; (hard borders, monospace, loud accents) so it feels like a tool, not a marketing site.&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;




&lt;h2&gt;
  
  
  “Works on my machine” wasn’t enough
&lt;/h2&gt;

&lt;p&gt;I wanted a &lt;strong&gt;production-shaped&lt;/strong&gt; repo, not just &lt;code&gt;go run&lt;/code&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Docker / Docker Compose&lt;/strong&gt; for local runs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;&lt;code&gt;docker-compose.prod.yml&lt;/code&gt;&lt;/strong&gt; with &lt;strong&gt;Traefik&lt;/strong&gt;, Let’s Encrypt, and no public port on the app container — only 80/443 on the proxy.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Helm chart&lt;/strong&gt; under &lt;code&gt;charts/gghstats&lt;/code&gt; for Kubernetes.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GoReleaser&lt;/strong&gt; + &lt;strong&gt;GitHub Actions&lt;/strong&gt; for releases, artifacts, and &lt;strong&gt;multi-arch&lt;/strong&gt; images (&lt;code&gt;linux/amd64&lt;/code&gt;, &lt;code&gt;linux/arm64&lt;/code&gt;).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you’ve ever maintained a side project, you know the drag of “I’ll dockerize it later.” I put the boring work upfront so future me doesn’t hate present me.&lt;/p&gt;




&lt;h2&gt;
  
  
  How it works (high level)
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz1zgc9f1or1s90abkbo4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz1zgc9f1or1s90abkbo4.png" alt="gghstats — high-level data flow: GitHub API, gghstats service, SQLite, web dashboard, JSON API, browser, and scripts" width="800" height="309"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Fetch&lt;/strong&gt; — Scheduled sync using your token (scope: &lt;code&gt;repo&lt;/code&gt; for private repos you care about).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Store&lt;/strong&gt; — Upserts into SQLite so you keep a &lt;strong&gt;timeline&lt;/strong&gt;, not a snapshot.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Serve&lt;/strong&gt; — Dashboard for humans and JSON for scripts.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Filtering (&lt;code&gt;GGHSTATS_FILTER&lt;/code&gt;, exclusions like &lt;code&gt;!fork&lt;/code&gt;, etc.) lives in env vars so you can keep the sync set tight.&lt;/p&gt;




&lt;h2&gt;
  
  
  Two numbers that matter (aggregate vs history)
&lt;/h2&gt;

&lt;p&gt;On the &lt;strong&gt;main&lt;/strong&gt; screen you see rollups: totals across the repos you track. That’s the “at a glance” view.&lt;/p&gt;

&lt;p&gt;The real payoff is opening &lt;strong&gt;one repository&lt;/strong&gt;: you see &lt;strong&gt;interval&lt;/strong&gt; stats (what GitHub is showing &lt;em&gt;now&lt;/em&gt; for the last ~14 days) &lt;strong&gt;next to&lt;/strong&gt; &lt;strong&gt;lifetime&lt;/strong&gt; totals from &lt;strong&gt;your&lt;/strong&gt; database. The gap between “what GitHub is willing to remember” and “what you kept” is the whole point — and the charts (clones, views, stars over time) are where SQLite stops being a file and starts being a &lt;strong&gt;memory&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Main dashboard&lt;/strong&gt; (rollups across tracked repos):&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff744n4t6sl6n4cip4mod.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ff744n4t6sl6n4cip4mod.png" alt="gghstats main dashboard — neo-brutalist UI, repo list and aggregates" width="800" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Repository detail&lt;/strong&gt; — interval stats from GitHub’s window next to &lt;strong&gt;lifetime&lt;/strong&gt; totals from SQLite, plus Chart.js trends (clones, views, stars):&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fetzy7upnzf4wddn8l346.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fetzy7upnzf4wddn8l346.png" alt="gghstats repository detail — interval vs lifetime stats and historical charts" width="800" height="389"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Who should try it
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Maintainers who want &lt;strong&gt;long-term&lt;/strong&gt; traffic context.&lt;/li&gt;
&lt;li&gt;People who already self-host and want &lt;strong&gt;data sovereignty&lt;/strong&gt; (your DB, your VPS, your rules).&lt;/li&gt;
&lt;li&gt;Anyone allergic to “sign up for another analytics product to see GitHub stats.”&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;From the repo (Compose):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://github.com/hrodrig/gghstats.git &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;cd &lt;/span&gt;gghstats
&lt;span class="nb"&gt;cp&lt;/span&gt; .env.example .env
&lt;span class="c"&gt;# set GGHSTATS_GITHUB_TOKEN, tune GGHSTATS_FILTER if needed&lt;/span&gt;
docker compose up &lt;span class="nt"&gt;-d&lt;/span&gt;
&lt;span class="c"&gt;# open http://localhost:8080&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Published image only&lt;/strong&gt; (no clone — see README for all env vars):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker pull ghcr.io/hrodrig/gghstats:v0.1.0
docker run &lt;span class="nt"&gt;--rm&lt;/span&gt; &lt;span class="nt"&gt;-p&lt;/span&gt; 8080:8080 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-e&lt;/span&gt; GGHSTATS_GITHUB_TOKEN &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-e&lt;/span&gt; GGHSTATS_FILTER &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-v&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="nv"&gt;$PWD&lt;/span&gt;&lt;span class="s2"&gt;/gghstats-data:/data"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  ghcr.io/hrodrig/gghstats:v0.1.0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Production-oriented compose (Traefik, TLS) lives in &lt;strong&gt;&lt;code&gt;docker-compose.prod.yml&lt;/code&gt;&lt;/strong&gt; — see the repo README for env vars like &lt;code&gt;GGHSTATS_HOSTNAME&lt;/code&gt; and &lt;code&gt;ACME_EMAIL&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Repo:&lt;/strong&gt; &lt;a href="https://github.com/hrodrig/gghstats" rel="noopener noreferrer"&gt;github.com/hrodrig/gghstats&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Issues and PRs welcome. If this saves you from losing another year of traffic history, it was worth writing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How do you capture or export GitHub traffic today&lt;/strong&gt; — CSV dumps, scripts, or nothing? Curious what others do in the comments.&lt;/p&gt;




&lt;h2&gt;
  
  
  Credits
&lt;/h2&gt;

&lt;p&gt;This article was drafted from my own notes and a long brainstorming thread with Gemini (analysis, structure, and image ideas). The code, the rough edges, and the neo-brutalist UI are mine — blame me for the bugs, not the LLM.&lt;/p&gt;

</description>
      <category>github</category>
      <category>go</category>
      <category>selfhosted</category>
      <category>devtools</category>
    </item>
    <item>
      <title>pgwd in Production: From Alerts to Runbook</title>
      <dc:creator>Hermes Rodríguez</dc:creator>
      <pubDate>Thu, 26 Mar 2026 13:16:50 +0000</pubDate>
      <link>https://dev.to/hrodrig/pgwd-in-production-from-alerts-to-runbook-2k42</link>
      <guid>https://dev.to/hrodrig/pgwd-in-production-from-alerts-to-runbook-2k42</guid>
      <description>&lt;p&gt;&lt;em&gt;This is a production-focused follow-up to my original post: &lt;a href="https://dev.to/hrodrig/pgwd-a-watchdog-for-your-postgresql-connections-1pjg"&gt;pgwd: A Watchdog for Your PostgreSQL Connections&lt;/a&gt;. In this one, I show what pgwd looked like in action, how we responded, and what we changed in a controlled way.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;When PostgreSQL connection pressure builds up, the real problem is not just crossing &lt;code&gt;max_connections&lt;/code&gt;; it is crossing it without operational context.&lt;/p&gt;

&lt;p&gt;That is where &lt;code&gt;pgwd&lt;/code&gt; helped us most: not as "just another alert sender," but as a signal-to-action layer for on-call decisions.&lt;/p&gt;

&lt;h2&gt;
  
  
  What happened (anonymized timeline)
&lt;/h2&gt;

&lt;p&gt;In one of our production environments, we saw a fast escalation pattern:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;attention&lt;/code&gt; (75%)&lt;/li&gt;
&lt;li&gt;then &lt;code&gt;alert&lt;/code&gt; (85%)&lt;/li&gt;
&lt;li&gt;then repeated &lt;code&gt;danger&lt;/code&gt; (95%+)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All within a relatively short window.&lt;/p&gt;

&lt;p&gt;The key signal from Slack was not only the threshold level, but also the breakdown:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;code&gt;total&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;active&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;idle&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;max_connections&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;plus &lt;code&gt;cluster&lt;/code&gt;, &lt;code&gt;database&lt;/code&gt;, &lt;code&gt;namespace&lt;/code&gt;, and &lt;code&gt;client&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That context let us decide quickly whether we were seeing true workload pressure, connection churn, or an idle-heavy pattern that still threatened capacity.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5vsgqf3jnk26cec8twwj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5vsgqf3jnk26cec8twwj.png" alt="Anonymized pgwd Slack alerts timeline" width="526" height="781"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Why threshold levels matter in production
&lt;/h2&gt;

&lt;p&gt;A 3-tier model (75/85/95) maps well to real operations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Attention (75%)&lt;/strong&gt;: observe trend, prepare people&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Alert (85%)&lt;/strong&gt;: start mitigation planning&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Danger (95%)&lt;/strong&gt;: execute containment now&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This prevented us from waiting for &lt;code&gt;FATAL: sorry, too many clients already&lt;/code&gt; as the first real signal.&lt;/p&gt;

&lt;h2&gt;
  
  
  The runbook we used (manual-first)
&lt;/h2&gt;

&lt;p&gt;For this rollout, we intentionally chose controlled, operator-present execution.&lt;br&gt;&lt;br&gt;
No unattended automation for critical steps yet.&lt;/p&gt;

&lt;h3&gt;
  
  
  1) Attention (&lt;code&gt;&amp;gt;=75%&lt;/code&gt;)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Confirm trend across intervals (not just one spike)&lt;/li&gt;
&lt;li&gt;Check &lt;code&gt;active&lt;/code&gt; vs &lt;code&gt;idle&lt;/code&gt; ratio&lt;/li&gt;
&lt;li&gt;Verify affected DB scope (single DB vs multiple)&lt;/li&gt;
&lt;li&gt;Open an observation incident thread&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2) Alert (&lt;code&gt;&amp;gt;=85%&lt;/code&gt;)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Engage app + platform on-call&lt;/li&gt;
&lt;li&gt;Correlate with scheduled jobs, batch windows, and maintenance events&lt;/li&gt;
&lt;li&gt;Reduce non-critical pressure if possible&lt;/li&gt;
&lt;li&gt;Prepare containment action&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3) Danger (&lt;code&gt;&amp;gt;=95%&lt;/code&gt;)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Execute mitigation immediately (controlled maintenance/throttling based on internal SOP)&lt;/li&gt;
&lt;li&gt;Prioritize availability restoration&lt;/li&gt;
&lt;li&gt;Capture timestamps for post-incident learning&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What we are changing now: &lt;code&gt;max_connections&lt;/code&gt; to &lt;code&gt;3192&lt;/code&gt;
&lt;/h2&gt;

&lt;p&gt;Based on this run, one concrete action in our runbook is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Increase PostgreSQL &lt;code&gt;max_connections&lt;/code&gt; from &lt;code&gt;2048&lt;/code&gt; to &lt;code&gt;3192&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Apply the change in a controlled session with operators present&lt;/li&gt;
&lt;li&gt;Monitor closely after the change&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is not "increase and forget."&lt;br&gt;&lt;br&gt;
It is "increase, observe, validate, and adjust."&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc77ewu5u8t93rh5ie690.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc77ewu5u8t93rh5ie690.png" alt="Configuration update with max_connections and resource values" width="316" height="386"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Important guardrail: do not solve pressure by over-sizing blindly
&lt;/h2&gt;

&lt;p&gt;Raising connection limits without infrastructure awareness can create a different failure mode:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;more connections -&amp;gt; more backend memory/CPU pressure&lt;/li&gt;
&lt;li&gt;more pressure -&amp;gt; noisy performance and unstable pods/nodes&lt;/li&gt;
&lt;li&gt;teams then over-allocate resources reactively&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So our runbook explicitly includes this guardrail:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Track infrastructure headroom (CPU/memory) after increasing limits&lt;/li&gt;
&lt;li&gt;Validate DB and app behavior under the new ceiling&lt;/li&gt;
&lt;li&gt;Avoid resource over-sizing unless data supports it&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Minimal commands (hybrid style)
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Basic daemon mode&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;PGWD_DB_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"postgres://..."&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;PGWD_SLACK_WEBHOOK&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"https://hooks.slack.com/..."&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;PGWD_INTERVAL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;60
pgwd

&lt;span class="c"&gt;# Verify notifier delivery before/after changes&lt;/span&gt;
pgwd &lt;span class="nt"&gt;-force-notification&lt;/span&gt;

&lt;span class="c"&gt;# Optional: run against Postgres service in Kubernetes&lt;/span&gt;
pgwd &lt;span class="nt"&gt;-kube-postgres&lt;/span&gt; &amp;lt;namespace&amp;gt;/svc/postgres &lt;span class="se"&gt;\&lt;/span&gt;
     &lt;span class="nt"&gt;-db-url&lt;/span&gt; &lt;span class="s2"&gt;"postgres://user:...@localhost:5432/db"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Post-change verification checklist (24h / 72h)
&lt;/h2&gt;

&lt;p&gt;After increasing to &lt;code&gt;3192&lt;/code&gt;, we track:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Alert frequency by level (&lt;code&gt;attention&lt;/code&gt; / &lt;code&gt;alert&lt;/code&gt; / &lt;code&gt;danger&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;active&lt;/code&gt; / &lt;code&gt;idle&lt;/code&gt; behavior by database&lt;/li&gt;
&lt;li&gt;Peak total connections vs new headroom&lt;/li&gt;
&lt;li&gt;App error rates and latency around peak windows&lt;/li&gt;
&lt;li&gt;Infrastructure utilization trend (not just point-in-time)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Success criteria:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;No sustained &lt;code&gt;danger&lt;/code&gt; periods&lt;/li&gt;
&lt;li&gt;Fewer repeated escalations&lt;/li&gt;
&lt;li&gt;Stable app behavior&lt;/li&gt;
&lt;li&gt;No unjustified resource inflation&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Lessons learned
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;code&gt;pgwd&lt;/code&gt; is most valuable when tied to a runbook, not only to notifications.&lt;/li&gt;
&lt;li&gt;Alert levels should map to explicit operator actions.&lt;/li&gt;
&lt;li&gt;Controlled, manual-first execution is safer for critical production changes.&lt;/li&gt;
&lt;li&gt;Increasing &lt;code&gt;max_connections&lt;/code&gt; can be right, if paired with disciplined capacity monitoring.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Community note
&lt;/h2&gt;

&lt;p&gt;If you want a complementary intro in French (installation + quick setup), this community write-up is also useful:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.cyberplanete.net/pgwd-surveillance-postgresql/" rel="noopener noreferrer"&gt;Surveillez votre base de données PostgreSQL avec pgwd&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;If you run PostgreSQL in production, I would love to hear your threshold strategy and runbook design.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Original intro post: &lt;a href="https://dev.to/hrodrig/pgwd-a-watchdog-for-your-postgresql-connections-1pjg"&gt;pgwd: A Watchdog for Your PostgreSQL Connections&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Install: &lt;code&gt;go install github.com/hrodrig/pgwd@latest&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Repo/docs/releases: &lt;a href="https://github.com/hrodrig/pgwd" rel="noopener noreferrer"&gt;github.com/hrodrig/pgwd&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Disclosure: This post was drafted with AI assistance and reviewed by the author.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>database</category>
      <category>devops</category>
      <category>monitoring</category>
      <category>postgres</category>
    </item>
    <item>
      <title>pgwd: A Watchdog for Your PostgreSQL Connections</title>
      <dc:creator>Hermes Rodríguez</dc:creator>
      <pubDate>Mon, 02 Mar 2026 03:37:45 +0000</pubDate>
      <link>https://dev.to/hrodrig/pgwd-a-watchdog-for-your-postgresql-connections-1pjg</link>
      <guid>https://dev.to/hrodrig/pgwd-a-watchdog-for-your-postgresql-connections-1pjg</guid>
      <description>&lt;h2&gt;
  
  
  &lt;strong&gt;Stop guessing when your database is about to run out of connections.&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;You’ve seen it before: an app starts failing with &lt;strong&gt;"sorry, too many clients already"&lt;/strong&gt;, and you only notice when users complain. By then, the database is saturated, and even your admin tools can’t connect. &lt;strong&gt;pgwd&lt;/strong&gt; (Postgres Watch Dog) is a small Go CLI that watches your connection counts and alerts you &lt;em&gt;before&lt;/em&gt; you hit the limit—and when you can’t connect at all.&lt;/p&gt;

&lt;h2&gt;
  
  
  The problem
&lt;/h2&gt;

&lt;p&gt;PostgreSQL has a &lt;code&gt;max_connections&lt;/code&gt; limit. When you exceed it:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;New connections are rejected with &lt;strong&gt;FATAL: sorry, too many clients already (SQLSTATE 53300)&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;If your app uses a superuser (or a role that can use all slots), even DBA access can be blocked unless you’ve reserved slots with &lt;code&gt;superuser_reserved_connections&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Without something watching connection usage, you only find out when things are already broken.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz10z21k0g3zzo7p2opor.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fz10z21k0g3zzo7p2opor.png" alt="Flow: PostgreSQL → pgwd (thresholds) → Slack / Loki" width="800" height="436"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;How pgwd fits in: it watches your Postgres and pushes alerts to Slack and/or Loki when thresholds are exceeded or the connection fails.&lt;/em&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  The idea: watch and alert
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;pgwd&lt;/strong&gt; connects to your Postgres (directly or via Kubernetes), reads connection stats from &lt;code&gt;pg_stat_activity&lt;/code&gt;, and sends alerts to &lt;strong&gt;Slack&lt;/strong&gt; and/or &lt;strong&gt;Loki&lt;/strong&gt; when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Total&lt;/strong&gt; or &lt;strong&gt;active&lt;/strong&gt; connections cross a threshold. By default, pgwd uses &lt;strong&gt;3-tier levels&lt;/strong&gt; (75%, 85%, 95%) with distinct severities: &lt;strong&gt;attention&lt;/strong&gt;, &lt;strong&gt;alert&lt;/strong&gt;, and &lt;strong&gt;danger&lt;/strong&gt;—so you can escalate response as usage grows.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Idle&lt;/strong&gt; or &lt;strong&gt;stale&lt;/strong&gt; connections exceed limits (useful for spotting connection leaks).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The connection fails&lt;/strong&gt;—including "too many clients"—so you get an urgent notification even when pgwd itself can’t connect.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So you get notified when you’re &lt;em&gt;approaching&lt;/em&gt; the limit, and again when the instance is &lt;em&gt;already&lt;/em&gt; saturated. Loki streams include &lt;strong&gt;labels&lt;/strong&gt; (&lt;code&gt;database&lt;/code&gt;, &lt;code&gt;cluster&lt;/code&gt;) for LogQL filtering and Grafana alert rules.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3pzqlxhqlynaq0kgib8x.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3pzqlxhqlynaq0kgib8x.png" alt="Example Slack alert when Postgres returns " width="800" height="436"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;When the DB is saturated, pgwd sends an urgent alert like this to your notifiers.&lt;/em&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  pgwd in action
&lt;/h2&gt;
&lt;h3&gt;
  
  
  Minimal setup (one-shot from cron)
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Alert at 75%, 85%, 95% of max_connections (default 3-tier levels)&lt;/span&gt;
&lt;span class="o"&gt;(&lt;/span&gt;default&lt;span class="o"&gt;)&lt;/span&gt;
pgwd &lt;span class="nt"&gt;-db-url&lt;/span&gt; &lt;span class="s2"&gt;"postgres://user:pass@localhost:5432/mydb"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
     &lt;span class="nt"&gt;-slack-webhook&lt;/span&gt; &lt;span class="s2"&gt;"https://hooks.slack.com/services/..."&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;No need to set thresholds if you’re fine with the defaults: pgwd reads &lt;code&gt;max_connections&lt;/code&gt; from the server and applies 75%, 85%, 95% (attention, alert, danger). Use &lt;code&gt;-threshold-levels&lt;/code&gt; to customize.&lt;/p&gt;
&lt;h3&gt;
  
  
  Daemon mode (continuous watch)
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;PGWD_DB_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"postgres://localhost:5432/mydb"&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;PGWD_SLACK_WEBHOOK&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"https://hooks.slack.com/..."&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;PGWD_INTERVAL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;60

pgwd
&lt;span class="c"&gt;# Runs every 60 seconds; exit with SIGTERM.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h3&gt;
  
  
  Catch connection leaks (stale connections)
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pgwd &lt;span class="nt"&gt;-db-url&lt;/span&gt; &lt;span class="s2"&gt;"postgres://..."&lt;/span&gt; &lt;span class="nt"&gt;-slack-webhook&lt;/span&gt; &lt;span class="s2"&gt;"https://..."&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
     &lt;span class="nt"&gt;-stale-age&lt;/span&gt; 600 &lt;span class="nt"&gt;-threshold-stale&lt;/span&gt; 1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Alerts when any connection has been open longer than 10 minutes—handy for spotting leaks or long-running transactions.&lt;/p&gt;
&lt;h3&gt;
  
  
  Postgres in Kubernetes
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pgwd &lt;span class="nt"&gt;-kube-postgres&lt;/span&gt; default/svc/postgres &lt;span class="se"&gt;\&lt;/span&gt;
     &lt;span class="nt"&gt;-db-url&lt;/span&gt; &lt;span class="s2"&gt;"postgres://user:DISCOVER_MY_PASSWORD@localhost:5432/mydb"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
     &lt;span class="nt"&gt;-slack-webhook&lt;/span&gt; &lt;span class="s2"&gt;"https://..."&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;pgwd runs &lt;code&gt;kubectl port-forward&lt;/code&gt;, can read the password from the pod’s environment, and connects to localhost. Alerts include cluster/namespace/service context.&lt;/p&gt;
&lt;h3&gt;
  
  
  Loki inside Kubernetes
&lt;/h3&gt;

&lt;p&gt;When Loki runs inside the cluster and pgwd runs outside (e.g. a VM with cron), use &lt;code&gt;-kube-loki&lt;/code&gt; so pgwd can port-forward to Loki as well:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pgwd &lt;span class="nt"&gt;-kube-postgres&lt;/span&gt; default/svc/postgres &lt;span class="nt"&gt;-kube-loki&lt;/span&gt; monitoring/svc/loki &lt;span class="se"&gt;\&lt;/span&gt;
     &lt;span class="nt"&gt;-db-url&lt;/span&gt; &lt;span class="s2"&gt;"postgres://user:DISCOVER_MY_PASSWORD@localhost:5432/mydb"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
     &lt;span class="nt"&gt;-slack-webhook&lt;/span&gt; &lt;span class="s2"&gt;"https://..."&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;pgwd port-forwards to Loki and pushes to &lt;code&gt;localhost:3100&lt;/code&gt; automatically; no &lt;code&gt;-loki-url&lt;/code&gt; needed.&lt;/p&gt;

&lt;h3&gt;
  
  
  Test alerts without touching Postgres
&lt;/h3&gt;

&lt;p&gt;Use &lt;code&gt;-test-max-connections N&lt;/code&gt; to simulate exceeded thresholds against production. pgwd treats &lt;code&gt;N&lt;/code&gt; as the effective &lt;code&gt;max_connections&lt;/code&gt; for threshold calculation—stats stay real, only the denominator changes. Handy for validating Slack/Loki/Grafana alerts without modifying Postgres config. See &lt;code&gt;docs/testing-alert-levels.md&lt;/code&gt; for the full procedure.&lt;/p&gt;

&lt;h3&gt;
  
  
  When things go wrong
&lt;/h3&gt;

&lt;p&gt;If the database is unreachable or returns "too many clients", pgwd &lt;strong&gt;always&lt;/strong&gt; sends an urgent alert to your notifiers (when configured)—no extra flag needed. So even in the worst case, you get a Slack/Loki message instead of silence.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Install:&lt;/strong&gt; &lt;code&gt;go install github.com/hrodrig/pgwd@latest&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Repo and docs:&lt;/strong&gt; &lt;a href="https://github.com/hrodrig/pgwd" rel="noopener noreferrer"&gt;github.com/hrodrig/pgwd&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Releases:&lt;/strong&gt; &lt;a href="https://github.com/hrodrig/pgwd/releases" rel="noopener noreferrer"&gt;Releases&lt;/a&gt; (v0.5.0 — binaries, Docker &lt;code&gt;ghcr.io/hrodrig/pgwd&lt;/code&gt;, Homebrew)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Loki + Grafana:&lt;/strong&gt; &lt;a href="https://github.com/hrodrig/pgwd/blob/main/docs/loki-grafana-alerts.md" rel="noopener noreferrer"&gt;docs/loki-grafana-alerts.md&lt;/a&gt; — labels, LogQL, alert rules&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Test alerts safely:&lt;/strong&gt; &lt;a href="https://github.com/hrodrig/pgwd/blob/main/docs/testing-alert-levels.md" rel="noopener noreferrer"&gt;docs/testing-alert-levels.md&lt;/a&gt; — trigger attention/alert/danger without changing Postgres&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;One-shot, daemon, or cron—with Slack and/or Loki you can stop flying blind on connection usage and get ahead of "too many clients" before it hits production.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Disclosure: This post was drafted with AI assistance and reviewed by the author.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>database</category>
      <category>postgres</category>
      <category>devops</category>
      <category>monitoring</category>
    </item>
  </channel>
</rss>
