Hermes Rodríguez

Posted on Apr 10

~21 tok/s Gemma 4 on a Ryzen mini PC: llama.cpp, Vulkan, and the messy truth about local chat

#ai #llm #performance #tutorial

Hands-on guide based on a real setup: Ubuntu 24.04 LTS, AMD Radeon 760M (Ryzen iGPU), lots of RAM (e.g. 96 GiB), llama.cpp built with GGML_VULKAN, OpenAI-compatible API via llama-server, Open WebUI in Docker, and OpenCode or VS Code (§11) using the same API.

Who this is for: if you buy (or plan to buy) a mini PC or small tower with plenty of RAM and disk, this walkthrough gets you to local inference — GGUF weights on your box, chat and APIs on your LAN, without treating a cloud vendor as mandatory for every request. The documented path is AMD iGPU + Vulkan; if your hardware differs, keep the Ubuntu → llama.cpp → weights → server flow and adjust §5–§6 (deps and build) for your GPU.

Reference hardware (validated while writing this guide): Minisforum UM760 Slim mini PC (Device Type: MINI PC on the chassis label; vendor Minisforum / Micro Computer (HK) Tech Limited) with AMD Ryzen 5 7640HS, Radeon 760M Graphics, 96 GiB DDR5 RAM, ~1 TiB NVMe, Ubuntu 24.04 LTS. This is not a minimum-requirements bar—it anchors compile times, download comfort, and token throughput vs other CPUs, RAM, or disks. To verify memory type and size on your box, see §3 (Quick hardware inventory). A photo of the box is at the end, under Closing thoughts.

Replace YOUR_USER, model paths, and hostname as needed. If the machine is server-only (no monitor), start with §4.

TL;DR

Too long; didn’t read — one-minute skim before the full guide. Full table of contents →

What you’re building: local inference on Ubuntu 24.04 with llama.cpp + Vulkan, a GGUF weights file, OpenAI-style API via llama-server (:8080); optional Open WebUI in Docker (:3000); OpenCode and Visual Studio Code can talk to the same http://…:8080/v1 base URL as an OpenAI-compatible provider (§11).
Shortest path: BIOS/UMA if relevant (§2) → deps + Vulkan (§5) → build llama.cpp (§6) → download .gguf (§7: wget --continue or huggingface-cli; screen / tmux for long SSH sessions) → smoke-test llama-cli → run llama-server manually or under systemd (§8–§9) → point Open WebUI at the host (§10) → optional: OpenCode / VS Code (§11).
Tight RAM / OOM: same user as the service; match llama-cli -c / -ngl to ExecStart; if it fails, drop -c (e.g. 4096) and -ngl (e.g. 40) before chasing 99 / 999. Don’t enable the unit until the GGUF is fully downloaded.
More models: §7 covers Gemma 4, Qwen Coder, DeepSeek Lite, Llama 3.1 (downloads, huggingface-cli, quick tests).
Swap in YOUR_USER, paths, and hostname; server-only box → start at §4.

Links jump to headings on GitHub, Cursor, and most Markdown viewers. If a link does not match your renderer, search for the heading in the file.

TL;DR
1. Context and choices
2. BIOS (before or right after installing Ubuntu)
3. Installing Ubuntu
- Quick hardware inventory (optional)
4. Ubuntu Server without a desktop (headless)
5. Base dependencies and Vulkan check
6. Building llama.cpp with Vulkan
- Update and rebuild llama.cpp
7. GGUF models and paths
8. Minimal web server (llama-server)
9. systemd service (start on boot)
10. Open WebUI with Docker (port 3000 → backend on 8080)
11. OpenCode and VS Code with your llama-server
- OpenCode
- Visual Studio Code
12. Troubleshooting: Vulkan / glslc on Ubuntu 24.04
13. Performance and models (rough guide)
- htop looks “light” while you chat (is that normal?)
- AMD: amdgpu_pm_info and dri/N (not always dri/0)
14. Remote desktop (Ubuntu 24.04 Desktop, LAN)
Final checklist
Quick port reference
Closing thoughts

1. Context and choices

Topic	Recommendation
OS	Ubuntu 24.04 LTS (desktop or server; server without a GUI saves RAM).
AMD iGPU	Vulkan + Mesa is usually simpler than ROCm for llama.cpp inference.
Models	GGUF format; Q4_K_M quantization (balance) or Q8_0 (higher quality, larger).
Engine	llama.cpp with `-DGGML_VULKAN=1` uses the GPU for layers (`-ngl`).
Lots of RAM	You can load large models in system RAM even if the iGPU has little dedicated VRAM; the BIOS can give the GPU a larger framebuffer (see §2).

Reference diagram (browser / container / host):

2. BIOS (before or right after installing Ubuntu)

On Minisforum boxes (e.g. UM760 Slim) with AMI BIOS and Ryzen:

Enter BIOS (Del, F2, or F7 on many systems).
Typical path: Advanced → AMD CBS → NBIO Common Options → GFX Configuration.
Set UMA Frame Buffer Size (or similar) from Auto / 2 GiB to 8 G or 16 G if available.

Goal: give the iGPU more unified memory for model layers; with plenty of system RAM the trade-off is usually worth it.

3. Installing Ubuntu

Enable third-party software for graphics and Wi‑Fi if you use the graphical installer.
The minimal install drops extra packages if the box is mainly an inference server.

Typical order of this guide (§4 and §10 are optional depending on your setup):

Quick hardware inventory (optional)

Before picking huge models and quantizations, check RAM, disk on /, and whether the integrated GPU shows up on the PCI bus (this does not replace a Vulkan test, but it sets expectations).

sudo lspci | grep -i -E 'vga|3d|display'
free -h
df -h /

What to look for in lspci: on Ryzen Phoenix / Hawk Point boards you often see something like VGA compatible controller: … Phoenix1 plus an AMD HDMI audio line. The marketing name “Radeon 760M” may not appear verbatim; the real check is that an AMD VGA/Display controller exists and that vulkaninfo / llama-cli see RADV (§4–§5).

free: total and available RAM tell you how large a GGUF you can keep comfortably in memory alongside the OS.

df: each .gguf costs whatever the card lists (e.g. ~8 GiB for an 8B Q8_0); leave headroom for updates, Docker, and rebuilds.

DDR4 vs DDR5 (re-check RAM type): data comes from firmware SMBIOS. Install sudo apt install -y dmidecode if needed. Note: some dmidecode builds indent fields with spaces, not tabs—an overly strict grep can print nothing even when DMI works.

# One line per interesting field (tab- or space-indented)
sudo dmidecode -t memory 2>/dev/null | grep -iE 'Locator|Size:|Type:|Speed:|Configured Memory Speed:'

If that is still empty, dump the start of the table—some boards expose only a subset of fields:

sudo dmidecode -t memory | head -n 120

For each populated slot, Type: should read DDR5, DDR4, etc. All-Unknown or an empty dump may mean a locked BIOS, a hypervisor restriction, or needs a firmware update—cross-check the mini PC spec sheet or DIMM/SODIMM silkscreen/label. Ryzen 7040 mobile (e.g. 7640HS) is usually DDR5-only on recent kits; still verify through one of these paths.

4. Ubuntu Server without a desktop (headless)

When the mini PC only serves the model (SSH + browser on another machine), Ubuntu Server 24.04 LTS saves RAM and attack surface by skipping GNOME and desktop services.

Installation

Download the Ubuntu Server ISO from ubuntu.com/download/server.
In the installer, enable OpenSSH for remote administration.
Create a normal user with sudo (this guide assumes that user’s $HOME).
BIOS (§2) is configured the same as on a desktop.

Networking

After first boot:

hostname -I
sudo systemctl status ssh

Open only what you need in the firewall (e.g. SSH, and later 8080/3000 if not using VPN only):

sudo apt install -y ufw
sudo ufw allow OpenSSH
# Optional: sudo ufw allow 8080/tcp && sudo ufw allow 3000/tcp
sudo ufw enable

Vulkan without a display (`vkcube` not applicable)

Server images have no display server by default: you cannot run vkcube unless you add a minimal GUI just for that test. To validate Vulkan from the console:

sudo apt update
sudo apt install -y vulkan-tools
vulkaninfo --summary 2>/dev/null | head -n 80

What to look for: besides the instance version (e.g. Vulkan Instance Version: 1.4.x), the Devices: section should list your AMD GPU (deviceName like Radeon …, deviceType INTEGRATED_GPU or DISCRETE_GPU, vendorID 0x1002 on AMD hardware).

Real-world sample (trimmed): you often see the instance and a long extension list first; Devices: comes later. As a normal user you may see only a software device:

Vulkan Instance Version: 1.4.313
...
Devices:
========
GPU0:
    apiVersion         = 1.4.318
    deviceType         = PHYSICAL_DEVICE_TYPE_CPU
    deviceName         = llvmpipe (LLVM …, 256 bits)
    driverName         = llvmpipe

Same machine, but sudo shows the Radeon: if your user only gets llvmpipe but root sees e.g. GPU0 AMD Radeon 760M Graphics (RADV PHOENIX) (vendorID 0x1002, INTEGRATED_GPU) and GPU1 llvmpipe, the kernel and Mesa are fine; your user lacks permission on the DRM nodes (/dev/dri/renderD*). You should not run llama-server as root long-term to “fix” Vulkan—fix group membership instead.

groups                    # should include render and video
ls -l /dev/dri/
sudo usermod -aG render,video "$USER"
# Log out of the desktop session or reboot, then (tighter grep: a broad
# GPU|deviceName|deviceType pattern may also match layer descriptions containing "GPU"):
vulkaninfo --summary 2>/dev/null | grep -E '^GPU[0-9]+:|^[[:space:]]+device(Name|Type)' | head -n 30

Expected output without sudo (RADV as GPU0, llvmpipe as an extra device):

GPU0:
    deviceType         = PHYSICAL_DEVICE_TYPE_INTEGRATED_GPU
    deviceName         = AMD Radeon 760M Graphics (RADV PHOENIX)
GPU1:
    deviceType         = PHYSICAL_DEVICE_TYPE_CPU
    deviceName         = llvmpipe (LLVM 20.1.2, 256 bits)

Typical “before” example: if groups does not list render or video, and you only see entries like adm cdrom sudo dip plugdev users lpadmin docker, that matches “Vulkan as your user = llvmpipe only; as root = RADV + llvmpipe”.

After usermod: the command may print nothing, but your already-running session keeps the old group set—groups in the same shell will not change until you log out of the desktop (or reboot). Open a new terminal and check again; id -nG is a handy way to list all group names. For a quick test without logging out of the whole session: newgrp render (spawns a subshell with that group active; fine for testing only).

On Ubuntu 24.04 the groups are usually render and video. Once the new session includes them, vulkaninfo without sudo should list the AMD device as well as llvmpipe.

A healthy summary often has the Radeon as GPU0 and llvmpipe as an extra entry:

GPU0:
    vendorID           = 0x1002
    deviceType         = PHYSICAL_DEVICE_TYPE_INTEGRATED_GPU
    deviceName         = AMD Radeon 760M Graphics (RADV PHOENIX)
    driverName         = radv
GPU1:
    deviceType         = PHYSICAL_DEVICE_TYPE_CPU
    deviceName         = llvmpipe (LLVM …)

Only llvmpipe even as root: then llvmpipe / PHYSICAL_DEVICE_TYPE_CPU is CPU-only Vulkan (Mesa) and the iGPU is not in the Vulkan device list. Check lspci -nn | grep -i vga, the amdgpu module, mesa-vulkan-drivers, and BIOS. On very minimal servers the render stack may still need setup before Vulkan enumerates the chip.

Rest of this guide

Install the same packages as §5, build llama.cpp in §6, and use Open WebUI from another PC at http://SERVER_IP:3000. Docker + llama-server does not require a graphical session on the server.

5. Base dependencies and Vulkan check

sudo apt update && sudo apt upgrade -y
sudo apt install -y build-essential cmake git libvulkan-dev vulkan-tools

Confirm the GPU is visible:

vkcube

A window with a spinning cube should open. Close it when done.

If vkcube works but vulkaninfo --summary as your user still shows only llvmpipe, add the same render and video groups as in §4 (and log out/in).

6. Building llama.cpp with Vulkan

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DGGML_VULKAN=1
cmake --build build --config Release -j"$(nproc)"

If cmake fails with Could NOT find Vulkan or missing: glslc, go to §12 (common on Ubuntu 24.04).

Update and rebuild `llama.cpp`

Newer GGUF architectures (Gemma 4, recent MoE builds, etc.) often need a fresh llama.cpp. Before blaming the weight file, update the tree and rebuild the same build folder (or wipe build and rerun CMake if CMakeLists changed a lot):

cd "$HOME/llama.cpp"
git pull
cmake --build build --config Release -j"$(nproc)"

If git pull changes CMake heavily and linking fails:

rm -rf build
cmake -B build -DGGML_VULKAN=1
cmake --build build --config Release -j"$(nproc)"

After rebuilding, if you use §9, restart so the service picks up new binaries: sudo systemctl restart llama-web.service. Check journalctl -u llama-web.service -n 30 --no-pager if a GGUF is rejected.

Useful binaries:

build/bin/llama-cli — terminal tests.
build/bin/llama-server — HTTP API compatible with OpenAI-style clients.

7. GGUF models and paths

What GGUF is (name, role, trade-offs)

GGUF (GGML Universal File Format) is a single-file container aimed at inference with llama.cpp and friends: it packs weights in a tensor layout tuned for efficient loading, metadata, and—in practice—what you need to tokenize and run the model without pulling in the full PyTorch/JAX training stack.

Why it matters here: you download a .gguf, pass its path as -m to llama-cli / llama-server, and the engine runs locally (CPU, and in this guide Vulkan on the GPU). You do not need the original framework runtime just to serve the converted file.
Typical upsides: one portable blob; quantized variants (Q4_K_M, Q8_0, IQ*, …) trade a bit of quality for disk / RAM / VRAM; huge Hugging Face catalog (community repos such as TheBloke, bartowski, Unsloth, …); first-class support in llama.cpp.
Limitations: quality depends on quant level and conversion tooling; brand-new architectures may need a fresh llama.cpp build or lack mature GGUFs yet; training / fine-tuning usually happens elsewhere, then you convert/export to GGUF; it is not a full cloud SaaS substitute without extra plumbing.

The rest of this section assumes a ready-to-run GGUF; paths and downloads always point at that file.

Quant labels in filenames (Q2, Q4, Q8, suffixes like `_K_M`, IQ…)

Repos list GGUFs with prefixes like Q2_, Q3_, Q4_, Q5_, Q6_, Q8_ and cousins (IQ2_, IQ3_, …). Naming is not one single marketing standard, but in practice:

The Q and number hint at quantization depth—roughly how many bits are used for weights (simplified). Lower → smaller file, less RAM/VRAM, sometimes more quality loss; higher (e.g. Q8) → heavier and often closer to “full” model behavior.
Suffixes such as _K_M, _K_S, _K_L, … are llama.cpp k-quant schemes: they mix layers/blocks at different precisions to balance quality vs size—it is not “literally 4-bit everything.”
IQ (imatrix / importance-weighted) lines aim for aggressive compression while protecting weights that matter most for output quality.
For this guide: Q4_K_M is a common sweet spot for disk, memory, and quality; Q8_0-class files if you favor quality and have RAM to spare. If names feel overwhelming, sort by MiB/GiB under the repo’s Files tab and pick the largest file that fits your machine comfortably.

Hugging Face CLI (huggingface-cli): Ubuntu 24.04 ships externally managed system Python (PEP 668), so python3 -m pip install … fails with externally-managed-environment. Prefer a small virtualenv for this tool. This guide uses $HOME/.venv/huggingface: install python3-venv, create the venv once, run source …/bin/activate before pip / huggingface-cli, or call "$HOME/.venv/huggingface/bin/huggingface-cli" directly. Avoid --break-system-packages unless you understand the risk. Alternative: pipx install 'huggingface_hub[cli]' (after sudo apt install pipx and pipx ensurepath).

Use one consistent directory (avoid mixing ~/models and llama.cpp/models by mistake):

mkdir -p "$HOME/models"

Where models live and how to list them

llama.cpp has no built-in model catalog: a model is a .gguf file. You always pass the path with -m (absolute paths are best in systemd).

List the usual folder:

ls -lh "$HOME/models"/*.gguf 2>/dev/null

If that prints nothing, you may still have GGUFs elsewhere (Downloads, etc.).

Search under your home (limited depth, faster):

find "$HOME" -maxdepth 5 -name '*.gguf' 2>/dev/null -ls

Sort by size:

find "$HOME" -maxdepth 5 -name '*.gguf' 2>/dev/null -printf '%s\t%p\n' | sort -n

Important: Open WebUI does not enumerate “every GGUF on disk”. What matters is whichever file llama-server loads via -m. To “use another model”, change that -m (and restart the process or service §9), or run another llama-server on another port (advanced; not detailed here).

Generic example (swap the URL for the file link under the repo’s Files tab on Hugging Face):

wget -O "$HOME/models/model-name.gguf" \
  "https://huggingface.co/ORG/REPO/resolve/main/file.gguf?download=true"

Concrete example: Gemma 4 26B A4B Instruct (GGUF, bartowski)

Recent quantized model (Apache 2.0), Gemma 4 / MoE architecture; a good fit for machines with lots of RAM (e.g. ~96 GiB). Full file list and sizes: bartowski/google_gemma-4-26B-A4B-it-GGUF.

Reasonable disk/RAM use: Q4_K_M (~17 GiB per the model card). Maximum quality in this repo: Q8_0 (~27 GiB).

Important: you need a recent llama.cpp with Gemma 4 support (before building: cd llama.cpp && git pull). If loading the GGUF reports architecture or tokenizer errors, update and rebuild (§6).

Recommended download (Q4_K_M):

mkdir -p "$HOME/models"
wget --continue -O "$HOME/models/google_gemma-4-26B-A4B-it-Q4_K_M.gguf" \
  "https://huggingface.co/bartowski/google_gemma-4-26B-A4B-it-GGUF/resolve/main/google_gemma-4-26B-A4B-it-Q4_K_M.gguf?download=true"

Higher-quality option (Q8_0):

wget --continue -O "$HOME/models/google_gemma-4-26B-A4B-it-Q8_0.gguf" \
  "https://huggingface.co/bartowski/google_gemma-4-26B-A4B-it-GGUF/resolve/main/google_gemma-4-26B-A4B-it-Q8_0.gguf?download=true"

Equivalent using huggingface-cli (handy for resumable downloads):

sudo apt install -y python3-venv
python3 -m venv "$HOME/.venv/huggingface"   # once; skip if this directory already exists
source "$HOME/.venv/huggingface/bin/activate"
pip install -U "huggingface_hub[cli]"
huggingface-cli download bartowski/google_gemma-4-26B-A4B-it-GGUF \
  --include "google_gemma-4-26B-A4B-it-Q4_K_M.gguf" \
  --local-dir "$HOME/models"

Notes:

On Hugging Face the model is tagged Image-Text-to-Text; for text-only chat, llama-server / Open WebUI usually work with the GGUF and embedded template. If message formatting breaks, check the Prompt format section on the model card.
resolve/main/... URLs can break if files are renamed; if so, open the repo and copy the download link for the exact .gguf.

Important: when running llama-cli or llama-server, use the real path to the .gguf (absolute or relative to your current working directory).

Advanced example: Kimi K2 Instruct 0905 (Unsloth, split GGUF)

A very large MoE (~32 B activated params / 1 T total per the model card). Community GGUFs: unsloth/Kimi-K2-Instruct-0905-GGUF. Run guide and flags: Unsloth — Kimi K2.

Hardware warning: Unsloth’s README recommends ≥ 128 GB unified RAM even for “small” quants. Boxes in the ~64–80 GiB range may fail to load, run very slowly, or thrash swap—treat it as an experiment (see §7 Experimenting with more models).

Hugging Face: access may be gated; sign in, accept terms on the model page, and use huggingface-cli login if required.

Shards: each quantization lives in a folder (UD-TQ1_0/, UD-IQ1_S/, IQ4_XS/, …) with files like …-00001-of-00006.gguf, … Download every .gguf in that folder. For llama-cli and llama-server, -m must point at the first shard (…-00001-of-….gguf); current llama.cpp loaders pick up sibling shards in the same directory.

Download one folder (example UD-TQ1_0, six parts; confirm names under Files on Hugging Face):

sudo apt install -y python3-venv
python3 -m venv "$HOME/.venv/huggingface"   # once; skip if this directory already exists
source "$HOME/.venv/huggingface/bin/activate"
pip install -U "huggingface_hub[cli]"
huggingface-cli login    # if token or gated access is required

mkdir -p "$HOME/models/kimi-k2-0905"
huggingface-cli download unsloth/Kimi-K2-Instruct-0905-GGUF \
  --include "UD-TQ1_0/*.gguf" \
  --local-dir "$HOME/models/kimi-k2-0905"

Other folders in the same repo are other quants (more disk / more quality). Pick based on free disk and RAM.

Before loading: git pull and rebuild llama.cpp (§6). Short smoke test:

cd "$HOME/llama.cpp"
./build/bin/llama-cli \
  -m "$HOME/models/kimi-k2-0905/UD-TQ1_0/Kimi-K2-Instruct-0905-UD-TQ1_0-00001-of-00006.gguf" \
  -c 4096 \
  -ngl 80 \
  -p "Say hi in one sentence."

Tune -ngl and -c; on architecture/tokenizer errors, update and rebuild. For §9 / Open WebUI, ExecStart uses the same path to the first shard; read the id from /v1/models via curl once llama-server is up for Model IDs.

Example: local Llama 3.1 8B Instruct Q8_0

If you already have e.g. $HOME/models/Llama-3.1-8B-Instruct-Q8_0.gguf (~8 GiB on disk), replace every -m path in this guide with yours. Q8_0 favors quality over speed; for higher tok/s on an iGPU, try a Q4_K_M in the same model family.

`llama-bench`: measure throughput (tokens/s)

Use this to compare the same machine with different -ngl, different GGUFs, or different builds (CPU vs Vulkan), without UI noise.

Verify the binary (size/date are hints; it should refresh after rebuilds):

cd "$HOME/llama.cpp"
ls -lh build/bin/llama-bench

If it is missing, rebuild the project (§6); most full builds already include llama-bench.
Flags change across versions—always start from help:

./build/bin/llama-bench --help | less

Minimal example (swap the path):

./build/bin/llama-bench \
  -m "$HOME/models/Llama-3.1-8B-Instruct-Q8_0.gguf" \
  -ngl 999 \
  -n 128

-m: path to the .gguf.
-ngl: GPU layers; many builds accept 999 or -1 as “as many as possible”. If rejected, try 35, 45, etc., and increase until it breaks or slows down.
-n: generated tokens per benchmark run (tune for longer runs).

Reading output: you usually see prompt processing vs generation tok/s. If numbers are tiny and logs show no Vulkan / ggml_vulkan, the binary might lack GGML_VULKAN, or /dev/dri permissions were wrong at build/run time (§4).
Fair comparisons: same llama-bench build, same model, same -n, only change -ngl or the .gguf.

Sample real output (same command order as above; Ubuntu 24.04, Radeon 760M RADV, Llama 3.1 8B Instruct Q8_0; numbers shift with BIOS, thermals, quantization, and llama.cpp revision):

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon 760M Graphics (RADV PHOENIX) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| llama 8B Q8_0                  |   7.95 GiB |     8.03 B | Vulkan     | 999 |           pp512 |        235.96 ± 0.19 |
| llama 8B Q8_0                  |   7.95 GiB |     8.03 B | Vulkan     | 999 |           tg128 |          9.80 ± 0.00 |

build: 4d688f9eb (8016)

The ggml_vulkan lines show one Vulkan device and that the bench is on RADV (not llvmpipe only). Errors or zero devices → revisit §4–§5.
pp512: prompt processing — tok/s for a ~512-token prefill; usually higher than generation.
tg128: token generation — tok/s while emitting 128 output tokens; closest bench metric to “reply speed” in chat. Here ≈9.8 t/s for Q8_0 on this iGPU.
The build: line is your llama.cpp llama-bench commit; it changes after git pull + rebuild.

Another sample (same mini PC class, Gemma 4 26B Instruct Q4_K_M — the model this guide uses in many examples):

./build/bin/llama-bench \
  -m "$HOME/models/google_gemma-4-26B-A4B-it-Q4_K_M.gguf" \
  -ngl 999 \
  -n 128

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon 760M Graphics (RADV PHOENIX) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| gemma4 ?B Q4_K - Medium        |  15.85 GiB |    25.23 B | Vulkan     | 999 |           pp512 |        239.04 ± 1.97 |
| gemma4 ?B Q4_K - Medium        |  15.85 GiB |    25.23 B | Vulkan     | 999 |           tg128 |         20.94 ± 0.02 |

build: d12cc3d1c (8720)

The model column is unreliable on some llama-bench builds: you may see gemma4 ?B, gemma4 7B, or similar even for Gemma 4 26B A4B GGUFs. Trust size (~15.85 GiB), params (~25.23 B), and your -m path to …26B…Q4_K_M.gguf. llama-bench mis-labels the first column; this run is Gemma 4 **26B* Q4_K_M*.
What this run says: with Vulkan and ngl 999, expect on the order of ~239 tok/s for prefill (pp512) and ~21 tok/s for generation (tg128). That ~21 t/s is the most useful single number for “raw” reply speed (no Open WebUI overhead, no long reasoning block, no huge prompts); real chat often lands near this ballpark or a bit lower.

Other GGUFs, ngl, or build: revisions will move tg* a lot; record your own table after major changes.

Quick terminal test

From the llama.cpp directory:

./build/bin/llama-cli \
  -m "$HOME/models/google_gemma-4-26B-A4B-it-Q4_K_M.gguf" \
  -p "Answer in one sentence what Linux is." \
  -cnv \
  -ngl 99

Gemma 4 and on-screen reasoning ([Start thinking] … [End thinking]): many Instruct GGUFs emit a “thinking” block before the final answer. On a recent llama-cli, --help normally documents (verify with ./build/bin/llama-cli --help | grep -iE 'reason|think|template'):

-rea, --reasoning on|off|auto — default auto (template decides). For clean screenshots, use --reasoning off (short -rea off if your build prints it).
--reasoning-budget N — 0 ends the thinking block immediately; -1 is unrestricted. Pair with off if needed.
--chat-template-kwargs STRING — JSON for the template parser (e.g. '{"enable_thinking": false}' in bash with outer single quotes).
--reasoning-format FORMAT — tag handling / extraction (DeepSeek-style paths); --reasoning off is usually enough for Gemma in interactive CLI.

Screenshot-friendly example (same command as above + reasoning disabled):

./build/bin/llama-cli \
  -m "$HOME/models/google_gemma-4-26B-A4B-it-Q4_K_M.gguf" \
  -p "Answer in one sentence what Linux is." \
  -cnv -ngl 99 \
  --reasoning off

Reference run (validated hardware in the intro; no [Start thinking] block; t/s are indicative):

You can also export the env vars mentioned in --help (LLAMA_ARG_REASONING, LLAMA_ARG_THINK_BUDGET, …) if you prefer not to repeat flags.

For llama-server (§8–§9), add the same switches to ExecStart (--reasoning off, --reasoning-budget 0, --chat-template-kwargs …) as your binary supports. If nothing disables it, try another GGUF/variant, or another model for a one-off capture (e.g. Llama in this same §7).

Example with a local Llama 3.1 8B (single-turn demo; chat template depends on the GGUF). An overly vague -p (“summarize llama.cpp”) may yield “I don’t have that information”; give context in the question (e.g. open-source inference, GGUF, local execution).

./build/bin/llama-cli \
  -m "$HOME/models/Llama-3.1-8B-Instruct-Q8_0.gguf" \
  -p "Answer in exactly one sentence: What does the llama.cpp project do for running language models locally?" \
  -ngl 999

Actual reference screenshot (same validated hardware in the intro: Ryzen 5 7640HS, Radeon 760M, DDR5; t/s varies with thermals, BIOS, and llama.cpp commit):

-ngl 99 / 999: tries to offload many layers to the GPU; on large models or a small unified VRAM budget you may need to lower -ngl or increase the BIOS framebuffer (§2).
On startup, look for lines like ggml_vulkan: and your GPU name (e.g. Radeon 760M) to confirm Vulkan.

Adding or switching models

Each additional model you want to run—another family, quantization, or file from Hugging Face—is one more .gguf in your folder (e.g. $HOME/models). ML slang often says “weights” for the trained parameters inside that file; here it is enough to think “another .gguf.” The flow is always download → test → point the server at that path.

Download using the same pattern as above (wget, huggingface-cli, or the repo’s download link on Hugging Face).
Smoke-test in the terminal with llama-cli -m "$HOME/models/your-new-file.gguf" (like the quick test). If the architecture is brand new and load fails, update and rebuild llama.cpp (§6).
Manual llama-server (§8): stop the process (Ctrl+C) and start it again with -m pointing at the new file.
systemd service (§9): edit /etc/systemd/system/llama-web.service, change only the -m /full/path/new.gguf argument inside ExecStart, save, then run:

sudo systemctl daemon-reload
sudo systemctl restart llama-web.service
sudo systemctl status llama-web.service

Open WebUI (§10): llama-server loads one model at a time (whichever you set at startup). After restarting the service, reload the UI; the model dropdown may show the filename or a generic label (default), depending on the version.
OpenCode / VS Code (§11): same host and port (…:8080/v1); in editors use the server IP or 127.0.0.1 depending on where the IDE runs.

Serving several models at once requires multiple llama-server processes on different ports (and matching entries in Open WebUI or more containers); that advanced layout is not spelled out here.

Experimenting with more models: setup, testing, and limits

If you want to try multiple GGUFs, follow a clear flow and know your hardware ceiling—this avoids pointless downloads and false “it’s broken” moments.

Recommended flow

Check disk and RAM (free -h, df -h /, §3). Each quantization costs what the model card says; keep headroom.
Update llama.cpp when the model is new (§6, Update and rebuild).
Download the .gguf into $HOME/models (wget, huggingface-cli, etc.).
Smoke-test with llama-cli and short generations; confirm ggml_vulkan if the GPU should participate (§7).
Optional: llama-bench with the same -ngl you plan for production to compare quantizations (§7).
Change -m in §9 (or manual §8), daemon-reload + restart, then curl /v1/models and Open WebUI (Admin → Connections; Model IDs if needed).

Typical limits on a mini PC with an iGPU

Topic	What it means
RAM	GGUF size + OS + context cannot grow without limit; huge MoE releases (e.g. Kimi K2-class GGUFs) can exceed usable RAM on 64–96 GiB class boxes or crawl at extremely low tok/s.
iGPU Vulkan	Caps tok/s on GPU; lots of RAM helps you load weights, not mimic a big discrete GPU.
One active model per `llama-server`	Switching models means changing `-m` and restarting (or a second server on another port).
Templates / chat	Weird chat in Open WebUI may be the GGUF chat template; check the Hugging Face card or try another frontend.
Network / disk	Large downloads take time; use `wget --continue` or resumable `huggingface-cli`.

Set expectations: an 8B–13B or a quantized 26B can be a great fit with ample RAM; datacenter-scale GGUF may not fit or run under ~1–2 tok/s with aggressive paging—that is a memory bandwidth issue, not an Ubuntu bug.

One playbook: Gemma 4, Qwen Coder, DeepSeek Coder, and Llama 3.1 (download → Open WebUI)

For a mini PC–style setup: Ubuntu 24.04, AMD iGPU Vulkan, ~64–96 GiB RAM, llama-server on 8080, systemd §9, Open WebUI §10. Swap in your paths and username.

Common steps (every model swap)

Refresh the engine if the weight is new or load fails: cd ~/llama.cpp && git pull and rebuild (§6).
Download the .gguf (per-family commands below). Verify the filename under Hugging Face → Files; if it is renamed, fix the URL.
Smoke test (tune -ngl and -c); or use the copy-paste commands per model under Per-model quick test below.

cd ~/llama.cpp
./build/bin/llama-cli -m "/absolute/path/to/file.gguf" -ngl 999 -c 4096 -n 80 -p "Answer in one short sentence."

Tuning: on OOM, hangs, or very slow output, lower -ngl (e.g. 50, 35) and/or -c (e.g. 2048). Unified iGPU memory is usually the limiter, not raw RAM alone.
llama-bench (optional, §7) with the same path and -ngl to compare quants or families.
systemd (§9): in /etc/systemd/system/llama-web.service, edit ExecStart: same path in -m, and match -c / -ngl to what worked in the smoke test.

sudo systemctl daemon-reload
sudo systemctl restart llama-web.service
sudo systemctl status llama-web.service

API check: curl -s http://127.0.0.1:8080/v1/models
Open WebUI: Admin → Connections → OpenAI (host.docker.internal:8080/v1). If the picker stays empty, paste the id from that JSON into Model IDs, save, and hard-refresh.

Reference table (repos + sample file)

Family	Hugging Face repo	Sample file (quant)	Notes (~machine with plenty of RAM)
Gemma 4 26B Instruct	bartowski/google_gemma-4-26B-A4B-it-GGUF	`google_gemma-4-26B-A4B-it-Q4_K_M.gguf`	~17 GiB on disk; usually needs fresh llama.cpp. Start `-c` around 4096–8192.
Qwen2.5 Coder 7B	bartowski/Qwen2.5-Coder-7B-Instruct-GGUF	`Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf`	Much lighter than Gemma 26B. For 14B / 32B, check Files sizes; 32B Q4 is often ~18–20 GiB+ and heavier.
DeepSeek Coder V2 Lite Instruct	bartowski/DeepSeek-Coder-V2-Lite-Instruct-GGUF	`DeepSeek-Coder-V2-Lite-Instruct-Q4_K_M.gguf`	“Lite” ≈ ~10 GiB class in Q4_K_M; solid code/disk trade-off locally.
Llama 3.1 8B Instruct	bartowski/Meta-Llama-3.1-8B-Instruct-GGUF	`Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf` or `-Q8_0.gguf`	Q4_K_M faster; Q8_0 heavier / often higher quality. If your file name differs, keep your real path in `-m`.

Download (`wget --continue`, one file per command)

If you use SSH and the download runs a long time, run it inside screen or tmux so a dropped connection does not kill the job. Example with screen (install if needed: sudo apt install -y screen):

screen -S hf-models
mkdir -p "$HOME/models"

wget --continue -O "$HOME/models/google_gemma-4-26B-A4B-it-Q4_K_M.gguf" \
  "https://huggingface.co/bartowski/google_gemma-4-26B-A4B-it-GGUF/resolve/main/google_gemma-4-26B-A4B-it-Q4_K_M.gguf?download=true"
# When this wget finishes, you can paste the next command from the block below without leaving screen.

# Detach (leave download running): Ctrl+A, release, D
# Reattach later: screen -r hf-models
# List sessions: screen -ls

The same pattern works for the other URLs in this section or for huggingface-cli download.

mkdir -p "$HOME/models"

# Gemma 4 26B Q4_K_M
wget --continue -O "$HOME/models/google_gemma-4-26B-A4B-it-Q4_K_M.gguf" \
  "https://huggingface.co/bartowski/google_gemma-4-26B-A4B-it-GGUF/resolve/main/google_gemma-4-26B-A4B-it-Q4_K_M.gguf?download=true"

# Qwen2.5 Coder 7B Q4_K_M
wget --continue -O "$HOME/models/Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf" \
  "https://huggingface.co/bartowski/Qwen2.5-Coder-7B-Instruct-GGUF/resolve/main/Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf?download=true"

# DeepSeek Coder V2 Lite Q4_K_M
wget --continue -O "$HOME/models/DeepSeek-Coder-V2-Lite-Instruct-Q4_K_M.gguf" \
  "https://huggingface.co/bartowski/DeepSeek-Coder-V2-Lite-Instruct-GGUF/resolve/main/DeepSeek-Coder-V2-Lite-Instruct-Q4_K_M.gguf?download=true"

# Llama 3.1 8B Q4_K_M
wget --continue -O "$HOME/models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf" \
  "https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf?download=true"

Meta / Llama (gated): if wget returns 403 or Hugging Face asks you to sign in, open the model page while logged in, accept the license, create a read token, and run huggingface-cli login. Gated repos usually need huggingface-cli download ..., not anonymous wget to resolve/main/....

huggingface-cli alternative (resumable; each command pulls one GGUF under --local-dir):

sudo apt install -y python3-venv
python3 -m venv "$HOME/.venv/huggingface"   # once; skip if this directory already exists
source "$HOME/.venv/huggingface/bin/activate"
pip install -U "huggingface_hub[cli]"
# huggingface-cli login   # required for *gated* repos (e.g. Llama/Meta); optional otherwise
mkdir -p "$HOME/models"

huggingface-cli download bartowski/google_gemma-4-26B-A4B-it-GGUF \
  --include "google_gemma-4-26B-A4B-it-Q4_K_M.gguf" \
  --local-dir "$HOME/models"

huggingface-cli download bartowski/Qwen2.5-Coder-7B-Instruct-GGUF \
  --include "Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf" \
  --local-dir "$HOME/models"

huggingface-cli download bartowski/DeepSeek-Coder-V2-Lite-Instruct-GGUF \
  --include "DeepSeek-Coder-V2-Lite-Instruct-Q4_K_M.gguf" \
  --local-dir "$HOME/models"

huggingface-cli download bartowski/Meta-Llama-3.1-8B-Instruct-GGUF \
  --include "Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf" \
  --local-dir "$HOME/models"

Depending on the CLI version, the .gguf may end up in a subfolder under --local-dir. Point -m at the real absolute path (for example find "$HOME/models" -name '*.gguf').

Per-model quick test (right after download)

Run one block (paths match the wget names above). -n caps generated tokens so the run stays short; if your llama-cli rejects -n, check ./build/bin/llama-cli --help (sometimes --predict or another alias). Earlier in §7, Quick terminal test shows a -cnv example for Gemma and a Llama variant.

Gemma 4 26B Q4_K_M

cd "$HOME/llama.cpp"
./build/bin/llama-cli \
  -m "$HOME/models/google_gemma-4-26B-A4B-it-Q4_K_M.gguf" \
  -c 4096 -ngl 999 -n 80 \
  -p "Answer in one short sentence what a tensor is in machine learning."

Qwen2.5 Coder 7B Q4_K_M

cd "$HOME/llama.cpp"
./build/bin/llama-cli \
  -m "$HOME/models/Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf" \
  -c 4096 -ngl 999 -n 128 \
  -p "Write a one-line Python factorial(n) function; code only."

DeepSeek Coder V2 Lite Q4_K_M

cd "$HOME/llama.cpp"
./build/bin/llama-cli \
  -m "$HOME/models/DeepSeek-Coder-V2-Lite-Instruct-Q4_K_M.gguf" \
  -c 4096 -ngl 999 -n 128 \
  -p "Write a JavaScript arrow function that adds two numbers; code only."

Llama 3.1 8B Instruct Q4_K_M

cd "$HOME/llama.cpp"
./build/bin/llama-cli \
  -m "$HOME/models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf" \
  -c 4096 -ngl 999 -n 80 \
  -p "Say in one sentence what llama.cpp is for."

On startup you should see ggml: / ggml_vulkan: lines naming your GPU when Vulkan is in use (§4–§5).

Typical `ExecStart` tweaks (example)

Same shape as §9; only -m (and possibly -c / -ngl) change:

…/llama-server \
    -m /home/YOUR_USER/models/THE_FILE_YOU_TESTED.gguf \
    --host 0.0.0.0 \
    --port 8080 \
    -c 8192 \
    -ngl 999 \
    --n-predict -1

If Gemma 26B Q4 or another big model OOMs on a box with only ~16 GiB RAM, lower -c (e.g. 4096) and -ngl (e.g. 40 or less) before pushing 99 / 999. Always validate with llama-cli using the same -m, -c, and -ngl you plan in ExecStart, then automate with systemd (§9).

8. Minimal web server (`llama-server`)

Run manually, listening on all interfaces on port 8080:

cd "$HOME/llama.cpp"
./build/bin/llama-server \
  -m "$HOME/models/google_gemma-4-26B-A4B-it-Q4_K_M.gguf" \
  --host 0.0.0.0 \
  --port 8080 \
  -c 8192 \
  -ngl 99 \
  --n-predict -1

On another machine: http://SERVER_IP:8080 (llama.cpp’s built-in UI is very basic).

9. systemd service (start on boot)

Create /etc/systemd/system/llama-web.service (e.g. with sudo nano):

[Unit]
Description=Llama.cpp API server (Vulkan)
After=network.target

[Service]
Type=simple
User=YOUR_USER
Group=YOUR_USER
# Vulkan on AMD: the service user must access /dev/dri (groups in §4).
# If the service loads the model on CPU only, check `groups` / `id` for that user.
SupplementaryGroups=render video
WorkingDirectory=/home/YOUR_USER/llama.cpp
ExecStart=/home/YOUR_USER/llama.cpp/build/bin/llama-server \
    -m /home/YOUR_USER/models/google_gemma-4-26B-A4B-it-Q4_K_M.gguf \
    --host 0.0.0.0 \
    --port 8080 \
    -c 8192 \
    -ngl 99 \
    --n-predict -1
Restart=always
RestartSec=3

[Install]
WantedBy=multi-user.target

Enable it:

sudo systemctl daemon-reload
sudo systemctl enable --now llama-web.service
sudo systemctl status llama-web.service

Recommended order (tight RAM):

The .gguf must be fully downloaded; a truncated file makes the unit fail or restart in a loop (Restart=always).
Smoke-test with llama-cli first as the same user as the systemd unit, with the same -m, -c, and -ngl as in ExecStart (§7 Per-model quick test or step 3’s generic example). If that already OOMs or hangs, tune flags before enable --now.
If systemd shows OOM in journalctl, the process dies and respawns every few seconds, or the kernel kills the worker, edit ExecStart: drop -c (e.g. 4096) and -ngl (e.g. 40 or less) instead of staying on 99 / 999 until status shows a stable active (running); then sudo systemctl daemon-reload and sudo systemctl restart llama-web.service.

If startup fails, check logs: journalctl -u llama-web.service -n 80 --no-pager (GGUF path, /dev/dri permissions, -ngl, Vulkan).

10. Open WebUI with Docker (port 3000 → backend on 8080)

Install Docker if needed:

sudo apt install -y docker.io
sudo usermod -aG docker "$USER"
# Log out again, or run: newgrp docker

Container (UI on 3000; engine stays on host 8080):

docker run -d \
  -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:main

In the browser: http://SERVER_IP:3000.

Connect Open WebUI to llama-server

Not the same as “External tools”. In regular user settings you may see External tools (Manage tool servers, openapi.json): that is for optional tool servers, not for the main LLM backend. Putting your URL only there leaves the model picker empty.

Use Admin Settings, not the gear icon that only shows General / Interface / External tools (personal user settings). Typical path: profile avatar → Admin Settings / Administration → Settings → Connections → OpenAI → Add connection. If Admin Settings is missing, your account is not an instance admin (the first registered user usually is). Docs: OpenAI-Compatible.

Admin panel → Settings → Connections.
OpenAI section (llama-server mimics the OpenAI API):
- Base URL: http://host.docker.internal:8080/v1
- API key: any string (e.g. sk-no-key-required).
Save and use verify connection if shown.
Turn off “Direct connections” (or equivalent) if you enabled it: otherwise the browser will try to resolve host.docker.internal outside Docker and fail. The UI should proxy to the backend.

Chat up and running (example)

With the backend wired, pick a model in chat (often the same label as the .gguf filename llama-server loaded), send a prompt, and the reply is generated on the host. The screenshot shows google_gemma-4-26B-A4B-it-Q4_K_M.gguf: the header dropdown reflects that file, and you get a “Thought for …”-style block (internal reasoning before the visible answer). That adds latency before you see the final text; for terminal use and less explicit “thinking” output with Gemma, try llama-cli with --reasoning off (§7 Quick terminal test).

No browsing or GitHub fetch: real limits (and confident wrong answers)

With llama-server + Open WebUI as wired here, the model is text → text only: it does not browse the web, issue its own internet requests, download a https://github.com/... tree, or run code in a sandbox. All it “sees” is what you type (plus whatever context the UI forwards) and knowledge frozen inside the GGUF up to training cutoff.

It may still answer very confidently as if it had tools—for example claiming it “can analyze a public repo if you share the link” or outlining how it will “read” a remote README. In this stack that is false if you only paste a URL: the backend never fetches HTML or the repo; Gemma (or any local GGUF) hallucinates or repeats patterns from training. Real analysis needs you to paste files / diffs, or separate plumbing (RAG, Open WebUI functions, agents, APIs) that this guide does not set up.

A “Thought for …” / reasoning block (§7, §10) does not verify anything online—it only extends generation and can read like a super-capable assistant; double-check claims about repos, “current” versions, or anything that depends on today.

Same stack, different tone: ask bluntly can you browse the Internet for new info? and Gemma may plainly refuse—no live search, only training data plus whatever you paste. That does not undo the GitHub-URL problem above: the model shifts persona with prompt framing (literal capability question vs. “please review this repo”). Ground truth is unchanged: llama-server still issues no HTTP on its own until you wire tools.

Live demo (the joke writes itself): the assistant just told you to “send the link”; you reply analyze https://github.com/…/pgwd and tell me what to improve—or the same request in Spanish (or any other language you type in the UI); llama-server does not switch behavior by chat language. Open WebUI shows Thinking… and Gemma looks busy, but llama-server never fetched that repo: it only sees the message string. The answer may sound technical yet be untethered from the real tree—paste files, use git yourself, or wire tools if you want grounded review.

Same experiment, a minute later: the model may return Thought for ~45–60s and a long “review” that reads like a real audit. The screenshot below is English (analyze in details…): it leans into Flask and Blueprints; in another chat the same Gemma might rattle off Go cmd//internal/—still with no tree read. That is template + guesswork, not repository access: some bullets may match the name (pgwd, “dashboard”, …), some may be wrong; length and “thought” time are not a substitute for cloning and diffing.

Model picker shows “No results found” / no models listed

This almost never means “the .gguf is missing on disk”; it means Open WebUI is not getting /v1/models from the backend you configured. Walk through in order:

llama-server must be running on the same host as Docker (§8 manual or §9 systemd). Nothing listening on 8080 → empty list.
On the host (mini PC shell), hit the API:

curl -sS http://127.0.0.1:8080/v1/models | head

You should see JSON (data, at least one id). Connection refused → start or fix llama-server. If it only bound a weird interface, use --host 0.0.0.0 in ExecStart (not only 127.0.0.1 if LAN clients need 8080; for Docker→host this is the usual choice).

From the Open WebUI container, the host port must be reachable:

docker exec open-webui sh -c 'wget -qO- http://host.docker.internal:8080/v1/models 2>/dev/null || curl -sS http://host.docker.internal:8080/v1/models' | head

If this fails but step 2 works, you are missing --add-host=host.docker.internal:host-gateway in docker run (§10), or a firewall blocks Docker bridge → host (ufw may need a rule; many setups allow it by default).

UI wiring: Settings → Connections → OpenAI (or Admin → Settings, depending on version), base URL http://host.docker.internal:8080/v1 (/v1 required). Save a dummy API key and verify if offered.
Do not mix with Ollama: putting the llama-server URL only under Ollama, or using port 8080 without /v1, can leave the dropdown empty. See the table below.
After fixing, hard-refresh the UI. The model label may match the .gguf name, default, or whatever id appears in the JSON from step 2.

“Failed to fetch models” under Ollama (Settings → Models)

If Settings → Models → Manage Models shows the Ollama service with URL http://host.docker.internal:8080 (and nothing else), you often get Failed to fetch models. That usually means two different backends are mixed up:

What you run	Typical port	Where to configure it in Open WebUI
llama-server (this guide)	8080, OpenAI-style API	Settings → Connections → OpenAI (or equivalent), base URL `http://host.docker.internal:8080/v1` (the `/v1` suffix is required).
Ollama (only if installed separately)	11434, Ollama API	Ollama connection / model management, typically `http://host.docker.internal:11434` (only if Ollama listens on the host and the container can reach it).

llama-server is not Ollama. If you put the llama-server URL in the Ollama field, the UI uses the wrong protocol and fails even when port 8080 is open.

If you only use llama-server:

Keep Connections → OpenAI exactly as above (…8080/v1, dummy key, verify).
If you do not run Ollama, clear or disable the Ollama URL (do not point it at 8080).
Return to Models or chat: available models follow whatever llama-server loaded with -m (§8–§9).

If host.docker.internal does not resolve inside the container, confirm your docker run includes --add-host=host.docker.internal:host-gateway (§10). On Linux that hostname is not defined by default without it.

Updating Open WebUI (Docker)

The UI often shows a banner like “A new version (v0.x.y) is now available…” when a newer image exists. Your chats and settings live in the open-webui named volume; they are kept when you recreate the container as long as you mount the same -v open-webui:/app/backend/data.

Pull the updated image (same tag you used at install; this guide uses main):

docker pull ghcr.io/open-webui/open-webui:main

Stop and remove only the container (the volume stays intact):

docker stop open-webui
docker rm open-webui

Run the same docker run block from §10 again (same -p 3000:8080, --add-host=host.docker.internal:host-gateway, -v open-webui:…, container name open-webui, etc.). The new container starts from the image you just pulled.

If you originally used a different tag (e.g. v0.8.12 or a cuda variant) instead of main, substitute that tag in both docker pull and docker run.

Notes: updating the UI does not update llama-server or your GGUF weights; the engine is still §6–§9. If you do not want to track main, pin an explicit image tag in docker run and repeat this flow when you choose to upgrade.

If you also run Ollama

A default endpoint may appear on port 11434. To keep using your Vulkan llama-server with the same -ngl/RAM behavior, prioritize the OpenAI entry pointing at :8080/v1 and do not rely on Ollama for that backend.

11. OpenCode and VS Code with your `llama-server`

Same API surface as Open WebUI: llama-server exposes an OpenAI-compatible endpoint at http://HOST:8080/v1 (keep §8 or §9 running). Use the mini PC’s IP instead of 127.0.0.1 when you work from another machine on the LAN (and open port 8080 in the firewall if needed).

OpenCode

OpenCode can use OpenAI-compatible providers through @ai-sdk/openai-compatible. The official docs include a llama.cpp / llama-server example: Providers — llama.cpp.

Confirm llama-server answers (e.g. curl -s http://127.0.0.1:8080/v1/models).
Create or edit opencode.json for your project or OpenCode’s config path ($schema: https://opencode.ai/config.json).
Add a provider with "npm": "@ai-sdk/openai-compatible" and "options.baseURL": "http://127.0.0.1:8080/v1" (or the remote IP).
Under provider.<id>.models, add keys that match what the API expects. If unsure, read the id field from /v1/models; it is often the .gguf filename or default.
In OpenCode, use /models to pick provider_id/model_id, or set "model": "provider_id/model_id" in the JSON.

Minimal example (adjust IDs to your setup):

{
  "$schema": "https://opencode.ai/config.json",
  "provider": {
    "llama-local": {
      "npm": "@ai-sdk/openai-compatible",
      "name": "llama-server (local)",
      "options": {
        "baseURL": "http://127.0.0.1:8080/v1"
      },
      "models": {
        "default": {
          "name": "Local model (default)"
        }
      }
    }
  },
  "model": "llama-local/default"
}

If OpenCode cannot see the model, align models keys with /v1/models. Tools and heavy agentic flows depend on the GGUF; a general chat model may underperform on coding-agent tasks.

Visual Studio Code

VS Code does not talk to your server by itself—you need an extension that supports a custom OpenAI-style endpoint.

Common picks: Continue and others advertising OpenAI-compatible API or “local LLM”. You typically set Base URL to http://127.0.0.1:8080/v1 (or the server IP) and API key to any placeholder (e.g. sk-local).
Visual Studio GitHub Copilot does not route through your llama-server; it is a separate service.
From another PC, use the host IP where llama-server runs—not host.docker.internal (that name is for containers such as Open WebUI).

Extensions usually trail cloud models on tools and huge context. Start on the same machine you already validated with llama-cli or Open WebUI.

12. Troubleshooting: Vulkan / `glslc` on Ubuntu 24.04

Typical CMake symptoms:

Could NOT find Vulkan (missing: ... glslc)
Vulkan found but glslc still missing

Suggested order (simplest first):

12.1 Universe repository and packages

sudo add-apt-repository universe
sudo apt update
sudo apt install -y libvulkan-dev vulkan-tools shaderc

Verify:

command -v glslc && glslc --version

Clean and reconfigure the build:

cd ~/llama.cpp
rm -rf build
cmake -B build -DGGML_VULKAN=1
cmake --build build --config Release -j"$(nproc)"

12.2 LunarG repository (Vulkan SDK)

If your Ubuntu mirror does not offer shaderc or glslc is still missing:

wget -qO- https://packages.lunarg.com/lunarg-signing-key-pub.asc \
  | sudo tee /etc/apt/trusted.gpg.d/lunarg.asc
sudo wget -qO /etc/apt/sources.list.d/lunarg-vulkan-noble.list \
  https://packages.lunarg.com/vulkan/lunarg-vulkan-noble.list
sudo apt update
sudo apt install -y vulkan-sdk

Then rm -rf build and run cmake again.

12.3 Conflict between Ubuntu’s `libshaderc-dev` and LunarG’s Shaderc

If dpkg complains about overwriting files between packages, as a last resort you can force-remove the blocking package, then repair:

sudo dpkg --remove --force-depends libshaderc-dev
sudo apt --fix-broken install -y
sudo apt install -y shaderc

Only do this if you understand mixed repos can leave messy dependencies; often sticking to either LunarG or Ubuntu for Shaderc dev packages is enough.

12.4 Snap fallback for `glslc`

sudo snap install google-shaderc
sudo ln -sf /snap/bin/glslc /usr/local/bin/glslc

Check glslc --version again and retry CMake.

13. Performance and models (rough guide)

With lots of RAM and a modest iGPU, unified VRAM and -ngl cap GPU tokens/s; larger models can spill into system RAM.

Scale	Notes
Gemma 4 26B A4B (e.g. Q4_K_M ~17 GiB)	Good balance with high RAM; needs an up-to-date llama.cpp.
Same family Q8_0 (~27 GiB)	Better quality; more pressure on RAM/unified VRAM.
Mixtral 8×7B, 70B, others	Feasible mainly thanks to RAM; slower.

Use a lower quantization (e.g. Q4_K_M) if you prioritize speed over quality.

For hard numbers on your box, run llama-bench (§7): it is the most direct way to compare -ngl and quantizations without the web UI in the way.

`htop` looks “light” while you chat (is that normal?)

If htop shows llama-server / llama-cli with low CPU across cores and only a few GiB of RES, that is often expected when:

-ngl leaves much of the model on the iGPU — heavy matmul runs on the graphics core; the CPU orchestrates and shuffles data, so you may not see all cores pegged at 100%.
The GGUF is small (e.g. 7B/8B Q4) — small resident RAM footprint; a 26B run would show much more RES if most weights live in system memory.
Bursts happen while scoring the prompt and generating tokens; between turns or while you read output, usage drops.
With unified memory (UMA), some model cost may not show up as a huge process RSS: the GPU also holds part of the working set.

Do not assume nothing is working just because htop stays calm: check t/s in llama-cli, llama-bench (§7), or a GPU monitor if you want to see graphics load.

Reference screenshot (same class of mini PC as the validated hardware; SSH + htop: llama.cpp around ~5 GiB RES and moderate CPU on one core—consistent with a non-huge model and GPU-bound ‑ngl):

AMD: `amdgpu_pm_info` and `dri/N` (not always `dri/0`)

Many snippets use /sys/kernel/debug/dri/0/amdgpu_pm_info. On Ryzen mini PCs with amdgpu, dri/0 often does not exist: the kernel exposes the GPU under a PCI BDF directory (0000:c4:00.0, …) and provides symlinks such as dri/1 or dri/128 into the same tree. If cat returns No such file or directory, inspect first:

mount | grep debugfs   # expect debugfs on /sys/kernel/debug
ls -la /sys/kernel/debug/dri/

Then read amdgpu_pm_info using the N or PCI path that belongs to your AMDGPU (1 or 0000:…:….0 usually works):

sudo cat /sys/kernel/debug/dri/1/amdgpu_pm_info
# same content if 1 → 0000:c4:00.0:
# sudo cat /sys/kernel/debug/dri/0000:c4:00.0/amdgpu_pm_info

If the directory exists but amdgpu_pm_info is missing, your kernel may not export that node; try ls … | grep -i pm. That does not mean Vulkan is broken.

How to read it (sample text, idle mini PC): GPU Load: 0 % with VCN powered down matches idle. While llama-cli / llama-server runs a long ‑ngl job, run cat during generation: you should usually see Load > 0 % (the counter may not peg the iGPU). For a live view, radeontop is often easier (sudo apt install -y radeontop).

GFX Clocks and Power:
    2800 MHz (MCLK)
    800 MHz (SCLK)
    ...
GPU Temperature: 36 C
GPU Load: 0 %
VCN Load: 0 %
VCN: Powered down

(Illustrative excerpt; clocks, millivolts, and watts vary with BIOS, governor, and workload.)

14. Remote desktop (Ubuntu 24.04 Desktop, LAN)

When the mini PC runs GNOME and you want the full desktop from another machine on the same network (Windows, Mac, Linux), Ubuntu 24.04 usually ships RDP built in; you often do not need xrdp unless you want different behavior.

14.1 Enable on the mini PC

Settings → System → Remote Desktop.
Turn Remote Desktop on.
Finish the assistant (password / auth as GNOME shows).

Underlying package: gnome-remote-desktop. If the toggle is missing or fails:

sudo apt update
sudo apt install --reinstall gnome-remote-desktop

Log out or reboot and open Settings again.

14.2 Connect from another machine

Native RDP clients: Windows (Remote Desktop Connection / mstsc), macOS (Microsoft Remote Desktop from the App Store), Linux (e.g. Remmina, RDP protocol).
Host: the Ubuntu box’s LAN IP (hostname -I | awk '{print $1}' on the mini PC).
Port: 3389/TCP by default.

14.3 Firewall (`ufw`)

If ufw is enabled:

sudo ufw allow 3389/tcp comment 'GNOME RDP'
sudo ufw status

14.4 If connection fails

On the Ubuntu host:

hostname -I
sudo ss -tlnp | grep 3389 || true

With Remote Desktop enabled, something should listen on 3389. Confirm the client is on the same LAN and that no AP isolation blocks client-to-client Wi‑Fi.

If GNOME/RDP misbehaves on Wayland, try the Ubuntu on Xorg session on the login screen and enable Remote Desktop again.

Security: exposing RDP to the public Internet without VPN/tunnel is a bad idea; keep it on a trusted LAN or behind VPN/WireGuard.

Final checklist

[ ] BIOS: UMA / VRAM for iGPU adjusted if applicable.
[ ] Vulkan OK: on desktop vkcube; on Ubuntu Server vulkaninfo --summary shows the GPU.
[ ] User is in render and video (id -nG); if you ran usermod, you logged out or rebooted (an old shell session does not pick up new groups).
[ ] cmake -B build -DGGML_VULKAN=1 succeeds; build reaches 100 %.
[ ] You can update llama.cpp (git pull, rebuild §6) and follow try model → systemd → Open WebUI when experimenting with new GGUFs (§7, Experimenting…).
[ ] llama-cli shows the Vulkan device when loading the model.
[ ] llama-server responds on :8080.
[ ] Open WebUI on :3000 with http://host.docker.internal:8080/v1 and Direct connections off.
[ ] You know the model does not browse or read GitHub from a URL alone; it may hallucinate capabilities (§10 No browsing or GitHub fetch).
[ ] You know how to upgrade Open WebUI: docker pull, stop/rm the container, rerun the same docker run with the open-webui volume (§10).
[ ] systemd service enabled if you want a persistent boot setup.
[ ] You know how to switch models: after adding another .gguf, you update -m in llama-web.service (or in the manual command), run sudo systemctl daemon-reload && sudo systemctl restart llama-web.service, and reload Open WebUI.
[ ] You can list your .gguf files (ls / find, §7) and measure throughput with llama-bench (§7) when comparing quantizations or -ngl.
[ ] You can follow the unified playbook for Gemma 4 / Qwen Coder / DeepSeek Lite / Llama 3.1 (§7): download → llama-cli → systemd → /v1/models → Open WebUI.
Remote desktop §14: RDP enabled in Settings, 3389 allowed in ufw if needed, smoke tested from another PC on the LAN.

Quick port reference

Service	Port
llama-server	8080
Open WebUI	3000
Remote desktop (GNOME RDP)	3389 TCP
Ollama (optional)	11434

Closing thoughts

Running local inference on Ubuntu with Vulkan and an AMD iGPU is not a one-click setup, but it is worth it: a model that answers on your LAN, without routing every request through a third-party API, and with the freedom to swap GGUFs or quantizations when you need to.

The stack moves fast: llama.cpp, Ubuntu packages, and Hugging Face repos change over time. If a command or package name no longer matches this guide, cmake and apt errors usually point you in the right direction; double-check the project’s current docs.

Once the checklist is green, the natural next step is tuning -ngl, context size (-c), and the model until you get the quality-vs-tokens-per-second balance you want on your hardware.

This is the mini PC we used for the tests and validation in this guide: Minisforum UM760 Slim (Ryzen 5 7640HS, Radeon 760M), Ubuntu 24.04 LTS, plenty of DDR5 RAM and NVMe — the same box behind the llama-bench runs, llama-cli screenshots, Open WebUI examples, and the other reference captures. The photo is the actual machine (powered on, front panel as shown), not a marketing render.

Now go tinker: this walkthrough is rooted in Ryzen + iGPU, but the playbook travels—mini PCs (Minisforum, Beelink, ASUS ExpertCenter PN, ZOTAC ZBOX, modern Intel NUC-class boxes…), Mac mini / Mac Studio on Apple Silicon if that is your stack, or compact power boxes like NVIDIA DGX Spark when budget and goals match. Build llama.cpp (or your preferred runtime), stress GGUF quantizations, run llama-bench on your iron, and tune -ngl until the ceiling feels right. Share what you learn—a dev.to post, a blog, Mastodon, article comments, or whatever community you use; real numbers beat brochure claims every time.

One quiet takeaway: on your codebases the model usually helps more as a copilot you feed—a diff, a log slice, a trimmed README—than as an all-knowing reviewer from a bare URL or a polished persona. When the answer feels too slick without anything concrete in the prompt, the limit is rarely the mini PC: it is text-in, text-out with nobody else reading disk for you. §10 walks the receipts; day-to-day, you supply the ground truth.

AI disclosure: I wrote the technical walkthrough from my own setup (Ubuntu 24.04, llama.cpp + Vulkan, Minisforum mini PC, real llama-bench numbers and screenshots). I used AI tools (e.g. ChatGPT/Gemini/Cursor-style assistants) for brainstorming titles, structure, and Reddit post wording, and for editing clarity in places—not for inventing the commands, hardware facts, or benchmarks, which I ran and documented myself. The project itself (self-hosted stack) does not require callers to use cloud LLMs; it’s local inference. Happy to clarify further if needed.