DEV Community: Hermes Rodríguez

GROOT 1.0: One Stable Archive for Kubernetes Incidents

Hermes Rodríguez — Sat, 04 Jul 2026 12:21:11 +0000

Follow-up to: GROOT: One archive for cluster diagnostics
Repository: github.com/hrodrig/groot
Release: v1.0.0

I have not published much here lately. Most of my time went into shipping and hardening open-source tools I want operators to trust in production — GROOT, kzero, and the selfhosted/operator repos that wrap them — rather than long-form posts. This article is the catch-up: what changed at v1.0.0, why it matters if you run Kubernetes on call, and how it fits next to tools you may already use. I hope it is useful.

This is my first v1.0.0 on GROOT — a read-only Go CLI that turns a Kubernetes incident into one reproducible .tar.gz: pod logs, events, API snapshots, RCA TSVs, and a manifest you can attach to a ticket or pipe into automation.

If you read the earlier walkthrough, you already know the core idea: collect evidence, do not pretend to diagnose. Version 1.0.0 is the line where I stop treating that promise as “best effort” and document it as a stable contract.

The 1.0 incident loop

Why 1.0.0 is different from “just another tag”

0.9.x was about operator wins: preflight (groot validate), offline archive review (groot inspect), kubectl groot, --summary, stable exit codes, config profiles, and a pile of collector fixes.

1.0.0 is the compatibility boundary:

Contract	What it means
`config_version: 1`	New configs declare schema version; legacy YAML without it still loads
`archive_layout_version: 1`	Every new archive’s `extras/manifest.json` carries a layout semver for downstream tools
`internal/` layout	GROOT is a CLI product, not a public Go SDK — no accidental import paths
`--output json`	`collect`, `validate`, and `inspect` emit structured output for CI and runbooks
Governance	CODEOWNERS, issue/PR templates, golden inspect fixtures in CI

Nothing here tries to be a monitoring platform or an AI RCA engine. The scope stays narrow: one command, one bundle, honest metadata.

The incident workflow I actually run

Three commands cover most of my on-call path:

groot validate --config groot.yml
groot collect --config groot.yml --summary
groot inspect ./groot-capture-*.tar.gz

Validate checks config load, API reachability, RBAC (auth can-i for the jobs your config would run), and free disk on output_dir. I run it before the first cron job in a new cluster and after every config change.

Collect still does the parallel read-only work (client-go end-to-end — no kubectl binary at runtime). --summary prints a one-screen footer: job counts, unhealthy pod tallies, archive path, duration.

Inspect reads an existing .tar.gz without cluster access — manifest, file tree, sizes. Post-mortems on a laptop, vendor handoffs, compliance reviews.

Same flow as kubectl groot if you install the plugin binary shipped beside groot in every release tarball.

GROOT vs alternatives (honest picks)

GROOT is not the only way to grab cluster state. Most teams already use something. The question is which job you are hiring a tool for.

Closest cousin: GROOT vs kubectl-gather

kubectl-gather collects similar Kubernetes context. Different output shape, different sweet spot.

Dimension	GROOT	kubectl-gather
Output	Single `.tar.gz` + `manifest.json` + RCA TSVs	YAML tree per cluster
Use case	Ticket-ready bundle (incident, compliance)	Multi-cluster diff, manual YAML inspection
Preflight / offline review	`groot validate`, `groot inspect` (no cluster)	—
Notifications	Slack, Discord, Teams, PagerDuty, Telegram, email, webhooks	None
Upload	S3, GCS, SFTP	None
Supply chain	GoReleaser, Homebrew, SBOMs, Cosign	Manual build
Plugin	`kubectl groot` (`kubectl-groot` in tarball)	`kubectl-gather`
Config	Versioned YAML (`config_version: 1`) + env	CLI flags
Redaction	Optional regex scrub before archive	None
Scheduling	groot-selfhosted Helm / CronJob	Manual

Pick GROOT when you need a signed, self-contained archive for tickets, compliance retention, or object storage — and you want validate/inspect/run_id in the loop.

Pick kubectl-gather when you juggle multiple clusters and diff YAML trees, or you already live in kubectl get … -o yaml workflows.

Both are read-only. Running both on a bad day is fine: .tar.gz for the record, YAML tree for ad-hoc diff.

Complementary tools (different jobs)

Tool	Job	vs GROOT
k9s / Lens	Live cluster navigation	Explore first; GROOT freezes state when the incident is hot
Stern / `kubectl logs -f`	Tail logs now	GROOT captures historical pod logs into the bundle
`kubectl cluster-info dump`	Built-in dump	Flat layout, no manifest/notify/upload; good for quick dumps
OpenShift must-gather	Vendor support bundle	Platform-specific; GROOT is portable K8s + your config
Popeye / kube-score	Config lint / best practices	Linter, not forensics — run on live cluster or against collected YAML
Trivy	Image vuln scan	Scan images in CI; GROOT may include image refs in RCA extras

My stack on call: k9s to poke around → GROOT to attach evidence → grep/jq/groot inspect offline. Not a replacement for Datadog, Loki, or your APM — the first-hour archaeology compress step.

GROOT + kzero: capture first, reset second

GROOT and kzero solve different problems — and I often run them back to back on the same cluster.

kzero is a Go CLI for declarative Kubernetes maintenance: ordered down, up, and reset pipelines from versioned YAML — scale workloads, Helm release steps, PVC/exec hooks, phase scripts, dry-run by default, optional notify when a pipeline stalls. It turns “start over” into a checked-in playbook, not a one-off shell tree.

GROOT is read-only: it freezes pod logs, events, and API context into a .tar.gz before you mutate anything.

kzero reorganizes cluster state — tear down, bring back, or full reset to a known baseline.

Typical maintenance or incident-recovery sequence:

# 1. Freeze what the cluster looked like *before* you touch it
groot validate --config groot.yml
groot collect --config groot.yml --summary --message pre-reset

# 2. Review the bundle (optional but cheap)
groot inspect ./out/groot-capture-*-pre-reset.tar.gz

# 3. Run the kzero playbook (dry-run first in kzero.yaml)
kzero analyze --config kzero.yaml
kzero reset --config kzero.yaml   # down then up; live when run.mode: live

If reset fails halfway, you still have the GROOT archive from step 1 — logs, CrashLoopBackOff counts, and manifest metadata from before the pipeline moved workloads.

Tool	Mutates cluster?	Output
GROOT	No (read-only)	`.tar.gz` evidence bundle
kzero	Yes (declarative pipelines)	Ordered down/up/reset + logs + notify

Links:

kzero product: github.com/hrodrig/kzero · kzero.hermesrodriguez.com · Releases
Install: curl -fsSL https://get.kzero.hermesrodriguez.com/install.sh | sh
Operator playbooks: kzero-selfhosted (bastion, cron, full-reset-example)

Same product/operator split as GROOT + groot-selfhosted: CLI and SPEC in the main repo; scheduling and runbooks in selfhosted.

What makes the bundle “production grade”

Collector, not analyzer

GROOT deliberately does not emit a verdict like “root cause: OOM”. It captures reproducible state: logs, events, workload context, optional metrics columns in extras/all-pods-rca.tsv, and extras/manifest.json with job counts, paths, run_id, and archive_sha256.

That separation matters. Diagnosis belongs in your observability stack, runbooks, or a human — not in a tool that might guess wrong under pressure.

Security by default

Read-only collection — mutating verbs never enter the hot path; extra_kubectl is allowlisted at config load.
Optional secret redaction — regex scrub before the archive is written (honest disclaimer: not a cryptographic guarantee).
Distroless nonroot container image (ghcr.io/hrodrig/groot).
Supply chain — SPDX + CycloneDX SBOMs, Cosign signatures on checksums and GHCR images since 0.6.x.
SFTP hardening in 1.0.0 — known_hosts_file is required when SFTP upload is enabled unless you explicitly opt into insecure host keys for lab use only.

Secrets for notify/upload stay in environment variables (GROOT_*), not committed YAML.

Operations beyond the laptop

The CLI lives in groot. Scheduling and in-cluster patterns live in groot-selfhosted — Helm CronJob, flat manifests, bastion Docker runbooks, airgapped SFTP relay playbooks.

That split keeps the product repo testable while operators pin ghcr.io/hrodrig/groot:1.0.0 (or @latest from GitHub Releases) in their own GitOps.

Structured output for automation

Scripting was always possible via exit codes (0 success, 1 config, 2 API, 3 abort, 4 notify failure). 1.0.0 adds --output json on groot collect so CI and ticketing systems get a stable Summary object on stdout after a successful run.

validate and inspect already supported JSON from 0.9.x — useful in preflight gates and archive QA pipelines.

Example gate in CI:

groot validate --config groot.yml --output json | jq -e '.ok == true'

Starter configs without starting from zero

Ready-made profiles live under examples/profiles/:

Profile	When
incident-quick.yml	Narrow namespaces, short log window, no upload
bastion-airgap.yml	SFTP via SSH relay, minimal notify
eks-managed.yml	Skip unsupported node logs, metrics on
compliance-full.yml	All namespaces, redaction enabled

Copy, edit cluster names and webhooks, run groot validate, then groot collect.

Migrating to 1.0.0

Existing configs: no forced rewrite. YAML without config_version keeps loading as legacy.

New configs: add at the top:

config_version: 1

New archives include "archive_layout_version": 1 in extras/manifest.json.

SFTP users: if you relied on implicit insecure host keys, set upload.sftp.known_hosts_file (or GROOT_UPLOAD_SFTP_KNOWN_HOSTS_FILE) before upgrading.

Full notes: CHANGELOG 1.0.0.

Honest gaps (post-1.0 roadmap)

Not in 1.0.0 — deliberately deferred to Band 4:

Multi-cluster capture in one archive
groot analyze / smart log hints
Streaming JSONL export
In-cluster progress UI

If you need those, open a roadmap discussion — I prefer honest backlog rows over scope creep in a stability release.

Install in one line

curl -fsSL https://get.groot.hermesrodriguez.com/install.sh | sh
groot -v
groot validate --config groot.yml   # after you have a config

Or: GitHub Releases (.deb, .rpm, .tar.gz, .zip), Homebrew (brew install hrodrig/groot/groot), go install github.com/hrodrig/groot/cmd/groot@v1.0.0.

Site: groot.hermesrodriguez.com

Closing

GROOT 1.0.0 is the release where I say: this config shape, this archive layout, and this CLI behavior are the baseline you can build runbooks on.

It is still one tool for one job — compress the first hour of kubectl archaeology into a single attachment — but now with versioned contracts, structured output, and the 0.9.x operator tooling I wished I had on my first production incident.

If it saves you a round of copy-paste on the next page, star the repo or drop an issue with what your team collects that I have not thought of yet.

Clone traffic for this repo (and the rest of the Hermes K8s/DB stack) lives on gghstats — hrodrig/groot — raw GitHub traffic, no vanity smoothing. Full index: gghstats.hermesrodriguez.com.

Thank you for reading — and if you already run GROOT in a cluster somewhere, thank you for trusting it with real incidents. Open source only works when people show up: issues, stars, runbooks, and honest feedback. Let's keep building dependable OSS together — see you in the next post.

Disclosure

Written by Hermes Rodríguez. AI tools helped with drafting and editing; technical claims were checked against SPECIFICATIONS.md, CHANGELOG, and the v1.0.0 tag. The DeepSeek audit notes that sparked this post were a useful checklist — not a substitute for reading the spec.

Always verify behavior on the release you install.

GROOT: One archive for cluster diagnostics

Hermes Rodríguez — Sun, 03 May 2026 03:02:43 +0000

If you have ever SSH’d into three terminals during an incident, copy-pasting kubectl get, kubectl logs, and kubectl describe while the clock ticks, you already know the problem: manual capture is slow, inconsistent, and easy to get wrong—especially on large clusters.

GROOT is a small open-source Go CLI ( Cobra + Viper ) that automates that workflow. You configure namespaces, workloads, and options once; GROOT runs the right kubectl invocations in parallel, writes a predictable directory layout, and produces a single .tar.gz you can attach to a ticket, upload to storage, or hand to a vendor.

This post walks through why it exists, how it behaves, and how to run it safely in production—including notifications, cron, and the guardrails around optional extra_kubectl commands.

Big picture (one glance)

The problem GROOT solves

During troubleshooting or post-incident review, you usually need a bundle of evidence:

Cluster-wide signals: nodes, events, pod list
Namespace-scoped resources and pod logs (sometimes including previous container logs)
Optional node detail (describe, top) when the incident touches capacity or scheduling
A sanitized snapshot of kube context (not the raw secret file—GROOT writes a summary under extras/)

Doing that by hand means dozens of commands, different filenames every time, and no guarantee the next engineer collects the same shape of data. GROOT’s goal is repeatable, fast capture with one entrypoint: groot collect.

What GROOT actually does

Under the hood, GROOT still uses kubectl as the execution engine (no in-cluster agent). That keeps RBAC and behavior aligned with what operators already understand, while worker concurrency (collection.worker_concurrency) speeds up I/O-bound phases.

High-level features (see the README for the full list):

Area	What you get
Concurrency	Configurable worker pool for parallel `kubectl` jobs
Scope	Namespaces, optional per-namespace targets (Deployments, StatefulSets, DaemonSets, Helm release instance labels)
Logs	Optional pod logs, optional `--previous`, configurable tail (including `0` for full logs)
Packaging	Timestamped capture dir → `.tar.gz`; paths inside the archive are prefixed with the capture folder so extractions do not collide
Config	YAML + *`GROOT_`** environment overrides
Notify	Slack, Discord, Teams, PagerDuty Events API v2, Telegram, generic JSON webhooks; multiple endpoints via `;`-separated URLs or chat IDs
Ops UX	`--verbose`, `--quiet`, `--no-notify`, `--test-connection`, `--message` suffix for archive names
Container	Rootless-oriented Dockerfile for air-gapped or locked-down environments

Notifications send a one-line summary (totals, duration, output dir, archive path). Outbound HTTP uses a bounded client timeout so a stuck webhook does not hang the whole run.

Quick start

Prerequisites (all installs): kubectl on PATH, a valid kubeconfig, and RBAC sufficient for read/list/log operations.

Prerequisites (build from source only): Go 1.26+ and Git.

Install from GitHub Releases

Pre-built .deb, .rpm, .tar.gz (and .zip on Windows) are published on Releases. GoReleaser names files with the semver without v (e.g. groot_0.1.8_amd64.deb), while the GitHub download URL uses the tag with v (…/download/v0.1.8/…). Use TAG for the path and VER="${TAG#v}" for the filename, or copy exact names from the release page.

Replace v0.1.8 / 0.1.8 and amd64 below with the release and CPU you need (arm64 on many ARM machines).

Debian / Ubuntu (`.deb`)

Installs the groot binary under /usr/bin and ships a sample at /etc/groot/groot.yml.sample (from configs/groot.yml.sample in the repo). For a machine-wide active config, copy it: sudo cp /etc/groot/groot.yml.sample /etc/groot/groot.yml and edit groot.yml, or rely on default discovery (see Configuration model).

TAG=v0.1.8
VER="${TAG#v}"
DEB="groot_${VER}_amd64.deb"
TMP="/tmp/${DEB}"
if ! curl -fsSL "https://github.com/hrodrig/groot/releases/download/${TAG}/${DEB}" -o "$TMP"; then
  echo "Download failed — check tag and filename (basename has no v)." >&2; exit 1
fi
[ -f "$TMP" ] || { echo "Missing $TMP after curl" >&2; exit 1; }
sudo apt install "$TMP"
# or: sudo dpkg -i "$TMP"
groot --version

On Debian/Ubuntu, installing a .deb from $HOME can trigger _apt permission denied if your home is not traversable; /tmp avoids that.

Fedora / RHEL / AlmaLinux / Rocky (`.rpm`)

Same binary path and /etc/groot/groot.yml.sample layout as the .deb.

TAG=v0.1.8
VER="${TAG#v}"
curl -fsSLO "https://github.com/hrodrig/groot/releases/download/${TAG}/groot_${VER}_amd64.rpm"
sudo rpm -Uvh "groot_${VER}_amd64.rpm"
# or: sudo dnf install "./groot_${VER}_amd64.rpm"
groot --version

Tarball (Linux / macOS, no package manager)

Archive names look like groot_0.1.8_linux_amd64.tar.gz or groot_0.1.8_darwin_arm64.tar.gz (no v in the basename).

TAG=v0.1.8
VER="${TAG#v}"
OS=linux    # or: darwin
ARCH=amd64  # or: arm64
curl -fsSLO "https://github.com/hrodrig/groot/releases/download/${TAG}/groot_${VER}_${OS}_${ARCH}.tar.gz"
tar xzf "groot_${VER}_${OS}_${ARCH}.tar.gz"
# GoReleaser wraps files in a directory named like the archive without .tar.gz — adjust if your layout differs.
cd "groot_${VER}_${OS}_${ARCH}"
./groot --version

On Windows, download the matching .zip, unpack it, and run groot.exe from a terminal with kubectl available.

After any install, generate or reuse a config and run groot collect (see below). With no --config, GROOT picks the first existing file among ./groot.yml, ~/.groot/groot.yml, /etc/groot/groot.yml, then /etc/groot/groot.yml.sample. You can always pass --config /path/to/file.yaml.

Build from source

git clone https://github.com/hrodrig/groot.git
cd groot
make build

Generate a starter config (matches the repo’s sample and groot --print-sample-config):

./bin/groot --print-sample-config > groot.yml
# Edit namespaces, targets, output_dir, notify.*, etc.
./bin/groot collect

Handy flags:

--test-connection — sanity-check cluster access without collecting
--verbose — print each command as it runs
--quiet — suppress normal console noise; notifications still fire unless disabled in config
--no-notify — skip all notify channels for this run (same as GROOT_NO_NOTIFY=1)
--message "meaningful-label" — sanitized suffix on archive / related names

Configuration model

If you pass --config /path/to/file.yaml, that file is the only YAML source for that run.

Otherwise the first existing file wins, in this order:

./groot.yml
~/.groot/groot.yml
/etc/groot/groot.yml
/etc/groot/groot.yml.sample

If none of those exist, built-in defaults apply, and GROOT_* environment variables override where documented.

Kubeconfig resolution is documented in the README: CLI flag → KUBECONFIG → YAML kubeconfig → kubectl defaults.

Targets and pod logs

If you define collection.targets for a namespace, pod log collection for that namespace is limited to pods belonging to those workloads. If a namespace has no targets, GROOT keeps broader pod log behavior for that namespace—useful for control-plane or shared namespaces.

Helm instances are matched via app.kubernetes.io/instance as documented in the project.

`extra_kubectl`: power with guardrails

You can append arbitrary read-only diagnostics as extra jobs, e.g.:

extra_kubectl:
  - "get componentstatuses"
  - "get csr"

GROOT splits each string on whitespace and passes arguments directly to kubectl (no shell). At config load time, only allowlisted, read-oriented first verbs are accepted (for example get, describe, logs, top, …, plus config view … and auth can-i …). Anything else fails fast so a bad paste cannot turn into delete or exec by accident.

Output layout

Each run creates a timestamped working directory under output_dir, then archives it:

Archive name: <timestamp>-<cluster>[-<sanitized-message>].tar.gz
Inside the tarball: paths are rooted under the capture folder name so extracting many runs into one parent directory (e.g. ~/tmp/groot-out) does not mix kube-system/ trees from different captures.

Typical top-level groups include nodes/, extras/, per-namespace folders, and pod log files named pod__node.log (and .previous.log when enabled).

Notifications in incident workflows

Enable channels in notify.* and provide webhooks or tokens. For PagerDuty Events v2, routing keys can also be ;-separated for multiple integrations. The README spells out payload shape, severity values, and HTTP 202 expectations for PagerDuty.

For cron where you only want the archive locally:

0 * * * * GROOT_NO_NOTIFY=1 /usr/local/bin/groot collect --config /home/you/.groot/prod.yml --quiet

Security and operations notes

RBAC: GROOT runs as your kube identity. Tight RBAC on the ServiceAccount or user still applies.
Secrets in archives: Logs and describes may contain sensitive material—treat bundles like production data.
Notify credentials: Prefer environment variables or secret stores; the README lists GROOT_NOTIFY_* env names.
Pre-built binaries: Check GitHub Releases for GoReleaser artifacts when you do not want to compile.

Why Go, Cobra, and Viper?

Single static binary easy to ship to bastions and CI agents
Familiar CLI patterns (subcommands, persistent flags)
Portable configuration (YAML + env) without inventing a new DSL

The repository includes make release-check (lint, race-enabled tests, vulnerability and cyclomatic gates, GoReleaser config validation) for maintainers cutting releases.

Contributing and feedback

GROOT is MIT-licensed. Issues and PRs are welcome against develop (see CONTRIBUTING.md). If you try it on your next incident, open an issue with what worked, what felt missing, or what integrations you would like next.

Closing

GROOT does not replace observability platforms or centralized logging—it compresses the first hour of kubectl archaeology into one repeatable command and one archive. That is often exactly what you need before you dive into traces, metrics, or vendor support.

Repository: github.com/hrodrig/groot
Releases: GitHub Releases

If this post saved you a round of manual copy-paste, consider starring the repo so others find it faster.

Disclosure

This post was created with a mix of human and AI collaboration.

I (Hermes) defined the content, reviewed all technical details, and verified everything against the real codebase. AI tools helped with English writing, structure, and editing.

Transparency first: always double-check the official repository for the latest information.

Beyond Power-On Hours: Auditing 'New' SAS Drives on Proxmox, HPE Gen9 & HBA

Hermes Rodríguez — Fri, 01 May 2026 02:45:00 +0000

Publication note (anonymization): Shell prompts, drive serials, LU WWNs, and wall-clock timestamps in pasted output are redacted or generalized so the article stays instructive without tying public text to a specific vendor shipment. Counts (ECC-corrected totals, GB processed), power-on time, firmware revision, and manufacture week/year match the real capture—the technical claims rest on those fields, not on identifiers. Use a cropped or blurred label photo in production if the sticker serial is still readable in pixels.

You ordered new enterprise disks. The PO says new. Logistics signs for new. Then you slide them into a ProLiant Gen9 class machine running Proxmox, and nothing about the story matches how “new” is supposed to feel.

This post is a field narrative with commands you can reuse: how a Smart Array can make lsblk look “empty,” how to reach SMART anyway, why HBA mode matters for ZFS, and what to do when power-on hours look pristine but the rest of the telemetry does not.

The actual problem we were solving

Disks failed. Replacements arrived. The job was not “install and hope”—it was verify the supply chain:

Are these genuinely low-cycle parts?
Are we about to bake gray-market refurbs into a RAID 5 or a ZFS pool that we will swear by for years?

The uncomfortable truth: SMART does not replace procurement judgment—it is a reporting channel. If you cannot read it at the right layer, or if you only read one field, you will fool yourself.

Procurement context (anonymized on purpose)

This write-up does not name the seller, contract vehicle, or price. That is intentional: the goal is a reusable methodology, not a public flame. Still, you should calibrate your priors with facts you do have internally—authorized distributor vs opportunistic channel, warranty terms, whether the SKU was “too cheap to be true,” and how much traceability the paperwork provides. None of that appears here; treat absent commercial context as uncertainty you fold into risk, not as proof either way.

Layer cake: where SMART “lives”

Rough mental model:

Platter/flash + firmware stores vendor logs (hours, wear, internal defects, manufacture metadata—when exposed).
A RAID controller may aggregate, delay, abstract, or gate access.
The Linux block layer shows what the OS is allowed to see as /dev/sdX, /dev/nvme*, multipath devices, etc.

When people say “I ran smartctl and it looks fine,” the first question is: fine at which layer?

Act 1: `lsblk` — what you see depends on the controller

I inserted spares into a RAID 5 world. On the host:

lsblk

With hardware RAID still presenting a single logical drive, I still saw essentially one large block device—the logical volume the controller exported—not three new naked disks sitting as /dev/sdb, /dev/sdc, …

That is normal for HPE Smart Array (e.g. P440ar class): the controller is a traffic cop. It hides unassigned physical drives from the OS until they join a logical drive or you change how the controller exposes devices.

Contrast — same question, different topology: once the machine was in a state where each physical path was visible to Linux (e.g. HBA mode / disks not hidden behind a single LD), lsblk finally showed one row per disk plus the Proxmox/LVM stack on the OS install device. Real capture from the audit host:

root@audit-host:~# lsblk
NAME               MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
sda                  8:0    0 931.5G  0 disk
sdb                  8:16   0 931.5G  0 disk
sdc                  8:32   0 931.5G  0 disk
sdd                  8:48   0 931.5G  0 disk
├─sdd1               8:49   0  1007K  0 part
├─sdd2               8:50   0     1G  0 part /boot/efi
└─sdd3               8:51   0   930G  0 part
  ├─pve-swap       252:0    0     8G  0 lvm  [SWAP]
  ├─pve-root       252:1    0    96G  0 lvm  /
  ├─pve-data_tmeta 252:2    0   8.1G  0 lvm
  │ └─pve-data     252:4    0 793.8G  0 lvm
  └─pve-data_tdata 252:3    0 793.8G  0 lvm
    └─pve-data     252:4    0 793.8G  0 lvm

So the lesson is blunt: lsblk is not a universal physical inventory tool. On hardware RAID it is an inventory of what the OS can address as block storage right now—sometimes one LD, sometimes N bare /dev/sdX nodes once the “wall” is gone.

Act 2: `ssacli`—inventory the controller, not the kernel

HPE’s ssacli is how you ask the controller what it thinks exists: bays, PDs, LDs, rebuild status, unassigned drives, etc.

Proxmox (Debian underneath) does not ship that vendor CLI. You install it yourself—usually via:

HPE MCP repository with a modern signed-by= keyring entry (avoid legacy apt-key patterns in new builds), or
A manual .deb.

The boring failure mode: pinned `.deb` URLs rot

If you wget a specific ssacli_x.y-z_amd64.deb URL you found in a blog post, expect 404 eventually. HPE rotates pool filenames. Prefer:

repo install, or
browse the vendor directory / use a small script to pick latest name, or
download on a workstation and scp the package to the host (sadly practical when corporate networks/VPNs get in the way).

Example repo-style skeleton (validate codename against HPE docs for your Debian/Proxmox major):

curl -fsSL https://downloads.linux.hpe.com/SDR/hpePublicKey2048_key1.pub \
  | gpg --dearmor -o /usr/share/keyrings/hpe-archive-keyring.gpg

echo "deb [signed-by=/usr/share/keyrings/hpe-archive-keyring.gpg] \
http://downloads.linux.hpe.com/SDR/repo/mcp bullseye/current non-free" \
  > /etc/apt/sources.list.d/hpe-mcp.list

apt update && apt install ssacli -y

Then:

ssacli ctrl all show config

What I always grep mentally for:

Logical drives (what the OS sees as “the array disk”)
Unassigned drives (your spares, still invisible to lsblk)
Recovering / degraded states—if parity is rebuilding, do not make rash controller changes until you understand the blast radius.

Act 3: `smartctl` through Smart Array (`cciss`)

Before flipping the whole machine to HBA, you can often still interrogate individual drives through the controller using smartctl’s HP-style device mapping.

Discovery:

smartctl --scan

You may see lines like /dev/sda -d cciss,0, /dev/sda -d cciss,1, …

Spot check:

smartctl -A -d cciss,N /dev/sda

SAS reality: grepping is a sport

SAS/SCSI logs are not always polite about naming. Power_On_Hours might not appear the way SATA textbooks promise. When in doubt, dump wider and read like a debugger:

for i in $(seq 0 15); do
  echo "==== cciss,$i ===="
  smartctl -a -d cciss,$i /dev/sda | egrep -i 'serial|hour|power|time|manufactured|ecc|error|health'
done

This is ugly, but it buys you coverage when bay order and cciss indices do not match your intuition.

Act 4: HBA mode—why ZFS people keep insisting

What I did on the Gen9 (on purpose): I removed every drive from the chassis, left a single disk installed, switched the Smart Array to HBA mode, then did a clean Proxmox install on that lone device. Only after that baseline did I rotate the rest of the batch through the machine and run the full audit pass (lsblk, smartctl, photos, notes) against each candidate. That workflow cost time, but it eliminated two headaches at once: no legacy logical drive hiding spares, and no guessing whether the OS view was “polluted” by an old array layout—I always knew I was reading raw paths on a known-fresh hypervisor root.

If your end state is ZFS on Proxmox, a very common “boring-hard” posture is:

Controller in HBA mode (pass-through-ish visibility; exact wording varies by generation/features)
Redundancy and scrubbing owned by ZFS (mirror, raidz1, raidz2, …)

Why bother?

Per-disk telemetry becomes honest enough to operationalize (scrubs, zpool status, SMART monitoring tools).
You stop fighting RAID abstraction every time a disk misbehaves slightly.

How you flip modes depends on policy and hardware support: System Options / Smart Storage at POST, and/or vendor CLI patterns like:

ssacli ctrl slot=0 modify hbamode=on

Hard stop warning: changing modes / deleting logical drives is data destructive if you do not have backups and a rebuild plan. Treat this as a change-managed migration, not a blog-copy-paste stunt.

After HBA, this often becomes pleasantly boring:

smartctl -a /dev/sdX

No cciss,N roulette—assuming the OS now sees each physical path cleanly.

When HBA + ZFS is not the automatic answer

Plenty of shops must stay on hardware RAID: corporate standard, support contract language, legacy monitoring that expects logical drives, boot disks already tied to an array, or a storage team that simply will not own filesystem-level redundancy. In those worlds, you can still audit with ssacli and smartctl -d cciss,N—you just may not get the same “naked disk” clarity ZFS operators chase. HBA + ZFS here is the path I chose because I control the full stack and wanted per-disk truth for a Proxmox build—not a universal mandate for every Gen9 still under a storage silo’s policy.

Act 5: When “almost zero hours” is the least interesting field

We eventually read a batch that looked “unused” by hours—think well under a few hours—yet the story fell apart on cross-checks:

Lot scope (so this is not a single bad apple): the procurement batch was ten drives. One drive was operationally unusable—the extreme ECC case detailed below. Across all ten, cross-checks showed telemetry and/or physical labels inconsistent with genuinely new, factory-sealed stock (low reported hours paired with old manufacture windows, label vs firmware/DOM mismatches, or corrupted manufacture strings). That is a systemic documentation smell for the lot, not just one marginal unit—while still not proving who altered what upstream.

1) Manufacture hints vs “baby disk” narrative

If the drive cheerfully reports very low power time but also exposes manufacture windows from a decade-ish ago, you should pause. That combination is compatible with counter hygiene in the supply chain—not with assumptions of brand-new factory stock.

2) Broken manufacture strings

Some units showed corrupt / truncated manufacture week/year fields in SMART text. Treat that as a process smell: something touched this device beyond “factory sealed happy path.”

3) ECC counters vs headline “OK”

One unit still presented as broadly “OK” in the headline sense, but the error counter log was obscene: massive corrected read/verify activity relative to a tiny amount of data processed. That pattern screams media fatigue or marginal heads—not a drive I want learning parity for my pool.

Rule: for SAS/SCSI logs, read the tables, not only the one-line health summary.

Act 6: Labels vs silicon—how to win an argument without starting one

We also found label vs firmware inconsistencies: DOM/firmware printed stories that did not match inquiry/SMART, plus model generations that did not match “recent assembly” claims.

Photo evidence (label vs silicon)

Same physical drive throughout: physical label vs smartctl (serial redacted in this public copy; full identifiers stay in internal evidence only). On this spare, the sticker story (e.g. DOM 03/2019, FW HPD9) did not match what the drive reported internally (manufacture week in 2013, FW HPD5)—the kind of mismatch you only catch if you photograph the label and capture SMART in one ticket.

Figure: label-side evidence for one problematic unit.

Same unit — smartctl excerpt (smartctl -a on the drive). Note Revision HPD5, Manufactured … 2013, and the error counter log vs only ~5 GB read — headline SMART Health Status: OK is misleadingly optimistic here.

smartctl 7.4 [x86_64-linux-*-pve] (build/version line trimmed for publication)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               HP
Product:              MM1000FBFVR
Revision:             HPD5
Compliance:           SPC-3
User Capacity:        1,000,204,886,016 bytes [1.00 TB]
Logical block size:   512 bytes
Rotation Rate:        7200 rpm
Form Factor:          2.5 inches
Logical Unit id:      0x5000c500[WWN redacted]
Serial number:        9XG52****KY8W
Device type:          disk
Transport protocol:   SAS (SPL-4)
Local Time is:        [redacted]
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Current Drive Temperature:     36 C
Drive Trip Temperature:        65 C

Accumulated power on time, hours:minutes 1:04
Manufactured in week 28 of year 2013
Specified cycle count over device lifetime:  10000
Accumulated start-stop cycles:  5
Specified load-unload count over device lifetime:  300000
Accumulated load-unload cycles:  5
Elements in grown defect list: 0

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0        0         0  13382752          0          5.039           0
write:         0        0         0         0          0          0.837           0
verify:        0        0         0   3229737          0          0.759           0

Non-medium error count:        6

No Self-tests have been logged

The ECC trap (why “SMART OK” was a trap)

Do not put this class of finding into a production ZFS pool. The summary line can still say SMART Health Status: OK and Elements in grown defect list: 0 while the real story sits in the error counter log—read that table line by line, not only the headline.

           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:          0        0         0  13382752          0          5.039           0
verify:        0        0         0   3229737          0          0.759           0

Against only ~5 GB read and ~0.76 GB verified, the drive had already accumulated on the order of 13.3 million read-side corrections and 3.2 million verify-side corrections in about one hour of logged power-on time. That is not “new disk noise”; it is media fatigue (or a marginal head/platter situation) being masked because the firmware is still winning the ECC battle—no uncorrected errors yet, so lazy dashboards stay green.

The magnetic surface is failing quietly. It will likely get worse fast; in ZFS you pay for that as latency, scrub pain, and resilver drama before the headline SMART field ever admits defeat. In our ten-drive lot, HPD5 on this unit versus HPD9 on other units was another clue we were not looking at one uniform “fresh from the factory” firmware story.

The professional move is not to accuse a specific counterparty of fraud in writing. It is to ship an evidence bundle:

photo of label (crop/blur serial on the public copy if needed)
smartctl capture for the same physical drive (serials/WWN redacted in public posts)
a short table: label claims vs silicon reports

That forces corrective action upstream without you pretending you ran a criminal investigation. Supply chains fail; your job is to measure and document.

Procurement engineering: “Gen9 is EOL” is not a magic excuse

A common vendor line: “Your server is old; new HDDs basically do not exist; refurbs are normal.”

Sometimes partially true—but Gen9 2.5" SFF is still fundamentally SAS/SATA bays. With HBA + ZFS, you are often less chained to “HP-branded spinning SKU from that exact era” than RAID-centric workflows were.

What you can push for, technically:

Enterprise SSDs with PLP (power-loss protection) suitable for server duty—not random consumer SATA models that will lie to you about endurance under sync writes.
Examples of categories people commonly standardize on: Samsung PM893-class, Micron 5400 PRO-class, Kingston DC600M-class, Solidigm D3-S4520-class—not product endorsements, just “this is what ‘datacenter SSD’ means” anchors.

Mechanically: reuse the metal caddies, mount SSDs, validate with the same SMART discipline.

A practical audit checklist (copy into your ticket)

[ ] Identify whether disks are behind hardware RAID / HBA / NVMe directly
[ ] If HPE Smart Array: install ssacli, run ssacli ctrl all show config, note unassigned PDs
[ ] Run smartctl --scan, then smartctl -a paths (with -d cciss,N while still in RAID mode if needed)
[ ] Record: serial, model, hours, manufacture strings (if present), ECC/error counters
[ ] If something is “new” but smells off: label photo + smartctl for the same bay/unit (redact IDs in public write-ups)
[ ] Only then: bake into RAID/ZFS and sleep soundly

Command cheat sheet

lsblk
ssacli ctrl all show config
smartctl --scan
smartctl -a -d cciss,N /dev/sda    # while controller presents cciss mapping
smartctl -a /dev/sdX               # typical after HBA / direct paths

What I would do again

Assume lsblk lies by omission on hardware RAID.
Treat hours as a single signal—never the whole proof.
Prefer HBA + ZFS when you want the OS to own disk truth operationally.
Package procurement language around evidence (captures + photos), not accusations.

Disclosure: use of AI in this article

Yes — with human oversight. The narrative structure, English prose, section flow, and several edits were produced with help from an AI assistant (large language model in an editor workflow). I remained responsible for what went in: technical facts, methodology, and conclusions come from a real on-server audit (e.g. smartctl / lsblk captures, label photograph, and internal notes/scripts from that work). Public-facing command listings add deliberate redaction of identifiers and timestamps; telemetry numbers (ECC totals, GB read, hours, firmware, manufacture year/week) match the underlying capture. The figure tracks the real unit; the repo PNG is the redacted label export used for public posts—unredacted originals stay in the internal evidence bundle only.

Rough split: AI — drafting, wording, reordering, checklist formatting; human — incident ownership, evidence selection, accuracy checks, and final sign-off before publication.

If you have your own “the array looked fine but the disks were lying” story—RAID vs HBA, Dell PERC, MegaRAID, whatever—I want to read it in the comments.

~21 tok/s Gemma 4 on a Ryzen mini PC: llama.cpp, Vulkan, and the messy truth about local chat

Hermes Rodríguez — Fri, 10 Apr 2026 12:46:26 +0000

Hands-on guide based on a real setup: Ubuntu 24.04 LTS, AMD Radeon 760M (Ryzen iGPU), lots of RAM (e.g. 96 GiB), llama.cpp built with GGML_VULKAN, OpenAI-compatible API via llama-server, Open WebUI in Docker, and OpenCode or VS Code (§11) using the same API.

Who this is for: if you buy (or plan to buy) a mini PC or small tower with plenty of RAM and disk, this walkthrough gets you to local inference — GGUF weights on your box, chat and APIs on your LAN, without treating a cloud vendor as mandatory for every request. The documented path is AMD iGPU + Vulkan; if your hardware differs, keep the Ubuntu → llama.cpp → weights → server flow and adjust §5–§6 (deps and build) for your GPU.

Reference hardware (validated while writing this guide): Minisforum UM760 Slim mini PC (Device Type: MINI PC on the chassis label; vendor Minisforum / Micro Computer (HK) Tech Limited) with AMD Ryzen 5 7640HS, Radeon 760M Graphics, 96 GiB DDR5 RAM, ~1 TiB NVMe, Ubuntu 24.04 LTS. This is not a minimum-requirements bar—it anchors compile times, download comfort, and token throughput vs other CPUs, RAM, or disks. To verify memory type and size on your box, see §3 (Quick hardware inventory). A photo of the box is at the end, under Closing thoughts.

Replace YOUR_USER, model paths, and hostname as needed. If the machine is server-only (no monitor), start with §4.

TL;DR

Too long; didn’t read — one-minute skim before the full guide. Full table of contents →

What you’re building: local inference on Ubuntu 24.04 with llama.cpp + Vulkan, a GGUF weights file, OpenAI-style API via llama-server (:8080); optional Open WebUI in Docker (:3000); OpenCode and Visual Studio Code can talk to the same http://…:8080/v1 base URL as an OpenAI-compatible provider (§11).
Shortest path: BIOS/UMA if relevant (§2) → deps + Vulkan (§5) → build llama.cpp (§6) → download .gguf (§7: wget --continue or huggingface-cli; screen / tmux for long SSH sessions) → smoke-test llama-cli → run llama-server manually or under systemd (§8–§9) → point Open WebUI at the host (§10) → optional: OpenCode / VS Code (§11).
Tight RAM / OOM: same user as the service; match llama-cli -c / -ngl to ExecStart; if it fails, drop -c (e.g. 4096) and -ngl (e.g. 40) before chasing 99 / 999. Don’t enable the unit until the GGUF is fully downloaded.
More models: §7 covers Gemma 4, Qwen Coder, DeepSeek Lite, Llama 3.1 (downloads, huggingface-cli, quick tests).
Swap in YOUR_USER, paths, and hostname; server-only box → start at §4.

Links jump to headings on GitHub, Cursor, and most Markdown viewers. If a link does not match your renderer, search for the heading in the file.

TL;DR
1. Context and choices
2. BIOS (before or right after installing Ubuntu)
3. Installing Ubuntu
- Quick hardware inventory (optional)
4. Ubuntu Server without a desktop (headless)
- Installation
- Networking
- Vulkan without a display (vkcube not applicable)
- Rest of this guide
5. Base dependencies and Vulkan check
6. Building llama.cpp with Vulkan
- Update and rebuild llama.cpp
7. GGUF models and paths
- What GGUF is (name, role, trade-offs)
- Quant labels in filenames (Q2, Q4, Q8, suffixes like _K_M, IQ…)
- Where models live and how to list them
- Concrete example: Gemma 4 26B A4B Instruct (GGUF, bartowski)
- Advanced example: Kimi K2 Instruct 0905 (Unsloth, split GGUF)
- Example: local Llama 3.1 8B Instruct Q8_0
- llama-bench: measure throughput (tokens/s)
- Quick terminal test
- Adding or switching models
- Experimenting with more models: setup, testing, and limits
- One playbook: Gemma 4, Qwen Coder, DeepSeek Coder, and Llama 3.1 (download → Open WebUI)
- Common steps (every model swap)
- Reference table (repos + sample file)
- Download (wget --continue, one file per command)
- Per-model quick test (right after download)
- Typical ExecStart tweaks (example)
8. Minimal web server (llama-server)
9. systemd service (start on boot)
10. Open WebUI with Docker (port 3000 → backend on 8080)
- Connect Open WebUI to llama-server
- Chat up and running (example)
- No browsing or GitHub fetch: real limits (and confident wrong answers)
- Model picker shows “No results found” / no models listed
- “Failed to fetch models” under Ollama (Settings → Models)
- Updating Open WebUI (Docker)
- If you also run Ollama
11. OpenCode and VS Code with your llama-server
- OpenCode
- Visual Studio Code
12. Troubleshooting: Vulkan / glslc on Ubuntu 24.04
- 12.1 Universe repository and packages
- 12.2 LunarG repository (Vulkan SDK)
- 12.3 Conflict between Ubuntu’s libshaderc-dev and LunarG’s Shaderc
- 12.4 Snap fallback for glslc
13. Performance and models (rough guide)
- htop looks “light” while you chat (is that normal?)
- AMD: amdgpu_pm_info and dri/N (not always dri/0)
14. Remote desktop (Ubuntu 24.04 Desktop, LAN)
- 14.1 Enable on the mini PC
- 14.2 Connect from another machine
- 14.3 Firewall (ufw)
- 14.4 If connection fails
Final checklist
Quick port reference
Closing thoughts

1. Context and choices

Topic	Recommendation
OS	Ubuntu 24.04 LTS (desktop or server; server without a GUI saves RAM).
AMD iGPU	Vulkan + Mesa is usually simpler than ROCm for llama.cpp inference.
Models	GGUF format; Q4_K_M quantization (balance) or Q8_0 (higher quality, larger).
Engine	llama.cpp with `-DGGML_VULKAN=1` uses the GPU for layers (`-ngl`).
Lots of RAM	You can load large models in system RAM even if the iGPU has little dedicated VRAM; the BIOS can give the GPU a larger framebuffer (see §2).

Reference diagram (browser / container / host):

2. BIOS (before or right after installing Ubuntu)

On Minisforum boxes (e.g. UM760 Slim) with AMI BIOS and Ryzen:

Enter BIOS (Del, F2, or F7 on many systems).
Typical path: Advanced → AMD CBS → NBIO Common Options → GFX Configuration.
Set UMA Frame Buffer Size (or similar) from Auto / 2 GiB to 8 G or 16 G if available.

Goal: give the iGPU more unified memory for model layers; with plenty of system RAM the trade-off is usually worth it.

3. Installing Ubuntu

Enable third-party software for graphics and Wi‑Fi if you use the graphical installer.
The minimal install drops extra packages if the box is mainly an inference server.

Typical order of this guide (§4 and §10 are optional depending on your setup):

Quick hardware inventory (optional)

Before picking huge models and quantizations, check RAM, disk on /, and whether the integrated GPU shows up on the PCI bus (this does not replace a Vulkan test, but it sets expectations).

sudo lspci | grep -i -E 'vga|3d|display'
free -h
df -h /

What to look for in lspci: on Ryzen Phoenix / Hawk Point boards you often see something like VGA compatible controller: … Phoenix1 plus an AMD HDMI audio line. The marketing name “Radeon 760M” may not appear verbatim; the real check is that an AMD VGA/Display controller exists and that vulkaninfo / llama-cli see RADV (§4–§5).

free: total and available RAM tell you how large a GGUF you can keep comfortably in memory alongside the OS.

df: each .gguf costs whatever the card lists (e.g. ~8 GiB for an 8B Q8_0); leave headroom for updates, Docker, and rebuilds.

DDR4 vs DDR5 (re-check RAM type): data comes from firmware SMBIOS. Install sudo apt install -y dmidecode if needed. Note: some dmidecode builds indent fields with spaces, not tabs—an overly strict grep can print nothing even when DMI works.

# One line per interesting field (tab- or space-indented)
sudo dmidecode -t memory 2>/dev/null | grep -iE 'Locator|Size:|Type:|Speed:|Configured Memory Speed:'

If that is still empty, dump the start of the table—some boards expose only a subset of fields:

sudo dmidecode -t memory | head -n 120

For each populated slot, Type: should read DDR5, DDR4, etc. All-Unknown or an empty dump may mean a locked BIOS, a hypervisor restriction, or needs a firmware update—cross-check the mini PC spec sheet or DIMM/SODIMM silkscreen/label. Ryzen 7040 mobile (e.g. 7640HS) is usually DDR5-only on recent kits; still verify through one of these paths.

4. Ubuntu Server without a desktop (headless)

When the mini PC only serves the model (SSH + browser on another machine), Ubuntu Server 24.04 LTS saves RAM and attack surface by skipping GNOME and desktop services.

Installation

Download the Ubuntu Server ISO from ubuntu.com/download/server.
In the installer, enable OpenSSH for remote administration.
Create a normal user with sudo (this guide assumes that user’s $HOME).
BIOS (§2) is configured the same as on a desktop.

Networking

After first boot:

hostname -I
sudo systemctl status ssh

Open only what you need in the firewall (e.g. SSH, and later 8080/3000 if not using VPN only):

sudo apt install -y ufw
sudo ufw allow OpenSSH
# Optional: sudo ufw allow 8080/tcp && sudo ufw allow 3000/tcp
sudo ufw enable

Vulkan without a display (`vkcube` not applicable)

Server images have no display server by default: you cannot run vkcube unless you add a minimal GUI just for that test. To validate Vulkan from the console:

sudo apt update
sudo apt install -y vulkan-tools
vulkaninfo --summary 2>/dev/null | head -n 80

What to look for: besides the instance version (e.g. Vulkan Instance Version: 1.4.x), the Devices: section should list your AMD GPU (deviceName like Radeon …, deviceType INTEGRATED_GPU or DISCRETE_GPU, vendorID 0x1002 on AMD hardware).

Real-world sample (trimmed): you often see the instance and a long extension list first; Devices: comes later. As a normal user you may see only a software device:

Vulkan Instance Version: 1.4.313
...
Devices:
========
GPU0:
    apiVersion         = 1.4.318
    deviceType         = PHYSICAL_DEVICE_TYPE_CPU
    deviceName         = llvmpipe (LLVM …, 256 bits)
    driverName         = llvmpipe

Same machine, but sudo shows the Radeon: if your user only gets llvmpipe but root sees e.g. GPU0 AMD Radeon 760M Graphics (RADV PHOENIX) (vendorID 0x1002, INTEGRATED_GPU) and GPU1 llvmpipe, the kernel and Mesa are fine; your user lacks permission on the DRM nodes (/dev/dri/renderD*). You should not run llama-server as root long-term to “fix” Vulkan—fix group membership instead.

groups                    # should include render and video
ls -l /dev/dri/
sudo usermod -aG render,video "$USER"
# Log out of the desktop session or reboot, then (tighter grep: a broad
# GPU|deviceName|deviceType pattern may also match layer descriptions containing "GPU"):
vulkaninfo --summary 2>/dev/null | grep -E '^GPU[0-9]+:|^[[:space:]]+device(Name|Type)' | head -n 30

Expected output without sudo (RADV as GPU0, llvmpipe as an extra device):

GPU0:
    deviceType         = PHYSICAL_DEVICE_TYPE_INTEGRATED_GPU
    deviceName         = AMD Radeon 760M Graphics (RADV PHOENIX)
GPU1:
    deviceType         = PHYSICAL_DEVICE_TYPE_CPU
    deviceName         = llvmpipe (LLVM 20.1.2, 256 bits)

Typical “before” example: if groups does not list render or video, and you only see entries like adm cdrom sudo dip plugdev users lpadmin docker, that matches “Vulkan as your user = llvmpipe only; as root = RADV + llvmpipe”.

After usermod: the command may print nothing, but your already-running session keeps the old group set—groups in the same shell will not change until you log out of the desktop (or reboot). Open a new terminal and check again; id -nG is a handy way to list all group names. For a quick test without logging out of the whole session: newgrp render (spawns a subshell with that group active; fine for testing only).

On Ubuntu 24.04 the groups are usually render and video. Once the new session includes them, vulkaninfo without sudo should list the AMD device as well as llvmpipe.

A healthy summary often has the Radeon as GPU0 and llvmpipe as an extra entry:

GPU0:
    vendorID           = 0x1002
    deviceType         = PHYSICAL_DEVICE_TYPE_INTEGRATED_GPU
    deviceName         = AMD Radeon 760M Graphics (RADV PHOENIX)
    driverName         = radv
GPU1:
    deviceType         = PHYSICAL_DEVICE_TYPE_CPU
    deviceName         = llvmpipe (LLVM …)

Only llvmpipe even as root: then llvmpipe / PHYSICAL_DEVICE_TYPE_CPU is CPU-only Vulkan (Mesa) and the iGPU is not in the Vulkan device list. Check lspci -nn | grep -i vga, the amdgpu module, mesa-vulkan-drivers, and BIOS. On very minimal servers the render stack may still need setup before Vulkan enumerates the chip.

Rest of this guide

Install the same packages as §5, build llama.cpp in §6, and use Open WebUI from another PC at http://SERVER_IP:3000. Docker + llama-server does not require a graphical session on the server.

5. Base dependencies and Vulkan check

sudo apt update && sudo apt upgrade -y
sudo apt install -y build-essential cmake git libvulkan-dev vulkan-tools

Confirm the GPU is visible:

vkcube

A window with a spinning cube should open. Close it when done.

If vkcube works but vulkaninfo --summary as your user still shows only llvmpipe, add the same render and video groups as in §4 (and log out/in).

6. Building llama.cpp with Vulkan

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DGGML_VULKAN=1
cmake --build build --config Release -j"$(nproc)"

If cmake fails with Could NOT find Vulkan or missing: glslc, go to §12 (common on Ubuntu 24.04).

Update and rebuild `llama.cpp`

Newer GGUF architectures (Gemma 4, recent MoE builds, etc.) often need a fresh llama.cpp. Before blaming the weight file, update the tree and rebuild the same build folder (or wipe build and rerun CMake if CMakeLists changed a lot):

cd "$HOME/llama.cpp"
git pull
cmake --build build --config Release -j"$(nproc)"

If git pull changes CMake heavily and linking fails:

rm -rf build
cmake -B build -DGGML_VULKAN=1
cmake --build build --config Release -j"$(nproc)"

After rebuilding, if you use §9, restart so the service picks up new binaries: sudo systemctl restart llama-web.service. Check journalctl -u llama-web.service -n 30 --no-pager if a GGUF is rejected.

Useful binaries:

build/bin/llama-cli — terminal tests.
build/bin/llama-server — HTTP API compatible with OpenAI-style clients.

7. GGUF models and paths

What GGUF is (name, role, trade-offs)

GGUF (GGML Universal File Format) is a single-file container aimed at inference with llama.cpp and friends: it packs weights in a tensor layout tuned for efficient loading, metadata, and—in practice—what you need to tokenize and run the model without pulling in the full PyTorch/JAX training stack.

Why it matters here: you download a .gguf, pass its path as -m to llama-cli / llama-server, and the engine runs locally (CPU, and in this guide Vulkan on the GPU). You do not need the original framework runtime just to serve the converted file.
Typical upsides: one portable blob; quantized variants (Q4_K_M, Q8_0, IQ*, …) trade a bit of quality for disk / RAM / VRAM; huge Hugging Face catalog (community repos such as TheBloke, bartowski, Unsloth, …); first-class support in llama.cpp.
Limitations: quality depends on quant level and conversion tooling; brand-new architectures may need a fresh llama.cpp build or lack mature GGUFs yet; training / fine-tuning usually happens elsewhere, then you convert/export to GGUF; it is not a full cloud SaaS substitute without extra plumbing.

The rest of this section assumes a ready-to-run GGUF; paths and downloads always point at that file.

Quant labels in filenames (Q2, Q4, Q8, suffixes like `_K_M`, IQ…)

Repos list GGUFs with prefixes like Q2_, Q3_, Q4_, Q5_, Q6_, Q8_ and cousins (IQ2_, IQ3_, …). Naming is not one single marketing standard, but in practice:

The Q and number hint at quantization depth—roughly how many bits are used for weights (simplified). Lower → smaller file, less RAM/VRAM, sometimes more quality loss; higher (e.g. Q8) → heavier and often closer to “full” model behavior.
Suffixes such as _K_M, _K_S, _K_L, … are llama.cpp k-quant schemes: they mix layers/blocks at different precisions to balance quality vs size—it is not “literally 4-bit everything.”
IQ (imatrix / importance-weighted) lines aim for aggressive compression while protecting weights that matter most for output quality.
For this guide: Q4_K_M is a common sweet spot for disk, memory, and quality; Q8_0-class files if you favor quality and have RAM to spare. If names feel overwhelming, sort by MiB/GiB under the repo’s Files tab and pick the largest file that fits your machine comfortably.

Hugging Face CLI (huggingface-cli): Ubuntu 24.04 ships externally managed system Python (PEP 668), so python3 -m pip install … fails with externally-managed-environment. Prefer a small virtualenv for this tool. This guide uses $HOME/.venv/huggingface: install python3-venv, create the venv once, run source …/bin/activate before pip / huggingface-cli, or call "$HOME/.venv/huggingface/bin/huggingface-cli" directly. Avoid --break-system-packages unless you understand the risk. Alternative: pipx install 'huggingface_hub[cli]' (after sudo apt install pipx and pipx ensurepath).

Use one consistent directory (avoid mixing ~/models and llama.cpp/models by mistake):

mkdir -p "$HOME/models"

Where models live and how to list them

llama.cpp has no built-in model catalog: a model is a .gguf file. You always pass the path with -m (absolute paths are best in systemd).

List the usual folder:

ls -lh "$HOME/models"/*.gguf 2>/dev/null

If that prints nothing, you may still have GGUFs elsewhere (Downloads, etc.).

Search under your home (limited depth, faster):

find "$HOME" -maxdepth 5 -name '*.gguf' 2>/dev/null -ls

Sort by size:

find "$HOME" -maxdepth 5 -name '*.gguf' 2>/dev/null -printf '%s\t%p\n' | sort -n

Important: Open WebUI does not enumerate “every GGUF on disk”. What matters is whichever file llama-server loads via -m. To “use another model”, change that -m (and restart the process or service §9), or run another llama-server on another port (advanced; not detailed here).

Generic example (swap the URL for the file link under the repo’s Files tab on Hugging Face):

wget -O "$HOME/models/model-name.gguf" \
  "https://huggingface.co/ORG/REPO/resolve/main/file.gguf?download=true"

Concrete example: Gemma 4 26B A4B Instruct (GGUF, bartowski)

Recent quantized model (Apache 2.0), Gemma 4 / MoE architecture; a good fit for machines with lots of RAM (e.g. ~96 GiB). Full file list and sizes: bartowski/google_gemma-4-26B-A4B-it-GGUF.

Reasonable disk/RAM use: Q4_K_M (~17 GiB per the model card). Maximum quality in this repo: Q8_0 (~27 GiB).

Important: you need a recent llama.cpp with Gemma 4 support (before building: cd llama.cpp && git pull). If loading the GGUF reports architecture or tokenizer errors, update and rebuild (§6).

Recommended download (Q4_K_M):

mkdir -p "$HOME/models"
wget --continue -O "$HOME/models/google_gemma-4-26B-A4B-it-Q4_K_M.gguf" \
  "https://huggingface.co/bartowski/google_gemma-4-26B-A4B-it-GGUF/resolve/main/google_gemma-4-26B-A4B-it-Q4_K_M.gguf?download=true"

Higher-quality option (Q8_0):

wget --continue -O "$HOME/models/google_gemma-4-26B-A4B-it-Q8_0.gguf" \
  "https://huggingface.co/bartowski/google_gemma-4-26B-A4B-it-GGUF/resolve/main/google_gemma-4-26B-A4B-it-Q8_0.gguf?download=true"

Equivalent using huggingface-cli (handy for resumable downloads):

sudo apt install -y python3-venv
python3 -m venv "$HOME/.venv/huggingface"   # once; skip if this directory already exists
source "$HOME/.venv/huggingface/bin/activate"
pip install -U "huggingface_hub[cli]"
huggingface-cli download bartowski/google_gemma-4-26B-A4B-it-GGUF \
  --include "google_gemma-4-26B-A4B-it-Q4_K_M.gguf" \
  --local-dir "$HOME/models"

Notes:

On Hugging Face the model is tagged Image-Text-to-Text; for text-only chat, llama-server / Open WebUI usually work with the GGUF and embedded template. If message formatting breaks, check the Prompt format section on the model card.
resolve/main/... URLs can break if files are renamed; if so, open the repo and copy the download link for the exact .gguf.

Important: when running llama-cli or llama-server, use the real path to the .gguf (absolute or relative to your current working directory).

Advanced example: Kimi K2 Instruct 0905 (Unsloth, split GGUF)

A very large MoE (~32 B activated params / 1 T total per the model card). Community GGUFs: unsloth/Kimi-K2-Instruct-0905-GGUF. Run guide and flags: Unsloth — Kimi K2.

Hardware warning: Unsloth’s README recommends ≥ 128 GB unified RAM even for “small” quants. Boxes in the ~64–80 GiB range may fail to load, run very slowly, or thrash swap—treat it as an experiment (see §7 Experimenting with more models).

Hugging Face: access may be gated; sign in, accept terms on the model page, and use huggingface-cli login if required.

Shards: each quantization lives in a folder (UD-TQ1_0/, UD-IQ1_S/, IQ4_XS/, …) with files like …-00001-of-00006.gguf, … Download every .gguf in that folder. For llama-cli and llama-server, -m must point at the first shard (…-00001-of-….gguf); current llama.cpp loaders pick up sibling shards in the same directory.

Download one folder (example UD-TQ1_0, six parts; confirm names under Files on Hugging Face):

sudo apt install -y python3-venv
python3 -m venv "$HOME/.venv/huggingface"   # once; skip if this directory already exists
source "$HOME/.venv/huggingface/bin/activate"
pip install -U "huggingface_hub[cli]"
huggingface-cli login    # if token or gated access is required

mkdir -p "$HOME/models/kimi-k2-0905"
huggingface-cli download unsloth/Kimi-K2-Instruct-0905-GGUF \
  --include "UD-TQ1_0/*.gguf" \
  --local-dir "$HOME/models/kimi-k2-0905"

Other folders in the same repo are other quants (more disk / more quality). Pick based on free disk and RAM.

Before loading: git pull and rebuild llama.cpp (§6). Short smoke test:

cd "$HOME/llama.cpp"
./build/bin/llama-cli \
  -m "$HOME/models/kimi-k2-0905/UD-TQ1_0/Kimi-K2-Instruct-0905-UD-TQ1_0-00001-of-00006.gguf" \
  -c 4096 \
  -ngl 80 \
  -p "Say hi in one sentence."

Tune -ngl and -c; on architecture/tokenizer errors, update and rebuild. For §9 / Open WebUI, ExecStart uses the same path to the first shard; read the id from /v1/models via curl once llama-server is up for Model IDs.

Example: local Llama 3.1 8B Instruct Q8_0

If you already have e.g. $HOME/models/Llama-3.1-8B-Instruct-Q8_0.gguf (~8 GiB on disk), replace every -m path in this guide with yours. Q8_0 favors quality over speed; for higher tok/s on an iGPU, try a Q4_K_M in the same model family.

`llama-bench`: measure throughput (tokens/s)

Use this to compare the same machine with different -ngl, different GGUFs, or different builds (CPU vs Vulkan), without UI noise.

Verify the binary (size/date are hints; it should refresh after rebuilds):

cd "$HOME/llama.cpp"
ls -lh build/bin/llama-bench

If it is missing, rebuild the project (§6); most full builds already include llama-bench.
Flags change across versions—always start from help:

./build/bin/llama-bench --help | less

Minimal example (swap the path):

./build/bin/llama-bench \
  -m "$HOME/models/Llama-3.1-8B-Instruct-Q8_0.gguf" \
  -ngl 999 \
  -n 128

-m: path to the .gguf.
-ngl: GPU layers; many builds accept 999 or -1 as “as many as possible”. If rejected, try 35, 45, etc., and increase until it breaks or slows down.
-n: generated tokens per benchmark run (tune for longer runs).

Reading output: you usually see prompt processing vs generation tok/s. If numbers are tiny and logs show no Vulkan / ggml_vulkan, the binary might lack GGML_VULKAN, or /dev/dri permissions were wrong at build/run time (§4).
Fair comparisons: same llama-bench build, same model, same -n, only change -ngl or the .gguf.

Sample real output (same command order as above; Ubuntu 24.04, Radeon 760M RADV, Llama 3.1 8B Instruct Q8_0; numbers shift with BIOS, thermals, quantization, and llama.cpp revision):

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon 760M Graphics (RADV PHOENIX) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| llama 8B Q8_0                  |   7.95 GiB |     8.03 B | Vulkan     | 999 |           pp512 |        235.96 ± 0.19 |
| llama 8B Q8_0                  |   7.95 GiB |     8.03 B | Vulkan     | 999 |           tg128 |          9.80 ± 0.00 |

build: 4d688f9eb (8016)

The ggml_vulkan lines show one Vulkan device and that the bench is on RADV (not llvmpipe only). Errors or zero devices → revisit §4–§5.
pp512: prompt processing — tok/s for a ~512-token prefill; usually higher than generation.
tg128: token generation — tok/s while emitting 128 output tokens; closest bench metric to “reply speed” in chat. Here ≈9.8 t/s for Q8_0 on this iGPU.
The build: line is your llama.cpp llama-bench commit; it changes after git pull + rebuild.

Another sample (same mini PC class, Gemma 4 26B Instruct Q4_K_M — the model this guide uses in many examples):

./build/bin/llama-bench \
  -m "$HOME/models/google_gemma-4-26B-A4B-it-Q4_K_M.gguf" \
  -ngl 999 \
  -n 128

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon 760M Graphics (RADV PHOENIX) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 1 | matrix cores: KHR_coopmat
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| gemma4 ?B Q4_K - Medium        |  15.85 GiB |    25.23 B | Vulkan     | 999 |           pp512 |        239.04 ± 1.97 |
| gemma4 ?B Q4_K - Medium        |  15.85 GiB |    25.23 B | Vulkan     | 999 |           tg128 |         20.94 ± 0.02 |

build: d12cc3d1c (8720)

The model column is unreliable on some llama-bench builds: you may see gemma4 ?B, gemma4 7B, or similar even for Gemma 4 26B A4B GGUFs. Trust size (~15.85 GiB), params (~25.23 B), and your -m path to …26B…Q4_K_M.gguf. llama-bench mis-labels the first column; this run is Gemma 4 **26B* Q4_K_M*.
What this run says: with Vulkan and ngl 999, expect on the order of ~239 tok/s for prefill (pp512) and ~21 tok/s for generation (tg128). That ~21 t/s is the most useful single number for “raw” reply speed (no Open WebUI overhead, no long reasoning block, no huge prompts); real chat often lands near this ballpark or a bit lower.

Other GGUFs, ngl, or build: revisions will move tg* a lot; record your own table after major changes.

Quick terminal test

From the llama.cpp directory:

./build/bin/llama-cli \
  -m "$HOME/models/google_gemma-4-26B-A4B-it-Q4_K_M.gguf" \
  -p "Answer in one sentence what Linux is." \
  -cnv \
  -ngl 99

Gemma 4 and on-screen reasoning ([Start thinking] … [End thinking]): many Instruct GGUFs emit a “thinking” block before the final answer. On a recent llama-cli, --help normally documents (verify with ./build/bin/llama-cli --help | grep -iE 'reason|think|template'):

-rea, --reasoning on|off|auto — default auto (template decides). For clean screenshots, use --reasoning off (short -rea off if your build prints it).
--reasoning-budget N — 0 ends the thinking block immediately; -1 is unrestricted. Pair with off if needed.
--chat-template-kwargs STRING — JSON for the template parser (e.g. '{"enable_thinking": false}' in bash with outer single quotes).
--reasoning-format FORMAT — tag handling / extraction (DeepSeek-style paths); --reasoning off is usually enough for Gemma in interactive CLI.

Screenshot-friendly example (same command as above + reasoning disabled):

./build/bin/llama-cli \
  -m "$HOME/models/google_gemma-4-26B-A4B-it-Q4_K_M.gguf" \
  -p "Answer in one sentence what Linux is." \
  -cnv -ngl 99 \
  --reasoning off

Reference run (validated hardware in the intro; no [Start thinking] block; t/s are indicative):

You can also export the env vars mentioned in --help (LLAMA_ARG_REASONING, LLAMA_ARG_THINK_BUDGET, …) if you prefer not to repeat flags.

For llama-server (§8–§9), add the same switches to ExecStart (--reasoning off, --reasoning-budget 0, --chat-template-kwargs …) as your binary supports. If nothing disables it, try another GGUF/variant, or another model for a one-off capture (e.g. Llama in this same §7).

Example with a local Llama 3.1 8B (single-turn demo; chat template depends on the GGUF). An overly vague -p (“summarize llama.cpp”) may yield “I don’t have that information”; give context in the question (e.g. open-source inference, GGUF, local execution).

./build/bin/llama-cli \
  -m "$HOME/models/Llama-3.1-8B-Instruct-Q8_0.gguf" \
  -p "Answer in exactly one sentence: What does the llama.cpp project do for running language models locally?" \
  -ngl 999

Actual reference screenshot (same validated hardware in the intro: Ryzen 5 7640HS, Radeon 760M, DDR5; t/s varies with thermals, BIOS, and llama.cpp commit):

-ngl 99 / 999: tries to offload many layers to the GPU; on large models or a small unified VRAM budget you may need to lower -ngl or increase the BIOS framebuffer (§2).
On startup, look for lines like ggml_vulkan: and your GPU name (e.g. Radeon 760M) to confirm Vulkan.

Adding or switching models

Each additional model you want to run—another family, quantization, or file from Hugging Face—is one more .gguf in your folder (e.g. $HOME/models). ML slang often says “weights” for the trained parameters inside that file; here it is enough to think “another .gguf.” The flow is always download → test → point the server at that path.

Download using the same pattern as above (wget, huggingface-cli, or the repo’s download link on Hugging Face).
Smoke-test in the terminal with llama-cli -m "$HOME/models/your-new-file.gguf" (like the quick test). If the architecture is brand new and load fails, update and rebuild llama.cpp (§6).
Manual llama-server (§8): stop the process (Ctrl+C) and start it again with -m pointing at the new file.
systemd service (§9): edit /etc/systemd/system/llama-web.service, change only the -m /full/path/new.gguf argument inside ExecStart, save, then run:

sudo systemctl daemon-reload
sudo systemctl restart llama-web.service
sudo systemctl status llama-web.service

Open WebUI (§10): llama-server loads one model at a time (whichever you set at startup). After restarting the service, reload the UI; the model dropdown may show the filename or a generic label (default), depending on the version.
OpenCode / VS Code (§11): same host and port (…:8080/v1); in editors use the server IP or 127.0.0.1 depending on where the IDE runs.

Serving several models at once requires multiple llama-server processes on different ports (and matching entries in Open WebUI or more containers); that advanced layout is not spelled out here.

Experimenting with more models: setup, testing, and limits

If you want to try multiple GGUFs, follow a clear flow and know your hardware ceiling—this avoids pointless downloads and false “it’s broken” moments.

Recommended flow

Check disk and RAM (free -h, df -h /, §3). Each quantization costs what the model card says; keep headroom.
Update llama.cpp when the model is new (§6, Update and rebuild).
Download the .gguf into $HOME/models (wget, huggingface-cli, etc.).
Smoke-test with llama-cli and short generations; confirm ggml_vulkan if the GPU should participate (§7).
Optional: llama-bench with the same -ngl you plan for production to compare quantizations (§7).
Change -m in §9 (or manual §8), daemon-reload + restart, then curl /v1/models and Open WebUI (Admin → Connections; Model IDs if needed).

Typical limits on a mini PC with an iGPU

Topic	What it means
RAM	GGUF size + OS + context cannot grow without limit; huge MoE releases (e.g. Kimi K2-class GGUFs) can exceed usable RAM on 64–96 GiB class boxes or crawl at extremely low tok/s.
iGPU Vulkan	Caps tok/s on GPU; lots of RAM helps you load weights, not mimic a big discrete GPU.
One active model per `llama-server`	Switching models means changing `-m` and restarting (or a second server on another port).
Templates / chat	Weird chat in Open WebUI may be the GGUF chat template; check the Hugging Face card or try another frontend.
Network / disk	Large downloads take time; use `wget --continue` or resumable `huggingface-cli`.

Set expectations: an 8B–13B or a quantized 26B can be a great fit with ample RAM; datacenter-scale GGUF may not fit or run under ~1–2 tok/s with aggressive paging—that is a memory bandwidth issue, not an Ubuntu bug.

One playbook: Gemma 4, Qwen Coder, DeepSeek Coder, and Llama 3.1 (download → Open WebUI)

For a mini PC–style setup: Ubuntu 24.04, AMD iGPU Vulkan, ~64–96 GiB RAM, llama-server on 8080, systemd §9, Open WebUI §10. Swap in your paths and username.

Common steps (every model swap)

Refresh the engine if the weight is new or load fails: cd ~/llama.cpp && git pull and rebuild (§6).
Download the .gguf (per-family commands below). Verify the filename under Hugging Face → Files; if it is renamed, fix the URL.
Smoke test (tune -ngl and -c); or use the copy-paste commands per model under Per-model quick test below.

cd ~/llama.cpp
./build/bin/llama-cli -m "/absolute/path/to/file.gguf" -ngl 999 -c 4096 -n 80 -p "Answer in one short sentence."

Tuning: on OOM, hangs, or very slow output, lower -ngl (e.g. 50, 35) and/or -c (e.g. 2048). Unified iGPU memory is usually the limiter, not raw RAM alone.
llama-bench (optional, §7) with the same path and -ngl to compare quants or families.
systemd (§9): in /etc/systemd/system/llama-web.service, edit ExecStart: same path in -m, and match -c / -ngl to what worked in the smoke test.

sudo systemctl daemon-reload
sudo systemctl restart llama-web.service
sudo systemctl status llama-web.service

API check: curl -s http://127.0.0.1:8080/v1/models
Open WebUI: Admin → Connections → OpenAI (host.docker.internal:8080/v1). If the picker stays empty, paste the id from that JSON into Model IDs, save, and hard-refresh.

Reference table (repos + sample file)

Family	Hugging Face repo	Sample file (quant)	Notes (~machine with plenty of RAM)
Gemma 4 26B Instruct	bartowski/google_gemma-4-26B-A4B-it-GGUF	`google_gemma-4-26B-A4B-it-Q4_K_M.gguf`	~17 GiB on disk; usually needs fresh llama.cpp. Start `-c` around 4096–8192.
Qwen2.5 Coder 7B	bartowski/Qwen2.5-Coder-7B-Instruct-GGUF	`Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf`	Much lighter than Gemma 26B. For 14B / 32B, check Files sizes; 32B Q4 is often ~18–20 GiB+ and heavier.
DeepSeek Coder V2 Lite Instruct	bartowski/DeepSeek-Coder-V2-Lite-Instruct-GGUF	`DeepSeek-Coder-V2-Lite-Instruct-Q4_K_M.gguf`	“Lite” ≈ ~10 GiB class in Q4_K_M; solid code/disk trade-off locally.
Llama 3.1 8B Instruct	bartowski/Meta-Llama-3.1-8B-Instruct-GGUF	`Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf` or `-Q8_0.gguf`	Q4_K_M faster; Q8_0 heavier / often higher quality. If your file name differs, keep your real path in `-m`.

Download (`wget --continue`, one file per command)

If you use SSH and the download runs a long time, run it inside screen or tmux so a dropped connection does not kill the job. Example with screen (install if needed: sudo apt install -y screen):

screen -S hf-models
mkdir -p "$HOME/models"

wget --continue -O "$HOME/models/google_gemma-4-26B-A4B-it-Q4_K_M.gguf" \
  "https://huggingface.co/bartowski/google_gemma-4-26B-A4B-it-GGUF/resolve/main/google_gemma-4-26B-A4B-it-Q4_K_M.gguf?download=true"
# When this wget finishes, you can paste the next command from the block below without leaving screen.

# Detach (leave download running): Ctrl+A, release, D
# Reattach later: screen -r hf-models
# List sessions: screen -ls

The same pattern works for the other URLs in this section or for huggingface-cli download.

mkdir -p "$HOME/models"

# Gemma 4 26B Q4_K_M
wget --continue -O "$HOME/models/google_gemma-4-26B-A4B-it-Q4_K_M.gguf" \
  "https://huggingface.co/bartowski/google_gemma-4-26B-A4B-it-GGUF/resolve/main/google_gemma-4-26B-A4B-it-Q4_K_M.gguf?download=true"

# Qwen2.5 Coder 7B Q4_K_M
wget --continue -O "$HOME/models/Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf" \
  "https://huggingface.co/bartowski/Qwen2.5-Coder-7B-Instruct-GGUF/resolve/main/Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf?download=true"

# DeepSeek Coder V2 Lite Q4_K_M
wget --continue -O "$HOME/models/DeepSeek-Coder-V2-Lite-Instruct-Q4_K_M.gguf" \
  "https://huggingface.co/bartowski/DeepSeek-Coder-V2-Lite-Instruct-GGUF/resolve/main/DeepSeek-Coder-V2-Lite-Instruct-Q4_K_M.gguf?download=true"

# Llama 3.1 8B Q4_K_M
wget --continue -O "$HOME/models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf" \
  "https://huggingface.co/bartowski/Meta-Llama-3.1-8B-Instruct-GGUF/resolve/main/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf?download=true"

Meta / Llama (gated): if wget returns 403 or Hugging Face asks you to sign in, open the model page while logged in, accept the license, create a read token, and run huggingface-cli login. Gated repos usually need huggingface-cli download ..., not anonymous wget to resolve/main/....

huggingface-cli alternative (resumable; each command pulls one GGUF under --local-dir):

sudo apt install -y python3-venv
python3 -m venv "$HOME/.venv/huggingface"   # once; skip if this directory already exists
source "$HOME/.venv/huggingface/bin/activate"
pip install -U "huggingface_hub[cli]"
# huggingface-cli login   # required for *gated* repos (e.g. Llama/Meta); optional otherwise
mkdir -p "$HOME/models"

huggingface-cli download bartowski/google_gemma-4-26B-A4B-it-GGUF \
  --include "google_gemma-4-26B-A4B-it-Q4_K_M.gguf" \
  --local-dir "$HOME/models"

huggingface-cli download bartowski/Qwen2.5-Coder-7B-Instruct-GGUF \
  --include "Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf" \
  --local-dir "$HOME/models"

huggingface-cli download bartowski/DeepSeek-Coder-V2-Lite-Instruct-GGUF \
  --include "DeepSeek-Coder-V2-Lite-Instruct-Q4_K_M.gguf" \
  --local-dir "$HOME/models"

huggingface-cli download bartowski/Meta-Llama-3.1-8B-Instruct-GGUF \
  --include "Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf" \
  --local-dir "$HOME/models"

Depending on the CLI version, the .gguf may end up in a subfolder under --local-dir. Point -m at the real absolute path (for example find "$HOME/models" -name '*.gguf').

Per-model quick test (right after download)

Run one block (paths match the wget names above). -n caps generated tokens so the run stays short; if your llama-cli rejects -n, check ./build/bin/llama-cli --help (sometimes --predict or another alias). Earlier in §7, Quick terminal test shows a -cnv example for Gemma and a Llama variant.

Gemma 4 26B Q4_K_M

cd "$HOME/llama.cpp"
./build/bin/llama-cli \
  -m "$HOME/models/google_gemma-4-26B-A4B-it-Q4_K_M.gguf" \
  -c 4096 -ngl 999 -n 80 \
  -p "Answer in one short sentence what a tensor is in machine learning."

Qwen2.5 Coder 7B Q4_K_M

cd "$HOME/llama.cpp"
./build/bin/llama-cli \
  -m "$HOME/models/Qwen2.5-Coder-7B-Instruct-Q4_K_M.gguf" \
  -c 4096 -ngl 999 -n 128 \
  -p "Write a one-line Python factorial(n) function; code only."

DeepSeek Coder V2 Lite Q4_K_M

cd "$HOME/llama.cpp"
./build/bin/llama-cli \
  -m "$HOME/models/DeepSeek-Coder-V2-Lite-Instruct-Q4_K_M.gguf" \
  -c 4096 -ngl 999 -n 128 \
  -p "Write a JavaScript arrow function that adds two numbers; code only."

Llama 3.1 8B Instruct Q4_K_M

cd "$HOME/llama.cpp"
./build/bin/llama-cli \
  -m "$HOME/models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf" \
  -c 4096 -ngl 999 -n 80 \
  -p "Say in one sentence what llama.cpp is for."

On startup you should see ggml: / ggml_vulkan: lines naming your GPU when Vulkan is in use (§4–§5).

Typical `ExecStart` tweaks (example)

Same shape as §9; only -m (and possibly -c / -ngl) change:

…/llama-server \
    -m /home/YOUR_USER/models/THE_FILE_YOU_TESTED.gguf \
    --host 0.0.0.0 \
    --port 8080 \
    -c 8192 \
    -ngl 999 \
    --n-predict -1

If Gemma 26B Q4 or another big model OOMs on a box with only ~16 GiB RAM, lower -c (e.g. 4096) and -ngl (e.g. 40 or less) before pushing 99 / 999. Always validate with llama-cli using the same -m, -c, and -ngl you plan in ExecStart, then automate with systemd (§9).

8. Minimal web server (`llama-server`)

Run manually, listening on all interfaces on port 8080:

cd "$HOME/llama.cpp"
./build/bin/llama-server \
  -m "$HOME/models/google_gemma-4-26B-A4B-it-Q4_K_M.gguf" \
  --host 0.0.0.0 \
  --port 8080 \
  -c 8192 \
  -ngl 99 \
  --n-predict -1

On another machine: http://SERVER_IP:8080 (llama.cpp’s built-in UI is very basic).

9. systemd service (start on boot)

Create /etc/systemd/system/llama-web.service (e.g. with sudo nano):

[Unit]
Description=Llama.cpp API server (Vulkan)
After=network.target

[Service]
Type=simple
User=YOUR_USER
Group=YOUR_USER
# Vulkan on AMD: the service user must access /dev/dri (groups in §4).
# If the service loads the model on CPU only, check `groups` / `id` for that user.
SupplementaryGroups=render video
WorkingDirectory=/home/YOUR_USER/llama.cpp
ExecStart=/home/YOUR_USER/llama.cpp/build/bin/llama-server \
    -m /home/YOUR_USER/models/google_gemma-4-26B-A4B-it-Q4_K_M.gguf \
    --host 0.0.0.0 \
    --port 8080 \
    -c 8192 \
    -ngl 99 \
    --n-predict -1
Restart=always
RestartSec=3

[Install]
WantedBy=multi-user.target

Enable it:

sudo systemctl daemon-reload
sudo systemctl enable --now llama-web.service
sudo systemctl status llama-web.service

Recommended order (tight RAM):

The .gguf must be fully downloaded; a truncated file makes the unit fail or restart in a loop (Restart=always).
Smoke-test with llama-cli first as the same user as the systemd unit, with the same -m, -c, and -ngl as in ExecStart (§7 Per-model quick test or step 3’s generic example). If that already OOMs or hangs, tune flags before enable --now.
If systemd shows OOM in journalctl, the process dies and respawns every few seconds, or the kernel kills the worker, edit ExecStart: drop -c (e.g. 4096) and -ngl (e.g. 40 or less) instead of staying on 99 / 999 until status shows a stable active (running); then sudo systemctl daemon-reload and sudo systemctl restart llama-web.service.

If startup fails, check logs: journalctl -u llama-web.service -n 80 --no-pager (GGUF path, /dev/dri permissions, -ngl, Vulkan).

10. Open WebUI with Docker (port 3000 → backend on 8080)

Install Docker if needed:

sudo apt install -y docker.io
sudo usermod -aG docker "$USER"
# Log out again, or run: newgrp docker

Container (UI on 3000; engine stays on host 8080):

docker run -d \
  -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:main

In the browser: http://SERVER_IP:3000.

Connect Open WebUI to llama-server

Not the same as “External tools”. In regular user settings you may see External tools (Manage tool servers, openapi.json): that is for optional tool servers, not for the main LLM backend. Putting your URL only there leaves the model picker empty.

Use Admin Settings, not the gear icon that only shows General / Interface / External tools (personal user settings). Typical path: profile avatar → Admin Settings / Administration → Settings → Connections → OpenAI → Add connection. If Admin Settings is missing, your account is not an instance admin (the first registered user usually is). Docs: OpenAI-Compatible.

Admin panel → Settings → Connections.
OpenAI section (llama-server mimics the OpenAI API):
- Base URL: http://host.docker.internal:8080/v1
- API key: any string (e.g. sk-no-key-required).
Save and use verify connection if shown.
Turn off “Direct connections” (or equivalent) if you enabled it: otherwise the browser will try to resolve host.docker.internal outside Docker and fail. The UI should proxy to the backend.

Chat up and running (example)

With the backend wired, pick a model in chat (often the same label as the .gguf filename llama-server loaded), send a prompt, and the reply is generated on the host. The screenshot shows google_gemma-4-26B-A4B-it-Q4_K_M.gguf: the header dropdown reflects that file, and you get a “Thought for …”-style block (internal reasoning before the visible answer). That adds latency before you see the final text; for terminal use and less explicit “thinking” output with Gemma, try llama-cli with --reasoning off (§7 Quick terminal test).

No browsing or GitHub fetch: real limits (and confident wrong answers)

With llama-server + Open WebUI as wired here, the model is text → text only: it does not browse the web, issue its own internet requests, download a https://github.com/... tree, or run code in a sandbox. All it “sees” is what you type (plus whatever context the UI forwards) and knowledge frozen inside the GGUF up to training cutoff.

It may still answer very confidently as if it had tools—for example claiming it “can analyze a public repo if you share the link” or outlining how it will “read” a remote README. In this stack that is false if you only paste a URL: the backend never fetches HTML or the repo; Gemma (or any local GGUF) hallucinates or repeats patterns from training. Real analysis needs you to paste files / diffs, or separate plumbing (RAG, Open WebUI functions, agents, APIs) that this guide does not set up.

A “Thought for …” / reasoning block (§7, §10) does not verify anything online—it only extends generation and can read like a super-capable assistant; double-check claims about repos, “current” versions, or anything that depends on today.

Same stack, different tone: ask bluntly can you browse the Internet for new info? and Gemma may plainly refuse—no live search, only training data plus whatever you paste. That does not undo the GitHub-URL problem above: the model shifts persona with prompt framing (literal capability question vs. “please review this repo”). Ground truth is unchanged: llama-server still issues no HTTP on its own until you wire tools.

Live demo (the joke writes itself): the assistant just told you to “send the link”; you reply analyze https://github.com/…/pgwd and tell me what to improve—or the same request in Spanish (or any other language you type in the UI); llama-server does not switch behavior by chat language. Open WebUI shows Thinking… and Gemma looks busy, but llama-server never fetched that repo: it only sees the message string. The answer may sound technical yet be untethered from the real tree—paste files, use git yourself, or wire tools if you want grounded review.

Same experiment, a minute later: the model may return Thought for ~45–60s and a long “review” that reads like a real audit. The screenshot below is English (analyze in details…): it leans into Flask and Blueprints; in another chat the same Gemma might rattle off Go cmd//internal/—still with no tree read. That is template + guesswork, not repository access: some bullets may match the name (pgwd, “dashboard”, …), some may be wrong; length and “thought” time are not a substitute for cloning and diffing.

Model picker shows “No results found” / no models listed

This almost never means “the .gguf is missing on disk”; it means Open WebUI is not getting /v1/models from the backend you configured. Walk through in order:

llama-server must be running on the same host as Docker (§8 manual or §9 systemd). Nothing listening on 8080 → empty list.
On the host (mini PC shell), hit the API:

curl -sS http://127.0.0.1:8080/v1/models | head

You should see JSON (data, at least one id). Connection refused → start or fix llama-server. If it only bound a weird interface, use --host 0.0.0.0 in ExecStart (not only 127.0.0.1 if LAN clients need 8080; for Docker→host this is the usual choice).

From the Open WebUI container, the host port must be reachable:

docker exec open-webui sh -c 'wget -qO- http://host.docker.internal:8080/v1/models 2>/dev/null || curl -sS http://host.docker.internal:8080/v1/models' | head

If this fails but step 2 works, you are missing --add-host=host.docker.internal:host-gateway in docker run (§10), or a firewall blocks Docker bridge → host (ufw may need a rule; many setups allow it by default).

UI wiring: Settings → Connections → OpenAI (or Admin → Settings, depending on version), base URL http://host.docker.internal:8080/v1 (/v1 required). Save a dummy API key and verify if offered.
Do not mix with Ollama: putting the llama-server URL only under Ollama, or using port 8080 without /v1, can leave the dropdown empty. See the table below.
After fixing, hard-refresh the UI. The model label may match the .gguf name, default, or whatever id appears in the JSON from step 2.

“Failed to fetch models” under Ollama (Settings → Models)

If Settings → Models → Manage Models shows the Ollama service with URL http://host.docker.internal:8080 (and nothing else), you often get Failed to fetch models. That usually means two different backends are mixed up:

What you run	Typical port	Where to configure it in Open WebUI
llama-server (this guide)	8080, OpenAI-style API	Settings → Connections → OpenAI (or equivalent), base URL `http://host.docker.internal:8080/v1` (the `/v1` suffix is required).
Ollama (only if installed separately)	11434, Ollama API	Ollama connection / model management, typically `http://host.docker.internal:11434` (only if Ollama listens on the host and the container can reach it).

llama-server is not Ollama. If you put the llama-server URL in the Ollama field, the UI uses the wrong protocol and fails even when port 8080 is open.

If you only use llama-server:

Keep Connections → OpenAI exactly as above (…8080/v1, dummy key, verify).
If you do not run Ollama, clear or disable the Ollama URL (do not point it at 8080).
Return to Models or chat: available models follow whatever llama-server loaded with -m (§8–§9).

If host.docker.internal does not resolve inside the container, confirm your docker run includes --add-host=host.docker.internal:host-gateway (§10). On Linux that hostname is not defined by default without it.

Updating Open WebUI (Docker)

The UI often shows a banner like “A new version (v0.x.y) is now available…” when a newer image exists. Your chats and settings live in the open-webui named volume; they are kept when you recreate the container as long as you mount the same -v open-webui:/app/backend/data.

Pull the updated image (same tag you used at install; this guide uses main):

docker pull ghcr.io/open-webui/open-webui:main

Stop and remove only the container (the volume stays intact):

docker stop open-webui
docker rm open-webui

Run the same docker run block from §10 again (same -p 3000:8080, --add-host=host.docker.internal:host-gateway, -v open-webui:…, container name open-webui, etc.). The new container starts from the image you just pulled.

If you originally used a different tag (e.g. v0.8.12 or a cuda variant) instead of main, substitute that tag in both docker pull and docker run.

Notes: updating the UI does not update llama-server or your GGUF weights; the engine is still §6–§9. If you do not want to track main, pin an explicit image tag in docker run and repeat this flow when you choose to upgrade.

If you also run Ollama

A default endpoint may appear on port 11434. To keep using your Vulkan llama-server with the same -ngl/RAM behavior, prioritize the OpenAI entry pointing at :8080/v1 and do not rely on Ollama for that backend.

11. OpenCode and VS Code with your `llama-server`

Same API surface as Open WebUI: llama-server exposes an OpenAI-compatible endpoint at http://HOST:8080/v1 (keep §8 or §9 running). Use the mini PC’s IP instead of 127.0.0.1 when you work from another machine on the LAN (and open port 8080 in the firewall if needed).

OpenCode

OpenCode can use OpenAI-compatible providers through @ai-sdk/openai-compatible. The official docs include a llama.cpp / llama-server example: Providers — llama.cpp.

Confirm llama-server answers (e.g. curl -s http://127.0.0.1:8080/v1/models).
Create or edit opencode.json for your project or OpenCode’s config path ($schema: https://opencode.ai/config.json).
Add a provider with "npm": "@ai-sdk/openai-compatible" and "options.baseURL": "http://127.0.0.1:8080/v1" (or the remote IP).
Under provider.<id>.models, add keys that match what the API expects. If unsure, read the id field from /v1/models; it is often the .gguf filename or default.
In OpenCode, use /models to pick provider_id/model_id, or set "model": "provider_id/model_id" in the JSON.

Minimal example (adjust IDs to your setup):

{
  "$schema": "https://opencode.ai/config.json",
  "provider": {
    "llama-local": {
      "npm": "@ai-sdk/openai-compatible",
      "name": "llama-server (local)",
      "options": {
        "baseURL": "http://127.0.0.1:8080/v1"
      },
      "models": {
        "default": {
          "name": "Local model (default)"
        }
      }
    }
  },
  "model": "llama-local/default"
}

If OpenCode cannot see the model, align models keys with /v1/models. Tools and heavy agentic flows depend on the GGUF; a general chat model may underperform on coding-agent tasks.

Visual Studio Code

VS Code does not talk to your server by itself—you need an extension that supports a custom OpenAI-style endpoint.

Common picks: Continue and others advertising OpenAI-compatible API or “local LLM”. You typically set Base URL to http://127.0.0.1:8080/v1 (or the server IP) and API key to any placeholder (e.g. sk-local).
Visual Studio GitHub Copilot does not route through your llama-server; it is a separate service.
From another PC, use the host IP where llama-server runs—not host.docker.internal (that name is for containers such as Open WebUI).

Extensions usually trail cloud models on tools and huge context. Start on the same machine you already validated with llama-cli or Open WebUI.

12. Troubleshooting: Vulkan / `glslc` on Ubuntu 24.04

Typical CMake symptoms:

Could NOT find Vulkan (missing: ... glslc)
Vulkan found but glslc still missing

Suggested order (simplest first):

12.1 Universe repository and packages

sudo add-apt-repository universe
sudo apt update
sudo apt install -y libvulkan-dev vulkan-tools shaderc

Verify:

command -v glslc && glslc --version

Clean and reconfigure the build:

cd ~/llama.cpp
rm -rf build
cmake -B build -DGGML_VULKAN=1
cmake --build build --config Release -j"$(nproc)"

12.2 LunarG repository (Vulkan SDK)

If your Ubuntu mirror does not offer shaderc or glslc is still missing:

wget -qO- https://packages.lunarg.com/lunarg-signing-key-pub.asc \
  | sudo tee /etc/apt/trusted.gpg.d/lunarg.asc
sudo wget -qO /etc/apt/sources.list.d/lunarg-vulkan-noble.list \
  https://packages.lunarg.com/vulkan/lunarg-vulkan-noble.list
sudo apt update
sudo apt install -y vulkan-sdk

Then rm -rf build and run cmake again.

12.3 Conflict between Ubuntu’s `libshaderc-dev` and LunarG’s Shaderc

If dpkg complains about overwriting files between packages, as a last resort you can force-remove the blocking package, then repair:

sudo dpkg --remove --force-depends libshaderc-dev
sudo apt --fix-broken install -y
sudo apt install -y shaderc

Only do this if you understand mixed repos can leave messy dependencies; often sticking to either LunarG or Ubuntu for Shaderc dev packages is enough.

12.4 Snap fallback for `glslc`

sudo snap install google-shaderc
sudo ln -sf /snap/bin/glslc /usr/local/bin/glslc

Check glslc --version again and retry CMake.

13. Performance and models (rough guide)

With lots of RAM and a modest iGPU, unified VRAM and -ngl cap GPU tokens/s; larger models can spill into system RAM.

Scale	Notes
Gemma 4 26B A4B (e.g. Q4_K_M ~17 GiB)	Good balance with high RAM; needs an up-to-date llama.cpp.
Same family Q8_0 (~27 GiB)	Better quality; more pressure on RAM/unified VRAM.
Mixtral 8×7B, 70B, others	Feasible mainly thanks to RAM; slower.

Use a lower quantization (e.g. Q4_K_M) if you prioritize speed over quality.

For hard numbers on your box, run llama-bench (§7): it is the most direct way to compare -ngl and quantizations without the web UI in the way.

`htop` looks “light” while you chat (is that normal?)

If htop shows llama-server / llama-cli with low CPU across cores and only a few GiB of RES, that is often expected when:

-ngl leaves much of the model on the iGPU — heavy matmul runs on the graphics core; the CPU orchestrates and shuffles data, so you may not see all cores pegged at 100%.
The GGUF is small (e.g. 7B/8B Q4) — small resident RAM footprint; a 26B run would show much more RES if most weights live in system memory.
Bursts happen while scoring the prompt and generating tokens; between turns or while you read output, usage drops.
With unified memory (UMA), some model cost may not show up as a huge process RSS: the GPU also holds part of the working set.

Do not assume nothing is working just because htop stays calm: check t/s in llama-cli, llama-bench (§7), or a GPU monitor if you want to see graphics load.

Reference screenshot (same class of mini PC as the validated hardware; SSH + htop: llama.cpp around ~5 GiB RES and moderate CPU on one core—consistent with a non-huge model and GPU-bound ‑ngl):

AMD: `amdgpu_pm_info` and `dri/N` (not always `dri/0`)

Many snippets use /sys/kernel/debug/dri/0/amdgpu_pm_info. On Ryzen mini PCs with amdgpu, dri/0 often does not exist: the kernel exposes the GPU under a PCI BDF directory (0000:c4:00.0, …) and provides symlinks such as dri/1 or dri/128 into the same tree. If cat returns No such file or directory, inspect first:

mount | grep debugfs   # expect debugfs on /sys/kernel/debug
ls -la /sys/kernel/debug/dri/

Then read amdgpu_pm_info using the N or PCI path that belongs to your AMDGPU (1 or 0000:…:….0 usually works):

sudo cat /sys/kernel/debug/dri/1/amdgpu_pm_info
# same content if 1 → 0000:c4:00.0:
# sudo cat /sys/kernel/debug/dri/0000:c4:00.0/amdgpu_pm_info

If the directory exists but amdgpu_pm_info is missing, your kernel may not export that node; try ls … | grep -i pm. That does not mean Vulkan is broken.

How to read it (sample text, idle mini PC): GPU Load: 0 % with VCN powered down matches idle. While llama-cli / llama-server runs a long ‑ngl job, run cat during generation: you should usually see Load > 0 % (the counter may not peg the iGPU). For a live view, radeontop is often easier (sudo apt install -y radeontop).

GFX Clocks and Power:
    2800 MHz (MCLK)
    800 MHz (SCLK)
    ...
GPU Temperature: 36 C
GPU Load: 0 %
VCN Load: 0 %
VCN: Powered down

(Illustrative excerpt; clocks, millivolts, and watts vary with BIOS, governor, and workload.)

14. Remote desktop (Ubuntu 24.04 Desktop, LAN)

When the mini PC runs GNOME and you want the full desktop from another machine on the same network (Windows, Mac, Linux), Ubuntu 24.04 usually ships RDP built in; you often do not need xrdp unless you want different behavior.

14.1 Enable on the mini PC

Settings → System → Remote Desktop.
Turn Remote Desktop on.
Finish the assistant (password / auth as GNOME shows).

Underlying package: gnome-remote-desktop. If the toggle is missing or fails:

sudo apt update
sudo apt install --reinstall gnome-remote-desktop

Log out or reboot and open Settings again.

14.2 Connect from another machine

Native RDP clients: Windows (Remote Desktop Connection / mstsc), macOS (Microsoft Remote Desktop from the App Store), Linux (e.g. Remmina, RDP protocol).
Host: the Ubuntu box’s LAN IP (hostname -I | awk '{print $1}' on the mini PC).
Port: 3389/TCP by default.

14.3 Firewall (`ufw`)

If ufw is enabled:

sudo ufw allow 3389/tcp comment 'GNOME RDP'
sudo ufw status

14.4 If connection fails

On the Ubuntu host:

hostname -I
sudo ss -tlnp | grep 3389 || true

With Remote Desktop enabled, something should listen on 3389. Confirm the client is on the same LAN and that no AP isolation blocks client-to-client Wi‑Fi.

If GNOME/RDP misbehaves on Wayland, try the Ubuntu on Xorg session on the login screen and enable Remote Desktop again.

Security: exposing RDP to the public Internet without VPN/tunnel is a bad idea; keep it on a trusted LAN or behind VPN/WireGuard.

Final checklist

[ ] BIOS: UMA / VRAM for iGPU adjusted if applicable.
[ ] Vulkan OK: on desktop vkcube; on Ubuntu Server vulkaninfo --summary shows the GPU.
[ ] User is in render and video (id -nG); if you ran usermod, you logged out or rebooted (an old shell session does not pick up new groups).
[ ] cmake -B build -DGGML_VULKAN=1 succeeds; build reaches 100 %.
[ ] You can update llama.cpp (git pull, rebuild §6) and follow try model → systemd → Open WebUI when experimenting with new GGUFs (§7, Experimenting…).
[ ] llama-cli shows the Vulkan device when loading the model.
[ ] llama-server responds on :8080.
[ ] Open WebUI on :3000 with http://host.docker.internal:8080/v1 and Direct connections off.
[ ] You know the model does not browse or read GitHub from a URL alone; it may hallucinate capabilities (§10 No browsing or GitHub fetch).
[ ] You know how to upgrade Open WebUI: docker pull, stop/rm the container, rerun the same docker run with the open-webui volume (§10).
[ ] systemd service enabled if you want a persistent boot setup.
[ ] You know how to switch models: after adding another .gguf, you update -m in llama-web.service (or in the manual command), run sudo systemctl daemon-reload && sudo systemctl restart llama-web.service, and reload Open WebUI.
[ ] You can list your .gguf files (ls / find, §7) and measure throughput with llama-bench (§7) when comparing quantizations or -ngl.
[ ] You can follow the unified playbook for Gemma 4 / Qwen Coder / DeepSeek Lite / Llama 3.1 (§7): download → llama-cli → systemd → /v1/models → Open WebUI.
Remote desktop §14: RDP enabled in Settings, 3389 allowed in ufw if needed, smoke tested from another PC on the LAN.

Quick port reference

Service	Port
llama-server	8080
Open WebUI	3000
Remote desktop (GNOME RDP)	3389 TCP
Ollama (optional)	11434

Closing thoughts

Running local inference on Ubuntu with Vulkan and an AMD iGPU is not a one-click setup, but it is worth it: a model that answers on your LAN, without routing every request through a third-party API, and with the freedom to swap GGUFs or quantizations when you need to.

The stack moves fast: llama.cpp, Ubuntu packages, and Hugging Face repos change over time. If a command or package name no longer matches this guide, cmake and apt errors usually point you in the right direction; double-check the project’s current docs.

Once the checklist is green, the natural next step is tuning -ngl, context size (-c), and the model until you get the quality-vs-tokens-per-second balance you want on your hardware.

This is the mini PC we used for the tests and validation in this guide: Minisforum UM760 Slim (Ryzen 5 7640HS, Radeon 760M), Ubuntu 24.04 LTS, plenty of DDR5 RAM and NVMe — the same box behind the llama-bench runs, llama-cli screenshots, Open WebUI examples, and the other reference captures. The photo is the actual machine (powered on, front panel as shown), not a marketing render.

Now go tinker: this walkthrough is rooted in Ryzen + iGPU, but the playbook travels—mini PCs (Minisforum, Beelink, ASUS ExpertCenter PN, ZOTAC ZBOX, modern Intel NUC-class boxes…), Mac mini / Mac Studio on Apple Silicon if that is your stack, or compact power boxes like NVIDIA DGX Spark when budget and goals match. Build llama.cpp (or your preferred runtime), stress GGUF quantizations, run llama-bench on your iron, and tune -ngl until the ceiling feels right. Share what you learn—a dev.to post, a blog, Mastodon, article comments, or whatever community you use; real numbers beat brochure claims every time.

One quiet takeaway: on your codebases the model usually helps more as a copilot you feed—a diff, a log slice, a trimmed README—than as an all-knowing reviewer from a bare URL or a polished persona. When the answer feels too slick without anything concrete in the prompt, the limit is rarely the mini PC: it is text-in, text-out with nobody else reading disk for you. §10 walks the receipts; day-to-day, you supply the ground truth.

AI disclosure: I wrote the technical walkthrough from my own setup (Ubuntu 24.04, llama.cpp + Vulkan, Minisforum mini PC, real llama-bench numbers and screenshots). I used AI tools (e.g. ChatGPT/Gemini/Cursor-style assistants) for brainstorming titles, structure, and Reddit post wording, and for editing clarity in places—not for inventing the commands, hardware facts, or benchmarks, which I ran and documented myself. The project itself (self-hosted stack) does not require callers to use cloud LLMs; it’s local inference. Happy to clarify further if needed.

gghstats-selfhosted: production-shaped manifests for gghstats

Hermes Rodríguez — Fri, 03 Apr 2026 15:43:55 +0000

You already read gghstats: Keep GitHub traffic past 14 days — gghstats is the small Go service that keeps GitHub traffic history in SQLite instead of losing it after GitHub’s ~14-day window. The app ships binaries, a multi-arch GHCR image, and a focused README.

What lives outside the application repo is everything that answers: how do I run this for real on my box, VPS, or cluster?

That’s gghstats-selfhosted — a separate repository with deployment manifests only: docker run, Docker Compose (minimal, Traefik + HTTPS, optional observability), and a Helm chart for Kubernetes. No Go code here; it stays easy to fork, pin, and diff.

Live app demo (read-only): gghstats.hermesrodriguez.com

What the app looks like

Flat screenshots with a white backdrop, perspective, and soft shadow (local asset pipeline — not in the GitHub repos):

Main dashboard (repository list):

Repository detail (charts and tables):

Why split “app” and “how to run it”?

gghstats = releases, security advisories, feature issues, container tags.
gghstats-selfhosted = Compose files, Helm templates, env patterns, and docs that change when deployment stories evolve (Traefik labels, persistence, layout under run/).

You can ignore the self-hosted repo forever and still run from GHCR with a one-liner — but if you want opinionated layouts (shared GGHSTATS_HOST_DATA, secrets outside the git clone, optional Prometheus/Grafana/Loki behind the same Traefik network), the split keeps each repo readable.

Who this is for

Self-hosters who already run a VPS, Compose, or a small Kubernetes cluster and want a repeatable layout instead of a one-off paste from Stack Overflow.
Operators who care about pinning an image tag, a single persistent path for SQLite, and secrets that never live in git — the same reasons larger teams split “app” and “platform,” just with a tiny footprint.

What this layout tries to spare you

Wiring Traefik + Let’s Encrypt from scratch on every new service.
Stuffing Compose samples, Helm, and env docs into the application repo (noisier release notes, harder security reviews).
Forgetting which volume holds gghstats.db when you bump the image six months later.

Context (other approaches)

GitHub’s Traffic view only goes back about 14 days. Other open-source projects chase “GitHub stats” with different goals (dashboards, exporters, hosted analytics). gghstats stays narrow: persist traffic via the API into SQLite, ship one Go binary or GHCR image, and let a separate repo own how you run it. If that trade-off fits you, these manifests are the glue.

What you get in `run/`

Path	Roughly
`run/standalone/{linux,macos,windows}/`	Notes for the binary-only path
`run/docker/`	Single-container `docker run`
`run/docker-compose/minimal/`	One service, quick VPS
`run/docker-compose/traefik/`	HTTPS + Let’s Encrypt + edge network for the app
`run/docker-compose/observability/`	Optional Prometheus / Grafana / Loki (after Traefik)
`run/kubernetes/helm/gghstats/`	Helm chart `gghstats` (same name as the app; not the GitHub repo name)
`run/kubernetes/manifests/`	Plain YAML if you prefer not to use Helm

The table in the README is the fastest way to jump to the flow you want.

Quick starts (copy, adjust, run)

Pick one path. Replace ghp_xxx, host paths, your-github-user/*, image tags, and domains with yours. Pin ghcr.io/hrodrig/gghstats: to a tag that exists on GHCR / releases (example below uses v0.1.2).

GitHub token (scopes and safety)

GGHSTATS_GITHUB_TOKEN must be a Personal Access Token that can reach the repos matched by GGHSTATS_FILTER. Follow gghstats — Token setup:

Token type	Scopes / access (summary)
Classic PAT	`public_repo` — enough for public repos only. Use `repo` if you track private repositories (or use `GGHSTATS_INCLUDE_PRIVATE=true`).
Fine-grained PAT	Grant access to the repositories you need; include whatever repository permissions GitHub requires for the Traffic and repo metadata APIs for those repos (the token wizard lists them per permission).

Safety: do not commit the token, put it in a public gist, or paste it into issues. Prefer env vars, Compose env_file, or Kubernetes Secrets. The app’s startup banner only shows a masked token. If the dashboard is empty after sync, verify filter rules and token scope (see Troubleshooting in the app README).

Binary (no Docker)

Grab a release binary from gghstats Releases, extract, then:

export GGHSTATS_GITHUB_TOKEN=ghp_xxx
./gghstats serve

Open http://localhost:8080.

Docker (one container)

export GGHSTATS_HOST_DATA=/home/gghstats/gghstats-data
mkdir -p "$GGHSTATS_HOST_DATA"

docker run -d \
  -e GGHSTATS_GITHUB_TOKEN=ghp_xxx \
  -e GGHSTATS_FILTER="your-github-user/*" \
  -p 8080:8080 \
  -v "${GGHSTATS_HOST_DATA}:/data" \
  --name gghstats \
  ghcr.io/hrodrig/gghstats:v0.1.2

Docker Compose (minimal, one service)

git clone https://github.com/hrodrig/gghstats-selfhosted.git
cd gghstats-selfhosted
export GGHSTATS_HOST_DATA=/home/gghstats/gghstats-data
mkdir -p "$GGHSTATS_HOST_DATA"
cp run/common/.env.example "${GGHSTATS_HOST_DATA}/.env"
# Edit "${GGHSTATS_HOST_DATA}/.env" — at least GGHSTATS_GITHUB_TOKEN, GGHSTATS_VERSION, GGHSTATS_HOST_DATA

docker compose --env-file "${GGHSTATS_HOST_DATA}/.env" \
  -f run/docker-compose/minimal/docker-compose.yml up -d

Docker Compose + Traefik (HTTPS, production-shaped)

Needs DNS A/AAAA to this host and 80 / 443 reachable.

git clone https://github.com/hrodrig/gghstats-selfhosted.git
cd gghstats-selfhosted
export GGHSTATS_HOST_DATA=/home/gghstats/gghstats-data
mkdir -p "$GGHSTATS_HOST_DATA"
cp run/common/.env.example "${GGHSTATS_HOST_DATA}/.env"
# Edit "${GGHSTATS_HOST_DATA}/.env" — token, GGHSTATS_HOSTNAME, ACME_EMAIL, GGHSTATS_VERSION, GGHSTATS_HOST_DATA, …

docker compose --env-file "${GGHSTATS_HOST_DATA}/.env" \
  -f run/docker-compose/traefik/docker-compose.yml up -d

Kubernetes (Helm)

helm repo add gghstats https://hrodrig.github.io/gghstats-selfhosted
helm repo update
helm show values gghstats/gghstats > my-values.yaml
# Edit my-values.yaml — e.g. image.tag, persistence, resources; keep githubToken.value empty (PAT goes in the Secret below)

kubectl create namespace gghstats
kubectl create secret generic gghstats-secret -n gghstats \
  --from-literal=github-token=ghp_xxx
helm install gghstats gghstats/gghstats -n gghstats -f my-values.yaml

my-values.yaml: start from helm show values (above) so you inherit defaults and values.schema.json constraints (e.g. resources). Do not put the PAT in that file — use the Secret and leave githubToken.value empty. Details: README — Kubernetes Helm.

Example my-values.yaml fragment — token only in the Secret created above; the chart reads it via githubToken.existingSecret (or default secretName):

# Excerpt — always start from: helm show values gghstats/gghstats > my-values.yaml
image:
  tag: "v0.1.2"

githubToken:
  value: ""
  existingSecret: "gghstats-secret"

resources:
  requests: { cpu: "50m", memory: "128Mi" }
  limits: { cpu: "1", memory: "512Mi" }

Helm chart security (defaults): the workload runs non-root (UID/GID 1000), with readOnlyRootFilesystem: true, capabilities.drop: [ALL], and a RuntimeDefault seccomp profile; SQLite lives under the /data mount and /tmp is a small emptyDir. Adjust only if your image requires it — see the chart README.

Reality check: 0.1.x is still early; pin tags, read CHANGELOG on upgrades, and expect manifests to evolve with releases.

Two repos, one story

How the two repos relate:

  gghstats (app repo)              gghstats-selfhosted (deploy repo)
           |                                      |
           | builds                               | Compose, Helm, run/
           v                                      v
      GHCR image ─────────────┬──────────── Manifests
                              │
                              v
                       your VPS / Kubernetes

Links

What	Where
Deployment manifests	github.com/hrodrig/gghstats-selfhosted
Application	github.com/hrodrig/gghstats
Helm index	hrodrig.github.io/gghstats-selfhosted/index.yaml
Versioning, contributing, changelog	README — Versioning · CONTRIBUTING · CHANGELOG

Closing

If you’re already self-hosting databases, proxies, and dashboards, gghstats is one more small service — and gghstats-selfhosted is the folder structure I wished existed when I wired mine up: copy run/common/.env.example, set GGHSTATS_HOST_DATA, choose Compose or Helm, and keep your PAT out of git.

Questions and PRs welcome on develop; merge and releases follow the repo docs.

Cross-posted from the author’s notes; exact commands, versioning, and release policy are always the repositories linked above.

Disclosure (Dev.to / transparency): The author used AI-assisted editing (e.g. drafting structure, wording, and Markdown) and reviewed and approved the final text. Technical claims are meant to match the linked repositories at publish time; if something drifts, trust the repos and CHANGELOG over this post.

gghstats: Keep GitHub traffic past 14 days

Hermes Rodríguez — Sun, 29 Mar 2026 04:41:28 +0000

We've all been there. You ship an open-source project, a tiny CLI, or a docs site. You watch Insights → Traffic for a week: views spike, clones climb, life is good.

Then you come back a month later and ask a simple question: did that blog post actually move the needle over time? GitHub’s answer is blunt: detailed traffic (views and clones) only lives in a rolling 14-day window. Past that, the granularity is gone unless you exported it yourself.

I wanted historical traffic — without a SaaS middleman, without babysitting CSV exports, and with something I could run beside my other self-hosted stuff. That’s why I built gghstats. The first stable line is v0.1.0 (binaries on Releases, multi-arch image on GHCR).

The problem in one sentence

GitHub is a great place to host code; it is not a long-term analytics warehouse for repository traffic. If you care about trends, seasonality, or “what happened after launch,” you need your own copy of that data.

What gghstats does

gghstats is a small Go service that:

Uses the GitHub API (with a personal access token) to pull traffic metrics on a schedule.
Merges them into a local SQLite database so history accumulates instead of disappearing.
Serves a web UI and JSON API so you can browse aggregates and per-repo charts.

On startup it runs a full sync once (so repo discovery matches your filter right away), then repeats on GGHSTATS_SYNC_INTERVAL (default 1h). No waiting for the first tick to see data.

It’s deliberately boring technology: one binary, one file for the DB, backups = copy gghstats.db.

Live demo (read-only UI): gghstats.hermesrodriguez.com

Stack (opinionated and minimal)

Piece	Why
Go	Fast, single binary, easy to ship in Docker.
SQLite	No separate DB server; ship backups with the rest of your backups.
Chart.js	Charts in the dashboard without a heavy frontend framework.
Bootstrap grid	Layout and responsive behavior without reinventing CSS — the UI is intentionally neo-brutalist (hard borders, monospace, loud accents) so it feels like a tool, not a marketing site.

“Works on my machine” wasn’t enough

I wanted a production-shaped repo, not just go run:

Docker / Docker Compose for local runs.
docker-compose.prod.yml with Traefik, Let’s Encrypt, and no public port on the app container — only 80/443 on the proxy.
Helm chart under charts/gghstats for Kubernetes.
GoReleaser + GitHub Actions for releases, artifacts, and multi-arch images (linux/amd64, linux/arm64).

If you’ve ever maintained a side project, you know the drag of “I’ll dockerize it later.” I put the boring work upfront so future me doesn’t hate present me.

How it works (high level)

Fetch — Scheduled sync using your token (scope: repo for private repos you care about).
Store — Upserts into SQLite so you keep a timeline, not a snapshot.
Serve — Dashboard for humans and JSON for scripts.

Filtering (GGHSTATS_FILTER, exclusions like !fork, etc.) lives in env vars so you can keep the sync set tight.

Two numbers that matter (aggregate vs history)

On the main screen you see rollups: totals across the repos you track. That’s the “at a glance” view.

The real payoff is opening one repository: you see interval stats (what GitHub is showing now for the last ~14 days) next to lifetime totals from your database. The gap between “what GitHub is willing to remember” and “what you kept” is the whole point — and the charts (clones, views, stars over time) are where SQLite stops being a file and starts being a memory.

Main dashboard (rollups across tracked repos):

Repository detail — interval stats from GitHub’s window next to lifetime totals from SQLite, plus Chart.js trends (clones, views, stars):

Who should try it

Maintainers who want long-term traffic context.
People who already self-host and want data sovereignty (your DB, your VPS, your rules).
Anyone allergic to “sign up for another analytics product to see GitHub stats.”

Try it

From the repo (Compose):

git clone https://github.com/hrodrig/gghstats.git && cd gghstats
cp .env.example .env
# set GGHSTATS_GITHUB_TOKEN, tune GGHSTATS_FILTER if needed
docker compose up -d
# open http://localhost:8080

Published image only (no clone — see README for all env vars):

docker pull ghcr.io/hrodrig/gghstats:v0.1.0
docker run --rm -p 8080:8080 \
  -e GGHSTATS_GITHUB_TOKEN \
  -e GGHSTATS_FILTER \
  -v "$PWD/gghstats-data:/data" \
  ghcr.io/hrodrig/gghstats:v0.1.0

Production-oriented compose (Traefik, TLS) lives in docker-compose.prod.yml — see the repo README for env vars like GGHSTATS_HOSTNAME and ACME_EMAIL.

Repo: github.com/hrodrig/gghstats

Issues and PRs welcome. If this saves you from losing another year of traffic history, it was worth writing.

How do you capture or export GitHub traffic today — CSV dumps, scripts, or nothing? Curious what others do in the comments.

Credits

This article was drafted from my own notes and a long brainstorming thread with Gemini (analysis, structure, and image ideas). The code, the rough edges, and the neo-brutalist UI are mine — blame me for the bugs, not the LLM.

pgwd in Production: From Alerts to Runbook

Hermes Rodríguez — Thu, 26 Mar 2026 13:16:50 +0000

This is a production-focused follow-up to my original post: pgwd: A Watchdog for Your PostgreSQL Connections. In this one, I show what pgwd looked like in action, how we responded, and what we changed in a controlled way.

When PostgreSQL connection pressure builds up, the real problem is not just crossing max_connections; it is crossing it without operational context.

That is where pgwd helped us most: not as "just another alert sender," but as a signal-to-action layer for on-call decisions.

What happened (anonymized timeline)

In one of our production environments, we saw a fast escalation pattern:

attention (75%)
then alert (85%)
then repeated danger (95%+)

All within a relatively short window.

The key signal from Slack was not only the threshold level, but also the breakdown:

total
active
idle
max_connections
plus cluster, database, namespace, and client

That context let us decide quickly whether we were seeing true workload pressure, connection churn, or an idle-heavy pattern that still threatened capacity.

Why threshold levels matter in production

A 3-tier model (75/85/95) maps well to real operations:

Attention (75%): observe trend, prepare people
Alert (85%): start mitigation planning
Danger (95%): execute containment now

This prevented us from waiting for FATAL: sorry, too many clients already as the first real signal.

The runbook we used (manual-first)

For this rollout, we intentionally chose controlled, operator-present execution.

No unattended automation for critical steps yet.

1) Attention (`>=75%`)

Confirm trend across intervals (not just one spike)
Check active vs idle ratio
Verify affected DB scope (single DB vs multiple)
Open an observation incident thread

2) Alert (`>=85%`)

Engage app + platform on-call
Correlate with scheduled jobs, batch windows, and maintenance events
Reduce non-critical pressure if possible
Prepare containment action

3) Danger (`>=95%`)

Execute mitigation immediately (controlled maintenance/throttling based on internal SOP)
Prioritize availability restoration
Capture timestamps for post-incident learning

What we are changing now: `max_connections` to `3192`

Based on this run, one concrete action in our runbook is:

Increase PostgreSQL max_connections from 2048 to 3192
Apply the change in a controlled session with operators present
Monitor closely after the change

This is not "increase and forget."

It is "increase, observe, validate, and adjust."

Important guardrail: do not solve pressure by over-sizing blindly

Raising connection limits without infrastructure awareness can create a different failure mode:

more connections -> more backend memory/CPU pressure
more pressure -> noisy performance and unstable pods/nodes
teams then over-allocate resources reactively

So our runbook explicitly includes this guardrail:

Track infrastructure headroom (CPU/memory) after increasing limits
Validate DB and app behavior under the new ceiling
Avoid resource over-sizing unless data supports it

Minimal commands (hybrid style)

# Basic daemon mode
export PGWD_DB_URL="postgres://..."
export PGWD_SLACK_WEBHOOK="https://hooks.slack.com/..."
export PGWD_INTERVAL=60
pgwd

# Verify notifier delivery before/after changes
pgwd -force-notification

# Optional: run against Postgres service in Kubernetes
pgwd -kube-postgres <namespace>/svc/postgres \
     -db-url "postgres://user:...@localhost:5432/db"

Post-change verification checklist (24h / 72h)

After increasing to 3192, we track:

Alert frequency by level (attention / alert / danger)
active / idle behavior by database
Peak total connections vs new headroom
App error rates and latency around peak windows
Infrastructure utilization trend (not just point-in-time)

Success criteria:

No sustained danger periods
Fewer repeated escalations
Stable app behavior
No unjustified resource inflation

Lessons learned

pgwd is most valuable when tied to a runbook, not only to notifications.
Alert levels should map to explicit operator actions.
Controlled, manual-first execution is safer for critical production changes.
Increasing max_connections can be right, if paired with disciplined capacity monitoring.

Community note

If you want a complementary intro in French (installation + quick setup), this community write-up is also useful:

Surveillez votre base de données PostgreSQL avec pgwd

If you run PostgreSQL in production, I would love to hear your threshold strategy and runbook design.

Original intro post: pgwd: A Watchdog for Your PostgreSQL Connections
Install: go install github.com/hrodrig/pgwd@latest
Repo/docs/releases: github.com/hrodrig/pgwd

Disclosure: This post was drafted with AI assistance and reviewed by the author.

pgwd: A Watchdog for Your PostgreSQL Connections

Hermes Rodríguez — Mon, 02 Mar 2026 03:37:45 +0000

Stop guessing when your database is about to run out of connections.

You’ve seen it before: an app starts failing with "sorry, too many clients already", and you only notice when users complain. By then, the database is saturated, and even your admin tools can’t connect. pgwd (Postgres Watch Dog) is a small Go CLI that watches your connection counts and alerts you before you hit the limit—and when you can’t connect at all.

The problem

PostgreSQL has a max_connections limit. When you exceed it:

New connections are rejected with FATAL: sorry, too many clients already (SQLSTATE 53300).
If your app uses a superuser (or a role that can use all slots), even DBA access can be blocked unless you’ve reserved slots with superuser_reserved_connections.

Without something watching connection usage, you only find out when things are already broken.

How pgwd fits in: it watches your Postgres and pushes alerts to Slack and/or Loki when thresholds are exceeded or the connection fails.

The idea: watch and alert

pgwd connects to your Postgres (directly or via Kubernetes), reads connection stats from pg_stat_activity, and sends alerts to Slack and/or Loki when:

Total or active connections cross a threshold. By default, pgwd uses 3-tier levels (75%, 85%, 95%) with distinct severities: attention, alert, and danger—so you can escalate response as usage grows.
Idle or stale connections exceed limits (useful for spotting connection leaks).
The connection fails—including "too many clients"—so you get an urgent notification even when pgwd itself can’t connect.

So you get notified when you’re approaching the limit, and again when the instance is already saturated. Loki streams include labels (database, cluster) for LogQL filtering and Grafana alert rules.

When the DB is saturated, pgwd sends an urgent alert like this to your notifiers.

pgwd in action

Minimal setup (one-shot from cron)

# Alert at 75%, 85%, 95% of max_connections (default 3-tier levels)
(default)
pgwd -db-url "postgres://user:pass@localhost:5432/mydb" \
     -slack-webhook "https://hooks.slack.com/services/..."

No need to set thresholds if you’re fine with the defaults: pgwd reads max_connections from the server and applies 75%, 85%, 95% (attention, alert, danger). Use -threshold-levels to customize.

Daemon mode (continuous watch)

export PGWD_DB_URL="postgres://localhost:5432/mydb"
export PGWD_SLACK_WEBHOOK="https://hooks.slack.com/..."
export PGWD_INTERVAL=60

pgwd
# Runs every 60 seconds; exit with SIGTERM.

Catch connection leaks (stale connections)

pgwd -db-url "postgres://..." -slack-webhook "https://..." \
     -stale-age 600 -threshold-stale 1

Alerts when any connection has been open longer than 10 minutes—handy for spotting leaks or long-running transactions.

Postgres in Kubernetes

pgwd -kube-postgres default/svc/postgres \
     -db-url "postgres://user:DISCOVER_MY_PASSWORD@localhost:5432/mydb" \
     -slack-webhook "https://..."

pgwd runs kubectl port-forward, can read the password from the pod’s environment, and connects to localhost. Alerts include cluster/namespace/service context.

Loki inside Kubernetes

When Loki runs inside the cluster and pgwd runs outside (e.g. a VM with cron), use -kube-loki so pgwd can port-forward to Loki as well:

pgwd -kube-postgres default/svc/postgres -kube-loki monitoring/svc/loki \
     -db-url "postgres://user:DISCOVER_MY_PASSWORD@localhost:5432/mydb" \
     -slack-webhook "https://..."

pgwd port-forwards to Loki and pushes to localhost:3100 automatically; no -loki-url needed.

Test alerts without touching Postgres

Use -test-max-connections N to simulate exceeded thresholds against production. pgwd treats N as the effective max_connections for threshold calculation—stats stay real, only the denominator changes. Handy for validating Slack/Loki/Grafana alerts without modifying Postgres config. See docs/testing-alert-levels.md for the full procedure.

When things go wrong

If the database is unreachable or returns "too many clients", pgwd always sends an urgent alert to your notifiers (when configured)—no extra flag needed. So even in the worst case, you get a Slack/Loki message instead of silence.

Try it

Install: go install github.com/hrodrig/pgwd@latest
Repo and docs: github.com/hrodrig/pgwd
Releases: Releases (v0.5.0 — binaries, Docker ghcr.io/hrodrig/pgwd, Homebrew)
Loki + Grafana: docs/loki-grafana-alerts.md — labels, LogQL, alert rules
Test alerts safely: docs/testing-alert-levels.md — trigger attention/alert/danger without changing Postgres

One-shot, daemon, or cron—with Slack and/or Loki you can stop flying blind on connection usage and get ahead of "too many clients" before it hits production.

Disclosure: This post was drafted with AI assistance and reviewed by the author.

DEV Community: Hermes Rodríguez

GROOT 1.0: One Stable Archive for Kubernetes Incidents

The 1.0 incident loop

Why 1.0.0 is different from “just another tag”

The incident workflow I actually run

GROOT vs alternatives (honest picks)

Closest cousin: GROOT vs kubectl-gather

Complementary tools (different jobs)

GROOT + kzero: capture first, reset second

What makes the bundle “production grade”

Collector, not analyzer

Security by default

Operations beyond the laptop

Structured output for automation

Starter configs without starting from zero

Migrating to 1.0.0

Honest gaps (post-1.0 roadmap)

Install in one line

Closing

Disclosure

GROOT: One archive for cluster diagnostics

Big picture (one glance)

The problem GROOT solves

What GROOT actually does

Quick start

Install from GitHub Releases

Debian / Ubuntu (.deb)

Fedora / RHEL / AlmaLinux / Rocky (.rpm)

Tarball (Linux / macOS, no package manager)

Build from source

Configuration model

Targets and pod logs

extra_kubectl: power with guardrails

Output layout

Notifications in incident workflows

Security and operations notes

Why Go, Cobra, and Viper?

Contributing and feedback

Closing

Disclosure

Beyond Power-On Hours: Auditing 'New' SAS Drives on Proxmox, HPE Gen9 & HBA

The actual problem we were solving

Procurement context (anonymized on purpose)

Layer cake: where SMART “lives”

Act 1: lsblk — what you see depends on the controller

Act 2: ssacli—inventory the controller, not the kernel

The boring failure mode: pinned .deb URLs rot

Act 3: smartctl through Smart Array (cciss)

SAS reality: grepping is a sport

Act 4: HBA mode—why ZFS people keep insisting

When HBA + ZFS is not the automatic answer

Act 5: When “almost zero hours” is the least interesting field

1) Manufacture hints vs “baby disk” narrative

2) Broken manufacture strings

3) ECC counters vs headline “OK”

Act 6: Labels vs silicon—how to win an argument without starting one

Photo evidence (label vs silicon)

The ECC trap (why “SMART OK” was a trap)

Procurement engineering: “Gen9 is EOL” is not a magic excuse

A practical audit checklist (copy into your ticket)

Command cheat sheet

What I would do again

Disclosure: use of AI in this article

~21 tok/s Gemma 4 on a Ryzen mini PC: llama.cpp, Vulkan, and the messy truth about local chat

TL;DR

Table of contents

1. Context and choices

2. BIOS (before or right after installing Ubuntu)

3. Installing Ubuntu

Quick hardware inventory (optional)

4. Ubuntu Server without a desktop (headless)

Installation

Networking

Vulkan without a display (vkcube not applicable)

Rest of this guide

5. Base dependencies and Vulkan check

6. Building llama.cpp with Vulkan

Update and rebuild llama.cpp

7. GGUF models and paths

What GGUF is (name, role, trade-offs)

Debian / Ubuntu (`.deb`)

Fedora / RHEL / AlmaLinux / Rocky (`.rpm`)

`extra_kubectl`: power with guardrails

Act 1: `lsblk` — what you see depends on the controller

Act 2: `ssacli`—inventory the controller, not the kernel

The boring failure mode: pinned `.deb` URLs rot

Act 3: `smartctl` through Smart Array (`cciss`)

Vulkan without a display (`vkcube` not applicable)

Update and rebuild `llama.cpp`

Quant labels in filenames (Q2, Q4, Q8, suffixes like `_K_M`, IQ…)

`llama-bench`: measure throughput (tokens/s)

Download (`wget --continue`, one file per command)

Typical `ExecStart` tweaks (example)

8. Minimal web server (`llama-server`)

11. OpenCode and VS Code with your `llama-server`

12. Troubleshooting: Vulkan / `glslc` on Ubuntu 24.04

12.3 Conflict between Ubuntu’s `libshaderc-dev` and LunarG’s Shaderc

12.4 Snap fallback for `glslc`

`htop` looks “light” while you chat (is that normal?)

AMD: `amdgpu_pm_info` and `dri/N` (not always `dri/0`)

14.3 Firewall (`ufw`)

What you get in `run/`