Linux · llama.cpp · Local AI
A complete, battle-tested guide using llama.cpp with Vulkan GPU acceleration on an Intel Iris Xe laptop — from zero to interactive chat.
Ragib · March 2026 · 8 min read
Why I stopped looking for alternatives and just built it myself.
I wanted to run an AI model locally on my Arch Linux laptop — privately, offline, with no cloud dependency. Ollama seemed like the obvious choice, but I didn't want something opaque and heavy. I wanted control.
After a lot of trial, error, and outdated documentation, I got llama.cpp running with Vulkan GPU acceleration on my Intel Iris Xe. This is the guide I wish I had.
🐧 Who this is for
Any Linux user with 8–16 GB RAM who wants a private, lightweight, local LLM — without Docker, without heavy frameworks, without cloud APIs.
🧩 System Specs
- OS: Arch Linux x86_64
- CPU: i5-1137G7 (4C/8T)
- GPU: Intel Iris Xe
- RAM: 16 GB
- WM: Hyprland (Wayland)
01 — Install Dependencies
Everything you need is in the official Arch repos:
sudo pacman -S git cmake ninja vulkan-intel vulkan-devel shaderc
⚠️ Don't skip
vulkan-devel
vulkan-intelgives runtime only. You needvulkan-develfor headers or GPU build will silently fall back to CPU.
02 — Clone llama.cpp
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
ℹ️ Clone the repo directly — don’t create the folder manually.
03 — Build with Vulkan (GPU Acceleration)
cmake -B build -G Ninja -DGGML_VULKAN=ON
cmake --build build
Check binary:
ls build/bin | grep llama-cli
Note:
./mainis gone → usellama-cli
04 — Download a Model
mkdir -p models
cd models
wget https://huggingface.co/Qwen/Qwen2.5-3B-Instruct-GGUF/resolve/main/qwen2.5-3b-instruct-q4_k_m.gguf
cd ..
💡 If file saves as
.gguf.1:
mv models/*.gguf.1 models/qwen2.5-3b-instruct-q4_k_m.gguf
05 — Run the Model
./build/bin/llama-cli \
-m models/qwen2.5-3b-instruct-q4_k_m.gguf \
-t 8 \
-ngl 20 \
-c 2048
Flags explained:
-
-t 8→ use all CPU threads -
-ngl 20→ GPU offload -
-c 2048→ context size
If Vulkan works:
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Iris(R) Xe Graphics
⚡ Common Errors & Fixes
❌ Error
fish: Unknown command: ./main
✅ Fix
Use:
./build/bin/llama-cli
❌ Error
Makefile:6: Build system changed
✅ Fix
cmake -B build -G Ninja -DGGML_VULKAN=ON && cmake --build build
❌ Error
Could NOT find Vulkan
✅ Fix
sudo pacman -S vulkan-devel
❌ Error
LLAMA_VULKAN not used
✅ Fix
Use:
-DGGML_VULKAN=ON
❌ Error
invalid argument: -i
✅ Fix
Interactive mode is default now.
🧠 Which Model Should You Use?
| Model | Size | RAM | Speed | Best For |
|---|---|---|---|---|
| Qwen 2.5 3B Q4_K_M | ~2 GB | 2–3 GB | ⚡ Fast | Beginners |
| Mistral 7B Q4_K_M | ~4 GB | 5–6 GB | ▶ Medium | Balanced |
| Llama 3 8B Q4_K_M | ~5 GB | 6–8 GB | 🐢 Slow | Best quality |
+1 Optional: Local API Server
./build/bin/llama-server \
-m models/qwen2.5-3b-instruct-q4_k_m.gguf \
-ngl 20
Then open:
http://localhost:8080
Final Thoughts
The biggest issues were outdated docs and breaking changes:
-
make→ replaced by CMake -
LLAMA_VULKAN→ nowGGML_VULKAN - need
vulkan-devel -
-iflag removed - binary is
llama-cli
Once fixed, everything works smoothly — even on Intel Iris Xe.
What's next?
Next step: add a web UI (Open WebUI) and turn this into a full local ChatGPT alternative.
Happy hacking 🐧
Top comments (0)