DEV Community

Ragib Hasan
Ragib Hasan Subscriber

Posted on

Running LLMs Locally on Arch Linux — No Cloud. No Ollama.

Linux · llama.cpp · Local AI

A complete, battle-tested guide using llama.cpp with Vulkan GPU acceleration on an Intel Iris Xe laptop — from zero to interactive chat.

Ragib · March 2026 · 8 min read


Why I stopped looking for alternatives and just built it myself.

I wanted to run an AI model locally on my Arch Linux laptop — privately, offline, with no cloud dependency. Ollama seemed like the obvious choice, but I didn't want something opaque and heavy. I wanted control.

After a lot of trial, error, and outdated documentation, I got llama.cpp running with Vulkan GPU acceleration on my Intel Iris Xe. This is the guide I wish I had.

🐧 Who this is for
Any Linux user with 8–16 GB RAM who wants a private, lightweight, local LLM — without Docker, without heavy frameworks, without cloud APIs.


🧩 System Specs

  • OS: Arch Linux x86_64
  • CPU: i5-1137G7 (4C/8T)
  • GPU: Intel Iris Xe
  • RAM: 16 GB
  • WM: Hyprland (Wayland)

01 — Install Dependencies

Everything you need is in the official Arch repos:

sudo pacman -S git cmake ninja vulkan-intel vulkan-devel shaderc
Enter fullscreen mode Exit fullscreen mode

⚠️ Don't skip vulkan-devel
vulkan-intel gives runtime only. You need vulkan-devel for headers or GPU build will silently fall back to CPU.


02 — Clone llama.cpp

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
Enter fullscreen mode Exit fullscreen mode

ℹ️ Clone the repo directly — don’t create the folder manually.


03 — Build with Vulkan (GPU Acceleration)

cmake -B build -G Ninja -DGGML_VULKAN=ON
cmake --build build
Enter fullscreen mode Exit fullscreen mode

Check binary:

ls build/bin | grep llama-cli
Enter fullscreen mode Exit fullscreen mode

Note: ./main is gone → use llama-cli


04 — Download a Model

mkdir -p models
cd models

wget https://huggingface.co/Qwen/Qwen2.5-3B-Instruct-GGUF/resolve/main/qwen2.5-3b-instruct-q4_k_m.gguf

cd ..
Enter fullscreen mode Exit fullscreen mode

💡 If file saves as .gguf.1:

mv models/*.gguf.1 models/qwen2.5-3b-instruct-q4_k_m.gguf
Enter fullscreen mode Exit fullscreen mode

05 — Run the Model

./build/bin/llama-cli \
  -m models/qwen2.5-3b-instruct-q4_k_m.gguf \
  -t 8 \
  -ngl 20 \
  -c 2048
Enter fullscreen mode Exit fullscreen mode

Flags explained:

  • -t 8 → use all CPU threads
  • -ngl 20 → GPU offload
  • -c 2048 → context size

If Vulkan works:

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Iris(R) Xe Graphics
Enter fullscreen mode Exit fullscreen mode

⚡ Common Errors & Fixes

❌ Error

fish: Unknown command: ./main
Enter fullscreen mode Exit fullscreen mode

✅ Fix

Use:

./build/bin/llama-cli
Enter fullscreen mode Exit fullscreen mode

❌ Error

Makefile:6: Build system changed
Enter fullscreen mode Exit fullscreen mode

✅ Fix

cmake -B build -G Ninja -DGGML_VULKAN=ON && cmake --build build
Enter fullscreen mode Exit fullscreen mode

❌ Error

Could NOT find Vulkan
Enter fullscreen mode Exit fullscreen mode

✅ Fix

sudo pacman -S vulkan-devel
Enter fullscreen mode Exit fullscreen mode

❌ Error

LLAMA_VULKAN not used
Enter fullscreen mode Exit fullscreen mode

✅ Fix

Use:

-DGGML_VULKAN=ON
Enter fullscreen mode Exit fullscreen mode

❌ Error

invalid argument: -i
Enter fullscreen mode Exit fullscreen mode

✅ Fix

Interactive mode is default now.


🧠 Which Model Should You Use?

Model Size RAM Speed Best For
Qwen 2.5 3B Q4_K_M ~2 GB 2–3 GB ⚡ Fast Beginners
Mistral 7B Q4_K_M ~4 GB 5–6 GB ▶ Medium Balanced
Llama 3 8B Q4_K_M ~5 GB 6–8 GB 🐢 Slow Best quality

+1 Optional: Local API Server

./build/bin/llama-server \
  -m models/qwen2.5-3b-instruct-q4_k_m.gguf \
  -ngl 20
Enter fullscreen mode Exit fullscreen mode

Then open:

http://localhost:8080
Enter fullscreen mode Exit fullscreen mode

Final Thoughts

The biggest issues were outdated docs and breaking changes:

  • make → replaced by CMake
  • LLAMA_VULKAN → now GGML_VULKAN
  • need vulkan-devel
  • -i flag removed
  • binary is llama-cli

Once fixed, everything works smoothly — even on Intel Iris Xe.


What's next?

Next step: add a web UI (Open WebUI) and turn this into a full local ChatGPT alternative.

Happy hacking 🐧

Top comments (0)