Ragib Hasan

Posted on Mar 24

Running LLMs Locally on Arch Linux — No Cloud. No Ollama.

#llm #ai #pgaichallenge

Linux · llama.cpp · Local AI

A complete, battle-tested guide using llama.cpp with Vulkan GPU acceleration on an Intel Iris Xe laptop — from zero to interactive chat.

Ragib · March 2026 · 8 min read

Why I stopped looking for alternatives and just built it myself.

I wanted to run an AI model locally on my Arch Linux laptop — privately, offline, with no cloud dependency. Ollama seemed like the obvious choice, but I didn't want something opaque and heavy. I wanted control.

After a lot of trial, error, and outdated documentation, I got llama.cpp running with Vulkan GPU acceleration on my Intel Iris Xe. This is the guide I wish I had.

🐧 Who this is for
Any Linux user with 8–16 GB RAM who wants a private, lightweight, local LLM — without Docker, without heavy frameworks, without cloud APIs.

🧩 System Specs

OS: Arch Linux x86_64
CPU: i5-1137G7 (4C/8T)
GPU: Intel Iris Xe
RAM: 16 GB
WM: Hyprland (Wayland)

01 — Install Dependencies

Everything you need is in the official Arch repos:

sudo pacman -S git cmake ninja vulkan-intel vulkan-devel shaderc

⚠️ Don't skip vulkan-devel
vulkan-intel gives runtime only. You need vulkan-devel for headers or GPU build will silently fall back to CPU.

02 — Clone llama.cpp

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp

ℹ️ Clone the repo directly — don’t create the folder manually.

03 — Build with Vulkan (GPU Acceleration)

cmake -B build -G Ninja -DGGML_VULKAN=ON
cmake --build build

Check binary:

ls build/bin | grep llama-cli

Note: ./main is gone → use llama-cli

04 — Download a Model

mkdir -p models
cd models

wget https://huggingface.co/Qwen/Qwen2.5-3B-Instruct-GGUF/resolve/main/qwen2.5-3b-instruct-q4_k_m.gguf

cd ..

💡 If file saves as .gguf.1:

mv models/*.gguf.1 models/qwen2.5-3b-instruct-q4_k_m.gguf

05 — Run the Model

./build/bin/llama-cli \
  -m models/qwen2.5-3b-instruct-q4_k_m.gguf \
  -t 8 \
  -ngl 20 \
  -c 2048

Flags explained:

-t 8 → use all CPU threads
-ngl 20 → GPU offload
-c 2048 → context size

If Vulkan works:

ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Intel(R) Iris(R) Xe Graphics

⚡ Common Errors & Fixes

❌ Error

fish: Unknown command: ./main

✅ Fix

Use:

./build/bin/llama-cli

❌ Error

Makefile:6: Build system changed

✅ Fix

cmake -B build -G Ninja -DGGML_VULKAN=ON && cmake --build build

❌ Error

Could NOT find Vulkan

✅ Fix

sudo pacman -S vulkan-devel

❌ Error

LLAMA_VULKAN not used

✅ Fix

Use:

-DGGML_VULKAN=ON

❌ Error

invalid argument: -i

✅ Fix

Interactive mode is default now.

🧠 Which Model Should You Use?

Model	Size	RAM	Speed	Best For
Qwen 2.5 3B Q4_K_M	~2 GB	2–3 GB	⚡ Fast	Beginners
Mistral 7B Q4_K_M	~4 GB	5–6 GB	▶ Medium	Balanced
Llama 3 8B Q4_K_M	~5 GB	6–8 GB	🐢 Slow	Best quality

+1 Optional: Local API Server

./build/bin/llama-server \
  -m models/qwen2.5-3b-instruct-q4_k_m.gguf \
  -ngl 20

Then open:

http://localhost:8080

Final Thoughts

The biggest issues were outdated docs and breaking changes:

make → replaced by CMake
LLAMA_VULKAN → now GGML_VULKAN
need vulkan-devel
-i flag removed
binary is llama-cli

Once fixed, everything works smoothly — even on Intel Iris Xe.

What's next?

Next step: add a web UI (Open WebUI) and turn this into a full local ChatGPT alternative.

Happy hacking 🐧

DEV Community

Running LLMs Locally on Arch Linux — No Cloud. No Ollama.

Why I stopped looking for alternatives and just built it myself.

🧩 System Specs

01 — Install Dependencies

02 — Clone llama.cpp

03 — Build with Vulkan (GPU Acceleration)

04 — Download a Model

05 — Run the Model

Flags explained:

⚡ Common Errors & Fixes

❌ Error

✅ Fix

❌ Error

✅ Fix

❌ Error

✅ Fix

❌ Error

✅ Fix

❌ Error

✅ Fix

🧠 Which Model Should You Use?

+1 Optional: Local API Server

Final Thoughts

What's next?

Top comments (0)