DEV Community

I am Starrzan
I am Starrzan

Posted on

Running Local LLMs on Intel Arc iGPU: A Complete Guide for Ubuntu on Mini-PC Hardware

System: GMKtec EVO-T1 (Intel Core Ultra 9 285H, Intel Arc 140T iGPU, 64GB DDR5)
OS: Ubuntu 26.04 LTS (kernel 7.0)
Stack: llama.cpp SYCL backend + oneAPI 2026.0 (IntelLLVM icx/icpx) + Hermes Agent

All of the local model research, SYCL GPU debugging, production inference setup, benchmark design, and this blog article were implemented by Hermes Agent (an autonomous AI agent). The human directed goals; the agent executed everything — kernel flag surgery, compiler troubleshooting, benchmark design, and full documentation. This is an ongoing effort: new models are continuously tested with long-context, multi-agent benchmarks to validate 24/7 daily-driver reliability. Latest addition: 30B-Coder validated at 131K context as the new local daily driver, replacing Sushi 9B after passing all 8 hard multi-agent tests at 4.3x the speed.

Most local LLM guides assume you have an NVIDIA GPU. If you're running an Intel mini-PC like the GMKtec EVO-T1 with an integrated Arc 140T, the path to GPU-accelerated local inference is less traveled — but entirely workable. Here's exactly how to get there.


Why This Is Harder Than NVIDIA

NVIDIA's CUDA stack is turnkey: install driver, install PyTorch, done. Intel's GPU compute story is more fragmented:

  • SYCL is the open standard that replaces CUDA for cross-vendor GPU compute
  • oneAPI is Intel's implementation of SYCL, plus MKL (math libraries) and the IntelLLVM compiler toolchain
  • Level Zero is Intel's low-level GPU runtime (think: what Vulkan is to graphics, Level Zero is to compute)
  • llama.cpp added SYCL support, but it's designed for Intel's proprietary compiler stack, not the open-source alternatives

The core tension: llama.cpp's SYCL backend requires Intel MKL's SYCL BLAS library, which only ships with Intel's oneAPI toolkit. The open-source dpclang compiler cannot provide it. This is the single biggest blocker, and the one most guides don't explain clearly.


Hardware Context

The Intel Arc 140T is an integrated GPU (iGPU) built into the Core Ultra 9 285H "Arrow Lake" processor. It shares system memory — there's no dedicated VRAM. My system has 64GB DDR5-5600, of which the iGPU can address a significant portion (typically 16-32GB depending on BIOS settings).

Key hardware facts:

  • Architecture: Xe-LPG (same as Arc A-series, just smaller)
  • Execution Units: 128 Xe cores
  • Shared memory: Uses system DDR5 as VRAM (configurable in BIOS up to 16GB)
  • PCIe: Appears as 00:02.0 VGA compatible controller: Intel Arrow Lake-P [Arc 130T/140T]
  • VRAM available to GPU: ~58GB (of 64GB total system RAM)

Step 1: BIOS Configuration

Before touching Linux, configure the firmware. These settings are essential for iGPU compute workloads:

Memory & GPU

  • Above 4G Decoding: Enabled — Without this, the iGPU cannot access large memory regions required for LLM KV caches
  • Resizable BAR (ReBAR): Enabled — Lets the CPU map the entire GPU-visible memory space in one go
  • DVMT Pre-Allocated: 512MB or Max — Reserves system RAM for iGPU texture/compute operations
  • XMP/EXPO Profile: Enabled — Ensures DDR5 runs at rated speed (5600+ MT/s). Memory bandwidth directly impacts inference throughput since the iGPU is bandwidth-starved

Performance

  • Intel Turbo Boost: Enabled — Cores boost to 5.4GHz; critical for prompt prefill
  • Speed Shift (HWP): Enabled — Reduces P-state transition latency
  • Hyper-Threading: Enabled — More threads help with prefill/decode overlap in llama.cpp
  • CPU C-States: C0/C1 only — Eliminates wake latency from deep sleep states during inference

Power

  • Restore on AC Power Loss: Power On — Server auto-restarts after power outage
  • ErP/EuP Ready: Disabled — Prevents deep S5 state that blocks wake-on-power

Step 2: Install Intel oneAPI 2026.0

This is where the journey diverges from NVIDIA. You need Intel's full proprietary compiler stack.

Download

Grab the oneAPI Base Toolkit for Linux from intel.com. You want the offline installer (~1GB for the compiler components).

Install

sudo dpkg -i intel-oneapi-dpcpp-cpp-2026.0_*.deb
Enter fullscreen mode Exit fullscreen mode

If you encounter broken package state from a previous Intel install attempt (common — Intel's packages have had packaging bugs):

# Force-remove all phantom Intel packages
sudo dpkg --remove --force-all $(dpkg -l | grep intel-oneapi | awk '{print $2}')
sudo rm -f /var/lib/dpkg/info/intel-oneapi-*.list
sudo rm -f /var/lib/dpkg/info/intel-oneapi-*.postinst
sudo rm -f /var/lib/dpkg/info/intel-oneapi-*.postrm
sudo dpkg --configure -a
sudo apt clean && sudo apt update
sudo apt --fix-broken install -y
Enter fullscreen mode Exit fullscreen mode

Then reinstall.

Also install the missing runtime dependencies that 2026.0 doesn't pull in automatically:

sudo apt install intel-ocloc libsycl-dev libigc2 libigdfcl2
Enter fullscreen mode Exit fullscreen mode

Verify

source /opt/intel/oneapi/setvars.sh --force
which icx && which icpx
icpx --version  # Should show IntelLLVM 2026.0
clang-offload-bundler --version  # Must exist or cmake will fail
echo $MKLROOT  # Should be /opt/intel/oneapi/mkl/2026.0
Enter fullscreen mode Exit fullscreen mode

The Compiler Matters

You might see guides suggesting dpclang-6 (the open-source DPC++ compiler from apt). Do not use it for llama.cpp. Here's why:

Aspect IntelLLVM 2026.0 (icx/icpx) dpclang-6 (open-source)
MKL BLAS SYCL Bundled, cmake finds it Not available, cmake fails
clang-offload-bundler Included Included
Level Zero Native Native
Status Works Cannot build SYCL+MKL

Root cause: llama.cpp/ggml-sycl/CMakeLists.txt hardcodes find_package(MKL REQUIRED). IntelLLVM ships MKL with proper CMake config. dpclang-6 does not provide MKL at all. You can patch CMakeLists.txt to make MKL optional, but you'll lose BLAS acceleration — defeating the purpose of GPU inference.


Step 3: Build llama.cpp with SYCL

Clone and build:

cd ~/llama.cpp
source /opt/intel/oneapi/setvars.sh --force

rm -rf build CMakeCache.txt CMakeFiles .cmake

cmake -Bbuild \
  -DGGML_SYCL=ON \
  -DGGML_SYCL_F16=OFF \
  -DGGML_SYCL_DNN=ON \
  -DGGML_SYCL_GRAPH=ON \
  -DGGML_SYCL_HOST_MEM_FALLBACK=ON \
  -DGGML_SYCL_STMT=ON \
  -DGGML_SYCL_SUPPORT_LEVEL_ZERO=ON \
  -DCMAKE_C_COMPILER=/opt/intel/oneapi/compiler/2026.0/bin/icx \
  -DCMAKE_CXX_COMPILER=/opt/intel/oneapi/compiler/2026.0/bin/icpx \
  -DCMAKE_BUILD_TYPE=Release \
  -DMKL_ROOT=/opt/intel/oneapi/mkl/2026.0 \
  -DBUILD_SHARED_LIBS=ON

cmake --build build -j4 --target llama-server
Enter fullscreen mode Exit fullscreen mode

What's happening in these flags

  • GGML_SYCL=ON — Enable the SYCL GPU backend in ggml
  • GGML_SYCL_F16=OFF — Disable FP16 SYCL (crashes on Arc via dpct dev_mgr bug)
  • GGML_SYCL_DNN=ON — Use oneDNN for DNN operations (better than MKL for some ops)
  • GGML_SYCL_GRAPH=ON — Enable SYCL graph execution for reduced launch overhead
  • GGML_SYCL_HOST_MEM_FALLBACK=ON — Allow fallback to host memory when GPU memory is tight
  • GGML_SYCL_STMT=ON — Enable SYCL speculative token generation support
  • GGML_SYCL_SUPPORT_LEVEL_ZERO=ON — Use Level Zero runtime (Intel's native low-level GPU API)
  • icx / icpx — Intel's LLVM-based C/C++ compilers, required for SYCL + MKL interop
  • DMKL_ROOT — Points to oneMKL so CMake can find the SYCL BLAS libraries
  • DBUILD_SHARED_LIBS=ON — Build shared libraries (required for production deployment)
  • -j4 — Limit parallel jobs to avoid OOM during compilation of large translation units

Critical Fix: RMS_NORM Crash at 131K Context

Symptom: Server crashes during model loading with Error OP RMS_NORM when context >= 32K.

Root Cause (two-part):

  1. The -ze-intel-greater-than-4GB-buffer-required linker flag in ggml/src/ggml-sycl/CMakeLists.txt:162 is only valid for GPU devices. When any SYCL operation falls back to the CPU device, the LLVM JIT compiler rejects the flag and crashes. This flag was unnecessary — the Arc 140T has 58GB shared VRAM.
  2. IntelLLVM 2026.0's sycl-kernel-reduce-cross-barrier-values LLVM pass crashes with free(): invalid pointer when compiling SYCL kernels for large KV cache allocations on the CPU SYCL device.

Fix:

# In ggml/src/ggml-sycl/CMakeLists.txt, line 162:
# COMMENT OUT this line:
# target_link_options(ggml-sycl PRIVATE -Xs -ze-intel-greater-than-4GB-buffer-required)

# Add before launching llama-server:
export ONEAPI_DEVICE_SELECTOR=level_zero:gpu
source /opt/intel/oneapi/setvars.sh --force
Enter fullscreen mode Exit fullscreen mode

The ONEAPI_DEVICE_SELECTOR=level_zero:gpu prevents the CPU SYCL device from being registered entirely, avoiding both the flag incompatibility and the optimizer crash. All operations stay on the Arc GPU.

Result: All tested 9B models now load successfully at 131K context with NGL=99.

Verify the build

./build/bin/llama-server --version
# Should show SYCL support in the build info

# Quick smoke test with a small model
./build/bin/llama-server -m ./models/your-model.gguf -ngl 99 --host 0.0.0.0 --port 8080
Enter fullscreen mode Exit fullscreen mode

In the server log, look for:

[INFO] SYCL device: Intel(R) Arc 140T (GPU) (Level Zero)
[INFO] Offloading 99 layers to GPU
Enter fullscreen mode Exit fullscreen mode

If you see those lines, GPU offload is working.


Step 4: Linux Tuning

CPU Governor

echo "performance" | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
echo "performance" | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/energy_performance_preference
Enter fullscreen mode Exit fullscreen mode

Memory

# With 64GB RAM, swapping kills inference
sudo sysctl vm.swappiness=10

# Hugepages reduce TLB misses for large model allocations
echo 1024 | sudo tee /proc/sys/vm/nr_hugepages

# Lazy mount /tmp as tmpfs (4GB) for faster temp I/O
sudo mount -t tmpfs -o size=4G tmpfs /tmp
Enter fullscreen mode Exit fullscreen mode

I/O Scheduler

# NVMe: use 'none' (noop) to reduce latency
echo "none" | sudo tee /sys/block/nvme0n1/queue/scheduler
Enter fullscreen mode Exit fullscreen mode

Make governor persistent

Create /etc/systemd/system/cpu-performance.service:

[Unit]
Description=Set CPU governor to performance

[Service]
Type=oneshot
ExecStart=/bin/bash -c 'echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor'

[Install]
WantedBy=multi-user.target
Enter fullscreen mode Exit fullscreen mode
sudo systemctl enable cpu-performance
Enter fullscreen mode Exit fullscreen mode

Step 5: Running Models on the iGPU

Memory Budget

The Arc 140T shares system DDR5. With -ngl 99 (all layers on GPU), the model weights sit in memory that the GPU can access. The remaining system RAM holds the KV cache and non-offloaded tensors.

Measured VRAM usage at 131K context, NGL=99:

Model Weights KV Cache (131K ctx) Total VRAM Headroom (of 58GB)
Qwen3.5-9B-Sushi-Coder-RL Q4_K_M ~5.3GB ~44GB (maxCtx) / ~4.4GB (at 10K used) ~9.7GB (at 10K ctx) ~48GB
Qwopus3.5-9B-Coder-MTP Q4_K_M ~5.4GB Similar to above ~9.8GB ~48GB
Qwen3.5-9B-DS-V4-Flash Q4_K_M ~5.3GB Similar to above ~9.7GB ~48GB
Qwen3-Coder-30B-A3B Q3_K_M ~14GB ~6.5GB (at 10K ctx) ~20.5GB ~37GB
Qwen3.6-35B-UD-IQ4_NL ~17GB ~7GB (at 10K ctx) ~24GB ~34GB

Key insight: At 131K context with NGL=99, the KV cache for a 9B model can consume up to ~44GB if the full context is utilized. In practice, agentic workloads with 10+ tool calls accumulating ~10K tokens use only ~4-5GB KV cache, leaving plenty of headroom. The 58GB shared memory is sufficient for 9B models at 131K ctx, but 27B+ models require careful context management.

Buffer size caution: Don't set context to 131K unless needed. Each token of context consumes memory even if unused. For most agentic work, 32K-65K context is the practical sweet spot.

Example Command

source /opt/intel/oneapi/setvars.sh --force
export ONEAPI_DEVICE_SELECTOR=level_zero:gpu

./build/bin/llama-server \
  -m ./models/your-model.gguf \
  -ngl 99 \
  -c 131072 \
  -b 2048 \
  --ubatch-size 512 \
  --no-warmup \
  --mmap
Enter fullscreen mode Exit fullscreen mode

Flags explained:

  • -ngl 99 — Offload all transformer layers to GPU (the heavy compute)
  • -c 131072 — Context length in tokens (130K practical for 9B models)
  • -b 2048 — Batch size for prompt processing (higher = faster prefill)
  • --ubatch-size 512 — Micro-batch size for decode
  • --no-warmup — Skip warmup (faster startup)
  • --mmap — Memory-map model files (lets OS manage page cache)
  • ONEAPI_DEVICE_SELECTOR=level_zero:gpuCritical: restricts to GPU-only, prevents CPU SYCL JIT crashes

Troubleshooting

CMake cannot find MKL

source /opt/intel/oneapi/setvars.sh --force
echo $MKLROOT  # Must show /opt/intel/oneapi/mkl/2026.0

# If empty, specify manually:
cmake -Bbuild -DMKL_ROOT=/opt/intel/oneapi/mkl/2026.0 ...
Enter fullscreen mode Exit fullscreen mode

Build fails with "cannot find -lsycl"

# Clean rebuild from scratch
rm -rf build CMakeCache.txt CMakeFiles .cmake
source /opt/intel/oneapi/setvars.sh --force
# Re-run cmake
Enter fullscreen mode Exit fullscreen mode

Server starts but completions hang

Check dmesg for GPU fence timeouts. Most likely: context size too large for available memory. Reduce -c from current value to 8192.

Phantom Intel packages after force deletion

The 2026.0 release had packaging bugs where dpkg tracks packages but files are missing. Always force-remove and clean dpkg state files before reinstalling.


Performance Notes

On the Arc 140T (128 Xe cores, DDR5-5600 shared memory), measured with NGL=99:

Model Prefill (pp512) Decode (tg128) Context Notes
Qwen3-Coder-30B-A3B Q3_K_M 81 t/s 11.52 t/s 131K Local default — MoE, fastest decode, 8/8 hard tests pass
Qwen3.5-9B-Sushi-Coder-RL Q4_K_M 166 t/s 8.24 t/s 130K General purpose — fastest prefill, RL-tuned
Qwen3.6-35B-UD-IQ4_NL 76 t/s 7.98 t/s 65K MoE reasoning, slower decode

Memory bandwidth is the bottleneck — DDR5-5600 provides ~85 GB/s shared between CPU and GPU. For comparison, an NVIDIA RTX 3060 (12GB GDDR6, 360 GB/s) is 4-5x faster on memory-bound operations. The Arc iGPU is competitive with a laptop RTX 3050 for LLM inference despite having no dedicated VRAM.

The key advantage isn't raw speed — it's that this is an iGPU inside a mini-PC that sips ~45W at full load, fits in your pocket, and costs a fraction of a discrete GPU setup. And with the SYCL fix applied, it runs 131K context models without crashing.

Agentic Task Performance

Real-world agentic benchmark results (8 hard tests per model, 131K ctx):

Model Tests Pass Total Time Output Quality
Qwen3-Coder-30B-A3B 8/8 228s Clean, coherent, valid JSON — 4.3x faster than Sushi
Qwen3.5-9B-Sushi-Coder-RL 8/8 994s Clean, coherent, valid JSON

Both models pass all 8 hard multi-agent tests at 131K context. 30B-Coder is the local default because it delivers the same quality at 4.3x the speed. Sushi remains the general-purpose option with 2x faster prefill and smaller disk footprint.

Why the Defaults Changed (2026-07-30)

After completing the hard multi-agent benchmark suite (cross-doc reasoning, constrained JSON, subagent delegation, complex nested JSON, edge cases, multi-turn fact retention, arithmetic reasoning), the data showed 30B-Coder matched Sushi on every quality metric while being dramatically faster. The MoE architecture activates only ~3B parameters per token, delivering high quality at a fraction of the compute cost.

On the remote side, gemma-4-31b-it:free replaced owl-alpha as the OpenRouter default after benchmarking 13 free models. Gemma-4-31b passed all 5 tests with the highest quality outputs, while owl-alpha took 2x longer for the same pass rate.


What About IPEX-LLM and Ollama?

IPEX-LLM (Intel Extension for PyTorch) offers OpenVINO/IPEX backends for LLM inference. It works on this hardware via the IPEX-LLM Python package, but it uses a different execution model than llama.cpp — PyTorch-based, OpenVINO-compiled graphs. Integration quality varies by model architecture. Worth experimenting with but less battle-tested than the llama.cpp SYCL path.

Ollama with Intel GPU support is maturing. As of 2026, Ollama can use Level Zero on Intel GPUs on Linux, but model selection and quantization options are more limited than llama.cpp's GGUF ecosystem. If you want the simplest possible setup and don't need fine-grained control, Ollama is worth trying first.


Ongoing Work

This is a live research project. Hermes Agent continues to:

  • Test new GGUF models as they're released, evaluating agentic capability at 131K+ context
  • Run long-duration multi-agent benchmarks (24/7 stability, context accumulation, memory pressure)
  • Profile VRAM usage across 9B, 27B, 30B, and 35B parameter models on the 58GB shared memory pool
  • Validate that the SYCL stack survives days of continuous inference without memory leaks or fence timeout

The goal: a reliable, completely local daily-driver AI agent running on pocketable Intel hardware — no cloud dependency, no API costs, no rate limits.

Content & Monetization

Alongside the technical work, the system also drives content creation — blog posts documenting the research, benchmark results, and lessons learned. The content strategy:

  • Medium + dev.to — publish technical deep-dives for developer audiences

All content was researched and written by Hermes Agent. The agent handles research pipelines, draft production, cross-posting scheduling, and performance analytics.

Security Monitoring

The server runs automated security monitoring via Hermes cron:

  • Every 30 min — SSH brute-force detection, fail2ban status, new device discovery, firewall health, unexpected listening services, gateway status
  • Every 12 hours — CVE feed monitoring (Ubuntu, kernel, Docker, Freebox OS, general advisories)
  • Alerts posted to Discord with severity ratings (CRITICAL / HIGH / MEDIUM)
  • Pentest tools available on the server: nmap, masscan, tcpdump, arp-scan, netcat, wireshark

All monitoring was set up and configured by Hermes Agent.

References


All local model research, SYCL GPU debugging, production inference setup, benchmark design, and this blog article were implemented by Hermes Agent. The human directed goals and validated results. The agent executed every step.

Top comments (2)

Collapse
 
harjjotsinghh profile image
Harjot Singh

Running local LLMs on Intel Arc iGPU on a mini PC is a quietly important direction, because the cost and privacy math of local inference changes the moment you don't need a $1500 discrete GPU to do it. Most local-LLM guides assume an NVIDIA card; showing it work on integrated Arc graphics lowers the barrier a lot, and cheap-enough-to-run-always is what makes local models actually useful rather than a novelty, because the whole advantage is zero marginal cost and data never leaving the box. The pragmatic framing I'd add for anyone following: local inference isn't an all-or-nothing replacement for hosted models, it's the cheap, private workhorse for the high-volume, latency-tolerant, good-enough tasks, while you still route the genuinely hard stuff to a frontier model. A mini PC humming away on the easy 80% of your workload, with the cloud reserved for the 20% that needs it, is a great cost architecture, and the iGPU angle makes that affordable. The detail that matters most in these setups is usually the quantization-vs-quality tradeoff and tokens/sec at a usable context length, since that's what decides which tasks it can actually own. Run the local box for the cheap majority, route up only when needed. That right-size-and-route instinct is core to how I think about cost in Moonshift. On the Arc iGPU, what model size and quant level did you land on as the sweet spot for usable speed?

Collapse
 
starrzan profile image
I am Starrzan

Thank you for the comment and insight.

I found a Qwen 3.5:9b model with a specific coder tuning to run at an acceptable speed, although extensive project benchmark tests showed that the actual quality and breadth of output were not great. I work well for general chat and too-use agent, but not for generating high-quality output in coding or creative work. The model is Qwen3.5-9b-Sushi-Coder-RL-Q4_K_M.gguf.

I, however, settled on a Qwen 3 Coder:30b model that produces much higher-quality output but at a much slower pace. Is it acceptable? Yes, but it is not optimal, and I am currently looking to improve the speed using TurboQuant. I have forked llama.cpp (rather, my Hermes-Agent has) to add TurbQuant support for Intel Arc models, and will post a blog about it soon and report on whether it improves speed. The model is Qwen3-Coder-30B-A3B-Instruct-Q3_K_M.gguf

If I can get the response time to a more useful speed, this model may hit all the boxes in terms of the sweet spot. I will also publish a blog post on the creative benchmarks results shortly.