<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: I am Starrzan</title>
    <description>The latest articles on DEV Community by I am Starrzan (@starrzan).</description>
    <link>https://dev.to/starrzan</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F167631%2F208d5f97-c24c-4cd2-9299-f322b15f16ec.jpg</url>
      <title>DEV Community: I am Starrzan</title>
      <link>https://dev.to/starrzan</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/starrzan"/>
    <language>en</language>
    <item>
      <title>Running Local LLMs on Intel Arc iGPU: A Complete Guide for Ubuntu on Mini-PC Hardware</title>
      <dc:creator>I am Starrzan</dc:creator>
      <pubDate>Sun, 31 May 2026 18:57:55 +0000</pubDate>
      <link>https://dev.to/starrzan/running-local-llms-on-intel-arc-igpu-a-complete-guide-for-ubuntu-on-mini-pc-hardware-40bc</link>
      <guid>https://dev.to/starrzan/running-local-llms-on-intel-arc-igpu-a-complete-guide-for-ubuntu-on-mini-pc-hardware-40bc</guid>
      <description>&lt;p&gt;&lt;strong&gt;System:&lt;/strong&gt; GMKtec EVO-T1 (Intel Core Ultra 9 285H, Intel Arc 140T iGPU, 64GB DDR5)&lt;br&gt;
&lt;strong&gt;OS:&lt;/strong&gt; Ubuntu 26.04 LTS (kernel 7.0)&lt;br&gt;
&lt;strong&gt;Stack:&lt;/strong&gt; llama.cpp SYCL backend + oneAPI 2026.0 (IntelLLVM icx/icpx) + Hermes Agent&lt;/p&gt;

&lt;p&gt;All of the local model research, SYCL GPU debugging, production inference setup, benchmark design, and this blog article were implemented by Hermes Agent (an autonomous AI agent). The human directed goals; the agent executed everything — kernel flag surgery, compiler troubleshooting, benchmark design, and full documentation. This is an ongoing effort: new models are continuously tested with long-context, multi-agent benchmarks to validate 24/7 daily-driver reliability. Latest addition: 30B-Coder validated at 131K context as the new local daily driver, replacing Sushi 9B after passing all 8 hard multi-agent tests at 4.3x the speed.&lt;/p&gt;

&lt;p&gt;Most local LLM guides assume you have an NVIDIA GPU. If you're running an Intel mini-PC like the GMKtec EVO-T1 with an integrated Arc 140T, the path to GPU-accelerated local inference is less traveled — but entirely workable. Here's exactly how to get there.&lt;/p&gt;


&lt;h2&gt;
  
  
  Why This Is Harder Than NVIDIA
&lt;/h2&gt;

&lt;p&gt;NVIDIA's CUDA stack is turnkey: install driver, install PyTorch, done. Intel's GPU compute story is more fragmented:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;SYCL&lt;/strong&gt; is the open standard that replaces CUDA for cross-vendor GPU compute&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;oneAPI&lt;/strong&gt; is Intel's implementation of SYCL, plus MKL (math libraries) and the IntelLLVM compiler toolchain&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Level Zero&lt;/strong&gt; is Intel's low-level GPU runtime (think: what Vulkan is to graphics, Level Zero is to compute)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;llama.cpp&lt;/strong&gt; added SYCL support, but it's designed for Intel's proprietary compiler stack, not the open-source alternatives&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The core tension: llama.cpp's SYCL backend &lt;strong&gt;requires&lt;/strong&gt; Intel MKL's SYCL BLAS library, which only ships with Intel's oneAPI toolkit. The open-source dpclang compiler cannot provide it. This is the single biggest blocker, and the one most guides don't explain clearly.&lt;/p&gt;


&lt;h2&gt;
  
  
  Hardware Context
&lt;/h2&gt;

&lt;p&gt;The Intel Arc 140T is an integrated GPU (iGPU) built into the Core Ultra 9 285H "Arrow Lake" processor. It shares system memory — there's no dedicated VRAM. My system has 64GB DDR5-5600, of which the iGPU can address a significant portion (typically 16-32GB depending on BIOS settings).&lt;/p&gt;

&lt;p&gt;Key hardware facts:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Architecture:&lt;/strong&gt; Xe-LPG (same as Arc A-series, just smaller)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Execution Units:&lt;/strong&gt; 128 Xe cores&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Shared memory:&lt;/strong&gt; Uses system DDR5 as VRAM (configurable in BIOS up to 16GB)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;PCIe:&lt;/strong&gt; Appears as &lt;code&gt;00:02.0 VGA compatible controller: Intel Arrow Lake-P [Arc 130T/140T]&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;VRAM available to GPU:&lt;/strong&gt; ~58GB (of 64GB total system RAM)&lt;/li&gt;
&lt;/ul&gt;


&lt;h2&gt;
  
  
  Step 1: BIOS Configuration
&lt;/h2&gt;

&lt;p&gt;Before touching Linux, configure the firmware. These settings are essential for iGPU compute workloads:&lt;/p&gt;
&lt;h3&gt;
  
  
  Memory &amp;amp; GPU
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Above 4G Decoding: Enabled&lt;/strong&gt; — Without this, the iGPU cannot access large memory regions required for LLM KV caches&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Resizable BAR (ReBAR): Enabled&lt;/strong&gt; — Lets the CPU map the entire GPU-visible memory space in one go&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DVMT Pre-Allocated: 512MB or Max&lt;/strong&gt; — Reserves system RAM for iGPU texture/compute operations&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;XMP/EXPO Profile: Enabled&lt;/strong&gt; — Ensures DDR5 runs at rated speed (5600+ MT/s). Memory bandwidth directly impacts inference throughput since the iGPU is bandwidth-starved&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  Performance
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Intel Turbo Boost: Enabled&lt;/strong&gt; — Cores boost to 5.4GHz; critical for prompt prefill&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Speed Shift (HWP): Enabled&lt;/strong&gt; — Reduces P-state transition latency&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hyper-Threading: Enabled&lt;/strong&gt; — More threads help with prefill/decode overlap in llama.cpp&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CPU C-States: C0/C1 only&lt;/strong&gt; — Eliminates wake latency from deep sleep states during inference&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;
  
  
  Power
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Restore on AC Power Loss: Power On&lt;/strong&gt; — Server auto-restarts after power outage&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;ErP/EuP Ready: Disabled&lt;/strong&gt; — Prevents deep S5 state that blocks wake-on-power&lt;/li&gt;
&lt;/ul&gt;


&lt;h2&gt;
  
  
  Step 2: Install Intel oneAPI 2026.0
&lt;/h2&gt;

&lt;p&gt;This is where the journey diverges from NVIDIA. You need Intel's full proprietary compiler stack.&lt;/p&gt;
&lt;h3&gt;
  
  
  Download
&lt;/h3&gt;

&lt;p&gt;Grab the oneAPI Base Toolkit for Linux from &lt;a href="https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit-download.html" rel="noopener noreferrer"&gt;intel.com&lt;/a&gt;. You want the offline installer (~1GB for the compiler components).&lt;/p&gt;
&lt;h3&gt;
  
  
  Install
&lt;/h3&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;dpkg &lt;span class="nt"&gt;-i&lt;/span&gt; intel-oneapi-dpcpp-cpp-2026.0_&lt;span class="k"&gt;*&lt;/span&gt;.deb
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;If you encounter broken package state from a previous Intel install attempt (common — Intel's packages have had packaging bugs):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Force-remove all phantom Intel packages&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;dpkg &lt;span class="nt"&gt;--remove&lt;/span&gt; &lt;span class="nt"&gt;--force-all&lt;/span&gt; &lt;span class="si"&gt;$(&lt;/span&gt;dpkg &lt;span class="nt"&gt;-l&lt;/span&gt; | &lt;span class="nb"&gt;grep &lt;/span&gt;intel-oneapi | &lt;span class="nb"&gt;awk&lt;/span&gt; &lt;span class="s1"&gt;'{print $2}'&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;
&lt;span class="nb"&gt;sudo rm&lt;/span&gt; &lt;span class="nt"&gt;-f&lt;/span&gt; /var/lib/dpkg/info/intel-oneapi-&lt;span class="k"&gt;*&lt;/span&gt;.list
&lt;span class="nb"&gt;sudo rm&lt;/span&gt; &lt;span class="nt"&gt;-f&lt;/span&gt; /var/lib/dpkg/info/intel-oneapi-&lt;span class="k"&gt;*&lt;/span&gt;.postinst
&lt;span class="nb"&gt;sudo rm&lt;/span&gt; &lt;span class="nt"&gt;-f&lt;/span&gt; /var/lib/dpkg/info/intel-oneapi-&lt;span class="k"&gt;*&lt;/span&gt;.postrm
&lt;span class="nb"&gt;sudo &lt;/span&gt;dpkg &lt;span class="nt"&gt;--configure&lt;/span&gt; &lt;span class="nt"&gt;-a&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;apt clean &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;sudo &lt;/span&gt;apt update
&lt;span class="nb"&gt;sudo &lt;/span&gt;apt &lt;span class="nt"&gt;--fix-broken&lt;/span&gt; &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-y&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then reinstall.&lt;/p&gt;

&lt;p&gt;Also install the missing runtime dependencies that 2026.0 doesn't pull in automatically:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;apt &lt;span class="nb"&gt;install &lt;/span&gt;intel-ocloc libsycl-dev libigc2 libigdfcl2
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Verify
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;source&lt;/span&gt; /opt/intel/oneapi/setvars.sh &lt;span class="nt"&gt;--force&lt;/span&gt;
which icx &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; which icpx
icpx &lt;span class="nt"&gt;--version&lt;/span&gt;  &lt;span class="c"&gt;# Should show IntelLLVM 2026.0&lt;/span&gt;
clang-offload-bundler &lt;span class="nt"&gt;--version&lt;/span&gt;  &lt;span class="c"&gt;# Must exist or cmake will fail&lt;/span&gt;
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="nv"&gt;$MKLROOT&lt;/span&gt;  &lt;span class="c"&gt;# Should be /opt/intel/oneapi/mkl/2026.0&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  The Compiler Matters
&lt;/h3&gt;

&lt;p&gt;You might see guides suggesting &lt;code&gt;dpclang-6&lt;/code&gt; (the open-source DPC++ compiler from &lt;code&gt;apt&lt;/code&gt;). &lt;strong&gt;Do not use it for llama.cpp.&lt;/strong&gt; Here's why:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aspect&lt;/th&gt;
&lt;th&gt;IntelLLVM 2026.0 (icx/icpx)&lt;/th&gt;
&lt;th&gt;dpclang-6 (open-source)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;MKL BLAS SYCL&lt;/td&gt;
&lt;td&gt;Bundled, cmake finds it&lt;/td&gt;
&lt;td&gt;Not available, cmake fails&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;clang-offload-bundler&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Included&lt;/td&gt;
&lt;td&gt;Included&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Level Zero&lt;/td&gt;
&lt;td&gt;Native&lt;/td&gt;
&lt;td&gt;Native&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Status&lt;/td&gt;
&lt;td&gt;Works&lt;/td&gt;
&lt;td&gt;Cannot build SYCL+MKL&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Root cause: &lt;code&gt;llama.cpp/ggml-sycl/CMakeLists.txt&lt;/code&gt; hardcodes &lt;code&gt;find_package(MKL REQUIRED)&lt;/code&gt;. IntelLLVM ships MKL with proper CMake config. dpclang-6 does not provide MKL at all. You &lt;em&gt;can&lt;/em&gt; patch CMakeLists.txt to make MKL optional, but you'll lose BLAS acceleration — defeating the purpose of GPU inference.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 3: Build llama.cpp with SYCL
&lt;/h2&gt;

&lt;p&gt;Clone and build:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;cd&lt;/span&gt; ~/llama.cpp
&lt;span class="nb"&gt;source&lt;/span&gt; /opt/intel/oneapi/setvars.sh &lt;span class="nt"&gt;--force&lt;/span&gt;

&lt;span class="nb"&gt;rm&lt;/span&gt; &lt;span class="nt"&gt;-rf&lt;/span&gt; build CMakeCache.txt CMakeFiles .cmake

cmake &lt;span class="nt"&gt;-Bbuild&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-DGGML_SYCL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;ON &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-DGGML_SYCL_F16&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;OFF &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-DGGML_SYCL_DNN&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;ON &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-DGGML_SYCL_GRAPH&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;ON &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-DGGML_SYCL_HOST_MEM_FALLBACK&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;ON &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-DGGML_SYCL_STMT&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;ON &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-DGGML_SYCL_SUPPORT_LEVEL_ZERO&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;ON &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-DCMAKE_C_COMPILER&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;/opt/intel/oneapi/compiler/2026.0/bin/icx &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-DCMAKE_CXX_COMPILER&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;/opt/intel/oneapi/compiler/2026.0/bin/icpx &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-DCMAKE_BUILD_TYPE&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;Release &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-DMKL_ROOT&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;/opt/intel/oneapi/mkl/2026.0 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-DBUILD_SHARED_LIBS&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;ON

cmake &lt;span class="nt"&gt;--build&lt;/span&gt; build &lt;span class="nt"&gt;-j4&lt;/span&gt; &lt;span class="nt"&gt;--target&lt;/span&gt; llama-server
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  What's happening in these flags
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;GGML_SYCL=ON&lt;/code&gt; — Enable the SYCL GPU backend in ggml&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;GGML_SYCL_F16=OFF&lt;/code&gt; — Disable FP16 SYCL (crashes on Arc via dpct dev_mgr bug)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;GGML_SYCL_DNN=ON&lt;/code&gt; — Use oneDNN for DNN operations (better than MKL for some ops)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;GGML_SYCL_GRAPH=ON&lt;/code&gt; — Enable SYCL graph execution for reduced launch overhead&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;GGML_SYCL_HOST_MEM_FALLBACK=ON&lt;/code&gt; — Allow fallback to host memory when GPU memory is tight&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;GGML_SYCL_STMT=ON&lt;/code&gt; — Enable SYCL speculative token generation support&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;GGML_SYCL_SUPPORT_LEVEL_ZERO=ON&lt;/code&gt; — Use Level Zero runtime (Intel's native low-level GPU API)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;icx&lt;/code&gt; / &lt;code&gt;icpx&lt;/code&gt; — Intel's LLVM-based C/C++ compilers, required for SYCL + MKL interop&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;DMKL_ROOT&lt;/code&gt; — Points to oneMKL so CMake can find the SYCL BLAS libraries&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;DBUILD_SHARED_LIBS=ON&lt;/code&gt; — Build shared libraries (required for production deployment)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;-j4&lt;/code&gt; — Limit parallel jobs to avoid OOM during compilation of large translation units&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Critical Fix: RMS_NORM Crash at 131K Context
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Symptom:&lt;/strong&gt; Server crashes during model loading with &lt;code&gt;Error OP RMS_NORM&lt;/code&gt; when context &amp;gt;= 32K.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Root Cause (two-part):&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The &lt;code&gt;-ze-intel-greater-than-4GB-buffer-required&lt;/code&gt; linker flag in &lt;code&gt;ggml/src/ggml-sycl/CMakeLists.txt:162&lt;/code&gt; is only valid for GPU devices. When any SYCL operation falls back to the CPU device, the LLVM JIT compiler rejects the flag and crashes. This flag was unnecessary — the Arc 140T has 58GB shared VRAM.&lt;/li&gt;
&lt;li&gt;IntelLLVM 2026.0's &lt;code&gt;sycl-kernel-reduce-cross-barrier-values&lt;/code&gt; LLVM pass crashes with &lt;code&gt;free(): invalid pointer&lt;/code&gt; when compiling SYCL kernels for large KV cache allocations on the CPU SYCL device.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# In ggml/src/ggml-sycl/CMakeLists.txt, line 162:&lt;/span&gt;
&lt;span class="c"&gt;# COMMENT OUT this line:&lt;/span&gt;
&lt;span class="c"&gt;# target_link_options(ggml-sycl PRIVATE -Xs -ze-intel-greater-than-4GB-buffer-required)&lt;/span&gt;

&lt;span class="c"&gt;# Add before launching llama-server:&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ONEAPI_DEVICE_SELECTOR&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;level_zero:gpu
&lt;span class="nb"&gt;source&lt;/span&gt; /opt/intel/oneapi/setvars.sh &lt;span class="nt"&gt;--force&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;ONEAPI_DEVICE_SELECTOR=level_zero:gpu&lt;/code&gt; prevents the CPU SYCL device from being registered entirely, avoiding both the flag incompatibility and the optimizer crash. All operations stay on the Arc GPU.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Result:&lt;/strong&gt; All tested 9B models now load successfully at 131K context with NGL=99.&lt;/p&gt;

&lt;h3&gt;
  
  
  Verify the build
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;./build/bin/llama-server &lt;span class="nt"&gt;--version&lt;/span&gt;
&lt;span class="c"&gt;# Should show SYCL support in the build info&lt;/span&gt;

&lt;span class="c"&gt;# Quick smoke test with a small model&lt;/span&gt;
./build/bin/llama-server &lt;span class="nt"&gt;-m&lt;/span&gt; ./models/your-model.gguf &lt;span class="nt"&gt;-ngl&lt;/span&gt; 99 &lt;span class="nt"&gt;--host&lt;/span&gt; 0.0.0.0 &lt;span class="nt"&gt;--port&lt;/span&gt; 8080
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In the server log, look for:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[INFO] SYCL device: Intel(R) Arc 140T (GPU) (Level Zero)
[INFO] Offloading 99 layers to GPU
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you see those lines, GPU offload is working.&lt;/p&gt;




&lt;h2&gt;
  
  
  Step 4: Linux Tuning
&lt;/h2&gt;

&lt;h3&gt;
  
  
  CPU Governor
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"performance"&lt;/span&gt; | &lt;span class="nb"&gt;sudo tee&lt;/span&gt; /sys/devices/system/cpu/cpu&lt;span class="k"&gt;*&lt;/span&gt;/cpufreq/scaling_governor
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"performance"&lt;/span&gt; | &lt;span class="nb"&gt;sudo tee&lt;/span&gt; /sys/devices/system/cpu/cpu&lt;span class="k"&gt;*&lt;/span&gt;/cpufreq/energy_performance_preference
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Memory
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# With 64GB RAM, swapping kills inference&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;sysctl vm.swappiness&lt;span class="o"&gt;=&lt;/span&gt;10

&lt;span class="c"&gt;# Hugepages reduce TLB misses for large model allocations&lt;/span&gt;
&lt;span class="nb"&gt;echo &lt;/span&gt;1024 | &lt;span class="nb"&gt;sudo tee&lt;/span&gt; /proc/sys/vm/nr_hugepages

&lt;span class="c"&gt;# Lazy mount /tmp as tmpfs (4GB) for faster temp I/O&lt;/span&gt;
&lt;span class="nb"&gt;sudo &lt;/span&gt;mount &lt;span class="nt"&gt;-t&lt;/span&gt; tmpfs &lt;span class="nt"&gt;-o&lt;/span&gt; &lt;span class="nv"&gt;size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;4G tmpfs /tmp
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  I/O Scheduler
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# NVMe: use 'none' (noop) to reduce latency&lt;/span&gt;
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"none"&lt;/span&gt; | &lt;span class="nb"&gt;sudo tee&lt;/span&gt; /sys/block/nvme0n1/queue/scheduler
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Make governor persistent
&lt;/h3&gt;

&lt;p&gt;Create &lt;code&gt;/etc/systemd/system/cpu-performance.service&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ini"&gt;&lt;code&gt;&lt;span class="nn"&gt;[Unit]&lt;/span&gt;
&lt;span class="py"&gt;Description&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;Set CPU governor to performance&lt;/span&gt;

&lt;span class="nn"&gt;[Service]&lt;/span&gt;
&lt;span class="py"&gt;Type&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;oneshot&lt;/span&gt;
&lt;span class="py"&gt;ExecStart&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;/bin/bash -c 'echo performance | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor'&lt;/span&gt;

&lt;span class="nn"&gt;[Install]&lt;/span&gt;
&lt;span class="py"&gt;WantedBy&lt;/span&gt;&lt;span class="p"&gt;=&lt;/span&gt;&lt;span class="s"&gt;multi-user.target&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;sudo &lt;/span&gt;systemctl &lt;span class="nb"&gt;enable &lt;/span&gt;cpu-performance
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Step 5: Running Models on the iGPU
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Memory Budget
&lt;/h3&gt;

&lt;p&gt;The Arc 140T shares system DDR5. With &lt;code&gt;-ngl 99&lt;/code&gt; (all layers on GPU), the model weights sit in memory that the GPU can access. The remaining system RAM holds the KV cache and non-offloaded tensors.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Measured VRAM usage at 131K context, NGL=99:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Weights&lt;/th&gt;
&lt;th&gt;KV Cache (131K ctx)&lt;/th&gt;
&lt;th&gt;Total VRAM&lt;/th&gt;
&lt;th&gt;Headroom (of 58GB)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Qwen3.5-9B-Sushi-Coder-RL Q4_K_M&lt;/td&gt;
&lt;td&gt;~5.3GB&lt;/td&gt;
&lt;td&gt;~44GB (maxCtx) / ~4.4GB (at 10K used)&lt;/td&gt;
&lt;td&gt;~9.7GB (at 10K ctx)&lt;/td&gt;
&lt;td&gt;~48GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qwopus3.5-9B-Coder-MTP Q4_K_M&lt;/td&gt;
&lt;td&gt;~5.4GB&lt;/td&gt;
&lt;td&gt;Similar to above&lt;/td&gt;
&lt;td&gt;~9.8GB&lt;/td&gt;
&lt;td&gt;~48GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qwen3.5-9B-DS-V4-Flash Q4_K_M&lt;/td&gt;
&lt;td&gt;~5.3GB&lt;/td&gt;
&lt;td&gt;Similar to above&lt;/td&gt;
&lt;td&gt;~9.7GB&lt;/td&gt;
&lt;td&gt;~48GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qwen3-Coder-30B-A3B Q3_K_M&lt;/td&gt;
&lt;td&gt;~14GB&lt;/td&gt;
&lt;td&gt;~6.5GB (at 10K ctx)&lt;/td&gt;
&lt;td&gt;~20.5GB&lt;/td&gt;
&lt;td&gt;~37GB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qwen3.6-35B-UD-IQ4_NL&lt;/td&gt;
&lt;td&gt;~17GB&lt;/td&gt;
&lt;td&gt;~7GB (at 10K ctx)&lt;/td&gt;
&lt;td&gt;~24GB&lt;/td&gt;
&lt;td&gt;~34GB&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Key insight:&lt;/strong&gt; At 131K context with NGL=99, the KV cache for a 9B model can consume up to ~44GB if the full context is utilized. In practice, agentic workloads with 10+ tool calls accumulating ~10K tokens use only ~4-5GB KV cache, leaving plenty of headroom. The 58GB shared memory is sufficient for 9B models at 131K ctx, but 27B+ models require careful context management.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Buffer size caution:&lt;/strong&gt; Don't set context to 131K unless needed. Each token of context consumes memory even if unused. For most agentic work, 32K-65K context is the practical sweet spot.&lt;/p&gt;

&lt;h3&gt;
  
  
  Example Command
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;source&lt;/span&gt; /opt/intel/oneapi/setvars.sh &lt;span class="nt"&gt;--force&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;ONEAPI_DEVICE_SELECTOR&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;level_zero:gpu

./build/bin/llama-server &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-m&lt;/span&gt; ./models/your-model.gguf &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-ngl&lt;/span&gt; 99 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-c&lt;/span&gt; 131072 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;-b&lt;/span&gt; 2048 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--ubatch-size&lt;/span&gt; 512 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--no-warmup&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--mmap&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Flags explained:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;-ngl 99&lt;/code&gt; — Offload all transformer layers to GPU (the heavy compute)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;-c 131072&lt;/code&gt; — Context length in tokens (130K practical for 9B models)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;-b 2048&lt;/code&gt; — Batch size for prompt processing (higher = faster prefill)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;--ubatch-size 512&lt;/code&gt; — Micro-batch size for decode&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;--no-warmup&lt;/code&gt; — Skip warmup (faster startup)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;--mmap&lt;/code&gt; — Memory-map model files (lets OS manage page cache)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;ONEAPI_DEVICE_SELECTOR=level_zero:gpu&lt;/code&gt; — &lt;strong&gt;Critical:&lt;/strong&gt; restricts to GPU-only, prevents CPU SYCL JIT crashes&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Troubleshooting
&lt;/h2&gt;

&lt;h3&gt;
  
  
  CMake cannot find MKL
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;source&lt;/span&gt; /opt/intel/oneapi/setvars.sh &lt;span class="nt"&gt;--force&lt;/span&gt;
&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="nv"&gt;$MKLROOT&lt;/span&gt;  &lt;span class="c"&gt;# Must show /opt/intel/oneapi/mkl/2026.0&lt;/span&gt;

&lt;span class="c"&gt;# If empty, specify manually:&lt;/span&gt;
cmake &lt;span class="nt"&gt;-Bbuild&lt;/span&gt; &lt;span class="nt"&gt;-DMKL_ROOT&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;/opt/intel/oneapi/mkl/2026.0 ...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Build fails with "cannot find -lsycl"
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Clean rebuild from scratch&lt;/span&gt;
&lt;span class="nb"&gt;rm&lt;/span&gt; &lt;span class="nt"&gt;-rf&lt;/span&gt; build CMakeCache.txt CMakeFiles .cmake
&lt;span class="nb"&gt;source&lt;/span&gt; /opt/intel/oneapi/setvars.sh &lt;span class="nt"&gt;--force&lt;/span&gt;
&lt;span class="c"&gt;# Re-run cmake&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Server starts but completions hang
&lt;/h3&gt;

&lt;p&gt;Check &lt;code&gt;dmesg&lt;/code&gt; for GPU fence timeouts. Most likely: context size too large for available memory. Reduce &lt;code&gt;-c&lt;/code&gt; from current value to 8192.&lt;/p&gt;

&lt;h3&gt;
  
  
  Phantom Intel packages after force deletion
&lt;/h3&gt;

&lt;p&gt;The 2026.0 release had packaging bugs where dpkg tracks packages but files are missing. Always force-remove and clean dpkg state files before reinstalling.&lt;/p&gt;




&lt;h2&gt;
  
  
  Performance Notes
&lt;/h2&gt;

&lt;p&gt;On the Arc 140T (128 Xe cores, DDR5-5600 shared memory), measured with NGL=99:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Prefill (pp512)&lt;/th&gt;
&lt;th&gt;Decode (tg128)&lt;/th&gt;
&lt;th&gt;Context&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Qwen3-Coder-30B-A3B Q3_K_M&lt;/td&gt;
&lt;td&gt;81 t/s&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;11.52 t/s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;131K&lt;/td&gt;
&lt;td&gt;
&lt;strong&gt;Local default&lt;/strong&gt; — MoE, fastest decode, 8/8 hard tests pass&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qwen3.5-9B-Sushi-Coder-RL Q4_K_M&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;166 t/s&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;8.24 t/s&lt;/td&gt;
&lt;td&gt;130K&lt;/td&gt;
&lt;td&gt;General purpose — fastest prefill, RL-tuned&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qwen3.6-35B-UD-IQ4_NL&lt;/td&gt;
&lt;td&gt;76 t/s&lt;/td&gt;
&lt;td&gt;7.98 t/s&lt;/td&gt;
&lt;td&gt;65K&lt;/td&gt;
&lt;td&gt;MoE reasoning, slower decode&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;strong&gt;Memory bandwidth is the bottleneck&lt;/strong&gt; — DDR5-5600 provides ~85 GB/s shared between CPU and GPU. For comparison, an NVIDIA RTX 3060 (12GB GDDR6, 360 GB/s) is 4-5x faster on memory-bound operations. The Arc iGPU is competitive with a laptop RTX 3050 for LLM inference despite having no dedicated VRAM.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The key advantage isn't raw speed&lt;/strong&gt; — it's that this is an iGPU inside a mini-PC that sips ~45W at full load, fits in your pocket, and costs a fraction of a discrete GPU setup. And with the SYCL fix applied, it runs 131K context models without crashing.&lt;/p&gt;

&lt;h3&gt;
  
  
  Agentic Task Performance
&lt;/h3&gt;

&lt;p&gt;Real-world agentic benchmark results (8 hard tests per model, 131K ctx):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Tests Pass&lt;/th&gt;
&lt;th&gt;Total Time&lt;/th&gt;
&lt;th&gt;Output Quality&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Qwen3-Coder-30B-A3B&lt;/td&gt;
&lt;td&gt;8/8&lt;/td&gt;
&lt;td&gt;228s&lt;/td&gt;
&lt;td&gt;Clean, coherent, valid JSON — 4.3x faster than Sushi&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qwen3.5-9B-Sushi-Coder-RL&lt;/td&gt;
&lt;td&gt;8/8&lt;/td&gt;
&lt;td&gt;994s&lt;/td&gt;
&lt;td&gt;Clean, coherent, valid JSON&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Both models pass all 8 hard multi-agent tests at 131K context. 30B-Coder is the local default because it delivers the same quality at 4.3x the speed. Sushi remains the general-purpose option with 2x faster prefill and smaller disk footprint.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why the Defaults Changed (2026-07-30)
&lt;/h3&gt;

&lt;p&gt;After completing the hard multi-agent benchmark suite (cross-doc reasoning, constrained JSON, subagent delegation, complex nested JSON, edge cases, multi-turn fact retention, arithmetic reasoning), the data showed 30B-Coder matched Sushi on every quality metric while being dramatically faster. The MoE architecture activates only ~3B parameters per token, delivering high quality at a fraction of the compute cost.&lt;/p&gt;

&lt;p&gt;On the remote side, gemma-4-31b-it:free replaced owl-alpha as the OpenRouter default after benchmarking 13 free models. Gemma-4-31b passed all 5 tests with the highest quality outputs, while owl-alpha took 2x longer for the same pass rate.&lt;/p&gt;




&lt;h2&gt;
  
  
  What About IPEX-LLM and Ollama?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;IPEX-LLM&lt;/strong&gt; (Intel Extension for PyTorch) offers OpenVINO/IPEX backends for LLM inference. It works on this hardware via the IPEX-LLM Python package, but it uses a different execution model than llama.cpp — PyTorch-based, OpenVINO-compiled graphs. Integration quality varies by model architecture. Worth experimenting with but less battle-tested than the llama.cpp SYCL path.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Ollama&lt;/strong&gt; with Intel GPU support is maturing. As of 2026, Ollama can use Level Zero on Intel GPUs on Linux, but model selection and quantization options are more limited than llama.cpp's GGUF ecosystem. If you want the simplest possible setup and don't need fine-grained control, Ollama is worth trying first.&lt;/p&gt;




&lt;h2&gt;
  
  
  Ongoing Work
&lt;/h2&gt;

&lt;p&gt;This is a live research project. Hermes Agent continues to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Test new GGUF models as they're released, evaluating agentic capability at 131K+ context&lt;/li&gt;
&lt;li&gt;Run long-duration multi-agent benchmarks (24/7 stability, context accumulation, memory pressure)&lt;/li&gt;
&lt;li&gt;Profile VRAM usage across 9B, 27B, 30B, and 35B parameter models on the 58GB shared memory pool&lt;/li&gt;
&lt;li&gt;Validate that the SYCL stack survives days of continuous inference without memory leaks or fence timeout&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The goal: a reliable, completely local daily-driver AI agent running on pocketable Intel hardware — no cloud dependency, no API costs, no rate limits.&lt;/p&gt;

&lt;h3&gt;
  
  
  Content &amp;amp; Monetization
&lt;/h3&gt;

&lt;p&gt;Alongside the technical work, the system also drives content creation — blog posts documenting the research, benchmark results, and lessons learned. The content strategy:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Medium + dev.to&lt;/strong&gt; — publish technical deep-dives for developer audiences&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All content was researched and written by Hermes Agent. The agent handles research pipelines, draft production, cross-posting scheduling, and performance analytics.&lt;/p&gt;

&lt;h3&gt;
  
  
  Security Monitoring
&lt;/h3&gt;

&lt;p&gt;The server runs automated security monitoring via Hermes cron:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Every 30 min&lt;/strong&gt; — SSH brute-force detection, fail2ban status, new device discovery, firewall health, unexpected listening services, gateway status&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Every 12 hours&lt;/strong&gt; — CVE feed monitoring (Ubuntu, kernel, Docker, Freebox OS, general advisories)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Alerts&lt;/strong&gt; posted to Discord with severity ratings (CRITICAL / HIGH / MEDIUM)&lt;/li&gt;
&lt;li&gt;Pentest tools available on the server: nmap, masscan, tcpdump, arp-scan, netcat, wireshark&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All monitoring was set up and configured by Hermes Agent.&lt;/p&gt;

&lt;h3&gt;
  
  
  References
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md" rel="noopener noreferrer"&gt;llama.cpp SYCL build docs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit.html" rel="noopener noreferrer"&gt;Intel oneAPI 2026.0 release&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/intel/llvm/blob/sycl/docs/GetStartedGuides/oneapi-mkl.md" rel="noopener noreferrer"&gt;IntelSYCL + MKL integration&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/canonical/oneapi-packaging/issues/30" rel="noopener noreferrer"&gt;Canonical oneAPI packaging fixes&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://hermes-agent.nousresearch.com" rel="noopener noreferrer"&gt;Hermes Agent&lt;/a&gt; — the autonomous AI agent platform behind this work&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;All local model research, SYCL GPU debugging, production inference setup, benchmark design, and this blog article were implemented by Hermes Agent. The human directed goals and validated results. The agent executed every step.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>linux</category>
      <category>llm</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>How I Built a Self-Managing AI Lab with Hermes Agent on a Intel Arc GPU</title>
      <dc:creator>I am Starrzan</dc:creator>
      <pubDate>Sat, 30 May 2026 21:28:10 +0000</pubDate>
      <link>https://dev.to/starrzan/how-i-built-a-self-managing-ai-workspace-with-hermes-agent-2lf6</link>
      <guid>https://dev.to/starrzan/how-i-built-a-self-managing-ai-workspace-with-hermes-agent-2lf6</guid>
      <description>&lt;p&gt;&lt;em&gt;This is a submission for the &lt;a href="https://dev.to/challenges/hermes-agent-2026-05-15"&gt;Hermes Agent Challenge&lt;/a&gt;: Write About Hermes Agent&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  What I Built
&lt;/h2&gt;

&lt;p&gt;A self-managing AI workspace powered by &lt;a href="https://hermes-agent.nousresearch.com" rel="noopener noreferrer"&gt;Hermes Agent&lt;/a&gt; — where an autonomous agent runs the local inference stack on Intel Arc GPU, automates research and documentation, manages cron jobs, and coordinates multiple LLM backends without human micro-management. The human directs goals; the agent executes everything.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hardware:&lt;/strong&gt; GMKtec EVO-T1 mini-PC (Intel Core Ultra 9 285H, Intel Arc 140T iGPU, 64GB DDR5-5600) — a pocketable 45W system that runs autonomous AI agents 24/7.&lt;/p&gt;

&lt;p&gt;The system manages:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Local LLM inference&lt;/strong&gt; via llama.cpp on Intel Arc SYCL (iGPU)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Automated research pipelines&lt;/strong&gt; feeding structured docs into a persistent vault&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Multi-model testing and benchmarking&lt;/strong&gt; — 9+ models across 9B to 35B parameters&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cron-driven monitoring&lt;/strong&gt; — market data, system health, memory management&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Self-maintaining skills&lt;/strong&gt; — the agent updates its own skills and docs when things change&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[ User Goals ]
      │
      ▼
[ Hermes Agent ]─── llama-server (Intel Arc SYCL)
      │                    ├── Qwen3.5-9B-Sushi-Coder-RL (130K ctx) ← daily driver
      │                    ├── Qwen3-Coder-30B-A3B (65K ctx) ← coding specialist
      │                    ├── Qwen3.6-35B-UD-IQ4_NL (65K ctx) ← reasoning
      │                    └── Qwen3.5-9B-DeepSeek-V4-Flash (130K ctx) ← stable but reasoning-only
      │
      ├── research-vault/   (research &amp;amp; docs)
      └── hermes-config/    (skills, plugins, cron jobs)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The agent runs as a Hermes session with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Persistent memory&lt;/strong&gt; — notes about the environment, user preferences, tool quirks, project conventions&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Durable skills&lt;/strong&gt; — 40+ specialized procedures for devops, mlops, research, etc.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Toolsets&lt;/strong&gt; — terminal, browser, file, cron, git, and more&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Full system access&lt;/strong&gt; — builds, debugs, tunes, and documents everything autonomously&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  GMKtec EVO-T1 Hardware
&lt;/h3&gt;

&lt;p&gt;The host is a &lt;strong&gt;GMKtec EVO-T1&lt;/strong&gt; mini-PC:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;CPU:&lt;/strong&gt; Intel Core Ultra 9 285H (Arrow Lake, 16 cores, up to 5.4GHz)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;iGPU:&lt;/strong&gt; Intel Arc 140T (128 Xe cores, shares system DDR5 as VRAM)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;RAM:&lt;/strong&gt; 64GB DDR5-5600 (~58GB addressable by GPU)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Power:&lt;/strong&gt; ~45W sustained under full load&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Form factor:&lt;/strong&gt; ~0.6L, pocketable&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The Intel Arc 140T iGPU is the inference engine. With llama.cpp SYCL backend and Intel oneAPI 2026.0, the agent runs GGUF models locally at 131K context. A critical kernel-level SYCL fix (removing the &lt;code&gt;-ze-intel-greater-than-4GB-buffer-required&lt;/code&gt; CUDA-style linker flag and setting &lt;code&gt;ONEAPI_DEVICE_SELECTOR=level_zero:gpu&lt;/code&gt;) was required to prevent JIT compilation crashes at large context sizes — diagnosed and applied by the agent.&lt;/p&gt;




&lt;h2&gt;
  
  
  How It Was Built
&lt;/h2&gt;

&lt;p&gt;All implementation was done by Hermes Agent. The human directed high-level goals; the agent executed every technical step.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: Local Inference Server (llama.cpp on Intel Arc)
&lt;/h3&gt;

&lt;p&gt;Built a llama.cpp inference server backed by Intel Arc SYCL. The server handles model loading, context sizing per model, and spec decode configuration.&lt;/p&gt;

&lt;p&gt;The critical subtlety: different models need different context sizes. CTX_SIZE must be set per-model, not globally. A 9B coder model gets 130k; a 27B model gets 65k. The agent handles this via model-specific startup configs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Major SYCL fix:&lt;/strong&gt; The SYCL backend had a critical bug — the &lt;code&gt;-ze-intel-greater-than-4GB-buffer-required&lt;/code&gt; linker flag in &lt;code&gt;ggml-sycl/CMakeLists.txt&lt;/code&gt; caused JIT compilation failures on the CPU SYCL device when any operation fell back from GPU. Removing this flag and setting &lt;code&gt;ONEAPI_DEVICE_SELECTOR=level_zero:gpu&lt;/code&gt; to restrict to GPU-only eliminated the RMS_NORM crash that prevented models from loading at 131K context. The agent found this, diagnosed it, and fixed it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2: Hermes Agent Configuration
&lt;/h3&gt;

&lt;p&gt;Configured Hermes with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;OpenRouter as default provider (cloud fallback)&lt;/li&gt;
&lt;li&gt;Local llama-server as local provider (primary for privacy-bound work)&lt;/li&gt;
&lt;li&gt;Skills system for recurring task patterns&lt;/li&gt;
&lt;li&gt;Memory persistence across sessions&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Step 3: Cron Jobs for Automation
&lt;/h3&gt;

&lt;p&gt;The agent uses Hermes cron to run scheduled research, commit/push cycles, and health checks:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Market data monitoring (Polymarket, Kalshi feeds)&lt;/li&gt;
&lt;li&gt;Workspace backup automation&lt;/li&gt;
&lt;li&gt;Codebase quality scans&lt;/li&gt;
&lt;li&gt;Security monitoring (SSH brute-force, system health, CVE feeds)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Step 4: Research Pipeline (research vault)
&lt;/h3&gt;

&lt;p&gt;The agent does autonomous research and documents findings in a structured vault:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;research-vault/
├── challenges/       # Dev challenge research, compatibility patches
├── research/         # Hardware, model, compatibility research
├── blogs/            # Technical blog articles
└── study/           # Learning notes, tutorials
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Model Lineup
&lt;/h2&gt;

&lt;p&gt;The system coordinates multiple GGUF models depending on task type:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Architecture&lt;/th&gt;
&lt;th&gt;Params&lt;/th&gt;
&lt;th&gt;Context&lt;/th&gt;
&lt;th&gt;Quant&lt;/th&gt;
&lt;th&gt;Role&lt;/th&gt;
&lt;th&gt;Notes&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Qwen3.5-9B-Sushi-Coder-RL&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Qwen 3.5 MoE&lt;/td&gt;
&lt;td&gt;9B&lt;/td&gt;
&lt;td&gt;130K&lt;/td&gt;
&lt;td&gt;Q4_K_M&lt;/td&gt;
&lt;td&gt;Daily driver&lt;/td&gt;
&lt;td&gt;RL-tuned, best agentic quality, clean JSON output&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Qwen3-Coder-30B-A3B&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Qwen 3 MoE&lt;/td&gt;
&lt;td&gt;30B (3B active)&lt;/td&gt;
&lt;td&gt;65K&lt;/td&gt;
&lt;td&gt;Q3_K_M&lt;/td&gt;
&lt;td&gt;Coding specialist&lt;/td&gt;
&lt;td&gt;Best decode throughput, strong at code generation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Qwen3.6-35B-UD-IQ4_NL&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Qwen 3.5 MoE&lt;/td&gt;
&lt;td&gt;35B&lt;/td&gt;
&lt;td&gt;65K&lt;/td&gt;
&lt;td&gt;UD-IQ4_NL&lt;/td&gt;
&lt;td&gt;Reasoning&lt;/td&gt;
&lt;td&gt;Highest reasoning quality, heavier VRAM cost&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Qwen3.5-9B-DeepSeek-V4-Flash&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Qwen 3.5 hybrid&lt;/td&gt;
&lt;td&gt;9B&lt;/td&gt;
&lt;td&gt;130K&lt;/td&gt;
&lt;td&gt;Q4_K_M&lt;/td&gt;
&lt;td&gt;Secondary&lt;/td&gt;
&lt;td&gt;Fastest prefill, but output is reasoning-only (content field empty)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Qwopus3.5-9B-Coder-MTP&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Qwen 3.5 w/ MTP&lt;/td&gt;
&lt;td&gt;9B&lt;/td&gt;
&lt;td&gt;8K effective&lt;/td&gt;
&lt;td&gt;Q4_K_M&lt;/td&gt;
&lt;td&gt;Deprecated&lt;/td&gt;
&lt;td&gt;MTP merge caused KV cache contamination, garbled output&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Why These Models
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Sushi 9B&lt;/strong&gt; is the only production-viable 9B model for agentic work on this hardware — passed all 6 agentic tests with 0 HTTP 500 errors, produced valid JSON, retained multi-turn context correctly&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Coder 30B&lt;/strong&gt; is a MoE model (30B total, 3B active parameters) so decode is fast despite the large parameter count — 11.52 t/s decode vs 8.24 t/s for the 9B model&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DS-V4-Flash&lt;/strong&gt; is useful for quick reasoning tasks where you don't need structured output — 190 t/s prefill makes it fast for short prompts&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;27B class models&lt;/strong&gt; fill the gap between 9B and 35B — reasonable quality without the VRAM overhead of the larger model in the shared memory pool&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Agentic Benchmark Results
&lt;/h2&gt;

&lt;p&gt;Ran comprehensive agentic evaluations across all 9B models at 131K context:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Tests Pass&lt;/th&gt;
&lt;th&gt;HTTP 500&lt;/th&gt;
&lt;th&gt;JSON Valid&lt;/th&gt;
&lt;th&gt;Total Time&lt;/th&gt;
&lt;th&gt;Quality&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Sushi 9B&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;6/6&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;Yes (3/3)&lt;/td&gt;
&lt;td&gt;561s&lt;/td&gt;
&lt;td&gt;Best&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;DS-V4-Flash&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;6/6&lt;/td&gt;
&lt;td&gt;0&lt;/td&gt;
&lt;td&gt;No (0/3)&lt;/td&gt;
&lt;td&gt;592s&lt;/td&gt;
&lt;td&gt;Reasoning-only&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Qwopus MTP&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;2/6&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;No (0/3)&lt;/td&gt;
&lt;td&gt;256s&lt;/td&gt;
&lt;td&gt;Broken&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Key Findings
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Sushi 9B (production daily driver):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Only model to pass all 6 agentic tests without errors&lt;/li&gt;
&lt;li&gt;Correct multi-turn context retention across 3 turns&lt;/li&gt;
&lt;li&gt;Valid structured JSON output (T2: 3/3 score)&lt;/li&gt;
&lt;li&gt;Correct VRAM calculations (all 9B models: ~9.7GB at 130K ctx, no OOM risk on 58GB headroom)&lt;/li&gt;
&lt;li&gt;Best instruction following (10 constraints, 4 paragraphs)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Qwopus MTP (deprecated):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;4 out of 6 tests returned HTTP 500 internal server errors&lt;/li&gt;
&lt;li&gt;Garbled output containing mixed Chinese/English pseudotext&lt;/li&gt;
&lt;li&gt;KV cache contamination — corrupted output poisons subsequent requests&lt;/li&gt;
&lt;li&gt;This is a model quality issue in the MTP merge — not fixable by configuration&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;DS-V4-Flash (secondary):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Stable, but all output is in reasoning_content only (content field empty)&lt;/li&gt;
&lt;li&gt;Coherent reasoning but cannot produce valid structured JSON in content&lt;/li&gt;
&lt;li&gt;Fast prefill (190 t/s) but 8.24 t/s decode&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Technical Decisions Validated
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Local-first, cloud-fallback&lt;/strong&gt;: All inference runs local by default. Cloud only for models not running locally.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Per-model context sizing&lt;/strong&gt;: Context window sizes are model-specific, not global. This prevents OOM on the Arc GPU's shared VRAM.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Skills over prompting&lt;/strong&gt;: Every recurring workflow is encoded as a skill file. The system maintains itself.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Git-backed vault&lt;/strong&gt;: All research auto-commits to GitHub. The workspace is the artifact.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Automated security monitoring&lt;/strong&gt;: The agent watches for intrusions, monitors CVE feeds, and posts alerts to Discord — the workspace defends itself.&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Security Infrastructure
&lt;/h2&gt;

&lt;p&gt;The server runs automated security monitoring set up by Hermes Agent:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;UFW firewall&lt;/strong&gt; — default deny incoming, SSH only from LAN + Tailscale&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;fail2ban&lt;/strong&gt; — auto-ban after 3 failed SSH attempts&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cron: security-monitor&lt;/strong&gt; — every 30 min, checks brute-force, new devices, firewall, services, gateway&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cron: vulnerability-feed-monitor&lt;/strong&gt; — every 12 hours, CVE monitoring for Ubuntu, kernel, Docker, Freebox OS&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Discord alerts&lt;/strong&gt; — CRITICAL and HIGH severity findings posted automatically&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pentest tools&lt;/strong&gt; — nmap, masscan, tcpdump, arp-scan, netcat, wireshark&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Key Numbers
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;58GB&lt;/strong&gt; shared VRAM on Intel Arc 140T&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;130K&lt;/strong&gt; context window (Sushi 9B)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;9.7GB&lt;/strong&gt; total VRAM usage at 130K ctx for 9B models (weights + KV cache)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;48GB&lt;/strong&gt; VRAM headroom at 130K ctx&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;8.24 t/s&lt;/strong&gt; decode speed (Sushi 9B)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;166 t/s&lt;/strong&gt; prefill speed (Sushi 9B)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;190 t/s&lt;/strong&gt; prefill speed (DS-V4-Flash)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;~36-37s&lt;/strong&gt; per generation turn (Sushi 9B at 256 max_tokens)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;0&lt;/strong&gt; HTTP 500 errors across 6 agentic tests (Sushi 9B)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;9+&lt;/strong&gt; GGUF models tested (9B through 35B parameters)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;6+ months&lt;/strong&gt; of continuous local inference development by Hermes Agent&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Automated security monitoring&lt;/strong&gt; — log analysis, intrusion detection, CVE feed monitoring, Discord alerts&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Demo / How to Replicate
&lt;/h2&gt;

&lt;p&gt;The entire setup — llama.cpp SYCL build, Hermes Agent config, benchmark suite, and documentation — was built and maintained by Hermes Agent.&lt;/p&gt;

&lt;p&gt;Minimal setup:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 1. Clone and build llama.cpp with SYCL&lt;/span&gt;
git clone https://github.com/ggerganov/llama.cpp
&lt;span class="nb"&gt;cd &lt;/span&gt;llama.cpp &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; cmake &lt;span class="nt"&gt;-B&lt;/span&gt; build &lt;span class="nt"&gt;-DGGML_SYCL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;ON &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; cmake &lt;span class="nt"&gt;--build&lt;/span&gt; build

&lt;span class="c"&gt;# 2. Install Hermes Agent&lt;/span&gt;
pip &lt;span class="nb"&gt;install &lt;/span&gt;hermes-agent

&lt;span class="c"&gt;# 3. Configure local server&lt;/span&gt;
hermes config &lt;span class="nb"&gt;set &lt;/span&gt;providers.local.base_url http://localhost:8080/v1

&lt;span class="c"&gt;# 4. Download and add your first model&lt;/span&gt;
&lt;span class="c"&gt;# (example: Qwen3.5-9B at Q4_K_M quantization)&lt;/span&gt;
hermes models add &lt;span class="nt"&gt;--alias&lt;/span&gt; coder &lt;span class="nt"&gt;--path&lt;/span&gt; ./models/your-model.gguf &lt;span class="nt"&gt;--context-size&lt;/span&gt; 131072
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;p&gt;&lt;em&gt;All local model research, SYCL GPU debugging, production inference setup, benchmark design, security hardening, and this blog article were implemented by Hermes Agent. The human-directed goals and validated results. The agent executed every step — from kernel flag surgery to final documentation.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>hermesagentchallenge</category>
      <category>ai</category>
      <category>agents</category>
      <category>productivity</category>
    </item>
  </channel>
</rss>
