I am Starrzan

Posted on May 30

How I Built a Self-Managing AI Lab with Hermes Agent on a Intel Arc GPU

#hermesagentchallenge #ai #agents #productivity

Hermes Agent Challenge Submission: Write About Hermes Agent

This is a submission for the Hermes Agent Challenge: Write About Hermes Agent

What I Built

A self-managing AI workspace powered by Hermes Agent — where an autonomous agent runs the local inference stack on Intel Arc GPU, automates research and documentation, manages cron jobs, and coordinates multiple LLM backends without human micro-management. The human directs goals; the agent executes everything.

Hardware: GMKtec EVO-T1 mini-PC (Intel Core Ultra 9 285H, Intel Arc 140T iGPU, 64GB DDR5-5600) — a pocketable 45W system that runs autonomous AI agents 24/7.

The system manages:

Local LLM inference via llama.cpp on Intel Arc SYCL (iGPU)
Automated research pipelines feeding structured docs into a persistent vault
Multi-model testing and benchmarking — 9+ models across 9B to 35B parameters
Cron-driven monitoring — market data, system health, memory management
Self-maintaining skills — the agent updates its own skills and docs when things change

Architecture

[ User Goals ]
      │
      ▼
[ Hermes Agent ]─── llama-server (Intel Arc SYCL)
      │                    ├── Qwen3.5-9B-Sushi-Coder-RL (130K ctx) ← daily driver
      │                    ├── Qwen3-Coder-30B-A3B (65K ctx) ← coding specialist
      │                    ├── Qwen3.6-35B-UD-IQ4_NL (65K ctx) ← reasoning
      │                    └── Qwen3.5-9B-DeepSeek-V4-Flash (130K ctx) ← stable but reasoning-only
      │
      ├── research-vault/   (research & docs)
      └── hermes-config/    (skills, plugins, cron jobs)

The agent runs as a Hermes session with:

Persistent memory — notes about the environment, user preferences, tool quirks, project conventions
Durable skills — 40+ specialized procedures for devops, mlops, research, etc.
Toolsets — terminal, browser, file, cron, git, and more
Full system access — builds, debugs, tunes, and documents everything autonomously

GMKtec EVO-T1 Hardware

The host is a GMKtec EVO-T1 mini-PC:

CPU: Intel Core Ultra 9 285H (Arrow Lake, 16 cores, up to 5.4GHz)
iGPU: Intel Arc 140T (128 Xe cores, shares system DDR5 as VRAM)
RAM: 64GB DDR5-5600 (~58GB addressable by GPU)
Power: ~45W sustained under full load
Form factor: ~0.6L, pocketable

The Intel Arc 140T iGPU is the inference engine. With llama.cpp SYCL backend and Intel oneAPI 2026.0, the agent runs GGUF models locally at 131K context. A critical kernel-level SYCL fix (removing the -ze-intel-greater-than-4GB-buffer-required CUDA-style linker flag and setting ONEAPI_DEVICE_SELECTOR=level_zero:gpu) was required to prevent JIT compilation crashes at large context sizes — diagnosed and applied by the agent.

How It Was Built

All implementation was done by Hermes Agent. The human directed high-level goals; the agent executed every technical step.

Step 1: Local Inference Server (llama.cpp on Intel Arc)

Built a llama.cpp inference server backed by Intel Arc SYCL. The server handles model loading, context sizing per model, and spec decode configuration.

The critical subtlety: different models need different context sizes. CTX_SIZE must be set per-model, not globally. A 9B coder model gets 130k; a 27B model gets 65k. The agent handles this via model-specific startup configs.

Major SYCL fix: The SYCL backend had a critical bug — the -ze-intel-greater-than-4GB-buffer-required linker flag in ggml-sycl/CMakeLists.txt caused JIT compilation failures on the CPU SYCL device when any operation fell back from GPU. Removing this flag and setting ONEAPI_DEVICE_SELECTOR=level_zero:gpu to restrict to GPU-only eliminated the RMS_NORM crash that prevented models from loading at 131K context. The agent found this, diagnosed it, and fixed it.

Step 2: Hermes Agent Configuration

Configured Hermes with:

OpenRouter as default provider (cloud fallback)
Local llama-server as local provider (primary for privacy-bound work)
Skills system for recurring task patterns
Memory persistence across sessions

Step 3: Cron Jobs for Automation

The agent uses Hermes cron to run scheduled research, commit/push cycles, and health checks:

Market data monitoring (Polymarket, Kalshi feeds)
Workspace backup automation
Codebase quality scans
Security monitoring (SSH brute-force, system health, CVE feeds)

Step 4: Research Pipeline (research vault)

The agent does autonomous research and documents findings in a structured vault:

research-vault/
├── challenges/       # Dev challenge research, compatibility patches
├── research/         # Hardware, model, compatibility research
├── blogs/            # Technical blog articles
└── study/           # Learning notes, tutorials

Model Lineup

The system coordinates multiple GGUF models depending on task type:

Model	Architecture	Params	Context	Quant	Role	Notes
Qwen3.5-9B-Sushi-Coder-RL	Qwen 3.5 MoE	9B	130K	Q4_K_M	Daily driver	RL-tuned, best agentic quality, clean JSON output
Qwen3-Coder-30B-A3B	Qwen 3 MoE	30B (3B active)	65K	Q3_K_M	Coding specialist	Best decode throughput, strong at code generation
Qwen3.6-35B-UD-IQ4_NL	Qwen 3.5 MoE	35B	65K	UD-IQ4_NL	Reasoning	Highest reasoning quality, heavier VRAM cost
Qwen3.5-9B-DeepSeek-V4-Flash	Qwen 3.5 hybrid	9B	130K	Q4_K_M	Secondary	Fastest prefill, but output is reasoning-only (content field empty)
Qwopus3.5-9B-Coder-MTP	Qwen 3.5 w/ MTP	9B	8K effective	Q4_K_M	Deprecated	MTP merge caused KV cache contamination, garbled output

Why These Models

Sushi 9B is the only production-viable 9B model for agentic work on this hardware — passed all 6 agentic tests with 0 HTTP 500 errors, produced valid JSON, retained multi-turn context correctly
Coder 30B is a MoE model (30B total, 3B active parameters) so decode is fast despite the large parameter count — 11.52 t/s decode vs 8.24 t/s for the 9B model
DS-V4-Flash is useful for quick reasoning tasks where you don't need structured output — 190 t/s prefill makes it fast for short prompts
27B class models fill the gap between 9B and 35B — reasonable quality without the VRAM overhead of the larger model in the shared memory pool

Agentic Benchmark Results

Ran comprehensive agentic evaluations across all 9B models at 131K context:

Model	Tests Pass	HTTP 500	JSON Valid	Total Time	Quality
Sushi 9B	6/6	0	Yes (3/3)	561s	Best
DS-V4-Flash	6/6	0	No (0/3)	592s	Reasoning-only
Qwopus MTP	2/6	4	No (0/3)	256s	Broken

Key Findings

Sushi 9B (production daily driver):

Only model to pass all 6 agentic tests without errors
Correct multi-turn context retention across 3 turns
Valid structured JSON output (T2: 3/3 score)
Correct VRAM calculations (all 9B models: ~9.7GB at 130K ctx, no OOM risk on 58GB headroom)
Best instruction following (10 constraints, 4 paragraphs)

Qwopus MTP (deprecated):

4 out of 6 tests returned HTTP 500 internal server errors
Garbled output containing mixed Chinese/English pseudotext
KV cache contamination — corrupted output poisons subsequent requests
This is a model quality issue in the MTP merge — not fixable by configuration

DS-V4-Flash (secondary):

Stable, but all output is in reasoning_content only (content field empty)
Coherent reasoning but cannot produce valid structured JSON in content
Fast prefill (190 t/s) but 8.24 t/s decode

Technical Decisions Validated

Local-first, cloud-fallback: All inference runs local by default. Cloud only for models not running locally.
Per-model context sizing: Context window sizes are model-specific, not global. This prevents OOM on the Arc GPU's shared VRAM.
Skills over prompting: Every recurring workflow is encoded as a skill file. The system maintains itself.
Git-backed vault: All research auto-commits to GitHub. The workspace is the artifact.
Automated security monitoring: The agent watches for intrusions, monitors CVE feeds, and posts alerts to Discord — the workspace defends itself.

Security Infrastructure

The server runs automated security monitoring set up by Hermes Agent:

UFW firewall — default deny incoming, SSH only from LAN + Tailscale
fail2ban — auto-ban after 3 failed SSH attempts
Cron: security-monitor — every 30 min, checks brute-force, new devices, firewall, services, gateway
Cron: vulnerability-feed-monitor — every 12 hours, CVE monitoring for Ubuntu, kernel, Docker, Freebox OS
Discord alerts — CRITICAL and HIGH severity findings posted automatically
Pentest tools — nmap, masscan, tcpdump, arp-scan, netcat, wireshark

Key Numbers

58GB shared VRAM on Intel Arc 140T
130K context window (Sushi 9B)
9.7GB total VRAM usage at 130K ctx for 9B models (weights + KV cache)
48GB VRAM headroom at 130K ctx
8.24 t/s decode speed (Sushi 9B)
166 t/s prefill speed (Sushi 9B)
190 t/s prefill speed (DS-V4-Flash)
~36-37s per generation turn (Sushi 9B at 256 max_tokens)
0 HTTP 500 errors across 6 agentic tests (Sushi 9B)
9+ GGUF models tested (9B through 35B parameters)
6+ months of continuous local inference development by Hermes Agent
Automated security monitoring — log analysis, intrusion detection, CVE feed monitoring, Discord alerts

Demo / How to Replicate

The entire setup — llama.cpp SYCL build, Hermes Agent config, benchmark suite, and documentation — was built and maintained by Hermes Agent.

Minimal setup:

# 1. Clone and build llama.cpp with SYCL
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && cmake -B build -DGGML_SYCL=ON && cmake --build build

# 2. Install Hermes Agent
pip install hermes-agent

# 3. Configure local server
hermes config set providers.local.base_url http://localhost:8080/v1

# 4. Download and add your first model
# (example: Qwen3.5-9B at Q4_K_M quantization)
hermes models add --alias coder --path ./models/your-model.gguf --context-size 131072

All local model research, SYCL GPU debugging, production inference setup, benchmark design, security hardening, and this blog article were implemented by Hermes Agent. The human-directed goals and validated results. The agent executed every step — from kernel flag surgery to final documentation.

Top comments (3)

Harjot Singh • May 31

Self-managing AI lab on an Intel Arc GPU is a fun setup - and the Arc choice is the interesting part: most local-LLM content assumes NVIDIA/CUDA, so getting a real agent loop running on Arc (oneAPI/IPEX, or Vulkan) is genuinely useful documentation for the people who bought into Arc on price and then hit the "everything assumes CUDA" wall. The compatibility story is half the value of a post like this.

The "self-managing" claim is the part I'd love detail on - true self-management (the agent monitoring/healing/restarting its own services) is the hard, interesting bit vs just "an agent that runs locally." If it actually recovers from its own failures, that's the durable pattern (and it needs verification/health-checks the agent can't fake - same gating discipline I lean on in Moonshift, a multi-agent pipeline shipping a prompt to a real SaaS). Cool build - how's Arc holding up for agent inference vs a comparable NVIDIA card, and what does "self-managing" concretely do when something crashes?

I am Starrzan • May 31

Thanks for the thoughtful comment. One of the reasons I wanted to post this was exactly as you said: the NVIDIA path is very well supported and mostly works "out of the box", but for Intel Arc GPUs, especially the 140T in my case, things just did not work easily. This is where Hermes played a big part: actually doing the research and the trial and error to get a working model. After many attempts, it finally got the right combination, with a lot of patching and rebuilding of core libraries in between, but it has been an interesting journey to see how it and AI solve problems.

As to your second point, this is an ongoing goal and process, and I am not there yet, but I have put many pieces in place so far. For one, I have a script that runs the backend and logs any issues, and a cron that regularly checks for them and attempts to fix them. Additionally, built into the script is llama-swap, which "in theory" should switch to another model if the current one fails, although I have not fully tested it or got it to work yet. So the idea here is a backup model to keep the machine going. The rest is all health, security and cleanup-related automations. So the question is, is this really self-managing/self-healing yet? Probably not, but I am on my way there and hope to write another post with a guide on how to achieve this in a similar way.

On the Arc vs NVIDIA front, I think NVIDIA is still king in terms of support, and this particular Arc, being an eGPU, does not help with VRAM, but it can run decently if you have the time and patience to tinker. So for now, the jury is still out.

I am Starrzan • May 31 • Edited

@harjjotsinghh Here is a more detailed follow-up post on how I actually got the Intel Arc 140T GPU to work with llama.cpp. dev.to/starrzan/running-local-llms...