DEV Community

Cover image for How I Built a Self-Managing AI Lab with Hermes Agent on a Intel Arc GPU
I am Starrzan
I am Starrzan

Posted on

How I Built a Self-Managing AI Lab with Hermes Agent on a Intel Arc GPU

Hermes Agent Challenge Submission: Write About Hermes Agent

This is a submission for the Hermes Agent Challenge: Write About Hermes Agent


What I Built

A self-managing AI workspace powered by Hermes Agent — where an autonomous agent runs the local inference stack on Intel Arc GPU, automates research and documentation, manages cron jobs, and coordinates multiple LLM backends without human micro-management. The human directs goals; the agent executes everything.

Hardware: GMKtec EVO-T1 mini-PC (Intel Core Ultra 9 285H, Intel Arc 140T iGPU, 64GB DDR5-5600) — a pocketable 45W system that runs autonomous AI agents 24/7.

The system manages:

  • Local LLM inference via llama.cpp on Intel Arc SYCL (iGPU)
  • Automated research pipelines feeding structured docs into a persistent vault
  • Multi-model testing and benchmarking — 9+ models across 9B to 35B parameters
  • Cron-driven monitoring — market data, system health, memory management
  • Self-maintaining skills — the agent updates its own skills and docs when things change

Architecture

[ User Goals ]
      │
      ▼
[ Hermes Agent ]─── llama-server (Intel Arc SYCL)
      │                    ├── Qwen3.5-9B-Sushi-Coder-RL (130K ctx) ← daily driver
      │                    ├── Qwen3-Coder-30B-A3B (65K ctx) ← coding specialist
      │                    ├── Qwen3.6-35B-UD-IQ4_NL (65K ctx) ← reasoning
      │                    └── Qwen3.5-9B-DeepSeek-V4-Flash (130K ctx) ← stable but reasoning-only
      │
      ├── research-vault/   (research & docs)
      └── hermes-config/    (skills, plugins, cron jobs)
Enter fullscreen mode Exit fullscreen mode

The agent runs as a Hermes session with:

  • Persistent memory — notes about the environment, user preferences, tool quirks, project conventions
  • Durable skills — 40+ specialized procedures for devops, mlops, research, etc.
  • Toolsets — terminal, browser, file, cron, git, and more
  • Full system access — builds, debugs, tunes, and documents everything autonomously

GMKtec EVO-T1 Hardware

The host is a GMKtec EVO-T1 mini-PC:

  • CPU: Intel Core Ultra 9 285H (Arrow Lake, 16 cores, up to 5.4GHz)
  • iGPU: Intel Arc 140T (128 Xe cores, shares system DDR5 as VRAM)
  • RAM: 64GB DDR5-5600 (~58GB addressable by GPU)
  • Power: ~45W sustained under full load
  • Form factor: ~0.6L, pocketable

The Intel Arc 140T iGPU is the inference engine. With llama.cpp SYCL backend and Intel oneAPI 2026.0, the agent runs GGUF models locally at 131K context. A critical kernel-level SYCL fix (removing the -ze-intel-greater-than-4GB-buffer-required CUDA-style linker flag and setting ONEAPI_DEVICE_SELECTOR=level_zero:gpu) was required to prevent JIT compilation crashes at large context sizes — diagnosed and applied by the agent.


How It Was Built

All implementation was done by Hermes Agent. The human directed high-level goals; the agent executed every technical step.

Step 1: Local Inference Server (llama.cpp on Intel Arc)

Built a llama.cpp inference server backed by Intel Arc SYCL. The server handles model loading, context sizing per model, and spec decode configuration.

The critical subtlety: different models need different context sizes. CTX_SIZE must be set per-model, not globally. A 9B coder model gets 130k; a 27B model gets 65k. The agent handles this via model-specific startup configs.

Major SYCL fix: The SYCL backend had a critical bug — the -ze-intel-greater-than-4GB-buffer-required linker flag in ggml-sycl/CMakeLists.txt caused JIT compilation failures on the CPU SYCL device when any operation fell back from GPU. Removing this flag and setting ONEAPI_DEVICE_SELECTOR=level_zero:gpu to restrict to GPU-only eliminated the RMS_NORM crash that prevented models from loading at 131K context. The agent found this, diagnosed it, and fixed it.

Step 2: Hermes Agent Configuration

Configured Hermes with:

  • OpenRouter as default provider (cloud fallback)
  • Local llama-server as local provider (primary for privacy-bound work)
  • Skills system for recurring task patterns
  • Memory persistence across sessions

Step 3: Cron Jobs for Automation

The agent uses Hermes cron to run scheduled research, commit/push cycles, and health checks:

  • Market data monitoring (Polymarket, Kalshi feeds)
  • Workspace backup automation
  • Codebase quality scans
  • Security monitoring (SSH brute-force, system health, CVE feeds)

Step 4: Research Pipeline (research vault)

The agent does autonomous research and documents findings in a structured vault:

research-vault/
├── challenges/       # Dev challenge research, compatibility patches
├── research/         # Hardware, model, compatibility research
├── blogs/            # Technical blog articles
└── study/           # Learning notes, tutorials
Enter fullscreen mode Exit fullscreen mode

Model Lineup

The system coordinates multiple GGUF models depending on task type:

Model Architecture Params Context Quant Role Notes
Qwen3.5-9B-Sushi-Coder-RL Qwen 3.5 MoE 9B 130K Q4_K_M Daily driver RL-tuned, best agentic quality, clean JSON output
Qwen3-Coder-30B-A3B Qwen 3 MoE 30B (3B active) 65K Q3_K_M Coding specialist Best decode throughput, strong at code generation
Qwen3.6-35B-UD-IQ4_NL Qwen 3.5 MoE 35B 65K UD-IQ4_NL Reasoning Highest reasoning quality, heavier VRAM cost
Qwen3.5-9B-DeepSeek-V4-Flash Qwen 3.5 hybrid 9B 130K Q4_K_M Secondary Fastest prefill, but output is reasoning-only (content field empty)
Qwopus3.5-9B-Coder-MTP Qwen 3.5 w/ MTP 9B 8K effective Q4_K_M Deprecated MTP merge caused KV cache contamination, garbled output

Why These Models

  • Sushi 9B is the only production-viable 9B model for agentic work on this hardware — passed all 6 agentic tests with 0 HTTP 500 errors, produced valid JSON, retained multi-turn context correctly
  • Coder 30B is a MoE model (30B total, 3B active parameters) so decode is fast despite the large parameter count — 11.52 t/s decode vs 8.24 t/s for the 9B model
  • DS-V4-Flash is useful for quick reasoning tasks where you don't need structured output — 190 t/s prefill makes it fast for short prompts
  • 27B class models fill the gap between 9B and 35B — reasonable quality without the VRAM overhead of the larger model in the shared memory pool

Agentic Benchmark Results

Ran comprehensive agentic evaluations across all 9B models at 131K context:

Model Tests Pass HTTP 500 JSON Valid Total Time Quality
Sushi 9B 6/6 0 Yes (3/3) 561s Best
DS-V4-Flash 6/6 0 No (0/3) 592s Reasoning-only
Qwopus MTP 2/6 4 No (0/3) 256s Broken

Key Findings

Sushi 9B (production daily driver):

  • Only model to pass all 6 agentic tests without errors
  • Correct multi-turn context retention across 3 turns
  • Valid structured JSON output (T2: 3/3 score)
  • Correct VRAM calculations (all 9B models: ~9.7GB at 130K ctx, no OOM risk on 58GB headroom)
  • Best instruction following (10 constraints, 4 paragraphs)

Qwopus MTP (deprecated):

  • 4 out of 6 tests returned HTTP 500 internal server errors
  • Garbled output containing mixed Chinese/English pseudotext
  • KV cache contamination — corrupted output poisons subsequent requests
  • This is a model quality issue in the MTP merge — not fixable by configuration

DS-V4-Flash (secondary):

  • Stable, but all output is in reasoning_content only (content field empty)
  • Coherent reasoning but cannot produce valid structured JSON in content
  • Fast prefill (190 t/s) but 8.24 t/s decode

Technical Decisions Validated

  1. Local-first, cloud-fallback: All inference runs local by default. Cloud only for models not running locally.
  2. Per-model context sizing: Context window sizes are model-specific, not global. This prevents OOM on the Arc GPU's shared VRAM.
  3. Skills over prompting: Every recurring workflow is encoded as a skill file. The system maintains itself.
  4. Git-backed vault: All research auto-commits to GitHub. The workspace is the artifact.
  5. Automated security monitoring: The agent watches for intrusions, monitors CVE feeds, and posts alerts to Discord — the workspace defends itself.

Security Infrastructure

The server runs automated security monitoring set up by Hermes Agent:

  • UFW firewall — default deny incoming, SSH only from LAN + Tailscale
  • fail2ban — auto-ban after 3 failed SSH attempts
  • Cron: security-monitor — every 30 min, checks brute-force, new devices, firewall, services, gateway
  • Cron: vulnerability-feed-monitor — every 12 hours, CVE monitoring for Ubuntu, kernel, Docker, Freebox OS
  • Discord alerts — CRITICAL and HIGH severity findings posted automatically
  • Pentest tools — nmap, masscan, tcpdump, arp-scan, netcat, wireshark

Key Numbers

  • 58GB shared VRAM on Intel Arc 140T
  • 130K context window (Sushi 9B)
  • 9.7GB total VRAM usage at 130K ctx for 9B models (weights + KV cache)
  • 48GB VRAM headroom at 130K ctx
  • 8.24 t/s decode speed (Sushi 9B)
  • 166 t/s prefill speed (Sushi 9B)
  • 190 t/s prefill speed (DS-V4-Flash)
  • ~36-37s per generation turn (Sushi 9B at 256 max_tokens)
  • 0 HTTP 500 errors across 6 agentic tests (Sushi 9B)
  • 9+ GGUF models tested (9B through 35B parameters)
  • 6+ months of continuous local inference development by Hermes Agent
  • Automated security monitoring — log analysis, intrusion detection, CVE feed monitoring, Discord alerts

Demo / How to Replicate

The entire setup — llama.cpp SYCL build, Hermes Agent config, benchmark suite, and documentation — was built and maintained by Hermes Agent.

Minimal setup:

# 1. Clone and build llama.cpp with SYCL
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && cmake -B build -DGGML_SYCL=ON && cmake --build build

# 2. Install Hermes Agent
pip install hermes-agent

# 3. Configure local server
hermes config set providers.local.base_url http://localhost:8080/v1

# 4. Download and add your first model
# (example: Qwen3.5-9B at Q4_K_M quantization)
hermes models add --alias coder --path ./models/your-model.gguf --context-size 131072
Enter fullscreen mode Exit fullscreen mode

All local model research, SYCL GPU debugging, production inference setup, benchmark design, security hardening, and this blog article were implemented by Hermes Agent. The human-directed goals and validated results. The agent executed every step — from kernel flag surgery to final documentation.

Top comments (0)