This is a submission for the Hermes Agent Challenge: Write About Hermes Agent
What I Built
A self-managing AI workspace powered by Hermes Agent — where an autonomous agent runs the local inference stack on Intel Arc GPU, automates research and documentation, manages cron jobs, and coordinates multiple LLM backends without human micro-management. The human directs goals; the agent executes everything.
Hardware: GMKtec EVO-T1 mini-PC (Intel Core Ultra 9 285H, Intel Arc 140T iGPU, 64GB DDR5-5600) — a pocketable 45W system that runs autonomous AI agents 24/7.
The system manages:
- Local LLM inference via llama.cpp on Intel Arc SYCL (iGPU)
- Automated research pipelines feeding structured docs into a persistent vault
- Multi-model testing and benchmarking — 9+ models across 9B to 35B parameters
- Cron-driven monitoring — market data, system health, memory management
- Self-maintaining skills — the agent updates its own skills and docs when things change
Architecture
[ User Goals ]
│
▼
[ Hermes Agent ]─── llama-server (Intel Arc SYCL)
│ ├── Qwen3.5-9B-Sushi-Coder-RL (130K ctx) ← daily driver
│ ├── Qwen3-Coder-30B-A3B (65K ctx) ← coding specialist
│ ├── Qwen3.6-35B-UD-IQ4_NL (65K ctx) ← reasoning
│ └── Qwen3.5-9B-DeepSeek-V4-Flash (130K ctx) ← stable but reasoning-only
│
├── research-vault/ (research & docs)
└── hermes-config/ (skills, plugins, cron jobs)
The agent runs as a Hermes session with:
- Persistent memory — notes about the environment, user preferences, tool quirks, project conventions
- Durable skills — 40+ specialized procedures for devops, mlops, research, etc.
- Toolsets — terminal, browser, file, cron, git, and more
- Full system access — builds, debugs, tunes, and documents everything autonomously
GMKtec EVO-T1 Hardware
The host is a GMKtec EVO-T1 mini-PC:
- CPU: Intel Core Ultra 9 285H (Arrow Lake, 16 cores, up to 5.4GHz)
- iGPU: Intel Arc 140T (128 Xe cores, shares system DDR5 as VRAM)
- RAM: 64GB DDR5-5600 (~58GB addressable by GPU)
- Power: ~45W sustained under full load
- Form factor: ~0.6L, pocketable
The Intel Arc 140T iGPU is the inference engine. With llama.cpp SYCL backend and Intel oneAPI 2026.0, the agent runs GGUF models locally at 131K context. A critical kernel-level SYCL fix (removing the -ze-intel-greater-than-4GB-buffer-required CUDA-style linker flag and setting ONEAPI_DEVICE_SELECTOR=level_zero:gpu) was required to prevent JIT compilation crashes at large context sizes — diagnosed and applied by the agent.
How It Was Built
All implementation was done by Hermes Agent. The human directed high-level goals; the agent executed every technical step.
Step 1: Local Inference Server (llama.cpp on Intel Arc)
Built a llama.cpp inference server backed by Intel Arc SYCL. The server handles model loading, context sizing per model, and spec decode configuration.
The critical subtlety: different models need different context sizes. CTX_SIZE must be set per-model, not globally. A 9B coder model gets 130k; a 27B model gets 65k. The agent handles this via model-specific startup configs.
Major SYCL fix: The SYCL backend had a critical bug — the -ze-intel-greater-than-4GB-buffer-required linker flag in ggml-sycl/CMakeLists.txt caused JIT compilation failures on the CPU SYCL device when any operation fell back from GPU. Removing this flag and setting ONEAPI_DEVICE_SELECTOR=level_zero:gpu to restrict to GPU-only eliminated the RMS_NORM crash that prevented models from loading at 131K context. The agent found this, diagnosed it, and fixed it.
Step 2: Hermes Agent Configuration
Configured Hermes with:
- OpenRouter as default provider (cloud fallback)
- Local llama-server as local provider (primary for privacy-bound work)
- Skills system for recurring task patterns
- Memory persistence across sessions
Step 3: Cron Jobs for Automation
The agent uses Hermes cron to run scheduled research, commit/push cycles, and health checks:
- Market data monitoring (Polymarket, Kalshi feeds)
- Workspace backup automation
- Codebase quality scans
- Security monitoring (SSH brute-force, system health, CVE feeds)
Step 4: Research Pipeline (research vault)
The agent does autonomous research and documents findings in a structured vault:
research-vault/
├── challenges/ # Dev challenge research, compatibility patches
├── research/ # Hardware, model, compatibility research
├── blogs/ # Technical blog articles
└── study/ # Learning notes, tutorials
Model Lineup
The system coordinates multiple GGUF models depending on task type:
| Model | Architecture | Params | Context | Quant | Role | Notes |
|---|---|---|---|---|---|---|
| Qwen3.5-9B-Sushi-Coder-RL | Qwen 3.5 MoE | 9B | 130K | Q4_K_M | Daily driver | RL-tuned, best agentic quality, clean JSON output |
| Qwen3-Coder-30B-A3B | Qwen 3 MoE | 30B (3B active) | 65K | Q3_K_M | Coding specialist | Best decode throughput, strong at code generation |
| Qwen3.6-35B-UD-IQ4_NL | Qwen 3.5 MoE | 35B | 65K | UD-IQ4_NL | Reasoning | Highest reasoning quality, heavier VRAM cost |
| Qwen3.5-9B-DeepSeek-V4-Flash | Qwen 3.5 hybrid | 9B | 130K | Q4_K_M | Secondary | Fastest prefill, but output is reasoning-only (content field empty) |
| Qwopus3.5-9B-Coder-MTP | Qwen 3.5 w/ MTP | 9B | 8K effective | Q4_K_M | Deprecated | MTP merge caused KV cache contamination, garbled output |
Why These Models
- Sushi 9B is the only production-viable 9B model for agentic work on this hardware — passed all 6 agentic tests with 0 HTTP 500 errors, produced valid JSON, retained multi-turn context correctly
- Coder 30B is a MoE model (30B total, 3B active parameters) so decode is fast despite the large parameter count — 11.52 t/s decode vs 8.24 t/s for the 9B model
- DS-V4-Flash is useful for quick reasoning tasks where you don't need structured output — 190 t/s prefill makes it fast for short prompts
- 27B class models fill the gap between 9B and 35B — reasonable quality without the VRAM overhead of the larger model in the shared memory pool
Agentic Benchmark Results
Ran comprehensive agentic evaluations across all 9B models at 131K context:
| Model | Tests Pass | HTTP 500 | JSON Valid | Total Time | Quality |
|---|---|---|---|---|---|
| Sushi 9B | 6/6 | 0 | Yes (3/3) | 561s | Best |
| DS-V4-Flash | 6/6 | 0 | No (0/3) | 592s | Reasoning-only |
| Qwopus MTP | 2/6 | 4 | No (0/3) | 256s | Broken |
Key Findings
Sushi 9B (production daily driver):
- Only model to pass all 6 agentic tests without errors
- Correct multi-turn context retention across 3 turns
- Valid structured JSON output (T2: 3/3 score)
- Correct VRAM calculations (all 9B models: ~9.7GB at 130K ctx, no OOM risk on 58GB headroom)
- Best instruction following (10 constraints, 4 paragraphs)
Qwopus MTP (deprecated):
- 4 out of 6 tests returned HTTP 500 internal server errors
- Garbled output containing mixed Chinese/English pseudotext
- KV cache contamination — corrupted output poisons subsequent requests
- This is a model quality issue in the MTP merge — not fixable by configuration
DS-V4-Flash (secondary):
- Stable, but all output is in reasoning_content only (content field empty)
- Coherent reasoning but cannot produce valid structured JSON in content
- Fast prefill (190 t/s) but 8.24 t/s decode
Technical Decisions Validated
- Local-first, cloud-fallback: All inference runs local by default. Cloud only for models not running locally.
- Per-model context sizing: Context window sizes are model-specific, not global. This prevents OOM on the Arc GPU's shared VRAM.
- Skills over prompting: Every recurring workflow is encoded as a skill file. The system maintains itself.
- Git-backed vault: All research auto-commits to GitHub. The workspace is the artifact.
- Automated security monitoring: The agent watches for intrusions, monitors CVE feeds, and posts alerts to Discord — the workspace defends itself.
Security Infrastructure
The server runs automated security monitoring set up by Hermes Agent:
- UFW firewall — default deny incoming, SSH only from LAN + Tailscale
- fail2ban — auto-ban after 3 failed SSH attempts
- Cron: security-monitor — every 30 min, checks brute-force, new devices, firewall, services, gateway
- Cron: vulnerability-feed-monitor — every 12 hours, CVE monitoring for Ubuntu, kernel, Docker, Freebox OS
- Discord alerts — CRITICAL and HIGH severity findings posted automatically
- Pentest tools — nmap, masscan, tcpdump, arp-scan, netcat, wireshark
Key Numbers
- 58GB shared VRAM on Intel Arc 140T
- 130K context window (Sushi 9B)
- 9.7GB total VRAM usage at 130K ctx for 9B models (weights + KV cache)
- 48GB VRAM headroom at 130K ctx
- 8.24 t/s decode speed (Sushi 9B)
- 166 t/s prefill speed (Sushi 9B)
- 190 t/s prefill speed (DS-V4-Flash)
- ~36-37s per generation turn (Sushi 9B at 256 max_tokens)
- 0 HTTP 500 errors across 6 agentic tests (Sushi 9B)
- 9+ GGUF models tested (9B through 35B parameters)
- 6+ months of continuous local inference development by Hermes Agent
- Automated security monitoring — log analysis, intrusion detection, CVE feed monitoring, Discord alerts
Demo / How to Replicate
The entire setup — llama.cpp SYCL build, Hermes Agent config, benchmark suite, and documentation — was built and maintained by Hermes Agent.
Minimal setup:
# 1. Clone and build llama.cpp with SYCL
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp && cmake -B build -DGGML_SYCL=ON && cmake --build build
# 2. Install Hermes Agent
pip install hermes-agent
# 3. Configure local server
hermes config set providers.local.base_url http://localhost:8080/v1
# 4. Download and add your first model
# (example: Qwen3.5-9B at Q4_K_M quantization)
hermes models add --alias coder --path ./models/your-model.gguf --context-size 131072
All local model research, SYCL GPU debugging, production inference setup, benchmark design, security hardening, and this blog article were implemented by Hermes Agent. The human-directed goals and validated results. The agent executed every step — from kernel flag surgery to final documentation.
Top comments (0)