Sodiq Jimoh

Posted on May 23

I Build ML Infrastructure for a Living — Here's Why Hermes Agent Changes the Game for Platform Engineers

#hermesagentchallenge #devchallenge #agents #kubernetes

Hermes Agent Challenge Submission: Write About Hermes Agent

Hermes Agent Challenge Submission: Write About Hermes Agent Challenge Page

I've spent the past year building NeuroScale — an open-source AI inference platform on Kubernetes. 108 commits. 21 automated smoke checks across 6 milestones. The kind of platform where a developer fills in a Backstage form and gets a production-grade inference endpoint with drift control, policy guardrails, and cost attribution — no kubectl required.

I'm telling you this because I need you to understand where I'm coming from when I say: Hermes Agent isn't just another AI coding assistant. It's the first agent framework that actually thinks like a platform engineer.

I don't say that lightly.

The Problem Nobody Talks About: AI Agents Are Stateless in a Stateful World

Building ML infrastructure teaches you one thing fast: everything is state.

Your ArgoCD sync status is state. Your Kyverno policy violations are state. The drift between what's in Git and what's running in the cluster — state. The fact that someone ran kubectl apply directly at 2am and broke the GitOps contract — that's state too.

Every AI agent I've used before Hermes treats each conversation like a blank canvas. You explain your architecture. You describe the problem. You get a plausible answer. Then you close the tab and do it all over again tomorrow.

Groundhog Day for infrastructure debugging.

Hermes Agent is architecturally different, and the difference matters specifically for the kind of work platform engineers do.

Three-Layer Memory: What It Actually Means for Infrastructure

Most people writing about Hermes focus on the memory system as a convenience feature. "It remembers your preferences." "It knows your name."

That's not what makes it interesting.

Hermes runs a three-layer memory architecture:

Short-term — current conversation context (same as every other agent)
Medium-term — session summaries that persist between conversations, built through periodic "memory nudges"
Long-term — Skill Documents that capture how it solved specific types of problems, stored as reusable procedures

For a platform engineer, this maps directly to something we already understand: runbooks.

When I troubleshoot an ArgoCD sync failure, I don't start from first principles. I check the runbook. Token expiry? Webhook misconfiguration? Sync wave ordering? The runbook encodes prior incident resolution as a procedure.

Hermes does this automatically. After roughly 15 tasks, its GEPA loop (Goal → Execute → self-Prompted introspection → Adapt — published at ICLR 2026 as an Oral) kicks in: it reviews its own performance, identifies patterns, and writes new Skill Documents. Agents with 20+ self-generated skills complete similar future tasks 40% faster than fresh instances.

That's not "remembering your name." That's an agent building its own runbook library. It's the difference between a junior on-call engineer and a senior who's seen every failure mode before.

Where Hermes Creates Real Value in an ML Platform Stack

Abstract possibilities are cheap. Let me be specific about where this matters in a stack like NeuroScale.

1. Configuration Drift Diagnosis

NeuroScale uses ArgoCD with selfHeal: true — drift is auto-corrected. But detecting drift before ArgoCD catches it, and understanding why it happened, is a different problem.

Here's what a Hermes scheduled audit looks like in practice:

hermes task add --cron "0 */6 * * *" \
  "Check the diff between Git-declared state in infrastructure/apps/ \
   and live cluster state. If they diverge, summarize what changed, \
   correlate with recent kubectl audit logs, and flag whether the \
   change was human-initiated or a controller reconciliation. \
   Send results to Telegram."

Most agents can run a diff. Hermes does the part that matters: building a pattern library over time. After a month of audits, it knows that drift in the serving-stack namespace is almost always a Knative autoscaler update (harmless), while drift in kyverno/policies/ is almost always someone bypassing admission control (critical).

That context accumulates in Skill Documents. I haven't seen another agent framework that does this out of the box.

Here's what a drift report from Hermes actually looks like after a few weeks of accumulated context:

📋 Drift Audit — 2026-05-23 12:00 UTC

Cluster: neuroscale-prod
Namespaces scanned: 4

✅ serving-stack: 2 diffs detected
   → Both are Knative autoscaler reconciliations (harmless)
   → Matches pattern from Skill: "knative-autoscaler-drift"
   → No action required.

⚠️ kyverno/policies: 1 diff detected
   → ClusterPolicy "require-resource-limits" modified in-cluster
   → Not present in Git (infrastructure/policies/)
   → kubectl audit: manual apply by user "ops-admin" at 03:12 UTC
   → FLAGGED: Possible admission control bypass.
   → Recommend: Revert in-cluster change or commit to Git.

📎 Context: This is the 3rd manual policy edit in 14 days.
   Previous incidents resolved by reverting. See Skill:
   "kyverno-drift-response" for standard procedure.

Notice the last three lines. That's not a generic diff. That's an agent referencing its own operational history — correlating today's anomaly with patterns it learned from previous audits. A fresh agent instance can't do that. One with a month of Skill Documents can.

2. Policy Validation Before Merge

NeuroScale enforces 5 Kyverno ClusterPolicies — requiring resource limits, standard labels, non-root containers, no :latest tags. But violations caught at admission mean the deploy already failed. The earlier you catch them, the cheaper the fix.

This is where Skill Documents become genuinely powerful. You write one that encodes your specific policies:

# Skill: NeuroScale Policy Pre-Check
## When to Use
When reviewing PRs that modify files under `apps/` or `infrastructure/`.
## Procedure
1. Check for `owner` and `cost-center` labels on all InferenceService manifests
2. Verify `resources.requests` and `resources.limits` are set
3. Flag any image tag that is `latest` or missing
4. Verify `securityContext.runAsNonRoot: true`
## Known False Positives
- ClusterServingRuntime objects are exempt from label requirements

That's not a prompt. It's a procedural memory document — loaded on-demand, zero tokens until needed, self-improving based on new violations it discovers.

3. Incident Response That Compounds

Real scenario from NeuroScale development: Backstage went into a CrashLoop. Root cause was a token refresh issue with the Kubernetes service account. I documented it in INCIDENT_BACKSTAGE_CRASHLOOP_RCA.md.

With Hermes running persistently — which you can do on a $5 VPS or a serverless backend that hibernates when idle — it would have:

Detected the CrashLoop via scheduled health check
Correlated with recent changes (cert rotation? secret update?)
Checked its Skill Documents for prior Backstage incidents
Either resolved it or escalated with a structured diagnosis

Next time a similar issue occurs, it resolves faster because the skill from incident #1 already exists. That's the compounding effect that makes experienced SREs more valuable over time — now encoded in an agent's memory.

What Hermes Gets Right That Other Frameworks Don't

I've looked at the landscape — LangChain, CrewAI, AutoGen. Here's what Hermes gets structurally right for infrastructure:

Local-first data residency. Everything lives in a local SQLite database. For platform engineers working with cluster credentials and deployment configs, this isn't a feature — it's a prerequisite. I'm not sending my policy violations through someone else's API.

Terminal backends that work. Seven backends including local, Docker, SSH, and serverless options. SSH means Hermes runs commands on your actual infrastructure. Docker means you can sandbox it. Serverless means it hibernates when idle, wakes on demand. This is infrastructure-native thinking, not "here's a chat UI that can run Python."

Built-in cron scheduling. Natural-language-configured scheduled tasks with delivery to Telegram, Discord, Slack, or Signal. For infrastructure monitoring, this is table stakes — and Hermes is one of the few agent frameworks that ships it natively, no external cron daemon or YAML required.

200+ model support. Switch between cheap models for routine audits and powerful ones for complex diagnosis with a single command. No code changes. Operational flexibility that platform engineers actually need.

What Hermes Doesn't Solve (Yet)

Honesty about limitations matters more than hype when we're talking about tools that touch production infrastructure.

Domain reasoning is shallow. Hermes can follow procedures and build skill documents, but it can't replace a senior engineer's intuition about why a particular autoscaler configuration causes cascading latency under specific traffic patterns. The skill system captures what to do, not why it works.

Multi-cluster coordination is manual. NeuroScale runs on a single cluster. For federated infrastructure across regions, Hermes' per-instance memory doesn't federate. Each agent builds its own skill library independently. There's no skill-sharing protocol between agents yet.

Approval workflows need hardening. The --yolo flag bypasses all approval prompts. For infrastructure work, that's terrifying. The approval system needs declarative rules about what the agent can and cannot do — something like Kyverno's admission policies, not just per-command approve/deny. The tools/ directory has approval pinning in progress, but it's not production-ready for high-stakes operations.

The Bigger Picture: Agents as Infrastructure Primitives

Here's the perspective I haven't seen anyone else articulate.

Hermes Agent isn't just a tool for platform engineers. It's a new kind of infrastructure primitive.

Think about the trajectory: manual server management → configuration management → infrastructure as code → GitOps → platform engineering. Each layer abstracted the layer below and added intelligence.

Hermes represents the next step: infrastructure as conversation. Not in the shallow "chat with your cluster" sense. In the sense that an agent with persistent memory, self-improving procedures, and scheduled automation can become a layer in your control plane.

A layer that:

Observes continuously (cron + terminal access)
Learns from incidents (GEPA → Skill Documents)
Enforces patterns (skill-driven validation)
Communicates across channels (Telegram/Slack/Discord)
Costs almost nothing when idle (serverless backends)

That's not a chatbot. That's an operator — in the Kubernetes sense of the word.

Why I'm Betting on Hermes

The tools that win aren't the ones with the most features. They're the ones with the right architecture for compounding.

ArgoCD won over manual deploys because GitOps compounds — every deployment is auditable, reproducible, reversible. Kyverno won over manual policy checks because admission policies compound — every new policy protects every future deployment.

Hermes Agent's architecture compounds the same way. Every task makes it better at the next one. Every incident resolution becomes a skill document. Every audit pattern becomes a scheduled automation.

164,000 GitHub stars in under three months. MIT licensed. Runs on a $5 VPS. Data stays on your machine.

For platform engineers who've spent years building systems that self-heal, self-monitor, and self-govern — Hermes Agent is the first AI framework that actually speaks our language.

I'm Sodiq, and I build ML infrastructure platforms. NeuroScale is open source: github.com/sodiq-code/neuroscale-platform — PRs welcome. If you want to see how Hermes Agent could fit into a real Kubernetes-based ML platform, that's where I'd start.

Star the repo if this perspective was useful. And if you've tried Hermes against your own infrastructure — what broke first? I want to know.

DEV Community