Vishal Keerthan

Posted on May 18

The Self-Trust Problem in Hermes Agent's Skill Architecture

#hermesagentchallenge #devchallenge #agents #opensource

Hermes Agent Challenge Submission

This is a submission for the Hermes Agent Challenge

Hermes Agent is one of the most architecturally serious open-source agent frameworks to emerge in 2026. The three-layer memory system, the GEPA self-evolution engine presented at ICLR 2026, local skill persistence with no telemetry, provider-agnostic routing across 400+ models — these are substantive engineering decisions, not feature marketing.

This post is not a feature overview. It's an examination of a structural tension that runs beneath all of those features: the system that generates knowledge is also the sole judge of whether that knowledge is valid. After working through the architecture documentation and the public GitHub issue tracker in depth, I think this tension is sharper and more layered than it first appears — and understanding it precisely is what separates building well with Hermes from quietly accumulating a skill library full of confident, stale, or structurally brittle knowledge.

The Compounding Mechanism and Its Hidden Assumption

When Hermes completes a complex task, it can persist what it learned as a skill — a Markdown file in ~/.hermes/skills/ encoding the approach, edge cases, and domain knowledge the task required. The next time a similar task arrives, the agent loads that skill rather than reasoning from scratch. Skills compound: agents with 20+ self-generated skills complete similar tasks 40% faster than fresh instances.

That benchmark is real. But it measures speed, not correctness. It does not assess whether the skills encode sound approaches or fortunate ones. It does not account for whether those skills remain valid as APIs deprecate, model versions change, or project requirements evolve.

A skill system without external validation does not compound quality. It compounds confidence. These are meaningfully different things, and the difference is the subject of Issue #25833, which states the structural problem directly:

"The agent is simultaneously the author, executor, and quality inspector of its own skills. There is no external validation point or consistency check."

This is not a bug that can be cleanly patched. It is a property of how autonomous self-improvement works. What follows are the concrete ways it surfaces.

Where the Tension Shows Up

Transient Failures Encoded as Permanent Lessons

Issue #6051 — now fixed, but worth understanding — documented the following: when the agent encountered a transient failure such as a terminal timeout or a network error, it would encode the lesson as a skill. The lesson was not "this tool failed under specific transient conditions." It was "this tool does not work in this context." Permanently.

The result was an agent that progressively avoided tools it had briefly failed with. The fix was a prompt adjustment instructing the background reviewer not to capture transient failures. Issue #25833 points out what that fix does not address: the underlying mechanism — write skill, use skill, no re-validation — is unchanged. Prompt instructions can drift. Future changes can override them. The structural gap remains.

The Runtime Loop Cannot Classify Its Own Failures

Issue #22112 documents a gap that cuts deeper than the skill system.

The Hermes architecture documentation describes a hard cap of 90 turns per task as a "deterministic circuit breaker" against runaway loops. What Issue #22112 documents is that this cap is insufficient at the execution layer. When the agent encounters repeated terminal timeouts — during routine directory enumeration across an external volume, for instance — it does not classify the failure and escalate. It retries the same sequence, silently consuming context and API budget until the turn cap terminates it.

The missing piece is a low-level escalation path: a mechanism that recognizes repeated equivalent failures as a failure class, halts, and surfaces a structured diagnostic rather than continuing to retry. The framework currently has no deterministic guardrail at the runtime layer capable of making that distinction.

This exposes a meaningful architectural gap. GEPA — the self-evolution engine — operates by reading execution traces and identifying why tasks failed. That is only useful when the execution traces are clean enough to analyze. A sophisticated reasoning layer sitting on top of a runtime loop that cannot survive a network timeout produces traces too noisy to evolve from. Evolution cannot substitute for fault tolerance.

Skills Have No Record of When They Were True

Consider this scenario from Issue #25833:

A user requests a data processing pipeline. The agent builds one using an API endpoint current at the time and creates a skill encoding the approach. Three weeks later, the endpoint is deprecated. The user makes the same request. The agent loads the skill, generates code pointing at the broken endpoint, and the task fails — with nothing in the system tracing the failure back to the skill as its source.

The skill carries no last_verified_at timestamp, no created_with_model_version field, no expiration metadata. The agent is not making anything up. It is confidently applying knowledge that used to be true.

The proposed schema from the issue:

runtime:
  model_created: "claude-sonnet-4-20250514"
  execution_count: 0
  success_rate: 0.0
  last_verified_at: null
  consistency_score: null

With this metadata, skills with low success rates could receive lower priority in prompt injection. Skills created against older model versions could be flagged for re-verification. Skills never verified after creation could be surfaced as experimental rather than loaded silently as ground truth. None of this exists today. These are open feature requests.

Memory Synthesis Can Quietly Overwrite Explicit Policy

Honcho, the memory subsystem, runs asynchronous multi-pass dialectic reasoning after each conversation turn. It deduces the user's preferences, communication style, and working patterns and synthesizes them into a persistent user model across three sequential passes — from gap analysis to reconciliation to final synthesis. This is sophisticated architecture that genuinely solves problems that pure vector retrieval systems cannot.

The problem documented in Issue #17583 is that the dialectic engine cannot reliably distinguish between an explicit user directive and an inferred behavioral pattern. If a user has stated "never use Python 3.9 under any circumstances due to legacy dependency conflicts," and the agent subsequently observes the user working in Python 3.9 in some adjacent context, the synthesis process may reclassify that hard constraint as a soft preference. The engine is built to find consistent patterns, and observed behavior can outweigh stated instruction.

There is no hierarchical policy layer — no mechanism by which manually authored directives carry immutable weight over dialectically synthesized observations. The result is gradual, untraceable behavioral drift. A constraint clearly set weeks ago may have silently softened with no record of when or why it changed.

The Evolution Engine Can Fail Without Saying So

GEPA — the Genetic-Pareto Prompt Evolution engine — is the intended structural answer to the self-grading problem. Rather than asking the agent to evaluate its own performance, GEPA uses an external reflection model to read execution traces, identify why tasks failed, and generate targeted mutations to the relevant skill instructions. Research by Agrawal et al. (2025) demonstrates it outperforms weight-space reinforcement learning approaches in complex agentic scenarios, achieving up to 19% performance improvement with approximately 35× fewer rollouts.

Two issues in the evolution repository document serious problems with the current implementation.

Issue #38 documented an architectural flaw in early versions of the Phase 1 SkillModule that prevented GEPA from mutating actual skill content at all. The evolution loop ran, produced no mutations, and gave no indication anything had gone wrong. Nothing evolved.

Issue #10 documents a separate failure mode: under certain DSPy 3.1+ configurations, GEPA silently falls back to MIPROv2 — an older, less capable optimizer — bypassing the Genetic-Pareto mechanisms entirely. This occurs due to configuration bugs involving the reflection_lm parameter and missing max_steps arguments. Constraint validators have also been observed throwing false positives on valid YAML structures, stalling the pipeline prematurely. In all of these cases, the system continues running without alerting the user that the external validation they are relying on is not actually operating.

An Undocumented Risk: Context Compression and Skill Provenance

This does not have a dedicated issue ticket, but it is embedded in the architecture documentation itself and its implications for the skill system are worth naming.

Hermes uses context_compressor.py, an aggressive lossy summarization module that activates as context window limits approach. It discards what the system classifies as low-signal transitional reasoning while preserving core deductive outputs — and the system itself determines what counts as low-signal.

In a long-running task where the approach evolves, earlier reasoning often explains why a particular direction was chosen over alternatives that seemed equally viable. When that reasoning is compressed away, subsequent decisions are made against an incomplete record of the task's own logic.

The interaction with the skill system is direct: a skill written from a compressed context encodes what was done without the why. A skill without its own reasoning cannot be safely extended, confidently modified, or trusted in edge cases that differ slightly from the original scenario. The agent's knowledge base can accumulate procedural steps that are locally correct but structurally brittle — correct for the exact scenario they were generated from, fragile everywhere adjacent to it.

The Curator: Where Maintenance Meets the Same Bias

Nous Research built the Curator as a background maintenance system for the skill library. It runs after 2+ hours of idle time or 7 days of inactivity, reviewing agent-authored skills, tracking usage frequency, marking stale ones unused for 30 days, archiving those unused for 90, and consolidating conflicting instructions.

The design is careful: snapshots before every pass, no automatic deletion, full recovery from ~/.hermes/skills/.archive/, and the ability to pin critical skills with hermes curator pin <skill>. The Curator never touches the 118 bundled skills — only the ones the agent generated.

But the documentation acknowledges the system's known limitation directly:

"The agent tends toward self-congratulation. It almost always thinks it performed well, even when it didn't. Community feedback has confirmed this."

The Curator's LLM reviewer is the same probabilistic system that generated the skills it is now evaluating. The self-congratulatory bias is not corrected by the maintenance loop — it is inherited by it. The loop runs every seven days. The bias runs continuously.

GEPA is the intended structural fix for exactly this: an external reflection model that does not ask the agent to grade its own work. But GEPA lives in a separate repository, requires explicit setup, costs $2–$10 per optimization run, and is currently in a volatile alpha state with the blocking issues described above. The online learning loop (Curator) and the offline evolution loop (GEPA) are designed to be complementary. Right now, the decoupling between them means most users are relying entirely on the self-congratulatory system.

What Hermes Actually Got Right

The failure modes above are real, but the honest picture of the system is not negative. The most architecturally significant thing Hermes built is something that deserves more attention than it gets in benchmark discussions: verifiable knowledge persistence.

Skills are Markdown files on disk. They can be read, edited, and diffed. When the Curator makes a decision, it produces a rationale that can be audited. This is fundamentally different from learning that happens inside model weights — weight-based systems offer no equivalent transparency. You cannot open a file and read what a neural network learned from a task. With Hermes, you can.

That transparency is the correct foundation. The current gap is not that the system lacks insight into its own behavior. It is that the validation infrastructure has not kept pace with the generation infrastructure. The agent can produce skills faster than it can verify them. The knowledge base grows faster than the quality controls around it.

Issue #10666 proposes skill quality tiers — core, recommended, experimental — based on verification status and usage history. This is the right direction. It would let the system be appropriately humble about freshly generated knowledge while still giving full weight to skills that have been battle-tested across many executions and verified over time.

The longer-term answer may be at the community level: if skills become portable, signable, peer-reviewed artifacts — analogous to what is beginning to emerge with Cursor rules or Claude Code plugins — then external validation replaces self-validation entirely. That is a structurally stronger trust model than any individual agent reviewing its own outputs.

What This Means in Practice

Pin your critical skills. Use hermes curator pin <skill> for any skill that is load-bearing in your workflow. The Curator's autonomous maintenance decisions should not touch knowledge your processes depend on.

Treat agent-generated skills as drafts by default, especially anything touching external APIs, file system paths, or version-specific tooling. The skill may be weeks old and the knowledge it encodes may have quietly expired.

Audit your explicit policies periodically. Any hard constraint you have set in the system deserves a manual review every few weeks. Given Issue #17583, you cannot assume the dialectic engine has preserved it unchanged.

Check the archive directory. ~/.hermes/skills/.archive/ is not a graveyard. It is a record of what the system decided to retire. What it archived may still be valuable; what it chose to keep may already be stale.

Run GEPA on your highest-frequency skills. The alpha is rough and the setup requires effort, but it is the only external validator currently available in the ecosystem. If you care about skill quality rather than just task speed, the setup cost is worth it.

Read your skills. They are Markdown files. If the agent encoded a flawed approach, it is there to see. The transparency that makes this system auditable is only useful if you actually audit it.

The Honest Assessment

Hermes Agent is a serious piece of work. The self-improving skill loop is a genuine architectural innovation. The GEPA evolution engine represents a meaningful departure from both static prompt engineering and weight-space reinforcement learning, when it functions as designed. The local-first, privacy-preserving deployment model is the right infrastructure bet for where AI development is heading.

The structural self-trust problem is also real. Skills accumulate without provenance metadata. The runtime loop cannot gracefully handle low-level failures. The memory system has no mechanism to protect explicit directives from being overwritten by behavioral inference. The evolution engine can fail silently. These are not surface-level issues.

The deepest tension in the architecture is between probabilistic reasoning and deterministic execution. A system sophisticated enough to write its own cognitive instructions needs to be robust enough to survive a network timeout. A system that builds behavioral models from conversation needs mechanisms to preserve explicit human directives against that synthesis. Evolution cannot substitute for fault tolerance. Inference cannot substitute for verification.

What makes this project worth taking seriously, beyond the benchmarks, is the quality of the issue tracker. It is detailed, technically honest, and full of proposals for mechanism-level solutions rather than prompt patches. The problems documented here are known, named, and being worked on by a community that understands them clearly. Building well with Hermes right now means building with that understanding — not despite it.

Top comments (3)

Mike Czerwinski • Jun 22

"Compounding confidence, not compounding quality" is the sharpest formulation of this I've seen — and it cuts deeper than "agent gets faster on similar tasks." Speed measured against unknown correctness is the brand of failure that doesn't show up in benchmarks because the benchmark is the rate, not the truth.

The schema you propose (last_verified_at, execution_count, success_rate, consistency_score) tracks almost exactly with what I've been building independently from the decision-store side: a lifecycle (proposed → accepted → locked → rejected → superseded → stale) plus verifiable_by pointers plus replaced_by chains on supersession. The convergence is real — Hermes skills and decision-store entries are different artifact types but the same architectural shape: file-on-disk, status-tracked, provenance-anchored, operator-authored. That you arrived at the equivalent schema from issue-tracker reading and that I arrived from running my own pipeline suggests it's not a design choice; it's the only shape that survives.

One floor up, though — and this came up sharply in a parallel cluster of conversations this week: GEPA's external reflection model is suspect for the same structural reason the Curator is. If the external reviewer is another LLM trained on similar data, you don't get independent validation; you get one signal wearing two hats. The honest second view has to be either operator-authored (signature on a hash-bound artifact) or genuinely structurally disjoint (deterministic check that doesn't depend on a model's interpretation at all). Anything else is the self-trust problem moved up one floor.

Question back: do you read the convergence — Hermes skills, decision schemas, Claude Code plugins, Cursor rules all landing on file-based, status-tracked, peer-reviewable artifacts — as the problem-space dictating the shape, or as vendors copying each other? The answer changes whether community-level peer review is the natural endpoint or just one possible direction.

Vishal Keerthan • Jun 23

I lean toward the problem space dictating the shape. The constraints are tight enough that there isn't much design freedom left. If knowledge needs to be readable, editable, versioned, and portable across model versions, file-based artifacts with metadata become the obvious choice. The fact that independent teams keep arriving at similar designs makes me think this is convergence driven by constraints, not vendors copying each other.

Mike Czerwinski • Jun 23

Constraints-driven-convergence is the formulation, yes. The cluster I keep watching this week ships at three different scales (peer-stack, prize-winning skill architectures, ByteDance opensource at 70k+ stars) and arrives at the same first half of the answer through different paths — readable/editable/versioned/portable as base constraints, file-based artifacts with metadata as inevitable shape. What's distinctive is what nobody has shipped yet across those three scales: verifiable_by diagnostics on the artifacts, fault-injection as a discipline applied to the harness itself, supersede chains that resolve forward against reality rather than their own previous wording. The infra-layer convergence is closing; the verification layer above it is open. Looking forward to where Apprentice lands once the ingestion shape settles.