DEV Community

Levash0v
Levash0v

Posted on

I Turned Hermes Agent into a Verifiable Agent Operating System

Hermes Agent Challenge Submission: Build With Hermes Agent

What I Built

I did not build another chatbot.

I built a memory hygiene system around Hermes Agent: a workflow that tells the agent what to remember, what to turn into a skill, what to write into the repo, what to track in a task system, and what to leave behind.

The core idea is simple:

Agent memory is not one bucket.

Long-running agent work breaks when chat history, global memory, project state, reusable procedures, task ownership, and public side effects are treated as the same thing. They have different lifetimes. Putting all of them into “memory” creates drift.

So I built a small repo-local harness and operating discipline around Hermes Agent.

Hermes Agent is the local agent runtime I use for tool-calling work: terminal commands, file edits, browser/search workflows, persistent memory, reusable skills, scheduled jobs, and gateway integrations.

Multica is the external task layer I use for active work ownership and routing. In this setup, it replaced local Hermes Kanban as the source of truth for current tasks.

The system separates agent work into durable layers:

Layer Responsibility
Hermes memory Stable facts only
Hermes skills Reusable procedures
Repo files Project-local state and conventions
Multica Task ownership and routing
Session search Historical recall
Human approval External side effects

The operating rule:

Memory for stable facts. Skills for reusable procedures. Repos for project state. Multica for task ownership. Session search for history. Human approval for side effects.

That turns Hermes from a chat assistant into a small agent operating layer.

Before / after

Before After
Task state buried in chat Task state lives in Multica
Reusable fixes lost in history Reusable fixes become Hermes skills
Project rules mixed with global memory Project rules live in AGENTS.md / CLAUDE.md
Agent repeats setup mistakes Skills + repo harness reduce rediscovery
Local Kanban drifts from reality Multica becomes the source of truth
Claims of completion are implicit Evidence reports verify artifacts

The important shift is not more memory. It is routing each kind of state to the layer with the right durability.

The lowest durable layer rule

The key rule is:

Store information in the lowest layer that is durable enough for its expected lifetime.

Examples:

  • A stable user preference goes to Hermes memory.
  • A repeated procedure becomes a Hermes skill.
  • A project convention goes to AGENTS.md or CLAUDE.md.
  • Current task ownership belongs in Multica.
  • Historical context can stay in session search.
  • Public side effects require human approval.

This keeps memory useful instead of turning it into a junk drawer.

Demo

The architecture is intentionally small:

Multica task layer ←→ Hermes Agent ←→ Session search
                         ↓
                  Evidence Loop
        Intent → Action → Artifact → Verification → Report
                         ↓
              Human Approval Gate, if external
                         ↓
              publish / send / deploy / push

Durable layers:
- Hermes memory: stable facts only
- Hermes skills: reusable procedures
- Repo harness: project-local state
Enter fullscreen mode Exit fullscreen mode

Architecture diagram of Hermes as a verifiable agent operating system with Multica, session search, evidence loop, human approval gate, memory, skills, and repo harness

Hermes routes work through durable layers, then through an evidence loop. External side effects stop at the Human Approval Gate.

The concrete task was:

Create a repeatable convention for repo-local agent state, verify it, and keep task ownership outside chat.

The workflow:

  1. A Multica issue defined the work.
  2. Hermes recovered prior context through session search.
  3. Hermes wrote the repo-local harness files:
  • AGENTS.md
  • CLAUDE.md
  • agent-progress.md
  • AGENT_LESSONS.md
  • session-handoff.md
  • feature_list.json
  • .agent-harness/validate_feature_list.py
    1. Reusable procedure was promoted into Hermes skills.
    2. Project-specific state stayed in the repository.
    3. Active ownership stayed in Multica.
    4. The harness was verified with tests and a validator command.

Multica task board showing the Hermes Agent Operating System project with completed repo harness and validator tasks and in-progress skill promotion and DEV.to submission tasks

Task ownership in Multica: repo harness setup and validator test suite are done, while skill promotion and the DEV.to submission are still in progress.

The point is not that an agent edited files. The point is that the workflow forced each kind of information into the correct durability layer.

Evidence loop

The workflow uses this loop:

Intent -> Tool action -> Artifact -> Verification -> Evidence report -> Approval if external
Enter fullscreen mode Exit fullscreen mode

Examples:

  • A repo update is verified by reading the changed file or checking the diff.
  • A harness update is verified by running tests.
  • A task completion is verified by a Multica comment or linked artifact.
  • A reusable procedure is verified by a committed Hermes skill.
  • A public action, like pushing a repo or publishing a post, stops at the approval gate.

This changes the agent contract from “trust me, I did it” to “here is the artifact and here is how it was verified.”

Code

Repository: https://github.com/Levash0v/verifiable-agent-harness

The public artifact is intentionally small, but it has a real project shape:

templates/      AGENTS.md, CLAUDE.md, handoff files
examples/       feature_list.example.json
agent_harness/  validator
tests/          validator tests
docs/           evidence loop, diagram, article draft
Enter fullscreen mode Exit fullscreen mode

Each repository gets a small operating contract.

Excerpt from templates/AGENTS.md:

# Agent Guide

This repository uses a repo-local agent harness. Treat these files as source of truth for agent work state:

- feature_list.json
- agent-progress.md
- session-handoff.md
- AGENT_LESSONS.md

## Startup protocol

1. Run `pwd`.
2. Run `git status --short --branch`.
3. Read this file and `CLAUDE.md` if present.
4. Read `feature_list.json`, `agent-progress.md`, `session-handoff.md`, and `AGENT_LESSONS.md`.
5. Run `python .agent-harness/validate_feature_list.py`.
6. Pick one unfinished feature only.
Enter fullscreen mode Exit fullscreen mode

That contract means the next agent session does not need to reconstruct the project from chat. The repository carries its own operating state: current features, verified progress, and repo-specific lessons.

The repo is not only documentation. It has an executable validator path:

python3 -m agent_harness validate examples/feature_list.example.json
python3 -m unittest discover -s tests -v
Enter fullscreen mode Exit fullscreen mode

Terminal output showing the agent harness validator passing and four unit tests completing successfully

The harness is executable: the feature list validator passes, and the test suite verifies both valid and invalid project-state files.

This is deliberately small. The goal is to make the convention executable and testable instead of purely narrative.

My Tech Stack

  • Hermes Agent — agent runtime, memory, skills, tools, session search, scheduled jobs, and gateways
  • Multica — active task ownership and routing
  • Python — repo harness validator
  • unittest — validation tests
  • Markdown — repo-local operating contracts
  • JSON — machine-readable feature state
  • Git / GitHub — versioned repo artifacts and proof trail
  • DEV.to — publication and challenge submission

How I Used Hermes Agent

Hermes Agent powered the project as the orchestrator and verifier.

I used Hermes memory only for stable facts: user preferences, environment facts, and long-lived workflow conventions.

I used Hermes skills as procedural memory: repo harness setup, publication workflow, clean-state checks, task handoff patterns, and debugging or routing procedures discovered during work.

I used session search for historical recall: prior decisions, old implementation attempts, and context reconstruction before updating a repo or task.

I used Hermes tools for concrete work: reading and editing files, running terminal commands, checking diffs, executing validators, and verifying test output.

Repo-local state lives in files such as:

AGENTS.md
CLAUDE.md
feature_list.json
agent-progress.md
AGENT_LESSONS.md
session-handoff.md
clean-state-checklist.md
evaluator-rubric.md
.agent-harness/validate_feature_list.py
Enter fullscreen mode Exit fullscreen mode

Multica handles active task ownership and routing: what is being worked on, who owns it, what needs approval, and what result was reported back.

External side effects remain gated: GitHub pushes, DEV.to publishing, social posts, Discord messages, infrastructure deploys, and irreversible task comments.

Hermes can draft, edit, verify, and stage. The human approves the public action.

The biggest change was operating discipline:

  • Hermes stopped using global memory as a scratchpad.
  • Repeated fixes became skills instead of disappearing into chat history.
  • Project rules moved into repo-local files.
  • Task ownership moved from local Kanban to Multica.
  • Completion claims became evidence-backed reports.

This made the system less magical and more reliable.

Limitations

This is not a full agent platform by itself.

  • The harness validates conventions, not semantic correctness.
  • Multica is an external coordination layer, not required by the repo template.
  • Human approval is still required for external effects.
  • Evidence quality depends on disciplined updates to files, tasks, and skills.

That is intentional. The system is boring at the boundaries because those boundaries are where long-running agents usually fail.

Next steps

Next, I want to add more validators, richer handoff examples for Hermes / Claude Code / Codex, a stricter approval protocol, and more examples of skill promotion from repeated work.

The lesson I took from this build is simple:

Agent memory should be designed like infrastructure, not treated like a magic notebook.

Hermes gave me the primitives: memory, skills, tools, session search, scheduled jobs, and gateways.

The harness turns those primitives into an operating discipline.

Top comments (0)