Davincc77

Posted on May 23

One Soul, Any Model: Portable Memory for Open-Source Agents with .klickd

#hermesagentchallenge #devchallenge #agents #ai

Hermes Agent Challenge Submission: Build With Hermes Agent

This is a submission for the Hermes Agent Challenge: Build With Hermes Agent

What I Built

I built a prototype integration between Hermes Agent and .klickd, an open portable memory format for AI agents.

The problem I wanted to explore is simple:

Every new agent session often pays again to rediscover context that already exists.

That repeated context cost shows up as:

re-explaining project state;
reloading constraints;
rediscovering previous decisions;
rebuilding handoff notes;
rerunning tests just to find the same failure;
losing track of which actions require human approval.

.klickd is designed to turn that repeated context into a portable, encrypted, versioned file that an agent can load before work starts.

Hermes Agent is a good fit for testing this because it is an open-source, self-hosted agent runtime with skills, plugins, hooks, approvals, local execution, and agentic workflow orchestration.

In this project:

Hermes runs the workflow. .klickd carries the state.

The prototype focuses on a benchmark called Context Cost Benchmark, which compares two modes:

Baseline cold start

The full context is pasted into the prompt every time.
.klickd-loaded mode

Structured context is loaded from a .klickd fixture and injected into the agent workflow.

The benchmark is designed to measure:

repeated input tokens;
output tokens;
estimated cost;
latency;
continuity errors;
violations of locked decisions;
violations of tool permissions;
handoff quality;
unnecessary reruns of expensive commands.

The goal is not to claim a magic percentage improvement. The goal is to measure, reproducibly:

How many tokens and errors are we paying for simply because the agent has to rediscover state we already produced?

Demo

For the Hermes Agent Challenge, I created an experimental Hermes integration inside the klickdskill repository.

The demo uses Hermes Agent to drive the local .klickd Context Cost Benchmark.

hermes_klickd_agent_session_messages_json

If the embedded agent session does not render correctly, here is the relevant Hermes output:

session_id: 20260523_004058_85115c

Existing artifacts from 2026-05-23 were used. No rerun was needed.

Token-proxy totals:
- Cold: 310
- Paste: 6570
- Klickd: 5270

Verified artifacts:
- report.md
- summary.csv
- raw_runs.jsonl
- artifacts/sample_test.log

No publishes, git pushes, or external tool calls were performed.

The live Hermes run used:

Hermes Agent v0.14.0
OpenRouter free model route
capped API key with no paid budget
local dry-run benchmark
no production deployment
no package publishing
no external posting

Hermes session:

20260523_004058_85115c

Hermes was asked to use the klickd-context-cost skill, inspect the benchmark outputs, and avoid rerunning work if durable artifacts already existed.

The key result:

Existing artifacts from 2026-05-23 were used. No rerun was needed.

That matters because one of the core ideas in .klickd v4 is that agents should not spend tokens or compute rediscovering output that already exists.

The dry-run produced these local artifacts:

benchmarks/context_cost/results/2026-05-23/
├── report.md
├── summary.csv
├── raw_runs.jsonl
└── artifacts/
    └── sample_test.log

The benchmark output was explicitly marked as a whitespace token proxy, not a provider-token measurement. This is important: these are not OpenAI, Anthropic, or OpenRouter tokenizer counts. They are deterministic local proxy values for early validation.

Current dry-run totals:

Condition	Token-proxy total
Cold start	310
Full context pasted	6570
`.klickd` structured context	5270

The useful result is not “.klickd reduces cost by X%.” That would be premature.

The useful result is:

The benchmark harness can now compare repeated context strategies, produce raw evidence, persist artifacts, and let Hermes inspect those artifacts instead of rerunning the same work.

Verification artifacts

One lesson from real agent workflows is that agents often rerun expensive commands just to recover output they already produced.

The benchmark therefore includes a verification_artifacts[] pattern inspired by this idea:

command 2>&1 | tee .test-output/<scope>.log

Instead of rerunning the test suite to find a failure, the agent can inspect the persisted artifact:

grep -n FAIL .test-output/full.log

In .klickd v4, that becomes structured state:

{
  "command": "npm test",
  "artifact_path": ".test-output/vitest.log",
  "status": "failed",
  "query_hint": "grep -n FAIL .test-output/vitest.log",
  "checked_at": "2026-05-23T00:00:00Z",
  "retention": "latest",
  "scope": "project"
}

This turns agent memory into something more operational:

what the agent knows;
what the agent must verify;
what the agent is not allowed to do without approval;
where the evidence lives;
what happened last time.

Code

Repository:

https://github.com/Davincc77/klickdskill

Hermes POC integration path:

integrations/hermes/
├── README.md
├── skill/
│   └── SKILL.md
├── plugin/
│   ├── plugin.yaml
│   └── __init__.py
├── scripts/
│   └── run_context_cost_benchmark.py
└── tests/

Context Cost Benchmark path:

benchmarks/context_cost/
├── RFC.md
├── runner.py
├── fixtures/
│   ├── baseline/
│   ├── klickd/
│   ├── prompts/
│   ├── validation/
│   ├── verification_artifacts/
│   └── edge_cases/
├── results/
└── tests/

Current benchmark pieces:

RFC-003: Context Cost Benchmark
local dry-run runner
fixture validation
deterministic token proxy
CSV / JSONL / Markdown reports
edge-case fixtures for:
- migration/version break;
- tool-call failure recovery;
- multi-session handoff.

The Hermes integration currently includes:

a Hermes-facing skill;
an experimental plugin scaffold;
a wrapper script that runs the local benchmark;
tests for the wrapper;
explicit safety constraints:
- no provider calls from the wrapper;
- no paid resources;
- no publishing;
- no production deployment;
- no secrets.

My Tech Stack

Hermes Agent — open-source, self-hosted agent runtime

https://github.com/NousResearch/hermes-agent
Hermes Agent docs

https://hermes-agent.app/en/docs
.klickd / klickdskill — portable encrypted AI context format

https://github.com/Davincc77/klickdskill
.klickd official page

https://klickd.app/klickdskill
Python SDK — local .klickd loading / saving

Current development install, until PyPI is updated:

pip install "git+https://github.com/Davincc77/klickdskill.git@main#subdirectory=packages/pypi/klickd"

Current Python import:

from klickd import load_klickd, save_klickd

GitHub Actions — test vectors and package integrity checks
CSV / JSONL / Markdown — benchmark reports
Local verification artifacts — persisted logs for agent inspection
OpenRouter free model route — used only to run the Hermes agent session for the demo

How I Used Hermes Agent

Hermes Agent is used as the workflow runner for the benchmark.

The .klickd file is not meant to replace Hermes memory or Hermes skills. Instead, it gives Hermes a portable external state artifact it can load before work starts.

Hermes is responsible for:

running the benchmark task;
reading fixture context;
executing local dry-run commands;
inspecting generated artifacts;
summarizing benchmark results;
respecting approval and verification boundaries.

.klickd is responsible for carrying:

project state;
locked decisions;
tool permissions;
handoff notes;
verification gates;
human veto rules;
claim sources;
verification artifacts.

This is useful because multi-agent systems need more than agent-to-agent communication.

If A2A defines how agents talk, .klickd explores what portable state they carry between tasks, tools, models, and sessions.

The Hermes integration is therefore not about making a chatbot remember more. It is about testing whether an open-source agent runtime can operate with structured, portable context instead of repeatedly reconstructing the same state.

The goal is to reduce:

repeated prompt context;
hallucinated continuations;
forgotten decisions;
unsafe actions;
unnecessary reruns;
handoff failures.

The larger idea is that agent memory should become infrastructure:

Portable state, explicit constraints, verification artifacts, and human approval boundaries.

In short:

Hermes runs the workflow. .klickd carries the state.

What I Learned

The first useful result was not a performance number. It was a workflow result.

Hermes correctly used the existing benchmark artifacts instead of rerunning the dry-run unnecessarily.

That matters because a lot of agent waste is not only token waste. It is also repeated execution waste.

Agents often:

rerun tests to rediscover failures;
reread long logs from context;
rebuild state from previous messages;
regenerate summaries that already exist;
ask the model to infer what a file could have told it deterministically.

The benchmark and Hermes POC make that waste visible.

This also clarified the role of .klickd:

.klickd should not only remember preferences. It should help agents know:

what state exists;
what evidence exists;
what claims were executed, inspected, or assumed;
what actions require human approval;
what artifacts should be read before rerunning work.

That is why .klickd v4 is moving beyond portable memory toward a more operational layer:

portable encrypted context
+ project memory
+ verification gates
+ human veto
+ claim sources
+ verification artifacts
+ migration safety

Sources

Hermes Agent Challenge:

https://dev.to/challenges/hermes-agent-2026-05-15

Hermes Agent repository:

https://github.com/NousResearch/hermes-agent

Hermes Agent documentation:

https://hermes-agent.app/en/docs

.klickd / klickdskill repository:

https://github.com/Davincc77/klickdskill

.klickd official page:

https://klickd.app/klickdskill

Related article on preserving command output for agents:

https://dev.to/tacoda/dont-make-the-agent-re-run-the-test-suite-to-find-the-failure-427

Final Note

This is still early.

The benchmark does not yet claim provider-token savings. The current numbers are a deterministic local proxy. The next step is to run the same structure against real provider usage and compare actual input/output tokens, latency, and continuity failures.

But the architecture is now testable:

Hermes can act as the workflow runner.
.klickd can act as the portable state layer.
The benchmark can produce raw evidence.
Verification artifacts can prevent unnecessary reruns.
The system can evolve without breaking older .klickd files.

That is the direction I want to keep exploring.

One soul. Any model. Any agent.

DEV Community