DEV Community: Levash0v

Agent Leaderboards Measure Score. We Added Price.

Levash0v — Tue, 07 Jul 2026 13:31:46 +0000

Agent benchmarks tell us who can solve a task. We built the missing layer: what each verified result cost.

Most leaderboard views answer only half the question

Agent benchmarks are everywhere now: SWE-bench, Terminal-Bench, OSWorld, GAIA. They show who scores highest — and some also expose cost at the run level.

But before hiring anyone — human or machine — the practical question is different:

What does it cost to get the task done?

Some of that data already exists. Princeton's HAL publishes recorded run costs alongside scores. Terminal-Bench 2.0 submissions include a result.json for individual task attempts, including dollar cost and pass/fail outcome.

What has been missing is a shared, task-level view that connects:

agent configuration → benchmark task → verified pass → recorded cost

We assembled that layer as part of the Agent Taxonomy Graph.

What we ingested

HAL (Princeton): 46 evaluation runs across 9 benchmarks, with reported run-level cost.
Terminal-Bench 2.0: 75 public submissions and 32,803 per-task result.json files, each carrying a recorded cost and pass/fail result.
Curated benchmark observations: merged into a dataset of 131 evaluation runs and 29 model configurations with both a score and a price attached.

Every run keeps its source URL and verification status. This is intended as an evidence layer, not a scraped leaderboard table.

Scope: run-level comparisons are directional, not a normalized ranking. Benchmark mix, harness design, retry policy, context limits, and cost accounting differ across public evaluations. The cleaner comparison is at the level of the same benchmark task and the same verifier pass condition.

What the money says

The spread is the story.

Across the evaluated configurations, recorded cost per benchmark run ranges from roughly $0.03 to more than $1,600 — five orders of magnitude for systems that can appear side by side on agent leaderboards.

The current cost × score Pareto frontier, as of July 2026:

Configuration	Recorded cost per evaluation run	Avg score
DeepSeek-V3.2	$0.03	39.6%
GPT-5.1 Codex mini	$0.06	61.6%
GPT-5.3 Codex	$0.18	71.9%
Frontier mixes	~$1.87+	~87%

Read it bottom-up:

A configuration costing around $0.06 already clears 60%.

Moving from roughly 62% to 87% score costs around 10–30× more in this sample.

That does not prove that cheap agents plus retries always beat a more expensive single pass. Reliability, verifier cost, retry budgets, and failure modes still matter.

But it establishes the question that should be measured:

What is the cost per accepted result, not just the cost per attempt?

For tasks with cheap verification and room for retries, lower-cost configurations may be much more competitive than leaderboard rankings alone suggest.

Per-task is where it gets interesting

Aggregate scores hide the strongest signal.

When the benchmark task and its verifier are held fixed, the pricing spread becomes hard to ignore:

pytorch-model-cli — 47 recorded successful runs, priced from $0.001 to $1.47. Same benchmark task, same verified pass condition, 1,109× price difference.
qemu-startup — 48 recorded successful runs, with an 810× cost spread.
log-summary-date-ranges — 58 recorded successful runs; the cheapest, Terminus 2 + DeepSeek-V3.2, cost roughly half a cent.

These are successful runs, not deduplicated configurations. The same agent × model pair can appear more than once because each trial takes its own trajectory: steps, retries, tokens, and tool use.

They are not identical agent trajectories. But they are comparable verified completions of the same externally checked task.

We have not found a common public price layer for that result.

A reader can see that one configuration passed. They can see that another scored higher overall. What they usually cannot see is that both passed the same task — and one paid three orders of magnitude less.

That is the gap.

Method, in plain terms

The current dataset joins public evaluation artifacts into a common graph:

Unit of analysis: a recorded evaluation run or task attempt
Cost: reported API spend in the source artifact
Success: benchmark verifier pass
Coverage: 131 evaluation runs, 29 model configurations, and 31,913 extracted per-task price rows
Evidence: source URLs and verification status retained per run
Not yet normalized for: infrastructure cost, harness differences, token accounting differences, retry policy, context limits, or reliability across repeated runs

The purpose is not to declare a universal “best agent.”

It is to make the cost of a verified outcome visible enough to compare.

The price graph is live

The 31,913 extracted task-level price rows are no longer just feedstock. The task-anchored price view now ships on the ATG dashboard as a dedicated screen.

For each of the 87 out of 90 Terminal-Bench 2.0 tasks with at least one solved, priced run, it shows:

the cheapest and median cost of a verified pass;
the full min→max price range on a shared log scale;
the solve rate across all recorded attempts;
the 20 cheapest recorded successful runs, linked to their source evidence.

The hardest priced task in the dataset, make-doom-for-mips, was solved by only 12 of 358 attempts.

The dashboard also plots cheapest verified-pass cost against the number of distinct configurations that managed to solve each task. That makes the trade-off visible: some tasks are cheap once solved, while others remain expensive because only a small number of configurations can complete them at all.

The next step is publishing the same structure as a knowledge graph: each canonical task becomes a node, and every priced verified pass becomes an edge weighted by recorded cost and linked back to its evidence.

The result is not just another leaderboard.

It is a public price layer for the agent market:

Which configurations can complete this verified task — and what did each one pay?

The practical view is live on the ATG dashboard: 133 agents, verified capabilities, protocols — and now per-task prices.

The dashboard is the analytical view.

Geo Protocol is the provenance layer for the Agent Taxonomy Graph: agent systems, capabilities, benchmark observations, and their evidence are published as queryable entities rather than static rows.

The current ATG hub and this research post are live on Geo Protocol: the post itself and the Agent Taxonomy Graph hub.

The next publication step is to add task-level price observations to that graph: a canonical benchmark task as a node, and each verified priced result as an evidence-linked relation.

Feedback: @maximl

I Turned Hermes Agent into a Verifiable Agent Operating System

Levash0v — Sat, 30 May 2026 14:06:53 +0000

What I Built

I did not build another chatbot.

I built a memory hygiene system around Hermes Agent: a workflow that tells the agent what to remember, what to turn into a skill, what to write into the repo, what to track in a task system, and what to leave behind.

The core idea is simple:

Agent memory is not one bucket.

Long-running agent work breaks when chat history, global memory, project state, reusable procedures, task ownership, and public side effects are treated as the same thing. They have different lifetimes. Putting all of them into “memory” creates drift.

So I built a small repo-local harness and operating discipline around Hermes Agent.

Hermes Agent is the local agent runtime I use for tool-calling work: terminal commands, file edits, browser/search workflows, persistent memory, reusable skills, scheduled jobs, and gateway integrations.

Multica is the external task layer I use for active work ownership and routing. In this setup, it replaced local Hermes Kanban as the source of truth for current tasks.

The system separates agent work into durable layers:

Layer	Responsibility
Hermes memory	Stable facts only
Hermes skills	Reusable procedures
Repo files	Project-local state and conventions
Multica	Task ownership and routing
Session search	Historical recall
Human approval	External side effects

The operating rule:

Memory for stable facts. Skills for reusable procedures. Repos for project state. Multica for task ownership. Session search for history. Human approval for side effects.

That turns Hermes from a chat assistant into a small agent operating layer.

Before / after

Before	After
Task state buried in chat	Task state lives in Multica
Reusable fixes lost in history	Reusable fixes become Hermes skills
Project rules mixed with global memory	Project rules live in `AGENTS.md` / `CLAUDE.md`
Agent repeats setup mistakes	Skills + repo harness reduce rediscovery
Local Kanban drifts from reality	Multica becomes the source of truth
Claims of completion are implicit	Evidence reports verify artifacts

The important shift is not more memory. It is routing each kind of state to the layer with the right durability.

The lowest durable layer rule

The key rule is:

Store information in the lowest layer that is durable enough for its expected lifetime.

Examples:

A stable user preference goes to Hermes memory.
A repeated procedure becomes a Hermes skill.
A project convention goes to AGENTS.md or CLAUDE.md.
Current task ownership belongs in Multica.
Historical context can stay in session search.
Public side effects require human approval.

This keeps memory useful instead of turning it into a junk drawer.

Demo

The architecture is intentionally small:

Multica task layer ←→ Hermes Agent ←→ Session search
                         ↓
                  Evidence Loop
        Intent → Action → Artifact → Verification → Report
                         ↓
              Human Approval Gate, if external
                         ↓
              publish / send / deploy / push

Durable layers:
- Hermes memory: stable facts only
- Hermes skills: reusable procedures
- Repo harness: project-local state

Hermes routes work through durable layers, then through an evidence loop. External side effects stop at the Human Approval Gate.

The concrete task was:

Create a repeatable convention for repo-local agent state, verify it, and keep task ownership outside chat.

The workflow:

A Multica issue defined the work.
Hermes recovered prior context through session search.
Hermes wrote the repo-local harness files:

AGENTS.md
CLAUDE.md
agent-progress.md
AGENT_LESSONS.md
session-handoff.md
feature_list.json
.agent-harness/validate_feature_list.py
1. Reusable procedure was promoted into Hermes skills.
2. Project-specific state stayed in the repository.
3. Active ownership stayed in Multica.
4. The harness was verified with tests and a validator command.

Task ownership in Multica: repo harness setup and validator test suite are done, while skill promotion and the DEV.to submission are still in progress.

The point is not that an agent edited files. The point is that the workflow forced each kind of information into the correct durability layer.

Evidence loop

The workflow uses this loop:

Intent -> Tool action -> Artifact -> Verification -> Evidence report -> Approval if external

Examples:

A repo update is verified by reading the changed file or checking the diff.
A harness update is verified by running tests.
A task completion is verified by a Multica comment or linked artifact.
A reusable procedure is verified by a committed Hermes skill.
A public action, like pushing a repo or publishing a post, stops at the approval gate.

This changes the agent contract from “trust me, I did it” to “here is the artifact and here is how it was verified.”

Code

Repository: https://github.com/Levash0v/verifiable-agent-harness

The public artifact is intentionally small, but it has a real project shape:

templates/      AGENTS.md, CLAUDE.md, handoff files
examples/       feature_list.example.json
agent_harness/  validator
tests/          validator tests
docs/           evidence loop, diagram, article draft

Each repository gets a small operating contract.

Excerpt from templates/AGENTS.md:

# Agent Guide

This repository uses a repo-local agent harness. Treat these files as source of truth for agent work state:

- feature_list.json
- agent-progress.md
- session-handoff.md
- AGENT_LESSONS.md

## Startup protocol

1. Run `pwd`.
2. Run `git status --short --branch`.
3. Read this file and `CLAUDE.md` if present.
4. Read `feature_list.json`, `agent-progress.md`, `session-handoff.md`, and `AGENT_LESSONS.md`.
5. Run `python .agent-harness/validate_feature_list.py`.
6. Pick one unfinished feature only.

That contract means the next agent session does not need to reconstruct the project from chat. The repository carries its own operating state: current features, verified progress, and repo-specific lessons.

The repo is not only documentation. It has an executable validator path:

python3 -m agent_harness validate examples/feature_list.example.json
python3 -m unittest discover -s tests -v

The harness is executable: the feature list validator passes, and the test suite verifies both valid and invalid project-state files.

This is deliberately small. The goal is to make the convention executable and testable instead of purely narrative.

My Tech Stack

Hermes Agent — agent runtime, memory, skills, tools, session search, scheduled jobs, and gateways
Multica — active task ownership and routing
Python — repo harness validator
unittest — validation tests
Markdown — repo-local operating contracts
JSON — machine-readable feature state
Git / GitHub — versioned repo artifacts and proof trail
DEV.to — publication and challenge submission

How I Used Hermes Agent

Hermes Agent powered the project as the orchestrator and verifier.

I used Hermes memory only for stable facts: user preferences, environment facts, and long-lived workflow conventions.

I used Hermes skills as procedural memory: repo harness setup, publication workflow, clean-state checks, task handoff patterns, and debugging or routing procedures discovered during work.

I used session search for historical recall: prior decisions, old implementation attempts, and context reconstruction before updating a repo or task.

I used Hermes tools for concrete work: reading and editing files, running terminal commands, checking diffs, executing validators, and verifying test output.

Repo-local state lives in files such as:

AGENTS.md
CLAUDE.md
feature_list.json
agent-progress.md
AGENT_LESSONS.md
session-handoff.md
clean-state-checklist.md
evaluator-rubric.md
.agent-harness/validate_feature_list.py

Multica handles active task ownership and routing: what is being worked on, who owns it, what needs approval, and what result was reported back.

External side effects remain gated: GitHub pushes, DEV.to publishing, social posts, Discord messages, infrastructure deploys, and irreversible task comments.

Hermes can draft, edit, verify, and stage. The human approves the public action.

The biggest change was operating discipline:

Hermes stopped using global memory as a scratchpad.
Repeated fixes became skills instead of disappearing into chat history.
Project rules moved into repo-local files.
Task ownership moved from local Kanban to Multica.
Completion claims became evidence-backed reports.

This made the system less magical and more reliable.

Limitations

This is not a full agent platform by itself.

The harness validates conventions, not semantic correctness.
Multica is an external coordination layer, not required by the repo template.
Human approval is still required for external effects.
Evidence quality depends on disciplined updates to files, tasks, and skills.

That is intentional. The system is boring at the boundaries because those boundaries are where long-running agents usually fail.

Next steps

Next, I want to add more validators, richer handoff examples for Hermes / Claude Code / Codex, a stricter approval protocol, and more examples of skill promotion from repeated work.

The lesson I took from this build is simple:

Agent memory should be designed like infrastructure, not treated like a magic notebook.

Hermes gave me the primitives: memory, skills, tools, session search, scheduled jobs, and gateways.

The harness turns those primitives into an operating discipline.