cucoleadan

Posted on Jun 3 • Originally published at vibestacklab.substack.com on Jun 2

Why AI Benchmarks Fail Real Hermes Agent Workflows

#agents #benchmarks #workflows #routing

This post was originally published on my Substack publication as Why AI Benchmarks Fail Real Hermes Agent Workflows.

The day after Opus 4.8 launched, I gave it a job that should've taken two minutes. Find a named file, summarize it under a strict word limit and return the result in a specific format so the next step in the pipeline could parse it correctly.

Opus handled it with clean output and solid reasoning, but it took its time making sure every move was right. By the time it finished, a cheaper model would've done the same work three times over.

That was the moment I stopped trusting benchmarks. A leaderboard score tells you how a model performs on a clean task under controlled conditions. It says nothing about whether that model can survive a twenty-step workflow.

I've wanted to put models through a proper test for a while now. I finally found the time and the right config to do it. Here's how I run every model through the same real tasks, and what four models taught me about real Hermes work.

In this article:

Tool call discipline matters more than reasoning quality. A model that calls the right tool once beats one that explores and verifies three times over.
Route by task, not by preference. Use the cheapest model that reliably finishes the job, and step up only when it fails.
A simple testing framework that puts models through real Hermes tasks instead of synthetic benchmarks.

Bottom line: There is no single survivor. I route lightweight models (GLM 5.1) for fast background tasks, capable mid-tier models (Qwen 3.7 Max) for complex workflows, and flagships (Opus) exclusively for deep debugging when the others fail.

Why AI Benchmarks Fail Agent Workflows

Benchmarks test the thing they can measure cleanly. Can the model solve this logic puzzle, answer this math question, write code that passes these test cases? Those are useful questions for evaluating raw capability.

But agent work needs different skills. The model needs to be obedient, fast, disciplined with tools, and able to stop when the task is done. None of these show up on a benchmark because none of them are easy to measure in a controlled test.

A model can score in the top tier on a reasoning benchmark and still be the wrong fit for unattended workflow automation. It might overthink simple tasks, waste tokens on unnecessary reasoning, or call tools it doesn't need because it's trying to be thorough. In a chat interface that thoroughness feels impressive. In a scheduled job running at 6 AM it means the session times out before the work finishes.

Benchmarks also miss the compounding effect of small failures. A model that adds an extra section, ignores a format constraint, or calls a tool twice when once would suffice. Each is minor on its own. In an agent workflow where each step feeds the next one, minor failures cascade into broken jobs.

In my opinion, ClawEval comes close to a valid benchmark. The paper runs 300 human-verified agent tasks across 9 categories with a Pass^3 rule that eliminates lucky runs. A task only passes if the model meets the success criteria in all three independent trials, which is a meaningfully stricter bar than Pass@3. It's the most serious attempt at realistic agent evaluation I've seen, and it has already spawned related work like WildClawBench that tests agents inside live OpenClaw instances. But that's beyond the point.

My test is less scientific and more practical. I put models through the same jobs I schedule on Hermes, using the same config I run every day, and I watch what happens.

Speed Is a Workflow Cost

Speed matters in two completely different ways depending on how you use the model.

When you're in a session talking to Hermes, time is something you're spending. A model that takes four seconds instead of one is the difference between staying in flow and getting distracted between tool calls. Fast models give that time back to you.

In a cron job, nobody's watching the clock. But speed still matters because fast models tend to produce fewer reasoning tokens per tool call, and that directly affects cost and reliability. Less verbosity means a tighter context window across twenty steps, which keeps context window degradation from compounding into broken jobs by step fifteen.

This is why GLM 5.1 on Ollama Cloud carries most of my daily workload. It's fast enough to feel instant in interactive sessions and tight enough with tokens to keep scheduled jobs cheap and stable. I use it for heartbeat checks, morning briefings that synthesize Asana tasks before coffee, and anything that needs to be fast and correct without deep reasoning. The boring work that makes up most of a Hermes workday.

Speed also compounds with cost. A fast mid-tier model that finishes a workflow in three minutes is cheaper than a slow flagship that takes fifteen minutes. The cost math behind this is what I've been tracking across providers and it's where a fast model pays for itself.

For tasks that require massive token input and a lot of tool calls to read and process files, I reach for DeepSeek v4 Flash. When a job needs to chew through a hundred thousand tokens across a dozen files, speed is the only thing that keeps the session from becoming an exercise in patience.

Tool Call Discipline Beats Reasoning

This is the sharpest opinion I hold about models inside agent loops, and it's the one benchmarks almost never test.

Benchmarks measure whether a model can think hard about a problem. Hermes needs a model that calls the right tool once, reads the result, and moves on without second-guessing itself. Those are different skills, and the second one matters more for unattended work.

A model that makes twelve tool calls when four would do is being expensive and fragile. Every extra call adds API cost, creates another failure point, and fills the context window with noise the model has to process on the next step. Most of what people call "context engineering" inside an agent loop is just preventing this kind of noise from ever entering the window in the first place.

I've seen top-tier reasoning models call a search tool, read the results, then call a different search tool to verify what the first one returned, then call a third tool to format the output when a simple string operation would've worked. The net effect was a session that cost three times as much and took three times as long.

The same failure shows up with instruction obedience. A model that ignores format constraints and adds helpful extra sections breaks downstream parsing. A model that keeps writing after the task is done wastes tokens. A model that skips a negative constraint includes something you told it to avoid at the worst possible moment.

In a chat, each of these looks like helpfulness. In an agent workflow, each one becomes a liability because each step feeds the next one.

Tool call discipline separates a model I trust with unattended work from a model I keep supervised. A disciplined model reads the task, decides which tools it needs, calls each one once, and stops. An undisciplined model explores and adds helpful extra steps nobody asked for.

If you're running Hermes with approval gates, tool call discipline becomes even more important. A model that makes unnecessary tool calls also tends to ignore "ask before destructive action" instructions. The model thinks it knows better than the prompt.

This also ties into the interface question I've been tracking — a disciplined model calls the right tool the right way, and sometimes the right tool is a lightweight CLI instead of a bloated MCP server that chokes the context window before the model even starts thinking.

How I Route AI Models in Hermes

Model selection inside an agent loop is a routing problem, not a ranking problem. There's no single best model. There's a best model for each job, and the job changes throughout the day.

GLM 5.1 carries most of my workload. Heartbeat checks, simple scheduled jobs, structured data parsing, quick research tasks. These are tool calls that need to be fast and correct but don't need deep reasoning. GLM on Ollama Cloud is cheap enough that I don't think twice about spinning up a session. The boring work that makes up the bulk of a Hermes workday.

GPT-5.5 is what I use through my Codex subscription for the heavy work. I tried it inside Hermes first and it burned through usage by making tons of unnecessary tool calls. The model doesn't know when to stop, which makes it terrible for agent loops where each tool call costs tokens and fills the context window. So I stopped using it in Hermes and shifted it to Codex, where I control the loop. In Codex it handles most of my coding and the research pulls where I want thorough coverage. It is more verbose than the other models in my stack, which helps for research and hurts for tight format constraints, so I shape the prompt accordingly. The subscription removes cost as a gating factor, which means I can run it as often as the job needs without watching a counter.

Opus gets the tasks GPT 5.5 can't finish. I run it through the API and it's expensive, so I keep it scoped to debugging. When a problem doesn't reproduce cleanly or when a code review needs a different reasoning style, Opus handles it. That's it. Outside of those sessions it stays idle.

Qwen 3.7 Max is my surgical tool. I reach for it when a task needs more reasoning than GLM can deliver but I don't want to pay Opus prices. The step up from the cheaper Qwen tier is noticeable on tasks involving multi-step logic or ambiguous instructions. The cheaper version guesses and moves on. 3.7 Max pauses and works through it. For most structured agent work, the cheaper version gets the job done. I use 3.7 Max sparingly and mostly for content and deep research.

My pattern is simple. Use the cheapest model that reliably completes the specific task type. When GLM fails, step up. When the mid-tier fails, step up again. Escalation stays task-driven, not model-driven. Pricing only matters once the model can finish the job.

The first rule of routing is reliability. The second rule is cost. Get them in the wrong order and you'll pay for it.

I pay per token every time Hermes calls a tool, so the cost math matters. Every tool call generates input tokens from the context window and output tokens from the response. A session with a dozen tool calls consumes as many tokens as a long chat conversation.

My AI Agent Evaluation Ladder

I don't trust leaderboards because they don't test what I need. So I run models through the same set of real tasks from my actual work. Same files, same prompts, same conditions for every model so the comparison stays honest. Each task tests a different dimension of what I need from an agent model.

The tasks escalate from simple to demanding.

First, whether the model stays restrained when given no task. A good Hermes model sends a short greeting back and waits. A bad one starts searching files and scanning memory before you ask it to. The model that won't wait is the model that won't stop.

Then whether it finds a document and summarizes under strict format rules without drifting into extra sections. I cap the summary at a specific number of bullets with a word limit on each one. The task doesn't require deep reasoning. It requires the model to follow directions and stop.

Whether it packages a CLI tool into a reusable skill without overbuilding. Some models create five files when one would do. The disciplined model reads the help output before it writes anything.

Whether it uses that skill correctly in a fresh session with specific source rules. No Reddit, no arXiv, no turning search snippets into facts. This tests whether the model can follow negative constraints, which are harder than positive ones because the model has to actively suppress the instinct to include everything it finds.

And finally, whether it pulls together sources, cross-checks claims, and produces a decision-ready report under a word limit. This combines everything. Format constraints, tool judgment, instruction obedience, and the discipline to stop writing when the task is done.

That last requirement is the one benchmarks never test. In a chat, extra writing is harmless. In an agent workflow, extra writing is a tax on every step that follows.

What's Coming Every Thursday

I'm running Opus, GPT-5.5, Qwen 3.7 Max via OpenCode Go, and GLM 5.1 via Ollama Cloud through these tasks and publishing a short verdict card every Thursday. Each card covers the best use case, the conditions where you should skip it, cost notes from real sessions, and whether the model belongs in a Hermes setup at all.

The goal is a repeatable testing standard that accumulates over time instead of a one-off leaderboard that goes stale the week after publishing.

I'll also note when a model fails because of provider instability rather than model quality. A weak model needs replacing. An unreliable route needs a backup. I've hit this with Hindsight memory timeouts and gateway drift — things that look like the model broke but were actually a layer below it.

Each Thursday card follows the same format so you can compare over time. The verdict will be one of five categories: daily driver, strong specialist, background worker, backup only, or skip entirely. I'll add new tasks to the ladder based on what people suggest in the comments.

Stay tuned and subscribe to receive these reports as soon as I publish them.

Share this with a builder choosing their next AI model from a leaderboard or plan page.

Which model should go through the ladder first, and what real workday task should I add to the test? Let me know in the comments.

DEV Community