Yitao

Posted on May 28

Same 3 tasks, they spent $30, we spent $5 — OpenClacky 1.0, an open-source AI agent optimized for token cost

#ai #opensource #ruby #agents

Most AI agents don't have a cost problem on paper. They have a cost problem at the end of the month.

More and more people use AI agents for real work now — not just writing code, but making slide decks, writing marketing plans, running competitor research, doing daily office automation. Use them for a while and everyone hits the same wall: the bill.

A lot of well-known agents are structurally token-hungry. $30 for a single complete task isn't unusual. The model isn't to blame. It's the harness layer around it: cache design that doesn't hold up, tool sets that keep growing, compression that destroys cache prefixes, context windows rebuilt from scratch every few turns. Each layer quietly burns money, and the user only gets educated once — when the monthly invoice arrives.

We decided from day one to treat "spend less" as a top-level harness design goal, not an optimization pass you bolt on later. Our first two architectures (Gen 1: RAG-based, Gen 2: cloud multi-agent orchestration) taught us plenty of painful lessons. The conclusion we came to: users just want the task done well and fast. The best architecture isn't blindly chasing multi-agent coordination and complex orchestration — it's pushing quality and cost control to the limit on a single agent.

That's how the third-generation architecture was born. Ruby rewrite from scratch, four months of work, redesigned around seven core decisions: cache, tool set, compression, self-evolution, and more. That's today's OpenClacky 1.0, MIT licensed.

GitHub: github.com/clacky-ai/openclacky
Site: openclacky.com

Benchmark: 4 agents, same 3 tasks, same model

Architecture done — but does it actually work? We spent over two weeks running side-by-side evaluations, pulling the major agents — Claude Code, OpenClaw, Hermes — onto the same playing field.

All agents running claude-opus-4-7 — currently the strongest and most expensive model. That's deliberate: the pricier the model, the more clearly it exposes each harness's real efficiency. Every token saved is real money.

Same prompts, same time window. Cost data from OpenRouter per-request CSV billing — not our own logs, a third-party bill.

Agent	Total Cost	Cache Hit Rate	Requests
OpenClacky	$5.10	90.6%	51
Claude Code	$5.49	95.2%	70
OpenClaw	$15.70	88.7%	81
Hermes	$30.14	60.3%	218

51 requests + 90.6% hit rate → $5.10. Hermes: 218 requests + 60.3% hit rate → $30.14.

Claude Code's cache hit rate (95.2%) is actually higher than ours. That's a world-class closed-source harness with internal model-switching to cheaper models for sub-tasks — they've had years to tune it. Our edge is in request count × hit rate: fewer round trips, each one mostly cached.

Full benchmark with per-request data, original prompts, raw outputs, and screen recordings: openclacky.com/benchmark

The three tasks

We picked everyday work scenarios, not pure coding challenges — because agents are being used by more than just developers now:

Task 1: 10-slide business presentation (AI agent industry trends)

Agent	Cost
OpenClacky	$1.23
Claude Code	$1.45
OpenClaw	$5.07
Hermes	$10.96

Task 2: AI customer service SaaS — marketing plan + working landing page

Agent	Cost
OpenClacky	$1.72
Claude Code	$1.20
OpenClaw	$7.47
Hermes	$4.65

Claude Code won this one. Their harness is genuinely good at multi-step creative tasks where the model needs room to iterate.

Task 3: B2B SaaS competitor analysis + one-week social media content calendar (6-step pipeline)

Agent	Cost
OpenClacky	$2.14
Claude Code	$2.84
OpenClaw	$3.15
Hermes	$14.53

Each benchmark page has the full prompt, all four agents' raw outputs, screen recordings, and per-request cost tables.

Why it's cheaper — 4 harness decisions

Not "fewer features = less cost." The problem with most agents isn't ambition, it's that nobody thought about cache during the architecture phase. We kept the feature set full and made every layer cache-aware. Here are the four decisions that matter most (there are 7 total — full technical write-up here):

1. Near-100% cache hit rate by design

The system prompt never rebuilds mid-session. Dynamic changes (skill list updates, model switches, date changes) go into a separate [session context] message block that doesn't break cache breakpoints. We place dual cache_control markers on the last two messages each turn, preventing the N+1 turn marker misalignment that kills cache in most implementations.

Most agents restart the session when skills reload or config changes. All cache gone instantly. We spent a lot of time making sure that never happens.

2. Minimal tool set — everything else is a skill

16 core tools. Claude Code has 40+. OpenClaw has 23. Hermes has 52.

The key is invoke_skill — a meta-tool that spawns sub-agents for complex capabilities: code exploration, memory recall, cron tasks, document generation. Users install new skills without changing the tool schema. Tool count stays at 16, cache stays warm.

Tool count isn't a competitive advantage. Task completion rate is.

3. Insert-then-Compress: compression that hits cache

Standard approach: open a separate LLM call with a "summarize this" system prompt. That call has zero shared cache prefix — 100% miss. And it destroys the main session's cache too.

We insert the compression instruction directly into the current conversation flow, executed on the next normal request. The compression call reuses existing cache naturally. Cost approaches zero.

4. BYOK, any model, any provider

Any OpenAI-compatible API works out of the box. Run the main task on Claude, sub-tasks on a cheaper model. You own the keys and pick the providers.

What OpenClacky actually is

It's not just a cost-optimized shell around an LLM. It's a full agent platform:

Web UI + CLI — browser-based interface with session list, conversation view, and artifact preview. Or terminal mode if you prefer.
Skill system — built-in skills for common workflows, community skills installable with one command. Skills self-evolve: after a task completes, the agent evaluates whether the workflow is worth persisting as a reusable skill.
Long-term memory — decisions, preferences, and context persisted to disk, recalled by relevance without polluting the active context window.
Scheduled tasks — describe what you want in natural language, get a working cron job.
Browser automation — drives your real Chrome/Edge (not headless), so it has your sessions, cookies, and you can watch what it's doing.
Three permission levels — from step-by-step confirmation to full autonomy. Destructive operations have guardrails regardless.

Full feature list: openclacky.com/features

About the Ruby rewrite

Gen 1 and Gen 2 were Python. By the end of Gen 2 it was clear that the bottleneck for agents is LLM call orchestration, not language runtime performance. What matters is how cleanly you can express the Session / Cache / Tool layer relationships.

Ruby's DSL and metaprogramming capabilities made the third-generation architecture significantly easier to write and maintain. The session/cache/tool boundaries fell out more naturally. Four months from first commit to working product — and we genuinely enjoyed writing it, which probably matters more than it should.

If you're thinking "but the AI ecosystem is all Python" — fair point. Document parsing (PDF, Excel, Word) still uses Python scripts under the hood. The agent calls them via the terminal tool. Best tool for each job.

Getting started

Desktop installer (easiest): macOS / Windows / Linux. Downloads handle dependencies automatically.

CLI :

gem install openclacky
openclacky

Bring your own API key (OpenRouter, Anthropic, OpenAI, or any compatible endpoint). There's also an optional managed key service if you don't want to deal with provider accounts.

Try it, challenge it

The benchmark prompts are public. Run them yourself with your own OpenRouter key and compare the CSV bills.

If you get cheaper results, we genuinely want to know how — open a PR. If you get more expensive results, open an issue and we'll dig into it. We're not claiming perfection; Claude Code still beats us on Task 2 and has a higher cache hit rate overall. But we think the cost/capability trade-off is honest, and the code is right there for you to verify.

GitHub: github.com/clacky-ai/openclacky
Benchmark: openclacky.com/benchmark