Same model. Different results. — AgentKit Benchmark + OpenCode Integration

AJAY SABLE — Thu, 09 Apr 2026 09:59:51 +0000

We open-sourced AgentKit two weeks ago with zero guarantees anyone would care.

400+ clones later — we're shipping the biggest update yet. And we have benchmark data to back it up.

Quick note: AgentKit Preview is our closed, in-development intelligence layer. The fully open-source AgentKit is live and ready to use today at github.com/Ajaysable123/AgentKit — npx agentkit-ai@latest init gets you running in seconds.

Live Benchmark — Gemma 4 31b · Same Model · Same Task

Both runs used Gemma 4 31b via OpenCode. The only variable was AgentKit Preview's workflow enforcement, skill injection, and plan gates.

Benchmark	Vanilla OpenCode	+ AgentKit Preview
Structured planning before coding	0%	100%
Plan approved before first edit	—	✅ Yes (40.6s review)
Task interruptions	1x	0x
Task completion	20% (scaffolding only)	80% (DER parser implemented)
Hard problem solved	❌ No	✅ Yes

Without AgentKit — Gemma 4 31b gave up on the hard part and shipped placeholder strings ([ASN.1 Decoding Required]). No plan, no verification, interrupted once.

With AgentKit — Same Gemma 4 31b implemented a real custom ASN.1 DER parser, handled both UTCTime and GeneralizedTime, built expiration logic. Completed the task properly.

The model didn't get smarter. AgentKit's workflow gates changed its behavior:

Plan gate forced it to think through the DER parsing approach before writing code
Approval step made it commit to solving the hard problem instead of sidestepping it
State machine kept it accountable through RESEARCH → PLAN → EXECUTE → REVIEW

What else just landed

🔌 Native OpenCode Integration

AgentKit now ships a native TUI plugin for OpenCode that lives inside the terminal UI — not just in the system prompt.

Select the agentkit agent from the agent switcher and you get:

Pre-loaded skills injected automatically
Workflow gates (RESEARCH → PLAN → EXECUTE → REVIEW → SHIP)
Mandatory approval dialogs before any code edit
Memory context from previous sessions

🤖 Works With Any Model

The skill router, workflow engine, and marketplace run entirely via CLI — no Claude API required. Tested on Gemma 4 31b, MiniMax M2.5, and Claude.

# Works with any model in OpenCode
agentkit workflow transition RESEARCH
agentkit workflow approve
agentkit workflow transition EXECUTE

Get started

Open-source AgentKit (free — stable & ready to use):

npx agentkit-ai@latest init

👉 github.com/Ajaysable123/AgentKit

AgentKit Preview (closed beta — in active development)

To everyone who cloned, starred, or tried AgentKit — thank you. This is just getting started. 🚀




The callout block at the top does the heavy lifting — anyone who lands on the article immediately knows the open-source version is stable and available, and Preview is the next thing being built. Want any other changes?

I studied 5 repos with 200K+ combined stars and built the tool they were all missing

AJAY SABLE — Tue, 31 Mar 2026 07:02:07 +0000

I build agentic AI systems for a living — multi-agent compliance pipelines, document orchestration, RAG-powered assistants. Claude Code is my daily driver.

Last month, my Claude Code bill hit $213.

Not because I was doing anything unusual. Standard development work. But I was burning tokens on skills that weren't relevant to my current task, re-explaining my project architecture every new session, and running Opus for tasks that Haiku could handle fine.

So I spent a few weeks studying the most popular tools trying to solve pieces of this problem:

Repo	⭐ Stars	What it solves	What it doesn't
Superpowers	108K	Workflow methodology, TDD, subagent development	No memory, no token optimization, no cost tracking
claude-mem	39.9K	Session memory persistence	No skills, no workflow, no model routing
awesome-claude-code	30.9K	Curates 1,234+ skills	It's a directory — no intelligence, no routing
ruflo	24.8K	Multi-agent swarm orchestration	Complex, heavy, uses 7x tokens
ui-ux-pro-max-skill	~500	Design-specific SKILL.md	Single domain only

The pattern was obvious: everyone built one layer. Nobody built the intelligence layer that ties them together while cutting your costs.

I built that layer. It's called AgentKit.

npx agentkit init

One command. Detects your platform. Installs everything. Starts saving tokens immediately.

What AgentKit actually does

Five layers, each solving a specific problem:

Layer 1: Intelligent Skill Router

This is the single biggest token saver.

The problem: You install 50 skills or dump everything into CLAUDE.md. All of it loads into context on every prompt — even when you're debugging Python and your React, Docker, and GraphQL skills are just sitting there burning tokens.

The fix: A 3-tier classifier that runs on every prompt:

# Tier 1: Keyword regex (instant, free)
# Tier 2: Heuristic scoring (instant, free)  
# Tier 3: Haiku fallback for ambiguous prompts (~$0.0003)

"Python AttributeError..."     → debugging    (confidence: 1.00)
"Write jest tests..."          → testing      (confidence: 1.00)
"Add JWT auth to REST API..."  → api-work     (confidence: 0.50)

It loads 2-5 relevant skills instead of all of them. And it uses progressive disclosure — skills load at 3 detail levels:

Level 1:  ~50 tokens   (trigger description — always loaded)
Level 2:  ~500 tokens  (core instructions — loaded when task confirmed)
Level 3:  ~2,000 tokens (full references — loaded for complex work)

Result: 45,000 tokens/session → ~5,000 tokens/session. 89% reduction.

Plus a forced-eval hook that bumps skill activation from 20% to 84%:

# hooks/forced_eval.sh — PreToolUse hook
LOADED_SKILLS="${AGENTKIT_LOADED_SKILLS:-}"
if [ -z "$LOADED_SKILLS" ]; then
  exit 0
fi
echo "SKILL_EVAL: Before proceeding, check if any active skill applies: ${LOADED_SKILLS}"

This one hook is probably worth the entire install.

Layer 2: Project Memory Graph

The problem: Claude forgets everything between sessions. Every morning you re-explain your architecture, re-discover API patterns, re-establish conventions.

The fix: A SQLite knowledge graph that automatically captures entities from your coding session:

-- memory/schema.sql
CREATE TABLE entities (
    id INTEGER PRIMARY KEY,
    name TEXT NOT NULL,
    type TEXT NOT NULL,  -- file, function, api_route, package, command
    context TEXT,
    confidence REAL DEFAULT 1.0,
    session_id TEXT,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

CREATE TABLE decisions (
    id INTEGER PRIMARY KEY,
    description TEXT NOT NULL,
    rationale TEXT,
    session_id TEXT,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- FTS5 for fast full-text search
CREATE VIRTUAL TABLE entities_fts USING fts5(name, context, content=entities);

At session end, it generates a Haiku-compressed handoff (~$0.0015). Next session, it injects only the memory nodes relevant to your current task — not everything.

Result: 10,000 tokens of context → ~2,000 tokens. 80% reduction. And your agent actually knows what happened yesterday.

Layer 3: Token Budget Intelligence

Three automatic optimizations that stack:

Auto model routing:

def route_model(prompt: str, is_subagent: bool = False) -> str:
    if is_subagent:
        return "claude-haiku-4-5"          # Always cheapest for subagents

    if has_complex_signals(prompt):         # "architect", "security audit", etc.
        return "claude-opus-4-6"           # $15/M tokens — only when needed

    if has_simple_signals(prompt):          # "find", "list", "rename", etc.
        return "claude-haiku-4-5"          # $0.25/M tokens

    return "claude-sonnet-4-6"             # $3/M — the 80% default

Thinking budget tuning:

Trivial tasks (file search, formatting):  0 thinking tokens     → saves $0.48/request
Moderate tasks (bug fixes, features):     8,192 thinking tokens → saves $0.36/request
Complex tasks (architecture, security):   32,000 thinking tokens → full power

Real-time cost dashboard:

💰 $0.034 | 🧠 Sonnet | ⚡ 12,450 tok | 📈 saved 32% vs baseline

Combined result: ~$200/mo → ~$60/mo. 70% cost reduction.

Layer 4: Workflow Engine

The problem: AI agents jump straight to coding. No research. No plan. Then you spend 3 hours debugging something that a 5-minute plan would have prevented.

The fix: An enforced state machine:

IDLE → RESEARCH → PLAN → EXECUTE → REVIEW → SHIP

The plan gate hook literally blocks Edit/Write operations until a plan exists:

# hooks/plan_gate.sh — PreToolUse hook
# Blocks Edit/Write tools if workflow state is not PLAN or EXECUTE
TOOL="$1"
STATE=$(python3 "${AGENTKIT_HOME}/workflow/enforcer.py" --action check)

if [[ "$TOOL" =~ ^(Edit|Write) ]] && [[ "$STATE" != "PLAN" ]] && [[ "$STATE" != "EXECUTE" ]]; then
    echo "BLOCK: Cannot edit files without an approved plan. Run research first."
    exit 1
fi

Quality gates run after every edit — syntax, lint, type checks, tests. Five languages supported: Python, TypeScript, JavaScript, Go, Rust.

Layer 5: Universal Platform Layer

One SKILL.md file → 10 platforms:

Claude Code   → Native SKILL.md + full hooks
Cursor        → .mdc rules in .cursor/rules/
Codex CLI     → AGENTS.md sections
Gemini CLI    → .gemini/GEMINI.md
Antigravity   → Plugin YAML
OpenCode      → Config JSON system prompt
Windsurf      → Cascade rules
Aider         → .aider.conf.yml
Kilo Code     → Plugin YAML
Augment       → Context file

npx agentkit init detects which platforms you have installed and configures the right format for each. Zero manual conversion.

The numbers

Everything above has been smoke-tested with real prompts:

What	Before AgentKit	After AgentKit	Change
Tokens per session (skills)	~45,000	~5,000	89% ↓
Memory context tokens	~10,000	~2,000	80% ↓
Monthly cost	~$200	~$60	70% ↓
Skill activation rate	20%	84%	4.2x ↑
Platforms supported	1	10	10x
Can skip planning	Always	Never	Enforced

How it works with existing tools

AgentKit doesn't replace Superpowers or claude-mem — it complements them:

With Superpowers: AgentKit adds the memory, token optimization, and model routing that Superpowers doesn't have. Use Superpowers for methodology + AgentKit for intelligence.
With claude-mem: AgentKit's memory graph is more structured (entities + relationships + decisions vs flat text), but they solve the same core problem. Use whichever fits your workflow.
With Ruflo swarms: AgentKit can optimize Ruflo swarm costs by routing worker agents to Haiku and loading only relevant skills per agent. (Phase 3 roadmap.)

Try it

# One command install
npx agentkit init

# Check what's running
npx agentkit status

# See your savings
npx agentkit costs

GitHub: github.com/Ajaysable123/AgentKit

npm: npm install -g agentkit-ai

MIT licensed. 16 open issues tagged "good first issue" if you want to contribute. We already got our first external contributor submitting 4 new skills via PR within 48 hours of launch.

If it saves you money, star it ⭐. If something breaks, open an issue. PRs welcome — especially skills for languages and frameworks I haven't covered yet.

I'm Ajay — a Senior Gen AI Developer building agentic systems in production for FinTech and Logistics clients. I built AgentKit because I was tired of paying $200/month for Claude Code when 70% of those tokens were wasted. Follow me on GitHub or LinkedIn for updates on AgentKit and agentic AI development.

DEV Community: AJAY SABLE