I build agentic AI systems for a living — multi-agent compliance pipelines, document orchestration, RAG-powered assistants. Claude Code is my daily driver.
Last month, my Claude Code bill hit $213.
Not because I was doing anything unusual. Standard development work. But I was burning tokens on skills that weren't relevant to my current task, re-explaining my project architecture every new session, and running Opus for tasks that Haiku could handle fine.
So I spent a few weeks studying the most popular tools trying to solve pieces of this problem:
| Repo | ⭐ Stars | What it solves | What it doesn't |
|---|---|---|---|
| Superpowers | 108K | Workflow methodology, TDD, subagent development | No memory, no token optimization, no cost tracking |
| claude-mem | 39.9K | Session memory persistence | No skills, no workflow, no model routing |
| awesome-claude-code | 30.9K | Curates 1,234+ skills | It's a directory — no intelligence, no routing |
| ruflo | 24.8K | Multi-agent swarm orchestration | Complex, heavy, uses 7x tokens |
| ui-ux-pro-max-skill | ~500 | Design-specific SKILL.md | Single domain only |
The pattern was obvious: everyone built one layer. Nobody built the intelligence layer that ties them together while cutting your costs.
I built that layer. It's called AgentKit.
npx agentkit init
One command. Detects your platform. Installs everything. Starts saving tokens immediately.
What AgentKit actually does
Five layers, each solving a specific problem:
Layer 1: Intelligent Skill Router
This is the single biggest token saver.
The problem: You install 50 skills or dump everything into CLAUDE.md. All of it loads into context on every prompt — even when you're debugging Python and your React, Docker, and GraphQL skills are just sitting there burning tokens.
The fix: A 3-tier classifier that runs on every prompt:
# Tier 1: Keyword regex (instant, free)
# Tier 2: Heuristic scoring (instant, free)
# Tier 3: Haiku fallback for ambiguous prompts (~$0.0003)
"Python AttributeError..." → debugging (confidence: 1.00)
"Write jest tests..." → testing (confidence: 1.00)
"Add JWT auth to REST API..." → api-work (confidence: 0.50)
It loads 2-5 relevant skills instead of all of them. And it uses progressive disclosure — skills load at 3 detail levels:
Level 1: ~50 tokens (trigger description — always loaded)
Level 2: ~500 tokens (core instructions — loaded when task confirmed)
Level 3: ~2,000 tokens (full references — loaded for complex work)
Result: 45,000 tokens/session → ~5,000 tokens/session. 89% reduction.
Plus a forced-eval hook that bumps skill activation from 20% to 84%:
# hooks/forced_eval.sh — PreToolUse hook
LOADED_SKILLS="${AGENTKIT_LOADED_SKILLS:-}"
if [ -z "$LOADED_SKILLS" ]; then
exit 0
fi
echo "SKILL_EVAL: Before proceeding, check if any active skill applies: ${LOADED_SKILLS}"
This one hook is probably worth the entire install.
Layer 2: Project Memory Graph
The problem: Claude forgets everything between sessions. Every morning you re-explain your architecture, re-discover API patterns, re-establish conventions.
The fix: A SQLite knowledge graph that automatically captures entities from your coding session:
-- memory/schema.sql
CREATE TABLE entities (
id INTEGER PRIMARY KEY,
name TEXT NOT NULL,
type TEXT NOT NULL, -- file, function, api_route, package, command
context TEXT,
confidence REAL DEFAULT 1.0,
session_id TEXT,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
CREATE TABLE decisions (
id INTEGER PRIMARY KEY,
description TEXT NOT NULL,
rationale TEXT,
session_id TEXT,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);
-- FTS5 for fast full-text search
CREATE VIRTUAL TABLE entities_fts USING fts5(name, context, content=entities);
At session end, it generates a Haiku-compressed handoff (~$0.0015). Next session, it injects only the memory nodes relevant to your current task — not everything.
Result: 10,000 tokens of context → ~2,000 tokens. 80% reduction. And your agent actually knows what happened yesterday.
Layer 3: Token Budget Intelligence
Three automatic optimizations that stack:
Auto model routing:
def route_model(prompt: str, is_subagent: bool = False) -> str:
if is_subagent:
return "claude-haiku-4-5" # Always cheapest for subagents
if has_complex_signals(prompt): # "architect", "security audit", etc.
return "claude-opus-4-6" # $15/M tokens — only when needed
if has_simple_signals(prompt): # "find", "list", "rename", etc.
return "claude-haiku-4-5" # $0.25/M tokens
return "claude-sonnet-4-6" # $3/M — the 80% default
Thinking budget tuning:
Trivial tasks (file search, formatting): 0 thinking tokens → saves $0.48/request
Moderate tasks (bug fixes, features): 8,192 thinking tokens → saves $0.36/request
Complex tasks (architecture, security): 32,000 thinking tokens → full power
Real-time cost dashboard:
💰 $0.034 | 🧠 Sonnet | ⚡ 12,450 tok | 📈 saved 32% vs baseline
Combined result: ~$200/mo → ~$60/mo. 70% cost reduction.
Layer 4: Workflow Engine
The problem: AI agents jump straight to coding. No research. No plan. Then you spend 3 hours debugging something that a 5-minute plan would have prevented.
The fix: An enforced state machine:
IDLE → RESEARCH → PLAN → EXECUTE → REVIEW → SHIP
The plan gate hook literally blocks Edit/Write operations until a plan exists:
# hooks/plan_gate.sh — PreToolUse hook
# Blocks Edit/Write tools if workflow state is not PLAN or EXECUTE
TOOL="$1"
STATE=$(python3 "${AGENTKIT_HOME}/workflow/enforcer.py" --action check)
if [[ "$TOOL" =~ ^(Edit|Write) ]] && [[ "$STATE" != "PLAN" ]] && [[ "$STATE" != "EXECUTE" ]]; then
echo "BLOCK: Cannot edit files without an approved plan. Run research first."
exit 1
fi
Quality gates run after every edit — syntax, lint, type checks, tests. Five languages supported: Python, TypeScript, JavaScript, Go, Rust.
Layer 5: Universal Platform Layer
One SKILL.md file → 10 platforms:
Claude Code → Native SKILL.md + full hooks
Cursor → .mdc rules in .cursor/rules/
Codex CLI → AGENTS.md sections
Gemini CLI → .gemini/GEMINI.md
Antigravity → Plugin YAML
OpenCode → Config JSON system prompt
Windsurf → Cascade rules
Aider → .aider.conf.yml
Kilo Code → Plugin YAML
Augment → Context file
npx agentkit init detects which platforms you have installed and configures the right format for each. Zero manual conversion.
The numbers
Everything above has been smoke-tested with real prompts:
| What | Before AgentKit | After AgentKit | Change |
|---|---|---|---|
| Tokens per session (skills) | ~45,000 | ~5,000 | 89% ↓ |
| Memory context tokens | ~10,000 | ~2,000 | 80% ↓ |
| Monthly cost | ~$200 | ~$60 | 70% ↓ |
| Skill activation rate | 20% | 84% | 4.2x ↑ |
| Platforms supported | 1 | 10 | 10x |
| Can skip planning | Always | Never | Enforced |
How it works with existing tools
AgentKit doesn't replace Superpowers or claude-mem — it complements them:
- With Superpowers: AgentKit adds the memory, token optimization, and model routing that Superpowers doesn't have. Use Superpowers for methodology + AgentKit for intelligence.
- With claude-mem: AgentKit's memory graph is more structured (entities + relationships + decisions vs flat text), but they solve the same core problem. Use whichever fits your workflow.
- With Ruflo swarms: AgentKit can optimize Ruflo swarm costs by routing worker agents to Haiku and loading only relevant skills per agent. (Phase 3 roadmap.)
Try it
# One command install
npx agentkit init
# Check what's running
npx agentkit status
# See your savings
npx agentkit costs
GitHub: github.com/Ajaysable123/AgentKit
npm: npm install -g agentkit-ai
MIT licensed. 16 open issues tagged "good first issue" if you want to contribute. We already got our first external contributor submitting 4 new skills via PR within 48 hours of launch.
If it saves you money, star it ⭐. If something breaks, open an issue. PRs welcome — especially skills for languages and frameworks I haven't covered yet.
I'm Ajay — a Senior Gen AI Developer building agentic systems in production for FinTech and Logistics clients. I built AgentKit because I was tired of paying $200/month for Claude Code when 70% of those tokens were wasted. Follow me on GitHub or LinkedIn for updates on AgentKit and agentic AI development.
Top comments (0)