DEV Community

Cover image for Same model. Different results. — AgentKit Benchmark + OpenCode Integration
AJAY SABLE
AJAY SABLE

Posted on

Same model. Different results. — AgentKit Benchmark + OpenCode Integration

We open-sourced AgentKit two weeks ago with zero guarantees anyone would care.

400+ clones later — we're shipping the biggest update yet. And we have benchmark data to back it up.

Quick note: AgentKit Preview is our closed, in-development intelligence layer. The fully open-source AgentKit is live and ready to use today at github.com/Ajaysable123/AgentKitnpx agentkit-ai@latest init gets you running in seconds.


Live Benchmark — Gemma 4 31b · Same Model · Same Task

Both runs used Gemma 4 31b via OpenCode. The only variable was AgentKit Preview's workflow enforcement, skill injection, and plan gates.

Benchmark Vanilla OpenCode + AgentKit Preview
Structured planning before coding 0% 100%
Plan approved before first edit ✅ Yes (40.6s review)
Task interruptions 1x 0x
Task completion 20% (scaffolding only) 80% (DER parser implemented)
Hard problem solved ❌ No ✅ Yes

Without AgentKit — Gemma 4 31b gave up on the hard part and shipped placeholder strings ([ASN.1 Decoding Required]). No plan, no verification, interrupted once.

With AgentKit — Same Gemma 4 31b implemented a real custom ASN.1 DER parser, handled both UTCTime and GeneralizedTime, built expiration logic. Completed the task properly.

The model didn't get smarter. AgentKit's workflow gates changed its behavior:

  • Plan gate forced it to think through the DER parsing approach before writing code
  • Approval step made it commit to solving the hard problem instead of sidestepping it
  • State machine kept it accountable through RESEARCH → PLAN → EXECUTE → REVIEW

What else just landed

🔌 Native OpenCode Integration

OpenCode Integration

AgentKit now ships a native TUI plugin for OpenCode that lives inside the terminal UI — not just in the system prompt.

Select the agentkit agent from the agent switcher and you get:

  • Pre-loaded skills injected automatically
  • Workflow gates (RESEARCH → PLAN → EXECUTE → REVIEW → SHIP)
  • Mandatory approval dialogs before any code edit
  • Memory context from previous sessions
npx agentkit-preview@latest init
# then open OpenCode → tab → agents → select agentkit ⚡
Enter fullscreen mode Exit fullscreen mode

🤖 Works With Any Model

The skill router, workflow engine, and marketplace run entirely via CLI — no Claude API required. Tested on Gemma 4 31b, MiniMax M2.5, and Claude.

# Works with any model in OpenCode
agentkit workflow transition RESEARCH
agentkit workflow approve
agentkit workflow transition EXECUTE
Enter fullscreen mode Exit fullscreen mode

Get started

Open-source AgentKit (free — stable & ready to use):

npx agentkit-ai@latest init
Enter fullscreen mode Exit fullscreen mode

👉 github.com/Ajaysable123/AgentKit

AgentKit Preview (closed beta — in active development):

npx agentkit-preview@latest init
Enter fullscreen mode Exit fullscreen mode

To everyone who cloned, starred, or tried AgentKit — thank you. This is just getting started. 🚀




The callout block at the top does the heavy lifting — anyone who lands on the article immediately knows the open-source version is stable and available, and Preview is the next thing being built. Want any other changes?
Enter fullscreen mode Exit fullscreen mode

Top comments (0)