I'm a big fan of my current agentic coding stack:
- OpenCode for harnessing basics, skills, and integrations
- OhMyOpenagent for superpower-level agents (Sisyphus in particular is astonishingly good)
- OpenRouter for controlling token funds and controlling the models driving specific agents and/or categories
But you're probably familiar with the way Claude token rates exploded in expense a month or two ago, not to mention all the other Anthropic drama. Sisyphus is borderline-hardcoded to optimize for Opus but with some modest adjustment to configurations you can route through nearly anything OpenRouter supports. So, I've been poking around a lot of different options to see just how well various models can function as drop-in replacements for the primary orchestrator.
And it's been a really interesting adventure! Lots of fun surprises. There's also a correlary to this experiment, which is: which locally-hosted models (having purchased an RTX 5060ti w/ 16gb VRAM for my birthday back in March) best complement the high-power orchestrators for smaller-scoped tasks like code generation, basic research, and quick/low tasks or actions that can iterate against adjacent skills very quickly. But we'll save that topic for another article!
So, without any further ado, here's my notes. This represents the experiments I've run up through today (May 14th) and is definitely subjective. Your mileage may very, but I think this could still be pretty interesting!
TL;DR: Top Three Overall
x-ai/grok-4.20-beta(Runner-Ups) — Strongest alternate found: good orchestration, tooling, delegation, hooks; fresh training data and solid completion tracking; good potential split-role strategiesminimax/minimax-m2.7(Black Sheep) — Good across duration, token value, and integration with no major convergence issues; noted as "one of the more pleasant surprises."poolside/laguna-m.1(Black Sheep) — Best value discovery: free and far better than expected ("where did this come from?"). Meh on duration/advanced, but Good on token value and integration.
Conditions, Categories, and Rankings
I focused (though not exclusively) on OpenRouter-hosted models filtered for specific conditions:
- Sizable token context (at least 128k)
- Tags under the "programming" category
- Must have support for tool integration
- Ordered by most recent / most popular (I switch between the two on a regular basis in an effort not to miss anything particularly interesting)
One interesting thing that emerged is, there's very clearly a couple of recurring categories. I would sort most of these models into one of the following bins, which I've used to organize the rest of the article (though some of these allocations were poorly informed and may be hit-or-miss):
- "AAA" (from one of the big three: OpenAI, Anthropic, or Google; these tend to be heavy and expensive but at least 3-5 months ahead of the others)
- "Runner-Ups" (these are typically still from big organizations that are spending a lot of capital on keeping up with the AAA options, therefore they are a good value and only slightly behind; you'll find a lot of discounted preview models here, and despite the name of this "bin" they can be very good options that you should take seriously)
- "Distillations/Heretics" (a lot of the Chinese models go here; they tend to be smaller and more performent but unless you're running inference against a hosted version--in which case there's an extra layer of overhead to pay--probably best if your project isn't related to sensitive political topics)
- "Black Sheep" (there are some interesting proprietary lab models that are heavily discounted to encourage adoption; effective but lesser-known options live here and there are some really interesting, surprising, and high-value discoveries to be made)
- "Open Weight" (these are options you could probably find on HuggingFace and would even be able to run locally--if you had enough VRAM for half a trillion parameters and/or millions of token contexts; obviously, these are a notch or two behind pretty much everything else and included mainly for normalization/comparison purposes, even though there's thousands more we could evaluate)
We're going to focus on four specific considerations (though still subjective) or dimensions of rankings; these were mostly evaluated within the scope of the same project but across a wide variety of prompts:
- Orchestration duration (if the context window is too small or its tokens aren't used efficiently, you will notice a fall-off in effectiveness and may need to restart your sessions more regularly)
- Token value (not just dollars per token but how efficiently those tokens are used, cached, and iterated--orchestrators can be very verbose!)
- Integration efficacy (some smart models still have trouble leveraging tools or child models/agents/categories in the right way, which can even lead to some circular or drifting orchestration work)
- Advanced effectiveness (some models are great but still struggle with more sophisticated codebases, architecture/design problems, iterating too much, or more recent/obscure--or maybe less popular is the right term--coding problems at the orchestration level; some models can orchcestrate a basic React app fine but completely fall apart if you take a crack at a 100k SLOC Zig project)
The ratings below are intentionally coarse: Good means the notes support using it for that dimension; Meh means usable or interesting with real caveats; Bad means that dimension currently blocks or materially degrades orchestration. Within each bin, entries are roughly ordered from most promising to least usable as a primary orchestrator.
AAA
google/gemini-3.1-pro-preview
- Orchestration duration: Meh. Takes longer to work through problems, but seems able to keep up
- Token value: Good. I'd describe it as reasonably cost-effective for a frontier-ish model
- Integration efficacy: Meh. Initial routing problems appear mostly addressed, but still friction
- Advanced effectiveness: Good. Reasonably effective even on modern/complicated problems
- Comments: Gating issue seems to have been harness/routing compatibility more than model capability
google/gemini-3-flash-preview
- Orchestration duration: Meh. Okay for modest task orchestration if left to chew, but needs nudges
- Token value: Good. Good rate and decent pricing
- Integration efficacy: Good. No significant tooling issues; recognizes appropriate delegation
- Advanced effectiveness: Meh. Unique/interesting reasoning, but not reliable enough for heavyweight
- Comments: A strong signpost for the next Gemini Pro rather than a fully satisfying primary orchestrator
openai/gpt-5.4
- Orchestration duration: Bad. Far too slow for responsive orchestration, though reasonably strong
- Token value: Bad. Not cheap, and slow orchestration makes the spend harder to justify
- Integration efficacy: Meh. No specific tool failure, but latency prevents it from feeling useful
- Advanced effectiveness: Good. Presumed brilliant enough for hard work, but blocking practicality
- Comments: Capability is not the concern; usable throughput is
openai/gpt-5.5
- Orchestration duration: Bad. Not meaningfully tested with Sisyphus: harness pushed back
- Token value: Meh. No clear value read because the model was effectively skipped for this role
- Integration efficacy: Bad. OMO gets upset when GPT models are used for roles other than Hephaestus
- Advanced effectiveness: Meh. Likely capable, but unproven here because integration constraints
- Comments: This may be a role-fit issue rather than a model-quality issue; listen to the harness
Runner-Ups
x-ai/grok-4.20
- Orchestration duration: Good. One of the best alternates; strong subtask tracking, stays on point
- Token value: Meh. Significantly cheaper than Opus, but spendier / less efficient than alternatives
- Integration efficacy: Good. Unusually strong with tooling, delegation, hooks, LSP usage, toolchain
- Advanced effectiveness: Good. Fresh cutoff, strong orchestration behavior, and reliable completion
- Comments: Illuminating split-role strategies here: pair 4.20-beta orchestration with 4.1-fast as junior; good stuff, underrated
x-ai/grok-4.3
- Orchestration duration: Meh. A little uncertain at startup, but once it gets going it can move
- Token value: Meh. Unclear how the value holds up once delegation and child-model costs are included
- Integration efficacy: Meh. Tool integration is initially uncertain, though not necessarily broken
- Advanced effectiveness: Good. Once it cranks, it appears to handle substantive tasks very well
- Comments: Promising, but not clear whether delegation remains cost-effective and stable
x-ai/grok-4.1-fast
- Orchestration duration: Meh. Token generation is quick, but simple tasks can trigger iteration
- Token value: Meh. Speed helps, but wasted loops and restarts erode the value
- Integration efficacy: Meh. Tool integration exists, but it is awkward
- Advanced effectiveness: Bad. Problem-solving is too cautious and brittle for primary orchestration
- Comments: Still potentially useful as a lower-level worker if paired with a stronger orchestrator
Distillations/Heretics
deepseek/deepseek-v3.2
- Orchestration duration: Meh. Earlier runs felt slow and prone to loops, expect provider variance
- Token value: Meh. Cost is described as merely okay, and slow convergence can eat into value
- Integration efficacy: Meh. Requires little hand-holding and has solid planning, but looping bad
- Advanced effectiveness: Good. One of the better options so far when it is behaving
- Comments: This model's evaluation improved substantially after more testing; "promising but variable"
moonshotai/kimi-k2.6
- Orchestration duration: Bad. Limited context means it can handle only a few tasks before falling off
- Token value: Meh. Better than expected, but restart pressure hurts value
- Integration efficacy: Meh. Needing manual runtime specification adds friction
- Advanced effectiveness: Good. Did a decent job with architecture refactoring
- Comments: I've heard a lot of good things but was somewhat disappointed; maybe expectations too high
moonshotai/kimi-k2.5
- Orchestration duration: Meh. Variable generation rates and slower convergence, but it gets there
- Token value: Meh. Pricing is okay, but provider-load variability makes the value inconsistent
- Integration efficacy: Bad. Codegen integration leaves a lot to be desired, at least on the Zig tests
- Advanced effectiveness: Meh. Thinking not bad, rigor reasonable, but not smooth enough for orchestration
- Comments: Interesting to see the difference against its successor, which in some ways wasn't as good
z-ai/glm-5.1
- Orchestration duration: Bad. Verbosity and circularity shorten the useful session before fallbacks
- Token value: Bad. Circular verbosity burns tokens without enough payoff
- Integration efficacy: Meh. Decent orchestration and introspection, but not enough to feel sticky
- Advanced effectiveness: Meh. Capable useful reasoning, but orchestration not compelling at this tier
- Comments: I feel like the GLM series models are probably underexplored, and was surprised by how much
qwen/qwen3.6-plus
- Orchestration duration: Meh. Seems pretty good, but prompt-to-prompt switching disrupts continuity
- Token value: Meh. No strong price/value signal beyond "pretty good"
- Integration efficacy: Bad. OpenCode wants to switch back to Opus each time there is a prompt
- Advanced effectiveness: Meh. Likely competent, but harness friction prevents a stronger conclusion
- Comments: Qwen in general is all over the place; wouldn't be surprised if other releases are better
deepseek/deepseek-v4-pro
- Orchestration duration: Meh. Slow and barely avoids circular behavior, but does get the job done
- Token value: Bad. Not cheap for mediocre efficacy
- Integration efficacy: Meh. No fatal tooling issues, but the orchestration loop is not strong enough
- Advanced effectiveness: Meh. Better than previous DeepSeeks, but not recommendable for orchestration
- Comments: One option that shows you there is a difference between good models and orchestration-quality
tencent/hy3-preview
- Orchestration duration: Bad. Effectively untested because OMO does not appear to want to load/use it
- Token value: Bad. No practical value if it cannot be loaded into the stack
- Integration efficacy: Bad. Harness compatibility is the blocker
- Advanced effectiveness: Bad. No useful advanced-orchestration because it never gets into the loop
- Comments: Surprised as I expected better from Tencent--makes me doubt and wonder if this was on me
Black Sheep
poolside/laguna-m.1
- Orchestration duration: Meh. Context is only 128k, but it is still surprisingly capable for freebie
- Token value: Good. Free and far better than expected makes this a standout value discoveries
- Integration efficacy: Good. No major tool/harness issues experienced
- Advanced effectiveness: Meh. Not AAA, maybe A-level, and it struggles with design/layout/CSS
- Comments: This is the strongest "where did this come from?" result in my notes
minimax/minimax-m2.7
- Orchestration duration: Good. Takes a little longer, but reliably converges for most orchestration
- Token value: Good. Decent token rates and good enough output quality to make the price attractive
- Integration efficacy: Good. No major convergence or orchestration-integration issue is called out
- Advanced effectiveness: Meh. Broadly competent, but weaker on layout/styling/frontend/design tasks
- Comments: One of the more pleasant surprises: not flashy, but practically useful
inclusionai/ring-2.6-1t
- Orchestration duration: Meh. Moves quickly, but frequent rate limits interrupt even modest work
- Token value: Good. Free for now, which makes passable orchestration inherently interesting
- Integration efficacy: Meh. Initial OpenRouter/OMO friction, Zen fallback, and hiccupy tooling uneven
- Advanced effectiveness: Meh. Passable on architecture-oriented tasks, but hesitant, warning-prone
- Comments: Worth monitoring because the raw economics are excellent w/o rate-limit and confidence issues
openrouter/owl-alpha
- Orchestration duration: Bad. No useful orchestration run--the harness does not like supporting it
- Token value: Bad. Unsupported models have no practical token value in this stack
- Integration efficacy: Bad. OMO/OpenCode compatibility seems to be the blocker
- Advanced effectiveness: Bad. No meaningful effectiveness read without a working integration
- Comments: Reminds me of the OpenCode Zen options, nice to know it's there but not a heavy orchestrator
Open Weight
nvidia/nemotron-3-super-120b-a12b
- Orchestration duration: Meh. Comes up to speed pretty quickly for a free model
- Token value: Good. Free makes the initial value attractive
- Integration efficacy: Bad. Errors out when trying to do things, suggesting tool-integration problems
- Advanced effectiveness: Bad. Promising comprehension does not matter if action-taking fails
- Comments: First experiment with an NVIDIA model, and if they're all like this, the last--surprising
gpt-oss 20b
- Orchestration duration: Bad. Reasoning closure issues prevent reliable orchestration
- Token value: Meh. Local economics may be attractive, but the model is not orchestration-grade
- Integration efficacy: Bad. Does not integrate well with tools
- Advanced effectiveness: Bad. Older and definitely below the required orchestration level
- Comments: Really not sure what the proposition for this model is--out of date, mediocre at everything
gemma4 e4b
- Orchestration duration: Bad. Not ready to close/converge on orchestration solutions
- Token value: Bad. Even if cheap/local, poor convergence makes it a bad orchestrator value
- Integration efficacy: Bad. Has tool-integration problems
- Advanced effectiveness: Bad. Not yet at orchestration level
- Comments: Love me some Gemma for local codegen, etc., but obviously not for high/heavy work
In Summary
Here's a quick summary table of where things wound up.
| Model | Category | Duration | Token | Integration | Advanced |
|---|---|---|---|---|---|
google/gemini-3.1-pro-preview |
AAA | Meh | Good | Meh | Good |
google/gemini-3-flash-preview |
AAA | Meh | Good | Good | Meh |
openai/gpt-5.4 |
AAA | Bad | Bad | Meh | Good |
openai/gpt-5.5 |
AAA | Bad | Meh | Bad | Meh |
x-ai/grok-4.20-beta |
Runner-Ups | Good | Meh | Good | Good |
x-ai/grok-4.3 |
Runner-Ups | Meh | Meh | Meh | Good |
x-ai/grok-4.20 |
Runner-Ups | Meh | Meh | Bad | Meh |
x-ai/grok-4.1-fast |
Runner-Ups | Meh | Meh | Meh | Bad |
deepseek/deepseek-v3.2 |
Distillations/Heretics | Meh | Meh | Meh | Good |
moonshotai/kimi-k2.6 |
Distillations/Heretics | Bad | Meh | Meh | Good |
moonshotai/kimi-k2.5 |
Distillations/Heretics | Meh | Meh | Bad | Meh |
z-ai/glm-5.1 |
Distillations/Heretics | Bad | Bad | Meh | Meh |
qwen/qwen3.6-plus |
Distillations/Heretics | Meh | Meh | Bad | Meh |
deepseek/deepseek-v4-pro |
Distillations/Heretics | Meh | Bad | Meh | Meh |
tencent/hy3-preview |
Distillations/Heretics | Bad | Bad | Bad | Bad |
minimax/minimax-m2.7 |
Black Sheep | Good | Good | Good | Meh |
poolside/laguna-m.1 |
Black Sheep | Meh | Good | Good | Meh |
inclusionai/ring-2.6-1t |
Black Sheep | Meh | Good | Meh | Meh |
openrouter/owl-alpha |
Black Sheep | Bad | Bad | Bad | Bad |
nvidia/nemotron-3-super-120b-a12b |
Open Weight | Meh | Good | Bad | Bad |
gemma4 e4b |
Open Weight | Bad | Bad | Bad | Bad |
gpt-oss 20b |
Open Weight | Bad | Meh | Bad | Bad |


Top comments (0)