Brian Kirkpatrick

Posted on May 15

Some Notes on OMO Orchestrator Claude Alternatives

#ai #agents #engineering

I'm a big fan of my current agentic coding stack:

OpenCode for harnessing basics, skills, and integrations
OhMyOpenagent for superpower-level agents (Sisyphus in particular is astonishingly good)
OpenRouter for controlling token funds and controlling the models driving specific agents and/or categories

But you're probably familiar with the way Claude token rates exploded in expense a month or two ago, not to mention all the other Anthropic drama. Sisyphus is borderline-hardcoded to optimize for Opus but with some modest adjustment to configurations you can route through nearly anything OpenRouter supports. So, I've been poking around a lot of different options to see just how well various models can function as drop-in replacements for the primary orchestrator.

And it's been a really interesting adventure! Lots of fun surprises. There's also a correlary to this experiment, which is: which locally-hosted models (having purchased an RTX 5060ti w/ 16gb VRAM for my birthday back in March) best complement the high-power orchestrators for smaller-scoped tasks like code generation, basic research, and quick/low tasks or actions that can iterate against adjacent skills very quickly. But we'll save that topic for another article!

So, without any further ado, here's my notes. This represents the experiments I've run up through today (May 14th) and is definitely subjective. Your mileage may very, but I think this could still be pretty interesting!

TL;DR: Top Three Overall

x-ai/grok-4.20-beta (Runner-Ups) — Strongest alternate found: good orchestration, tooling, delegation, hooks; fresh training data and solid completion tracking; good potential split-role strategies
minimax/minimax-m2.7 (Black Sheep) — Good across duration, token value, and integration with no major convergence issues; noted as "one of the more pleasant surprises."
poolside/laguna-m.1 (Black Sheep) — Best value discovery: free and far better than expected ("where did this come from?"). Meh on duration/advanced, but Good on token value and integration.

Conditions, Categories, and Rankings

I focused (though not exclusively) on OpenRouter-hosted models filtered for specific conditions:

Sizable token context (at least 128k)
Tags under the "programming" category
Must have support for tool integration
Ordered by most recent / most popular (I switch between the two on a regular basis in an effort not to miss anything particularly interesting)

One interesting thing that emerged is, there's very clearly a couple of recurring categories. I would sort most of these models into one of the following bins, which I've used to organize the rest of the article (though some of these allocations were poorly informed and may be hit-or-miss):

"AAA" (from one of the big three: OpenAI, Anthropic, or Google; these tend to be heavy and expensive but at least 3-5 months ahead of the others)
"Runner-Ups" (these are typically still from big organizations that are spending a lot of capital on keeping up with the AAA options, therefore they are a good value and only slightly behind; you'll find a lot of discounted preview models here, and despite the name of this "bin" they can be very good options that you should take seriously)
"Distillations/Heretics" (a lot of the Chinese models go here; they tend to be smaller and more performent but unless you're running inference against a hosted version--in which case there's an extra layer of overhead to pay--probably best if your project isn't related to sensitive political topics)
"Black Sheep" (there are some interesting proprietary lab models that are heavily discounted to encourage adoption; effective but lesser-known options live here and there are some really interesting, surprising, and high-value discoveries to be made)
"Open Weight" (these are options you could probably find on HuggingFace and would even be able to run locally--if you had enough VRAM for half a trillion parameters and/or millions of token contexts; obviously, these are a notch or two behind pretty much everything else and included mainly for normalization/comparison purposes, even though there's thousands more we could evaluate)

We're going to focus on four specific considerations (though still subjective) or dimensions of rankings; these were mostly evaluated within the scope of the same project but across a wide variety of prompts:

Orchestration duration (if the context window is too small or its tokens aren't used efficiently, you will notice a fall-off in effectiveness and may need to restart your sessions more regularly)
Token value (not just dollars per token but how efficiently those tokens are used, cached, and iterated--orchestrators can be very verbose!)
Integration efficacy (some smart models still have trouble leveraging tools or child models/agents/categories in the right way, which can even lead to some circular or drifting orchestration work)
Advanced effectiveness (some models are great but still struggle with more sophisticated codebases, architecture/design problems, iterating too much, or more recent/obscure--or maybe less popular is the right term--coding problems at the orchestration level; some models can orchcestrate a basic React app fine but completely fall apart if you take a crack at a 100k SLOC Zig project)

The ratings below are intentionally coarse: Good means the notes support using it for that dimension; Meh means usable or interesting with real caveats; Bad means that dimension currently blocks or materially degrades orchestration. Within each bin, entries are roughly ordered from most promising to least usable as a primary orchestrator.

AAA

`google/gemini-3.1-pro-preview`

Orchestration duration: Meh. Takes longer to work through problems, but seems able to keep up
Token value: Good. I'd describe it as reasonably cost-effective for a frontier-ish model
Integration efficacy: Meh. Initial routing problems appear mostly addressed, but still friction
Advanced effectiveness: Good. Reasonably effective even on modern/complicated problems
Comments: Gating issue seems to have been harness/routing compatibility more than model capability

`google/gemini-3-flash-preview`

Orchestration duration: Meh. Okay for modest task orchestration if left to chew, but needs nudges
Token value: Good. Good rate and decent pricing
Integration efficacy: Good. No significant tooling issues; recognizes appropriate delegation
Advanced effectiveness: Meh. Unique/interesting reasoning, but not reliable enough for heavyweight
Comments: A strong signpost for the next Gemini Pro rather than a fully satisfying primary orchestrator

`openai/gpt-5.4`

Orchestration duration: Bad. Far too slow for responsive orchestration, though reasonably strong
Token value: Bad. Not cheap, and slow orchestration makes the spend harder to justify
Integration efficacy: Meh. No specific tool failure, but latency prevents it from feeling useful
Advanced effectiveness: Good. Presumed brilliant enough for hard work, but blocking practicality
Comments: Capability is not the concern; usable throughput is

`openai/gpt-5.5`

Orchestration duration: Bad. Not meaningfully tested with Sisyphus: harness pushed back
Token value: Meh. No clear value read because the model was effectively skipped for this role
Integration efficacy: Bad. OMO gets upset when GPT models are used for roles other than Hephaestus
Advanced effectiveness: Meh. Likely capable, but unproven here because integration constraints
Comments: This may be a role-fit issue rather than a model-quality issue; listen to the harness

Runner-Ups

`x-ai/grok-4.20`

Orchestration duration: Good. One of the best alternates; strong subtask tracking, stays on point
Token value: Meh. Significantly cheaper than Opus, but spendier / less efficient than alternatives
Integration efficacy: Good. Unusually strong with tooling, delegation, hooks, LSP usage, toolchain
Advanced effectiveness: Good. Fresh cutoff, strong orchestration behavior, and reliable completion
Comments: Illuminating split-role strategies here: pair 4.20-beta orchestration with 4.1-fast as junior; good stuff, underrated

`x-ai/grok-4.3`

Orchestration duration: Meh. A little uncertain at startup, but once it gets going it can move
Token value: Meh. Unclear how the value holds up once delegation and child-model costs are included
Integration efficacy: Meh. Tool integration is initially uncertain, though not necessarily broken
Advanced effectiveness: Good. Once it cranks, it appears to handle substantive tasks very well
Comments: Promising, but not clear whether delegation remains cost-effective and stable

`x-ai/grok-4.1-fast`

Orchestration duration: Meh. Token generation is quick, but simple tasks can trigger iteration
Token value: Meh. Speed helps, but wasted loops and restarts erode the value
Integration efficacy: Meh. Tool integration exists, but it is awkward
Advanced effectiveness: Bad. Problem-solving is too cautious and brittle for primary orchestration
Comments: Still potentially useful as a lower-level worker if paired with a stronger orchestrator

Distillations/Heretics

`deepseek/deepseek-v3.2`

Orchestration duration: Meh. Earlier runs felt slow and prone to loops, expect provider variance
Token value: Meh. Cost is described as merely okay, and slow convergence can eat into value
Integration efficacy: Meh. Requires little hand-holding and has solid planning, but looping bad
Advanced effectiveness: Good. One of the better options so far when it is behaving
Comments: This model's evaluation improved substantially after more testing; "promising but variable"

`moonshotai/kimi-k2.6`

Orchestration duration: Bad. Limited context means it can handle only a few tasks before falling off
Token value: Meh. Better than expected, but restart pressure hurts value
Integration efficacy: Meh. Needing manual runtime specification adds friction
Advanced effectiveness: Good. Did a decent job with architecture refactoring
Comments: I've heard a lot of good things but was somewhat disappointed; maybe expectations too high

`moonshotai/kimi-k2.5`

Orchestration duration: Meh. Variable generation rates and slower convergence, but it gets there
Token value: Meh. Pricing is okay, but provider-load variability makes the value inconsistent
Integration efficacy: Bad. Codegen integration leaves a lot to be desired, at least on the Zig tests
Advanced effectiveness: Meh. Thinking not bad, rigor reasonable, but not smooth enough for orchestration
Comments: Interesting to see the difference against its successor, which in some ways wasn't as good

`z-ai/glm-5.1`

Orchestration duration: Bad. Verbosity and circularity shorten the useful session before fallbacks
Token value: Bad. Circular verbosity burns tokens without enough payoff
Integration efficacy: Meh. Decent orchestration and introspection, but not enough to feel sticky
Advanced effectiveness: Meh. Capable useful reasoning, but orchestration not compelling at this tier
Comments: I feel like the GLM series models are probably underexplored, and was surprised by how much

`qwen/qwen3.6-plus`

Orchestration duration: Meh. Seems pretty good, but prompt-to-prompt switching disrupts continuity
Token value: Meh. No strong price/value signal beyond "pretty good"
Integration efficacy: Bad. OpenCode wants to switch back to Opus each time there is a prompt
Advanced effectiveness: Meh. Likely competent, but harness friction prevents a stronger conclusion
Comments: Qwen in general is all over the place; wouldn't be surprised if other releases are better

`deepseek/deepseek-v4-pro`

Orchestration duration: Meh. Slow and barely avoids circular behavior, but does get the job done
Token value: Bad. Not cheap for mediocre efficacy
Integration efficacy: Meh. No fatal tooling issues, but the orchestration loop is not strong enough
Advanced effectiveness: Meh. Better than previous DeepSeeks, but not recommendable for orchestration
Comments: One option that shows you there is a difference between good models and orchestration-quality

`tencent/hy3-preview`

Orchestration duration: Bad. Effectively untested because OMO does not appear to want to load/use it
Token value: Bad. No practical value if it cannot be loaded into the stack
Integration efficacy: Bad. Harness compatibility is the blocker
Advanced effectiveness: Bad. No useful advanced-orchestration because it never gets into the loop
Comments: Surprised as I expected better from Tencent--makes me doubt and wonder if this was on me

Black Sheep

`poolside/laguna-m.1`

Orchestration duration: Meh. Context is only 128k, but it is still surprisingly capable for freebie
Token value: Good. Free and far better than expected makes this a standout value discoveries
Integration efficacy: Good. No major tool/harness issues experienced
Advanced effectiveness: Meh. Not AAA, maybe A-level, and it struggles with design/layout/CSS
Comments: This is the strongest "where did this come from?" result in my notes

`minimax/minimax-m2.7`

Orchestration duration: Good. Takes a little longer, but reliably converges for most orchestration
Token value: Good. Decent token rates and good enough output quality to make the price attractive
Integration efficacy: Good. No major convergence or orchestration-integration issue is called out
Advanced effectiveness: Meh. Broadly competent, but weaker on layout/styling/frontend/design tasks
Comments: One of the more pleasant surprises: not flashy, but practically useful

`inclusionai/ring-2.6-1t`

Orchestration duration: Meh. Moves quickly, but frequent rate limits interrupt even modest work
Token value: Good. Free for now, which makes passable orchestration inherently interesting
Integration efficacy: Meh. Initial OpenRouter/OMO friction, Zen fallback, and hiccupy tooling uneven
Advanced effectiveness: Meh. Passable on architecture-oriented tasks, but hesitant, warning-prone
Comments: Worth monitoring because the raw economics are excellent w/o rate-limit and confidence issues

`openrouter/owl-alpha`

Orchestration duration: Bad. No useful orchestration run--the harness does not like supporting it
Token value: Bad. Unsupported models have no practical token value in this stack
Integration efficacy: Bad. OMO/OpenCode compatibility seems to be the blocker
Advanced effectiveness: Bad. No meaningful effectiveness read without a working integration
Comments: Reminds me of the OpenCode Zen options, nice to know it's there but not a heavy orchestrator

Open Weight

`nvidia/nemotron-3-super-120b-a12b`

Orchestration duration: Meh. Comes up to speed pretty quickly for a free model
Token value: Good. Free makes the initial value attractive
Integration efficacy: Bad. Errors out when trying to do things, suggesting tool-integration problems
Advanced effectiveness: Bad. Promising comprehension does not matter if action-taking fails
Comments: First experiment with an NVIDIA model, and if they're all like this, the last--surprising

`gpt-oss 20b`

Orchestration duration: Bad. Reasoning closure issues prevent reliable orchestration
Token value: Meh. Local economics may be attractive, but the model is not orchestration-grade
Integration efficacy: Bad. Does not integrate well with tools
Advanced effectiveness: Bad. Older and definitely below the required orchestration level
Comments: Really not sure what the proposition for this model is--out of date, mediocre at everything

`gemma4 e4b`

Orchestration duration: Bad. Not ready to close/converge on orchestration solutions
Token value: Bad. Even if cheap/local, poor convergence makes it a bad orchestrator value
Integration efficacy: Bad. Has tool-integration problems
Advanced effectiveness: Bad. Not yet at orchestration level
Comments: Love me some Gemma for local codegen, etc., but obviously not for high/heavy work

In Summary

Here's a quick summary table of where things wound up.

Model	Category	Duration	Token	Integration	Advanced
`google/gemini-3.1-pro-preview`	AAA	Meh	Good	Meh	Good
`google/gemini-3-flash-preview`	AAA	Meh	Good	Good	Meh
`openai/gpt-5.4`	AAA	Bad	Bad	Meh	Good
`openai/gpt-5.5`	AAA	Bad	Meh	Bad	Meh
`x-ai/grok-4.20-beta`	Runner-Ups	Good	Meh	Good	Good
`x-ai/grok-4.3`	Runner-Ups	Meh	Meh	Meh	Good
`x-ai/grok-4.20`	Runner-Ups	Meh	Meh	Bad	Meh
`x-ai/grok-4.1-fast`	Runner-Ups	Meh	Meh	Meh	Bad
`deepseek/deepseek-v3.2`	Distillations/Heretics	Meh	Meh	Meh	Good
`moonshotai/kimi-k2.6`	Distillations/Heretics	Bad	Meh	Meh	Good
`moonshotai/kimi-k2.5`	Distillations/Heretics	Meh	Meh	Bad	Meh
`z-ai/glm-5.1`	Distillations/Heretics	Bad	Bad	Meh	Meh
`qwen/qwen3.6-plus`	Distillations/Heretics	Meh	Meh	Bad	Meh
`deepseek/deepseek-v4-pro`	Distillations/Heretics	Meh	Bad	Meh	Meh
`tencent/hy3-preview`	Distillations/Heretics	Bad	Bad	Bad	Bad
`minimax/minimax-m2.7`	Black Sheep	Good	Good	Good	Meh
`poolside/laguna-m.1`	Black Sheep	Meh	Good	Good	Meh
`inclusionai/ring-2.6-1t`	Black Sheep	Meh	Good	Meh	Meh
`openrouter/owl-alpha`	Black Sheep	Bad	Bad	Bad	Bad
`nvidia/nemotron-3-super-120b-a12b`	Open Weight	Meh	Good	Bad	Bad
`gemma4 e4b`	Open Weight	Bad	Bad	Bad	Bad
`gpt-oss 20b`	Open Weight	Bad	Meh	Bad	Bad

DEV Community