DEV Community

Edy Silva
Edy Silva

Posted on • Originally published at blog.codeminer42.com

Opus 4.7 vs GLM 5.1: is mixing models worth it?

A couple of months ago, I compared Opus vs GLM by having both of them do a task for me. It’s not that surprising that Opus was best.

But what if we get both cooperating? For optimization purposes: one planning, the other executing.

I ran this comparison before, same plugin (cm-multilingual), three runs: Opus 4.7 alone, GLM 5.1 alone via Ollama Cloud, and GLM 5.1 following the plan Opus had written. It was a small task – just providing progress feedback for the AI translation – and the "cheap follows expensive plan" split delivered ~99% of Opus’s scope for ~37% of the cost.

Too beautiful. In a task of that size, there isn’t much room for things to go wrong, and I knew that the moment I posted it.

But it is the kind of result that becomes a Twitter thread: “Opus quality, GLM/Qwen/Minimax/Kimi price – just pass the plan”. Every week, someone posts a variation – the cheap model of the moment changes, the argument does not. And that was exactly the shortcut I wanted to test more rigorously before adopting it as the default.

So I raised the bar. Same plugin, same setup, but now the task was to implement translation by chunks, improving the resilience of the entire process, with the progress bar still in place. Async coordination between PHP, REST, JS, and an external provider whose latency can disappear halfway through.

The response changed. And this is the round I think is worth recording.

The three branches:

  1. chunks-opus – Opus 4.7 plans and executes.
  2. chunks-glm – GLM 5.1 alone. Same prompt verbatim. GLM plans and executes.
  3. chunks-glm-follow-opus – GLM 5.1 with Opus’s plan loaded in the first turn (“follow strictly”).

All passed the tests. All work, eventually. But to answer is it worth mixing? the way each one got there is what matters.

Methodological note: I let each agent work until the feature ran and did not iterate to polish. There were minimal interactions just to ensure all branches had a deliverable and nothing more.

What each one delivered

Meta box

From a UI perspective, the three branches deliver practically the same thing: progress bar in the “Translations” meta box, text indicating the current chunk. The difference lies in the guts – orchestrator, dispatch, tests – not on the surface.

Opus (chunks-opus). Refactors translation into three well-separated pieces: CM_Chunker (cuts the post into pieces by HTML block boundary), CM_Translation_Orchestrator (receives a ?callable $reporter in the constructor and orchestrates title -> chunks -> finalization), and the CM_REST_API remains just for HTTP.

True parallel dispatch via translate_batch which accepts a on_complete callback, triggered each time a chunk returns. The bar advances as responses arrive in parallel. Wall time ≈ max(latência por chunk).

GLM alone (chunks-glm). No separate orchestrator. ~150 lines of orchestration inlined in the REST handler.

It has the CM_Chunker, but the REST does sequential foreach over translate_with_context(), which sends the chunk content. The bar advances chunk by chunk, but wall time ≈ soma(latência por chunk).

GLM with Opus plan (chunks-glm-follow-opus). Recreates the orchestrator the plan described, but sequentializes the dispatch (foreach calling provider->translate( $chunk, ... )) with set_progress() before each call.

The numbers

Bar chart comparing the three branches in Active Time (minutes), Cost (USD) and Supervision (corrective interventions): Opus 4.7 with 47 min, $180, 0 interventions; GLM 5.1 alone with 129 min, $62, 8 interventions; GLM + Opus plan with 162 min, $75, 18 interventions.

The three axes where the trade-off appears most. The table below has the rest of the details (tokens, autocompacts), but the three metrics that matter for the verdict are in the chart.

Opus 4.7 GLM 5.1 alone GLM + plan
Active time (90 s threshold) 46m 39s 2h 9m 27s 2h 41m 52s
Total tokens (in + out) 99.4M 64.5M 71.2M
Equivalent cost (reference) $180.15 $61.63 $74.51 ($6.63 plan + $67.88 exec)
Corrective user check-ins 0 ~8 ~18
Autocompacts (Claude Code) 0 2 2
Final result shipped shipped shipped

The dollar costs are referential; it is what that volume of tokens would cost if you paid per API. In practice, with the Claude/Ollama Cloud subscription, the user pays zero per session – the numbers exist only to compare the three runs on a common axis.

Some lines of this table deserve a zoom.

Dollar cost: the savings still exist

$67.88 + $6.63 (plan) = $74.51 for the mixed split. This is ~41% of Opus alone. It matches the ~37% I measured in the simple task. Billing savings are robust across the two rounds.

If the only thing that matters is “how much this would cost in retail”, the split delivers what it promised.

Active time: the savings disappear

But active time (the time you spend monitoring the session, discounting idle gaps >90s) inverts:

  • Opus: 46m 39s.
  • GLM alone: 2h 9m 27s. ~2.8× more.
  • GLM with plan: 2h 41m 52s. ~3.5× more – the worst of the three.

The agent following another model’s plan spends turns interpreting the plan, matching it with the reality of the code, and circling back to ask when reality doesn’t match. I expected the plan to shorten the run; instead, it lengthened it.

Supervision: ~18 interventions in the run with plan

This was the number that bothered me most. I counted “corrective user check-ins” as messages like “did you actually validate?” or “open the browser again to make sure it works”.

  • Opus: 0.
  • GLM alone: ~8.
  • GLM with plan: ~18. More than double the free-form GLM.

The intuition was that the plan would reduce supervision because the agent would have less freedom to make mistakes. On the contrary, the same interventions from free-GLM appeared in plan-GLM, but earlier and in greater numbers.

Autocompacts: two in each GLM run, zero in Opus

Autocompact is Claude Code summarizing the conversation when it approaches the context limit. Each compaction replaces the turn-by-turn history with a lossy summary – the agent’s working memory becomes degraded.

  • Opus: zero autocompacts. Anthropic’s caching keeps the prompt lean, so even in 461 turns it didn’t hit the ceiling.
  • GLM alone: 2 autocompacts.
  • GLM with plan: 2 autocompacts.

Since I ran each scenario only once, I cannot say if the autocompact caused the degradation or just measured it – longer runs hit the limit more easily, and the same session that needs more turns is the one most subject to compacting. But the signal goes together: zero autocompact, zero corrective check-in, short run. Two autocompacts, several interventions, long run.

Assisted validation was off the table with GLM

The chrome-devtools-mcp was central to this experiment – the agent clicks the button, waits for the translation, reads the DOM, and validates the feature without me needing to open the browser. It worked beautifully with Opus: the entire validation run was agent-driven, I just watched.

With GLM running via Ollama Cloud, it was not possible.

The MCP itself responds quickly (snapshot, screenshot, evaluate_script stay within the noise). But the calls that wait for the browser to finish doing something (wait_for, navigate_page) became impractical: wait_for 3.8× slower, navigate_page 27× slower, with a max of 199.8s in a single call. Total chrome-devtools roundtrips: 110s on Opus, 559s on GLM.

The cause is not the MCP, it is what the browser was waiting for. When the agent tells the browser to click “Translate with AI” and gives wait_for, what it is waiting for is the backend to call Ollama Cloud, which is the same latency as the GLM itself. The model’s slowness leaks into the validation loop and ruins the assisted UX.

It is worth noting that this is not a peculiarity of my setup. The community itself has reported degradation in Ollama Cloud serving in recent months, for example: r/ollama: “Ollama cloud has become unbearably slow”.

Under better serving conditions, the active time gap would shrink. However, the rest of what the post measures (supervision, validation DX) is from the model execution itself and does not change.

I did manual QA on both GLM branches. I opened the post myself, clicked the button, monitored the bar, made the request with long posts from my instance.

That is what becomes “GLM needs more supervision” in the check-in count. A good portion of the ~8 (free-form) and ~18 (with plan) interventions is exactly this: me taking over the work the agent couldn’t do agent-driven.

It is not that GLM refused to validate. It is that, with Ollama Cloud latency, letting it validate end-to-end would have multiplied the wall-clock by another factor.

Code quality (quick)

Three axes where the difference is worth a zoom:

  • Encapsulation.

    Opus separates orchestrator, chunker and REST into their own files. The orchestrator doesn’t know WordPress: the constructor receives ?callable $reporter and the report() method only calls the callback. The REST handler decides what to do with the event (write transient, log, send elsewhere). Clean dependency inversion.

    GLM-with-plan also separates the orchestrator, but stores $post_id and $target_lang_code as mutable class properties. Set at the start of translate_post(), read inside set_progress() which calls set_transient() directly. This couples the orchestrator to WordPress and the (post, lang) identity. The class cannot be reused in a context without a transient.

    GLM alone has no orchestrator. ~150 lines of orchestration inlined in the REST handler, right in the middle of the file. Mixed with lock/transient/get_active_provider/copy_taxonomies in the same method.

  • Added tests (all files).

    Gross total: Opus 85 tests / 194 assertions in 5 files, GLM-alone 76 / 171 in 4 files, GLM-with-plan 69 / 145 in 5 files. GLM-with-plan added the least coverage, across all cuts. Opus added the most.

    Gross numbers don’t tell everything. Looking by dimension:

    • New feature contract. Opus pinned the on_complete callback of the translate_batch in 6 dedicated tests in Test_CM_Providers (test_translate_batch_on_complete_fires_for_each_chunk_via_pre_filter, ..._for_single_text_path, ..._for_build_time_errors, etc.). This callback is what makes the bar advance live, and it is guaranteed in tests.

      GLM-alone has 3 translate_chunks tests (cover basic dispatch, without asserting anything about the callback because there is no callback). GLM-with-plan has 0 tests of the parallel contract because it also ended up not using parallel.

      In other words: each implementation tested exactly what it chose to do. The problem is that both GLM-runs chose a less demanding path, which means fewer tests pinning behavior.

    • Progress endpoint. All three runs test the REST progress endpoint, but with different depths. Opus checks idle state (test_progress_returns_idle_when_no_translation_in_progress), key isolation per (post, lang) pair (test_progress_keys_are_isolated_per_post_and_language), and the edit_posts permission (test_progress_endpoint_requires_edit_posts_capability). GLM-alone tests the happy path and one not-found, but does not test authentication or isolation. GLM-with-plan falls in between the two.

    • Chunker round-trip. All three have a reassembly test (test_reassembly_preserves_byte_for_byte_equality in Opus, test_reassembly_matches_original in GLM, test_preserves_content_through_chunk_and_reassemble in GLM-with-plan). Leveled in this cut. Subtle difference: Opus requires byte-for-byte, others accept “matches” – one is an exact invariant, the other leaves room for interpretation.

    • Edge cases. Opus tests oversized_single_block_is_emitted_unsplit, placeholder_token_is_never_split_internally, and (in Test_CM_Meta_Box) 14 tests including meta_box_is_not_registered_when_block_editor_active – the case that historically broke in production. Both GLM-runs stop at 12 Meta_Box tests and do not cover this transition.

    Two sub-details worth noting:

    GLM-alone has 34 tests in Test_CM_Providers (more than Opus, 30) but with 72 assertions vs. 55 from Opus. On average shallower per test; many are close variations (test_translate_chunks_with_single_chunk_calls_translate, ..._with_multiple_chunks, ..._returns_error_on_provider_failure) where one @dataProvider would solve it. Volume ≠ depth.

    GLM-with-plan also has the lowest average assertions/teste among the three (~2.1), which suggests tests that “hit the surface” more than “interrogate state”.

  • Dispatch strategy.

    Opus: true parallel, via translate_batch( $chunks, …, $on_complete ) which fires N simultaneous requests and closes each via callback as soon as it responds. Wall time ≈ max(latência por chunk). The bar follows the arrival of responses.

    Both GLM-runs converged to sequential dispatch with lean payload (foreach calling provider->translate( $chunk, ... )). The bar advances chunk by chunk, but wall time ≈ soma(latência por chunk) instead of max. For a post of 8 chunks, this is the difference between “2 minutes” and “16 minutes” – same apparent feature, very different waiting cost.

    Detail about GLM-with-plan: the comment in the code itself documents the choice – “Translate chunks sequentially so progress updates are visible”. The author preferred visible progress over shorter wall-clock.

    It is a defensible choice (seeing progress > waiting faster). But it is interesting that the Opus plan described the parallel via callback, and the agent preferred not to follow that part.

In summary: Opus > GLM-with-plan > GLM-alone in encapsulation and tests; and Opus leading both GLM-runs in dispatch (parallel vs sequential). In other words, each axis tells a version of the same story, which is how much the author thought about the class consumer (encapsulation), the maintainer (contract tests), and the end user waiting for the bar (dispatch).

What this post is not

Before closing, four caveats to calibrate how you read the numbers above:

  • It is not a paper, it is an informed anecdote. I ran each configuration once, at different times (Ollama Cloud latency varies a lot from morning to afternoon), without fixing the seed. The numbers serve as a signal, not as a statistical benchmark.

    For those who want the structured benchmark of the same question – repeated rounds, more models compared, quality metrics – Akita published exactly that last week: Is it worth mixing 2 models?. His conclusion matches mine. This post here is the anecdotal signal of the same finding, on a real plugin instead of greenfield Rails.

  • Everything runs in Claude Code. It is the agent I use every day, it is the one I trust my workflow to. I did not test in Cursor, Aider, Windsurf, or another harness – some observations about supervision and DX may be specific to this particular environment.

  • Implicit conflict of interest. Claude Code is by Anthropic, and the Opus models are likely optimized for this harness. One cannot separate for sure “Opus is better” from “Opus fits better in Claude Code”. I believe the gap remains in another harness, but I cannot prove it with this experiment.

  • It is not rooting for Opus. Every time Anthropic has an outage, I switch to GLM 5.1 via Ollama Cloud without thinking twice.

    It is genuinely the second best model/harness pair in my routine, and is what pays the bills when the first one is down. This post is not “GLM is bad”. It is “split with plan doesn’t pay off when the problem grows”.

A detail that probably widens the gap

Worth noting one thing before the verdict: transient + polling was my choice as the plugin maintainer, not something the agents had to decide. I wanted the simplest and most effective thing for now in this case.

What this does to the experiment: it did not test the ability to architect difficult things. No SSE, no WebSocket, no Action Scheduler with queue and server-to-client push, none of that.

Still, if the result already shows Opus 47m vs. GLM-with-plan 2h42m + ~18 corrective interventions + completely manual validation DX in this simple setup, it is reasonable to expect the gap to widen in a scenario where the architecture itself was part of the challenge. Here, the agent only had to stitch together the standard pattern well.

Is it worth mixing?

Therefore, the direct answer for this round: it didn’t pay off.

The three branches delivered code that works – same feature, same baseline. What changed around it was everything else:

  • 41% of the dollar cost (billing savings hold up)
  • 3.5× more active time than Opus, and worse than free-form GLM
  • ~18 corrective interventions vs. ~8 in free-form GLM vs. 0 in Opus
  • Validation DX moving from “agent drives the browser, and I watch” (Opus) to “I open the post myself, click, monitor” (any GLM)

If your time in front of the keyboard is worth anything above zero, the math flips. At 100 USD/hour, GLM-with-plan is already more expensive than Opus alone. At 150/hour, Opus wins by a comfortable margin.

And, more important than the number: this task involves async coordination between PHP, REST, JS and an external provider whose latency can disappear. In this type of problem, the plan you pass to the cheap executor has to be very good to close the execution gap. Opus’s was good – and even then it didn’t close.

In which situations does the split still make sense: reasonably delimited UI tasks, with a very clear scope. Basically, the simple scenario of the original experiment: no chunking, no async coordination, no timeout fighting with you. There, the plan closes the gap, and the cheap executor delivers.

For anything involving timing, parallelism, or external dependency, do Opus alone. You pay for “not needing to babysit”, not for the tokens.

And for bugfixes in code you know well, GLM alone works: the plan already lives in your head.

I am not saying split never pays off. I am saying: the more complex the problem, the less the pre-ready plan compensates for the cheap executor. The plan closes the WHAT gap. It leaves the HOW gap open. And the HOW is where the extra time, extra supervision and degraded DX appear.

Beware of the universal shortcut

You just did the whole math with me. If a Twitter/X thread pops up in a bit promising “Opus quality, GLM/Qwen/Minimax/Kimi price – just pass the plan”, it is showing you one column of the CSV. The other half is not in the billing.

And it is not just this experiment saying this. Akita ran a broader multi-model benchmark in the same month, with three harnesses and seven model combinations, and reached the same answer: solo Opus for typical greenfield, multi-model only pays off in genuinely parallel and decoupled tasks. Different methods, aligned conclusion.

I will merge the chunks-opus into the plugin. The other two branches remain as reference: the free-form is the minimum viable that GLM delivers in a shallow prompt, and the with-plan shows how expensive it is to try and close the capacity gap with a plan. Neither of the two goes to main.

The next step is to implement the feature in the most ideal way (SSE, scheduler) and observe GLM and other models like Minimax and Kimi K2.6 working on them.

In three lines, for those who arrived here at the end:

  • Split works in simple and well-delimited tasks – the plan closes the cheap executor’s gap.
  • In complex tasks, the HOW escapes the plan. Extra time, extra supervision and degraded DX flip the math – and that math is not in the billing.
  • You pay for the expensive model so you “don’t have to babysit” and not because of the token difference.

That is all for today.


Session aggregates are in this gist: token counts, active time, tool breakdowns, corrective check-in transcripts, full comparison between the three sessions.

If you want to check the post numbers or disagree with the conclusions, this is the starting point.

Top comments (0)