DEV Community

Cover image for Same Agent Workflow, Three Model Routes: A Real Crazyrouter Benchmark
Jenny Met
Jenny Met

Posted on • Originally published at crazyrouter.com

Same Agent Workflow, Three Model Routes: A Real Crazyrouter Benchmark

Same Agent Workflow, Three Model Routes: A Real Crazyrouter Benchmark

Dynamic workflows are exciting, but they introduce a practical question: which model should run each step?

If a workflow has a planner, implementer, adversarial reviewer, and verifier, you can route all steps to one strong model. Or you can route different steps to different models.

The wrong answer is to guess.

So we ran a small real benchmark through Crazyrouter.

Dynamic workflow routing benchmark

Test setup

We used the Crazyrouter OpenAI-compatible endpoint:

https://cn.crazyrouter.com/v1
Enter fullscreen mode Exit fullscreen mode

The workflow task was:

Add a CSV export for account billing history with user-level authorization, timezone-safe timestamps, CSV escaping, tests, and rollback notes.
Enter fullscreen mode Exit fullscreen mode

The workflow had four steps:

  1. planner
  2. implementer
  3. adversarial reviewer
  4. verifier

We tested three routing policies:

Policy Planner Implementer Reviewer Verifier
all_opus_47 claude-opus-4-7 claude-opus-4-7 claude-opus-4-7 claude-opus-4-7
all_opus_48 claude-opus-4-8 claude-opus-4-8 claude-opus-4-8 claude-opus-4-8
routed_47_48 claude-opus-4-7 claude-opus-4-8 claude-opus-4-7 claude-opus-4-8

Raw benchmark artifact:

generated/dynamic_workflow_routing_20260603/benchmark_results.json
Enter fullscreen mode Exit fullscreen mode

Results

Route Calls Total latency Total tokens Output tokens Score
all Opus 4.7 4 100.939s 8,853 5,977 14/17
all Opus 4.8 4 82.598s 8,357 5,782 15/17
routed 4.7/4.8 4 85.975s 8,652 5,873 15/17

In this run, all_opus_48 won on latency, total tokens, and score.

That does not mean every workflow should use Opus 4.8 everywhere. It means routing needs evidence.

What the score measured

This was not a generic benchmark. Each workflow step had step-specific checks.

For example:

  • planner needed affected files, risks, acceptance criteria, tests, rollback;
  • implementer needed CSV handling, authorization, timestamps, tests;
  • reviewer needed security, privacy, tests, rollback;
  • verifier needed commands, tests, evidence, inspection.

The score was a simple keyword-based quality gate. It is not a perfect human evaluation, but it catches whether the output covered required workflow concerns.

Why this matters

Dynamic workflows can multiply model calls.

A simple AI coding request might become:

planner call
+ implementer call
+ reviewer call
+ verifier call
+ retry calls
+ patch-fix calls
+ final summary call
Enter fullscreen mode Exit fullscreen mode

If you always use the most expensive model, cost can grow quickly. If you always use the cheapest model, failures and retries can grow quickly.

The useful metric is not token price.

The useful metric is:

cost and latency per successful workflow
Enter fullscreen mode Exit fullscreen mode

What we learned

1. Opus 4.8 was faster in this workflow

The all-4.8 route finished in 82.598 seconds. The all-4.7 route took 100.939 seconds.

That is an 18.341 second difference for the same four-step workflow.

In a single call, that may not matter. In a background agent system with many workflow steps, it does.

2. Mixed routing was close, but not better here

The mixed route used 4.7 for planning/review and 4.8 for implementation/verification.

It scored 15/17, same as all-4.8, but took 85.975 seconds and used 8,652 tokens.

That is still good. But in this run, all-4.8 was simpler and slightly better.

3. A static rule is risky

A different task might favor a different route. Security review, legal summarization, long-context extraction, frontend implementation, and test generation are not the same workload.

The point is not β€œalways use model X.”

The point is to create a workflow trace and compare model routes on your actual task types.

Minimal reproduction code

from openai import OpenAI

client = OpenAI(
    api_key="YOUR_CRAZYROUTER_API_KEY",
    base_url="https://cn.crazyrouter.com/v1"
)

steps = {
    "planner": "Create a concise implementation plan with risks, tests, and rollback.",
    "implementer": "Write a minimal pseudo-patch plan and code sketch.",
    "reviewer": "Adversarially review security, correctness, privacy, tests, and rollback.",
    "verifier": "Create a verification checklist with concrete commands and evidence."
}

route = {
    "planner": "claude-opus-4-8",
    "implementer": "claude-opus-4-8",
    "reviewer": "claude-opus-4-8",
    "verifier": "claude-opus-4-8"
}

for role, instruction in steps.items():
    response = client.chat.completions.create(
        model=route[role],
        messages=[{"role": "user", "content": instruction}],
        temperature=0
    )
    print(role, response.choices[0].message.content)
Enter fullscreen mode Exit fullscreen mode

Keep the API base URL clean. Do not add UTM parameters to code endpoints.

Recommended workflow routing practice

Start with this loop:

  1. define workflow steps;
  2. define success checks per step;
  3. run the same task across 2-3 routing policies;
  4. log model, latency, tokens, and score;
  5. choose the route based on successful workflow outcome, not model hype.

A gateway makes this practical because the application code can keep the same client and base URL while changing model IDs.

Why Crazyrouter fits this pattern

Dynamic workflows need three things:

  • model variety;
  • centralized routing;
  • traceable usage.

Crazyrouter gives one OpenAI-compatible API surface for multiple models. That makes it easier to test planner/reviewer/verifier routes without rewriting the product around each provider.

This matters more as AI coding moves from single-agent chats to orchestrated workflows.

Final take

Dynamic workflows are not just a Claude Code or Codex feature. They are an engineering pattern.

Once you split work into planner, implementer, reviewer, and verifier, model choice becomes a routing problem.

In this run, all Opus 4.8 was the best route. In your workflow, it might be a mixed route. The only way to know is to measure.

Try model routing here: Crazyrouter

Top comments (0)