Jenny Met

Posted on Jun 3 • Originally published at crazyrouter.com

Same Agent Workflow, Three Model Routes: A Real Crazyrouter Benchmark

#ai #api #claude #agents

Same Agent Workflow, Three Model Routes: A Real Crazyrouter Benchmark

Dynamic workflows are exciting, but they introduce a practical question: which model should run each step?

If a workflow has a planner, implementer, adversarial reviewer, and verifier, you can route all steps to one strong model. Or you can route different steps to different models.

The wrong answer is to guess.

So we ran a small real benchmark through Crazyrouter.

Test setup

We used the Crazyrouter OpenAI-compatible endpoint:

https://cn.crazyrouter.com/v1

The workflow task was:

Add a CSV export for account billing history with user-level authorization, timezone-safe timestamps, CSV escaping, tests, and rollback notes.

The workflow had four steps:

planner
implementer
adversarial reviewer
verifier

We tested three routing policies:

Policy	Planner	Implementer	Reviewer	Verifier
all_opus_47	`claude-opus-4-7`	`claude-opus-4-7`	`claude-opus-4-7`	`claude-opus-4-7`
all_opus_48	`claude-opus-4-8`	`claude-opus-4-8`	`claude-opus-4-8`	`claude-opus-4-8`
routed_47_48	`claude-opus-4-7`	`claude-opus-4-8`	`claude-opus-4-7`	`claude-opus-4-8`

Raw benchmark artifact:

generated/dynamic_workflow_routing_20260603/benchmark_results.json

Results

Route	Calls	Total latency	Total tokens	Output tokens	Score
all Opus 4.7	4	100.939s	8,853	5,977	14/17
all Opus 4.8	4	82.598s	8,357	5,782	15/17
routed 4.7/4.8	4	85.975s	8,652	5,873	15/17

In this run, all_opus_48 won on latency, total tokens, and score.

That does not mean every workflow should use Opus 4.8 everywhere. It means routing needs evidence.

What the score measured

This was not a generic benchmark. Each workflow step had step-specific checks.

For example:

planner needed affected files, risks, acceptance criteria, tests, rollback;
implementer needed CSV handling, authorization, timestamps, tests;
reviewer needed security, privacy, tests, rollback;
verifier needed commands, tests, evidence, inspection.

The score was a simple keyword-based quality gate. It is not a perfect human evaluation, but it catches whether the output covered required workflow concerns.

Why this matters

Dynamic workflows can multiply model calls.

A simple AI coding request might become:

planner call
+ implementer call
+ reviewer call
+ verifier call
+ retry calls
+ patch-fix calls
+ final summary call

If you always use the most expensive model, cost can grow quickly. If you always use the cheapest model, failures and retries can grow quickly.

The useful metric is not token price.

The useful metric is:

cost and latency per successful workflow

What we learned

1. Opus 4.8 was faster in this workflow

The all-4.8 route finished in 82.598 seconds. The all-4.7 route took 100.939 seconds.

That is an 18.341 second difference for the same four-step workflow.

In a single call, that may not matter. In a background agent system with many workflow steps, it does.

2. Mixed routing was close, but not better here

The mixed route used 4.7 for planning/review and 4.8 for implementation/verification.

It scored 15/17, same as all-4.8, but took 85.975 seconds and used 8,652 tokens.

That is still good. But in this run, all-4.8 was simpler and slightly better.

3. A static rule is risky

A different task might favor a different route. Security review, legal summarization, long-context extraction, frontend implementation, and test generation are not the same workload.

The point is not “always use model X.”

The point is to create a workflow trace and compare model routes on your actual task types.

Minimal reproduction code

from openai import OpenAI

client = OpenAI(
    api_key="YOUR_CRAZYROUTER_API_KEY",
    base_url="https://cn.crazyrouter.com/v1"
)

steps = {
    "planner": "Create a concise implementation plan with risks, tests, and rollback.",
    "implementer": "Write a minimal pseudo-patch plan and code sketch.",
    "reviewer": "Adversarially review security, correctness, privacy, tests, and rollback.",
    "verifier": "Create a verification checklist with concrete commands and evidence."
}

route = {
    "planner": "claude-opus-4-8",
    "implementer": "claude-opus-4-8",
    "reviewer": "claude-opus-4-8",
    "verifier": "claude-opus-4-8"
}

for role, instruction in steps.items():
    response = client.chat.completions.create(
        model=route[role],
        messages=[{"role": "user", "content": instruction}],
        temperature=0
    )
    print(role, response.choices[0].message.content)

Keep the API base URL clean. Do not add UTM parameters to code endpoints.

Recommended workflow routing practice

Start with this loop:

define workflow steps;
define success checks per step;
run the same task across 2-3 routing policies;
log model, latency, tokens, and score;
choose the route based on successful workflow outcome, not model hype.

A gateway makes this practical because the application code can keep the same client and base URL while changing model IDs.

Why Crazyrouter fits this pattern

Dynamic workflows need three things:

model variety;
centralized routing;
traceable usage.

Crazyrouter gives one OpenAI-compatible API surface for multiple models. That makes it easier to test planner/reviewer/verifier routes without rewriting the product around each provider.

This matters more as AI coding moves from single-agent chats to orchestrated workflows.

Final take

Dynamic workflows are not just a Claude Code or Codex feature. They are an engineering pattern.

Once you split work into planner, implementer, reviewer, and verifier, model choice becomes a routing problem.

In this run, all Opus 4.8 was the best route. In your workflow, it might be a mixed route. The only way to know is to measure.

Try model routing here: Crazyrouter

DEV Community

Same Agent Workflow, Three Model Routes: A Real Crazyrouter Benchmark

Same Agent Workflow, Three Model Routes: A Real Crazyrouter Benchmark

Test setup

Results

What the score measured

Why this matters

What we learned

1. Opus 4.8 was faster in this workflow

2. Mixed routing was close, but not better here

3. A static rule is risky

Minimal reproduction code

Recommended workflow routing practice

Why Crazyrouter fits this pattern

Final take

Top comments (0)