"Claude 3, Qwen 6: why we set a different fix_verify retry cap per model"

#python #ai #claude #anthropic

Claude gets 3 retries. Qwen gets 6. Everything else gets 5.

That is the default fix_verify_retry_cap in Codens Purple right now, after a few weeks of staring at fix-rate curves per model. It started as one global cap, the same number for every model the workflow could route to. We changed it once we had enough production data to see that the same number was both too high for one model and too low for another at the same time.

This is the story of the split, what the loop actually does, and the few lines of code that put the policy in.

The fix_verify loop

Codens Purple runs an agent that proposes a code fix, then verifies it by running a test or a check, then decides whether to retry with feedback from the verification step. The loop looks roughly like this. Generate a candidate change, apply it, run the verify command, read the result. If verify passes, the loop is done. If verify fails, feed the failure output back into the next prompt and try again. Each retry is a new API call. Each API call costs per-token credits, and verify itself costs wall clock time plus whatever the test suite costs to run.

The retry cap is the integer that says how many of those iterations the loop is allowed before it gives up and surfaces the partial result to the user. A cap of 1 means one attempt, no retry. A cap of 3 means an initial attempt plus two retries. A cap of 6 means up to six attempts total.

The cap matters because the curve of "fix succeeds at attempt N" is not flat. It is heavily front-loaded. Most successful fixes succeed on attempt 1 or 2. The question for any given model is how long the long tail is, and how much of that tail is worth paying for.

When we had one cap for all models, that one number had to be a compromise. The compromise was bad in two directions at once.

How we got to multi-model

Codens started with Claude as the only model. Specifically, Claude via the Anthropic API, using a raw API key with per-token billing. Not the subscription, not the bundled tier. We are a multi-tenant product running thousands of small fix_verify cycles per day across many customers, and a subscription does not cleanly support that shape of workload. Per-token billing lets us scale spend with usage and attribute cost back to the project that incurred it.

This came up again recently when Anthropic announced that the claude -p print mode, the Agent SDK, and CI use cases now require an API plan rather than a subscription. For us this was a non-event. We were already on the API. The announcement just confirmed that the path we picked is the path Anthropic wants production agent workloads to take.

Claude is excellent for fix_verify. The per-attempt success rate is high and the failure modes are usually informative, meaning when it does not fix the bug on attempt 1, the diff it produces and the verify output together give the next attempt a real signal. The downside is cost. At scale, with thousands of fix loops a day, the per-token bill is a real line item.

A few months in, we started evaluating Qwen as a secondary model to drive cost down on a subset of tasks. Qwen runs on our own infrastructure on AWS EC2 hosts, which gives us per-token cost well below the Anthropic API for the same task size. The tradeoff was the reliability profile. Per-attempt success rate is lower than Claude. Failure modes are noisier. Some of the time the model will produce a syntactically valid but semantically wrong patch, and the verify step is the only thing that catches it.

This is exactly the kind of model where retries earn their keep. Qwen's curve of cumulative success vs attempt number rises more slowly than Claude's, but it keeps rising further out. Attempt 5 is still adding meaningful success rate. With Claude, attempt 5 is mostly wasted credits on a fundamentally wrong understanding that more retries are not going to fix.

So we had two models in production with different shapes of success curve, and we were applying the same retry cap to both. Something had to give.

Why one cap did not work

Suppose we set the global cap to 3, tuned for Claude. Claude is fine. Qwen leaves real success on the table, because attempts 4, 5, and 6 would have converted a measurable fraction of failures into passes, and now they do not happen. Fix rate drops on Qwen-routed tasks. Users notice. They route more work to Claude, which is the opposite of what we wanted from introducing Qwen.

Suppose we set the global cap to 6, tuned for Qwen. Qwen is fine. Claude wastes credits. Attempts 4, 5, and 6 on a Claude-routed task that has already failed three times have a low chance of succeeding, because Claude's failure mode at attempt 3 is usually "I do not understand the bug" or "the test I am running is checking something I cannot see," and the same prompt with the same verify output is not going to flip that on attempt 6. We were paying full Sonnet-tier per-token cost for those attempts.

The compromise we ran for a while was a cap of 5 globally. It was bad on both axes. Claude wasted 2 attempts worth of credits on its failure cases. Qwen left 1 attempt worth of success on the floor. We could see this in the data once we started bucketing the loop outcome by model and attempt number. The right answer was clearly per-model, not global.

The per-model defaults

The implementation is small. We added a nullable integer column on the project table, fix_verify_retry_cap, with NULL meaning "use the model-based default." A helper function returns the default for a given model name. The use case layer combines the two when it kicks off a loop.

The helper:

def _default_fix_verify_cap(model: str) -> int:
    name = (model or "").lower()
    if name.startswith("claude"):
        return 3
    if name.startswith("qwen"):
        return 6
    return 5

The schema field, on the project update payload:

class PurpleProjectUpdate(BaseModel):
    fix_verify_retry_cap: Optional[int] = Field(
        default=None, ge=1, le=20
    )

The Alembic migration adds the column:

op.add_column(
    "purple_projects",
    sa.Column("fix_verify_retry_cap", sa.Integer(), nullable=True),
)

And the use case resolves the effective cap when it starts a task:

effective_cap = (
    pp.fix_verify_retry_cap
    or _default_fix_verify_cap(execute_model)
)

The override range is 1 to 20. One on the low end because some projects have run a single attempt followed by a human review, and we do not want to break that pattern. Twenty on the high end because it is a reasonable ceiling for a customer who wants to push the long tail of a cheap self-hosted model further than our default. If they set 20 and burn through it, that is their cost. We log the effective cap on every task so it shows up in the project audit log alongside the outcome.

The defaults of 3, 5, 6 are not magic numbers pulled out of intuition. We picked them by plotting cumulative fix rate against attempt number for each model from a few weeks of production runs and looking at where the curve flattens. For Claude, the curve is essentially flat past attempt 3. For Qwen, it is still meaningfully rising at 5 and starts to flatten at 6. For other models we had less data, so 5 is the safe middle.

The tradeoff

The honest cost of this change is that adding a new model to the routing layer is no longer free. Before, we added a model and it inherited the global cap. Now we have to pick a default. If we do not pick one, the model falls through to the 5 default, which is usually fine but not always optimal.

In practice, this turned into a small ritual when introducing a new model. Route a small fraction of traffic to it at cap 8 or 10 for a week, plot the curve, find the elbow, set the default to one or two above the elbow. The ritual takes a few hours of analysis on top of the model integration itself. We considered automating it, computing the default from rolling fix rates per model on a cadence. We have not built that yet. The set of models we route to is small enough that a manual review every couple of months is fine. If the set grew to ten or more, automation would start to pay back.

The other tradeoff is that the policy is now opinionated in a way users can feel. If a customer on a Claude-routed project reports "fix gave up too early," the answer is sometimes "the default cap is 3, raise it to 5 on your project and try again." That is a real conversation we have had. It is the price of a default that is right on average but not for every codebase.

What the cap is, really

A retry cap is a budget. Specifically, it is a budget that integrates two things at once. The marginal probability of success at each attempt. The marginal cost of each attempt. The optimal cap is the largest N where the expected value of attempt N is still positive, which means attempt N's marginal success times the value of a fix exceeds attempt N's marginal cost in credits and verify time. That number is per-model because both factors are per-model.

When we set 3 for Claude and 6 for Qwen, we are saying the integral converges faster on Claude because high per-attempt success runs out of incremental room quickly, and converges slower on Qwen because lower per-attempt success keeps adding incremental room for longer at a much lower per-attempt cost. The split is what makes a multi-model workflow economically coherent.

If you are running anything like this loop in production, do not pick one number for all your models. Plot the curve. The number falls out.

Codens Purple is part of the harness at https://www.codens.ai/en/ . The retry cap split lives in purple-codens under the project use case layer.