"Adding Cursor Composer 2.5 as a third executor lane: 10x cheaper than Opus at comparable scores, but smoke tells a different story"

#cursor #ai #python #aws

A roughly tenfold per-task cost drop at comparable accuracy is one of those numbers you do not get to ignore for very long. Composer 2.5 published SWE-Bench Multilingual figures in the same neighborhood as Opus, and the per-attempt API cost is about an order of magnitude lower. For an agent harness that runs hundreds of attempts per project per week, a 10x cost compression on a viable lane reshapes the unit economics enough to justify a real integration, not just a spike.

So I shipped Composer 2.5 as a third executor lane in Codens Purple, the orchestration service that decides which model runs each task. Codens was already running two lanes side by side: Claude via the raw Anthropic API and a self-hosted Qwen deployment. The third lane went in over two days, May 23-24, across a Phase 1 skeleton commit, a Phase 2 SDK wire, an ECS Fargate task definition change, an IAM credential isolation fix, and a one-project canary toggle.

Then I ran a smoke pass. 16 failed out of 25 attempts across v4 through v17. The integration works. The benchmark numbers are not the production numbers. This is the writeup of both halves: what shipped, and what the smoke phase actually told me.

Why a third lane at all

The case for a third lane is the same case I made earlier this year for the per-model retry cap pattern. Each model has its own failure shape and its own cost curve. Pinning the whole harness to one provider means inheriting one bill, one rate-limit policy, and one definition of "the model got it wrong."

Composer 2.5 changes the cost arithmetic in a way that matters at our retry caps. Codens retries each task per model up to a cap: claude=3, qwen=6, composer-2.5=5 for now. At cap=3 with Opus, the worst-case attempt cost dominates the per-task budget. At cap=3 with Composer 2.5 at roughly 1/10 the per-attempt rate and comparable accuracy, the worst-case attempt cost drops by roughly an order of magnitude even before factoring in higher-than-Opus first-pass success. That math is what made integration time worth spending.

The optionality argument also got stronger recently. Anthropic clarified that the Agent SDK and claude -p CLI workflows are not covered by subscription plans for agent use cases, which validates the API-direct path Codens already runs on. Adding a Cursor lane on top of that is the same bet, extended: do not get pinned to any one vendor's pricing or policy, and keep the harness free to route tasks to whichever lane wins on cost and reliability for the workload at hand.

Executor lane design

The pleasant part of the design was that PurpleTask.execute_model already supported per-task model switching, and PurpleProject.default_model already let an entire project pin a model. Adding the third lane was not an architecture change. It was an enum value plus a new runner module.

class PurpleTask(Base):
    # existing fields elided
    execute_model = Column(
        Enum("opus", "sonnet", "qwen", "composer-2.5", name="execute_model"),
        nullable=True,
    )

The runner dispatcher already had two branches: runner_claude.py for the Anthropic API path that wraps the claude -p CLI, and runner_qwen.py for the self-hosted endpoint. The third runner, runner_cursor.py, slots in next to those two with the same input contract (task spec, workspace dir, env) and the same output contract (workspace diff, structured result, failure_reason on non-zero).

I split the change into two commits on purpose. Phase 1 was a validation-only runner that exited non-zero on every invocation, plus the enum addition. Shippable in isolation, zero behavior change for existing tasks because nothing pointed at composer-2.5 yet. Phase 2 was the actual SDK call. Splitting like this means each commit can be reverted on its own, and the enum migration is not coupled to any SDK behavior question.

I have learned the hard way that bundling an enum addition with the runtime that depends on it produces commits you cannot cleanly revert when the runtime turns out to be the problem. Phase 1 / Phase 2 splits are cheap insurance.

Phase 1: the skeleton

Phase 1, commit 5a575031, did three things and nothing else. It added composer-2.5 to the model enum, registered runner_cursor.py in the dispatch table, and made the runner validate its inputs and exit non-zero with a clear "not yet implemented" failure_reason. The migration ran on staging. The dispatch table picked up the new entry. No production task pointed at the new lane, so the runner was never invoked in the live path.

This is the kind of commit that looks like it does nothing and is actually doing the most important thing: proving the surrounding plumbing is correct before the new code can hide bugs in the plumbing. If Phase 2 had landed in one shot and the SDK call had failed, I would have spent the next hour trying to figure out whether the failure was in the dispatcher, the env wiring, the IAM role, or the SDK. With Phase 1 already in production for an hour, the only thing Phase 2 could break was the SDK call itself.

Phase 2: wiring the Cursor SDK

Phase 2, commit b1e7ebcd, is where the real work happened. The Cursor Python SDK exposes a session that walks Bridge → Client → Agent → events. The shape in the runner is:

bridge = await Bridge.launch(...)
client = Client(bridge=bridge)
agent = await client.agent.create(
    model=ModelSelection(id=model_id),
    local=LocalAgentOptions(cwd=workspace_dir),
)
run = await agent.send(prompt, SendOptions(...))
async for event in run.events():
    handle_event(event)

The local=LocalAgentOptions(cwd=workspace_dir) part matters: Cursor agents can run remotely or locally, and for Codens the workspace is already mounted into the Fargate task at a known path, so local-mode keeps the file IO inside the task and avoids round-tripping the diff over the wire. agent.send returns a run handle whose events() async iterator yields the structured event stream we already know how to consume from the Claude path. The translation layer in runner_cursor.py normalizes Cursor's event shapes to the internal event schema that the rest of Purple already speaks.

CURSOR_API_KEY is the obvious blocker. We store it in AWS Secrets Manager at purple-codens-prod/cursor-api-key and inject it into the per-task environment so the SDK picks it up automatically. The ECS Fargate task definition change in PR #1156 (commits d1ef5db4 and 656f42e4) exposes the secret ARN as an environment variable:

{
  "name": "CURSOR_API_KEY_SECRET_ARN",
  "value": "arn:aws:secretsmanager:ap-northeast-1:...:secret:purple-codens-prod/cursor-api-key"
}

The entrypoint script resolves it before launching the runner:

CURSOR_API_KEY=$(aws secretsmanager get-secret-value \
    --secret-id "$CURSOR_API_KEY_SECRET_ARN" \
    --query SecretString --output text)
export CURSOR_API_KEY
exec python -m purple.runner_cursor "$@"

This part is where I introduced a bug I want to flag specifically, because it is the kind of bug a multi-tenant SaaS should never ship. Initial commit pulled the secret using whatever AWS_PROFILE was active in the task environment, which in some code paths inherited from the customer's connected AWS credentials. That is wrong in a multi-tenant harness. The fix in commit 6210a052 makes the entrypoint use the ECS task IAM role for the Secrets Manager call, never the customer's profile. Customer credentials are scoped to customer resources only. Platform credentials, including our Cursor API key, must resolve through the task role. Easy mistake, important fix.

The canary procedure

I do not trust new lanes in production until a real project has run on them for at least a day. The canary procedure (commit d6fe3cb3) is intentionally small: flip purple_projects.default_model = 'composer-2.5' on exactly one internal Corevice-org project, dogfood it, and watch the metrics. Every other project stays on whatever model they were already on, which means the canary is fully isolated.

The SQL is one row:

UPDATE purple_projects
SET default_model = 'composer-2.5'
WHERE id = '<internal-project-id>';

Rollback is the same statement with the prior value. No code deploy involved. This is one of the upsides of keeping model selection as runtime data rather than baking it into deploy artifacts: rollback is a transaction, not a release.

The comparison axes we track on the canary versus the same project's last 30 days on Opus:

Completion rate (task finishes without exhausting retries)
Verify pass rate (Codens verify steps succeed against the final diff)
Wall time per task
Cost per completed task

The point of the canary is not to certify the lane is good. The point is to surface the failure modes that benchmarks do not surface, before any real customer touches the new lane.

What the smoke runs actually showed

Across v4 through v17, the smoke pass ran 25 attempts on the canary project. Nine finished. Sixteen failed. That is a 36% completion rate on a workload where the equivalent Opus runs were sitting around 80%+. The benchmark numbers and the production numbers were not the same numbers.

Two failure modes accounted for almost all of the misses.

The Cursor SDK bridge dropped mid-session on a handful of long-running tasks. When the bridge dropped, the workspace diff in progress was lost, the run handle errored, and the runner reported a generic SDK exception. Salvaging the partial diff at the moment the bridge dropped was the obvious fix. Commit 0f95f020 catches the bridge-drop exception, snapshots whatever is currently on disk in the workspace, and feeds that diff into the retry attempt's context so the next attempt does not start from zero.

The other failure mode was uglier. When a task exhausted its retry cap, the runner reported failure_reason = "exceeded max executions (5)" and that was it. The operator on the other side had no visibility into why each of those five attempts had failed. The fix in the same commit (0f95f020) enriches failure_reason with the last attempt's actual error string. Now when the cap is exhausted, the operator sees "exceeded max executions (5): last attempt failed with: <real error>" and can route the task to a different lane or escalate.

Two smaller fixes shipped alongside. Commit 1be0614f surfaces the AWS CLI failure when the Secrets Manager call fails. Previously the entrypoint swallowed it silently and the runner started with an empty CURSOR_API_KEY, producing an opaque 401 from the SDK three seconds later. Now the entrypoint exits non-zero with the AWS CLI error before the runner even starts. Commit 64af2b50 cleans up the per-task env injection and drops a message field collision between the Cursor event schema and our internal one that was causing some events to lose their payload during translation.

None of these fixes turn Composer 2.5 into a production-grade lane for our workload. They turn it into a lane I can operate, observe, and reason about while we keep iterating on it. The canary stays canary. Customer-facing projects stay on the lanes they were on.

Closing

Multi-lane executor architecture is a hedge, and like all hedges, the value shows up only when you actually need it. Composer 2.5 may or may not become a default-routing lane for Codens in the coming weeks. The 10x cost compression is real, the benchmark numbers are real, and the smoke phase is also real. The point of the canary procedure is that we get to find out which of those three numbers matters for our workload before any customer feels it.

The integration cost was a Phase 1 skeleton, a Phase 2 SDK wire, an ECS task definition change, an IAM fix, and a one-row SQL toggle. The integration value, regardless of whether Composer 2.5 sticks, is one more lane the harness can route through next time a pricing announcement or a model release reshapes the cost curve. That optionality is what an AI dev harness is supposed to give you.

Codens is at https://www.codens.ai/en/ if you want to see what a multi-lane harness for autonomous code repair and QA looks like in production.