Amit

Posted on Jun 6 • Originally published at artificialcuriositylabs.ai

Composer 2.5 Earned a Daily-Driver Slot — After Two Days of Real Wiring Work

#ainative #agents #cursor #patterns

TL;DR

Composer 2.5 scores 63.2% on CursorBench v3.1 (up from 52.2%) and 79.8% on SWE-Bench Multilingual — trained with 25x more synthetic tasks and mid-task localized feedback, not just terminal rewards.
328 requests, ~202M tokens over two heavy days; 95% cache hits, 93% on the included pool — parallel agent loops reuse context at 20:1 input-to-fresh ratio.
It held trust boundary splits across turns without re-narration — human onboarding on portal, agent operations through tool gateway, inference on HTTP APIs — and got deploy order right unprompted.
One confirmed incident: background terminal processes spawned by a sub-agent weren't cleaned up, and the main agent waited on them indefinitely; subprocess state awareness is unreliable under heavy multi-terminal load.
Run real work through new tools and export the data — loyalty to a single harness at this pace of change is the wrong frame.

I've been running Claude Code as my primary agent harness for months. When Composer 2.5 landed in Cursor's agents window, I gave it a run — on real work, not a demo repo. After two days, something shifted.

I've been a Claude Code power user for months. It's my default harness for everything from deploy automation to multi-repo architecture work. I wasn't looking for a replacement.

I had tried Cursor in the past, but left after a week — the IDE-shaped flow wasn't how I work. Working inside a file tree, in a tab, in a window felt like I was doing the agent's job for it.

When Composer 2.5 dropped as the default model inside the agents window, I came back out of curiosity. Two days of real integration work later — deploy paths, identity boundaries, multi-repo wiring — it earned a slot alongside Claude Code. This is a first-impression report, not a verdict.

Why this model is different

Cursor disclosed the training approach: Composer 2.5 is built on Kimi K2.5 (Moonshot AI's open-weight model), trained with 25x more synthetic tasks than Composer 2. The mechanism that changed: instead of only terminal rewards at the end of a task, the training system injects localized feedback mid-task — hints when the model tries an unavailable tool call, corrections at the point of error rather than the end of the run. The result is better course-correction during a task, not just a better final output.

The benchmarks back this up. Against a harder version of CursorBench (v3.1), Composer 2.5 scores 63.2% — up from 52.2% on Composer 2. On SWE-Bench Multilingual it hits 79.8% (vs. 73.7%). Terminal-Bench 2.0: 69.3% (vs. 61.7%). For context, Opus 4.7 and GPT-5.5 score 69.4% and 82.7% on Terminal-Bench respectively — Composer 2.5 is in that range on the tasks it's trained for, at a fraction of the cost.

That's vendor benchmarks. I didn't rerun them. What I care about is whether the upgrade shows up in the sessions I actually run.

It does.

The work that backed the impression

The sessions that counted weren't single-file edits. Two repos in flight simultaneously: one tightening a deploy automation path (stack → DNS → CDN invalidation), one standing up a multi-tenant control plane with separate human and agent surfaces.

The ask on the second one: wire human onboarding and API keys on a dashboard, expose operations to agents through a tool gateway, keep inference on HTTP APIs — don't let chat completions leak into the portal. I described the trust boundaries. I didn't paste a full architecture doc.

Composer separated those layers without me re-narrating the graph every turn. Deploy order came out right: provision before DNS, invalidate after assets land. On identity, it tracked which surface holds credentials, which path is connect for humans, which is tools for agents.

One miss: it once suggested inference-style chat behind the human onboarding UI. I said no — inference on APIs, portal human-first — and the next turns held the split.

That's the collaboration shape I want: I hold the diagram; the agent holds the files.

On Composer 2 that re-narration cost me a turn or two every time the context drifted. On 2.5, it mostly didn't drift. That's the practical difference.

Why the agents UI was the other half of this

The model alone wouldn't have done it. The agents window is what made the two days feel like flow, not friction.

The throughput is a delight. Responses come fast enough that I never left the problem — I stayed in the architecture while the agent worked the files. That's the thing most AI coding tools get wrong: they make you wait in a way that breaks your mental model of what you're building. You drift. You pick up your phone. You lose the thread. The agents window didn't do that. I was batching work — multiple threads in the same repo, separate agents across repos — because I could keep up with what was coming back.

The surface itself deserves credit. It's not an IDE with an AI bolted on. The agents window is built around the work, not around the files. I never had to navigate a file tree to stay oriented. I described what I needed at the architecture level; the agent handled the file-level execution. That separation — human on structure, agent on files — is the correct division of labor. Most tools blur it.

What I didn't expect: the interface made the parallel work feel manageable rather than chaotic. I could see what each agent was doing, pick up a thread, hand something off, and move on without losing context on the other sessions. The whole thing felt designed for someone who has too much to do and needs the agents to carry the load without dropping the thread. That's exactly what it is.

What the numbers show

I exported the usage CSV after the heavy stretch. No account details — just aggregates.

Metric	My export
Requests	328
Total tokens	~202M
Window	~two days heavy use
Billing	~93% Included on Auto + Composer pool

The model labels break down as auto for 313 requests (~199M tokens) and composer-2.5-fast for 15 (~2.5M tokens). Most work routes as auto, not an explicit model name — Cursor's Auto and Composer 2.5 share the same pricing pool, and the routing is Cursor's to make. Don't read "no composer-2.5-fast rows" as "wasn't on 2.5."

The more interesting number is cache shape:

Token type	Share of ~202M
Cache read	~95%
Fresh input	~5%
Output	~0.5%

Input-side cache vs fresh runs roughly 20:1. Median 98% cache hit rate per request. That's what long agent loops look like — same context re-read across tool rounds, not a full re-ingest every message. Part of why 328 parallel requests felt affordable on the included pool.

The CSV has no duration or tokens-per-second. Speed stays subjective. What the export proves: the intensity was real, and 93% of it was on the included pool.

What's still open

The two-day window is enough to earn a trial slot, not a final verdict. A few things I haven't stress-tested:

Fast-tier throughput. Only 15 explicit fast-tier runs in my export. Others report uneven throughput on fast mode. Worth noting: Composer 2.5 fast pricing doubled vs Composer 2 — $3.00/$15.00 per million tokens vs $1.50/$7.50. If you're running high volume on fast, model that cost before you scale.

Open-ended depth. My integration sessions held up better than one-off "explain this concept" probes. Multi-step architecture asks outperformed exploratory ones. That's the pattern to watch.

Ghost terminal sessions. I hit one incident where a local agent had spawned ten background terminal processes, finished its work, and moved on — but hadn't cleaned them up. The main agent kept treating those sessions as live and waited on them. I had to explicitly tell it to kill them before it would proceed. This is consistent with a forum-confirmed incident on May 21 where agents consumed tokens silently without output. Not a dealbreaker — but the agent's awareness of its own subprocess state isn't reliable yet on heavy multi-terminal sessions.

Cursor 3.5 shipped May 20 with automations management directly in the agents window — recurring agents, not just one-shot sessions. That's the next surface to test.

So what

The pace of change in this space is fast enough that loyalty to any single tool is the wrong frame. Learning AI is building with AI — the two don't happen in isolation. You learn what a model can actually do by putting it on real work, not by reading the release post.

Composer 2.5 surprised me. I wasn't expecting it to hold the trust-boundary split across turns, or to get deploy order right without me re-narrating the graph. It did. That's worth noting — and worth continuing to test.

The right posture isn't switching. It's staying curious. Different tools expose different strengths. Claude Code is still my primary harness. Composer 2.5 in the agents window earned a slot alongside it. Whether it expands, shrinks, or disappears from my rotation depends on the next stretch of mixed, boring, and hard tasks.

Keep an open mind. Run real work through new things. Export the data and compare it to how it felt. That's how you know what's actually changed.

Daily driver is the right phrase for this week. A month from now the answer will come from the next hard session — not from this one.

DEV Community