Composer 2: Cursor's New Coding Model, Benchmarks, and Pricing

#cursor #composer #ai #devtools

This post was created with AI assistance and reviewed for accuracy before publishing.

On March 19, 2026, Cursor shipped Composer 2, the next version of its in-house agentic coding model. It is positioned as frontier-level on coding work while Standard pricing stays at $0.50 per million input tokens and $2.50 per million output tokens. If you use Cursor’s Agent regularly, Composer 2 is the headline change: better scores on Cursor’s own suite and public agent benchmarks, with a faster tier available if you prioritize latency over token cost.

What Composer 2 is

Composer is Cursor’s model built for software engineering inside the Cursor harness: tools for search, edits, terminal work, and multi-step tasks. Composer 2 replaces the previous generation for users who select it in the product. Cursor describes it as a jump in quality from continued pretraining, which gives reinforcement learning a stronger base, and from training on long-horizon coding problems where success takes hundreds of actions.

That matters because agent workflows are not single-shot completions. They are sequences of reads, edits, test runs, and retries. A model tuned for that loop is a different product decision than bolting a general chat model onto an IDE.

What the benchmarks show

Cursor publishes numbers comparing Composer 2 to Composer 1.5 and Composer 1 on three tracks: CursorBench, Terminal-Bench 2.0, and SWE-bench Multilingual. The announcement post reports:

Model	CursorBench	Terminal-Bench 2.0	SWE-bench Multilingual
Composer 2	61.3	61.7	73.7
Composer 1.5	44.2	47.9	65.9
Composer 1	38.0	40.0	56.9

CursorBench is Cursor’s internal suite built from real sessions and graded for agent behavior, not just a single correct patch. Terminal-Bench 2.0 is an external terminal-oriented agent benchmark (Cursor documents using the Harbor evaluation framework for their reported score). SWE-bench Multilingual is a broader software engineering benchmark suite.

Treat any benchmark as a signal, not a guarantee for your repo. Your stack, tests, and conventions still decide whether a change ships. The useful takeaway is directional: Composer 2 is a large step from 1.5 and 1 on the metrics Cursor uses to ship the model.

CursorBench scores reflect how well the agent completes tasks drawn from real Cursor usage, with grading that tries to capture correctness and fit with existing code, not only a single patch. Terminal-Bench 2.0 stresses terminal-centric agent behavior; Cursor’s post notes scores computed with the Harbor evaluation framework and compares against public leaderboard figures where applicable. SWE-bench Multilingual widens software engineering tasks across languages. Together they cover in-editor work, shell-driven workflows, and broader patch quality, which matches how people actually split time between the editor and the terminal.

Training story: pretraining plus long-horizon RL

Cursor ties the quality gain to continued pretraining before scaling reinforcement learning. The blog also links long-horizon behavior to work on self-summarization and self-driving codebases, where agents must persist across many steps without losing the thread.

If you skim one thing besides the announcement, How we compare model quality in Cursor explains why Cursor invests in CursorBench: public benchmarks often drift away from how developers actually use agents, and grading underspecified tasks is hard. Composer 2’s scores are meant to align with that internal bar, not only with leaderboard trivia.

The same post is candid about limitations of many public evals. SWE-style sets can be contaminated by training data. Some terminal puzzles do not resemble day-to-day product work. Cursor argues that CursorBench separates frontier models more cleanly in cases where public numbers look saturated. Whether you buy that claim wholesale or not, it explains why Cursor still publishes both internal and external numbers: internal for alignment with the product, external for comparability with the rest of the industry.

Pricing: Standard, Fast, and usage pools

Composer 2 has two price points on the model side:

Standard: $0.50 per million input tokens, $2.50 per million output tokens.
Fast: $1.50 per million input, $7.50 per million output, with Cursor stating the same intelligence as Standard but optimized for speed. Fast is the default option in the product.

On individual plans, Composer usage sits in a standalone usage pool with included usage; exact allowances change with plan details, so read the current pricing page when you budget.

For full parameters, limits, and defaults, use the Composer 2 model documentation.

Choosing Standard versus Fast is mostly economics and feel. Standard minimizes cost per token if you are comfortable with longer waits between turns. Fast costs more per token but targets lower latency, and Cursor makes it the default so interactive sessions stay snappy. For batch-like work, such as kicking off a long agent run while you review elsewhere, Standard may be enough. For tight feedback loops where you are steering step by step, Fast is the product default for a reason.

Where to use it

Composer 2 is available inside Cursor. Cursor also points to an early alpha of Glass, a separate interface experiment, for trying the model there.

If you are evaluating whether to switch from another model in Cursor, run a few real tasks you already do: a refactor across files, a bug hunt with tests, or a dependency upgrade. Benchmarks narrow the field; your project confirms it.

A simple evaluation checklist helps avoid placebo conclusions:

Pick one task class you repeat often, for example “extract a shared hook from three components” or “add structured logging across an API layer,” not a one-off trivial edit.
Keep the prompt stable between models so you are comparing models, not prompt luck.
Measure what you care about: time to green tests, number of files touched unnecessarily, and how often you had to revert or rewrite agent output.
Run twice on different days. Agent variance is real; one heroic run is not a trend.

Composer 2’s advertised strength is sustained, tool-heavy work. Your evaluation should include at least one multi-step task that spans search, edit, and verification, not only a single completion in one file.

Practical takeaway for developers

Composer 2 does not change the rule that you own the architecture and the merge. It raises the ceiling for agentic coding inside Cursor’s tool ecosystem and lowers the Standard token price relative to many frontier third-party models, with a clear Fast tier when responsiveness matters more than output token spend.

Watch how it behaves on your longest agent sessions, especially where context and discipline (tests, types, review) already matter. That is where a long-horizon model either earns trust or does not, regardless of leaderboard scores.

Teams should align on when Composer 2 is required versus optional. If one person uses it for security-sensitive paths and another uses a cheaper model without review, you have a process gap, not a model gap. Pair Composer 2 with the same practices you would use for any high-autonomy tool: branch protection, CI, and explicit review for risky areas.

Vendor lock-in framing is worth stating plainly. Composer 2 is a Cursor product, tuned for Cursor’s harness. That is a feature if you live in Cursor daily; it is a constraint if you need a portable API for non-Cursor pipelines. For most individual developers choosing a daily driver IDE, the question is whether the loop inside Cursor got better, not whether the model exists on every platform.

Finally, keep an eye on changelog and model pages. Pricing tiers and defaults can shift as capacity and product strategy evolve. The numbers in this post come from Cursor’s March 2026 announcement; verify before you budget at scale.

Sources: Introducing Composer 2, Changelog: Composer 2, How we compare model quality in Cursor, Composer 2 docs.