Gabriel Anhaia

Posted on Apr 26

Cursor Composer 2 Is a $200/Month Habit Now. Was It Worth It?

#ai #llm #productivity #prompt

Book: Prompt Engineering Pocket Guide
Also by me: AI Agents Pocket Guide
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

$200/month per seat. That is what Cursor's Ultra plan costs, and Composer 2 is the reason a lot of teams just signed off on it. An engineer I work with described the shift bluntly: he used to expense a $20 Cursor seat, now it's $200, and finance is asking why. Cursor released Composer 2 on March 19, 2026, the Ultra plan that actually unlocks running it without rate-limit anxiety sits at $200/month, and a lot of teams have spent the last five weeks deciding whether the line item is justified.

What Composer 2 does that a tab-completion model doesn't, where it earns its money, where it sets fire to your token budget, and a decision rule for whether your workflow actually needs the Ultra tier.

The thing the headlines miss

Composer 2 is not a chat model. It is an agentic coding model trained on tasks that require hundreds of sequential actions. The benchmarks people cite (CursorBench 61.3 on Cursor's own evaluation, and Terminal-Bench 2.0 at 61.7 per the project's leaderboard) are measuring a different shape of work than tab-completion or chat-with-codebase.

What that means in practice: when you ask Composer 2 to "split the auth module out of users into its own package, update every import, fix the broken tests, and run the build until it passes," it does that. It plans. It edits across files. It runs commands. It reads the failure output and tries again. It keeps going until either the build is green or it gives up and tells you why.

A tab-completion model cannot do that. A chat model can describe how to do it. Composer 2 actually does it. That is the whole pitch, and it is the only frame in which the $200 makes sense.

Where Composer 2 earns the line item

Multi-file refactors with cascading changes. Renaming a function used in 80 places, splitting a module, extracting an interface, migrating a class hierarchy to composition. The model can chase imports across the repo, update the call sites, regenerate snapshots, and re-run tests. A Pydantic v1-to-v2 migration that a quarterly plan had penciled in for three days can finish in an afternoon (illustrative; your repo's test coverage and import graph drive the variance).

Long-tail bug hunts in unfamiliar code. "The webhook handler started 502-ing for tenant X last Tuesday. Find out why." Composer 2 reads the logs, traces the request through the handler, checks recent commits, runs the failing case in isolation, and either fixes it or comes back with a write-up. The agentic loop is doing the work that used to be 90 minutes of tab-switching between Datadog and the IDE.

Test backfill on a module with no coverage. Pointing it at services/billing/ and saying "get unit-test coverage to 80%" is a task it can grind through. The output isn't always great. You get tests that assert on implementation details and tests that mock too aggressively. The volume and the scaffolding are still real, and spending an hour curating its output beats spending six writing it from scratch.

Greenfield scaffolding inside an existing repo. "Add a feature-flag service that uses Postgres for storage, exposes a gRPC API, and integrates with our existing auth middleware." It can plan that across files, follow your repo conventions if you point it at examples, and produce a PR that passes CI and matches the surrounding code style. The work after is review and tightening, not authorship.

Build-and-deploy plumbing. CI pipeline edits, Dockerfile changes, Terraform module composition. Tasks where the answer requires running the thing and reading errors. Composer 2's Terminal-Bench score shows up in real life here.

In each of these, the value isn't "the model wrote the code." It's "the model held the loop." On the bug-hunt and the refactor especially, the senior engineer's time goes to reviewing five small commits and a write-up rather than babysitting a 90-minute task.

Where it wastes your money

The same model that pays for itself on a multi-file refactor will burn $40 of tokens on a problem that wanted a one-line answer. Workloads where Composer 2 is the wrong tool:

Tab-completion-shaped work. Writing the next line, completing a function signature, suggesting a parameter. Composer 2 will route to Cursor's tab model anyway, but the agentic surface tempts you to ask the wrong tool the wrong question. A chat-style "what does this regex do" prompt does not need an agent.

Single-file logic puzzles. You have a 200-line function and a clear bug. You don't need a planner. You need a fast model with the relevant code in context. Asking Composer 2 to plan-and-execute on a one-file fix wastes its strengths and your tokens.

Anything where the spec is wrong. The agent can grind for 20 minutes building exactly the wrong thing. If you're not sure what you want, talk to a chat model first. Don't hand the agentic loop a fuzzy prompt and pay for the exploration.

Codebases with weak signals. No tests, no types, no clear conventions. The agent has nothing to verify against. It writes code that looks plausible and breaks in production. Composer 2 amplifies whatever signals it finds, good or bad.

Greenfield projects with no scaffolding yet. Counterintuitive but real. The agent shines when it has examples to mimic. On day one of a new repo, you spend more time correcting its choices than just writing the first 500 lines yourself.

A 12-task list worth running it on

Not benchmark numbers. The shape of work where Composer 2 beats a chat model. Run this list on your own repo and you'll know within a week whether the Ultra seat is paying off.

Module extraction. Pick a tightly coupled directory. Ask the agent to split it into a separate package and update all imports.
Test coverage backfill. Pick a module under 70% coverage. Target 85%. Read what it produces.
Dependency upgrade. A non-trivial one. Pydantic v1→v2, Express 4→5, Rails major version. Let it chase the breaking changes.
Feature flag rollout. Add a feature flag, gate three call sites, write the cleanup script that removes it once the flag is fully on.
Cross-file rename. Rename a function used in 50+ places, including one usage inside a string template that grep would miss.
API client migration. Switch one HTTP library for another across a service. Update tests and mocks.
Schema migration. Add a column, generate the migration, update the model, the serializer, the OpenAPI doc, and the seed data.
CI pipeline rewrite. Convert a CircleCI config to GitHub Actions, or vice versa. Make the build pass.
Logging cleanup. Find every print and console.log, replace with structured logging, preserve the message content.
Type backfill. Add type annotations to a Python or TypeScript module that has none. Run mypy/tsc until clean.
Bug-hunt with a stack trace. Give it a real production stack trace and the relevant log window. Let it propose a fix and a regression test.
Doc generation from code. Generate or update reference docs from a service's public interface. Catch the drift between code and prose.

Score each one yourself: did it save you time, did it match what a senior reviewer would do, and would you ship its output. Three yeses across most of the list and the seat is paying off. Fewer than two yeses on most and you're holding it wrong, or the workload doesn't fit.

The token-economics gotcha

Cursor's Ultra tier at $200 sits on top of Cursor's compute pool. The pitch is that Ultra removes practical rate limits for the workloads above. The fine print: heavy agentic users running long-running background agents can still hit overages (illustrative — check Cursor's own pricing FAQ and community forum for current limits before you commit a team budget).

Composer 2 is meaningfully cheaper per token than Composer 1.5, with API pricing in the range of $0.50/M input and $2.50/M output (figures aggregated by DataCamp's analysis; confirm against Cursor's pricing page before quoting them). That price is not what you pay on the Cursor IDE plans; it's what you'd pay if you wired the model into a custom integration. The seat-pricing model abstracts this, which is part of why it took finance teams a few weeks to figure out whether $200 was the right number.

If your team includes more than two developers running multi-hour background agents on production codebases, Ultra is probably the right tier. For everyone else, Pro+ at $60 is enough Composer 2 to do the work above without rate-limit pain.

A decision rule

A simple test for an engineer asking whether the Ultra seat is worth it:

If at least 30% of your week is spent on tasks from the 12-task list above, the seat pays for itself. If less than 10%, the seat is a luxury. Between 10% and 30%, run the trial month and count.

The 30% threshold is illustrative; the math under it is straightforward. A senior engineer billable at roughly $150–$250/hour (use your own loaded rate) is paid for hours, not lines. At 30% of a 40-hour week, you're looking at 12 hours where the agent is doing real work. Saving 4-6 of those hours covers the seat several times over.

The trap is paying $200 for a tool you use as a $20 tool. The other trap is staying on the $20 tool and doing $200 work the slow way.

What I tell teams

Three rules for a rollout:

Pick the workloads first. Don't hand every developer Ultra and hope for the best. Identify the two or three engineers whose week looks like the 12-task list and give them Ultra.
Write the prompts down. The good outputs come from prompts that include codebase conventions, testing approach, and constraint set. Treat them as engineering artifacts.
Audit the diffs. Composer 2 ships PRs, and the model is good enough to fool a tired reviewer at 5 PM on a Friday. Your code-review bar matters more now, not less.

The seat is worth it for some teams and a vanity expense for others. The honest answer doesn't fit on a pricing page. It fits on a list of tasks and an audit of which ones make up your week.

If this was useful

The Prompt Engineering Pocket Guide covers the prompt patterns that make agentic models actually finish the task — how to structure constraints, how to give the model the codebase context it needs, how to write a spec it can verify against. The AI Agents Pocket Guide covers the loop itself: how the planning, tool-use, and verification cycle works, and where it breaks. If you're paying for an agentic seat and want to get more out of it, those two together are the playbook.

Top comments (1)

Jill Mercer • Apr 30

twenty dollars a month was an easy yes — two hundred makes you pause. i’ve found composer is a beast for moving fast but it can loop and eat credits if the context gets messy. i usually stay on the pro plan and handle my assets separately to save tokens for logic. since you're building dev tools like hermes ide, come hang at stackapps.app — it's where i'm trying to solve the discovery problem for my own apps. full disclosure — i'm a pilot there and i only talk about tools i actually use.