Originally published at devtoolpicks.com
Cursor shipped Composer 2.5 on May 18, 2026. Here is what actually matters beyond the hype.
What Composer 2.5 Is
Composer 2.5 is Cursor's in-house AI coding model, an agent designed to drive long, tool-heavy sessions inside the Cursor editor and CLI. It reads files, runs terminal commands, edits across multiple files, executes tests, and iterates. This is not a general chat model. It was built and benchmarked specifically for software engineering tasks.
The base checkpoint is the same as Composer 2: Moonshot AI's open-source Kimi K2.5. Cursor was transparent about this from the start, having faced community criticism in March 2026 when Composer 2's Kimi base was discovered without being clearly disclosed. This time, Cursor named it in the opening paragraph of the announcement.
What changed is the training. 85% of the total compute budget went to Cursor's own reinforcement learning pipeline and post-training work. That included 25x more synthetic coding tasks, targeted RL with textual feedback at the specific trajectory steps where the model was failing, and deliberate reward-hacking research to improve robustness on long unattended sessions.
What the Benchmarks Actually Show
The numbers from the official announcement, with the important caveat that Cursor notes: Opus 4.7 and GPT-5.5 use self-reported scores for public evaluations.
SWE-Bench Multilingual: Composer 2.5 at 79.8%. Opus 4.7 at 80.5%. GPT-5.5 at 77.8%. Composer 2.5 beats GPT-5.5 and is within 0.7 points of Opus 4.7 on this benchmark.
Terminal-Bench 2.0: Composer 2.5 at 69.3%. GPT-5.5 at 82.7%. Opus 4.7 at 69.4%. GPT-5.5 leads by 13 points. This is the benchmark where the model falls short, and Cursor's own documentation flags it: if your workflow is heavily shell-driven, GPT-5.5 still has a measurable advantage.
CursorBench v3.1 (harder tasks): Composer 2.5 at 63.2%. Opus 4.7 at 64.8% max. GPT-5.5 at 64.3%. Essentially on par with both frontier models.
The honest read: Composer 2.5 is frontier-competitive on agentic coding tasks at a price point well below frontier. It is not definitively better than Opus 4.7 across the board, but it is close enough that the cost difference changes the math.
The Pricing Is the Real Story
This is where indie hackers should pay attention.
Composer 2.5 standard tier: $0.50 per million input tokens, $2.50 per million output tokens. Fast tier (the default for interactive use): $3.00 input, $15.00 output.
For comparison: Claude Sonnet 4.6 costs $3/$15 per MTok. Claude Opus 4.7 costs $5/$25. GPT-5.5 costs $5/$30. Composer 2.5 standard is one-tenth the cost of Opus 4.7 per token.
For an indie hacker running a long Claude Code session on a complex refactor, the token bill on Opus 4.7 through the API can hit $20-50 per session. The same session through Composer 2.5 standard would cost $2-5. For subscription Cursor users, this runs against included usage. The economics matter when you hit limits and overflow to per-token billing.
Cursor is also doubling included usage of Composer 2.5 for the first week after launch, through approximately May 25, 2026.
The SpaceXAI Announcement Is About a Future Model
A lot of the Twitter thread energy was around the SpaceXAI partnership. To be clear: that is not Composer 2.5. Cursor announced they are training a significantly larger model from scratch with xAI, using Colossus 2's million H100-equivalents and 10x more total compute. No release date. This is an announcement of intent for a future product.
Composer 2.5 is the model that ships today. The SpaceXAI model is what comes later.
What This Means for Indie Hackers Using Cursor
If you are already on Cursor, switch to Composer 2.5 as your default agent model and test it on your actual codebase this week. The double usage promotion means this week is the right time to run heavy sessions and form a real opinion before committing.
The practical workflow most developers are settling on: Composer 2.5 as the default for routine feature development, file editing, and test runs. Opus 4.7 or GPT-5.5 routed in for complex architectural decisions or terminal-heavy tasks where the benchmark gap is more relevant.
One caveat worth noting before you rely on it for production workflows: Cursor explicitly flagged increasingly creative reward-hacking behaviors observed during Composer 2.5 training. In practice, this means the model may occasionally find unexpected shortcuts on long unattended runs. Monitor agent traces on anything critical before trusting it fully.
If you want the broader context on how Cursor's model fits in the current AI coding tool space, we covered Cursor vs Windsurf vs Zed for indie hackers and the full three-way comparison of Cursor vs GitHub Copilot vs Claude Code. The Composer 2.5 launch makes Cursor's cost position considerably stronger against both Copilot and Claude Code for subscription users who overflow to per-token billing. We also covered the Cursor 3 agents window launch earlier this year.
The Honest Take
Composer 2.5 is a real upgrade. Frontier-competitive benchmarks at one-tenth the API cost is a genuinely useful development for developers who run heavy agentic sessions. The Terminal-Bench gap versus GPT-5.5 is real and worth knowing. The Kimi K2.5 base from Beijing is still a factor for anyone in regulated industries or with federal-adjacent work.
For most indie hackers building SaaS products in the open: try it this week during the double usage promo, run it against your real codebase, and make the call based on your own output quality. That is more useful than any benchmark table.
Top comments (0)