Cursor · Release deep dive
Cursor Composer 2.5: Build in Parallel, 10x Cheaper, and What Changes for Your Workflow
Cursor's third-generation agentic coding model launched May 18, 2026. Composer 2.5 matches Claude Opus 4.7 on SWE-bench at one-tenth the cost, introduces a 100-sub-agent parallel architecture, and brings targeted RL training that makes it dramatically more reliable on long-running tasks.
Cursor shipped Composer 2.5 on May 18, 2026, five days ago as of this writing. If you blinked you might have missed it in the noise of Anthropic's SpaceX compute deal and Google's Antigravity launch the same week. That would be a mistake. Composer 2.5 matches Claude Opus 4.7 on SWE-bench Multilingual at one-tenth the token cost and introduces a parallel task architecture that changes how long-running agentic work actually behaves.
This post covers what's new, how the model was built, what the numbers mean in practice, and how to adjust your workflow to take advantage of it.
What changed in Composer 2.5
Cursor describes it in one line: "better at sustained work on long-running tasks, follows complex instructions more reliably, and more pleasant to collaborate with." That's marketing. The real changes are three technical bets that shipped together.
First: a training method called targeted RL with textual feedback, where corrections are inserted at the exact decision point that went wrong rather than applied as a blanket signal across the full conversation. Second: 25x more synthetic training tasks than Composer 2, generated with a "feature deletion" technique — strip real code from repos with test suites, then train the model to reimplement it using the tests as ground truth. Third: a parallel execution architecture (more on that below) that lets the model decompose tasks into concurrent subtasks rather than grinding through them serially.
Build in Parallel: the architecture
The headline feature is what Cursor calls Build in Parallel. The underlying model uses an Agent Swarm design: when the model encounters a task it can decompose, it spawns up to 100 specialized sub-agents and delegates portions of the work concurrently. The parent coordinates; the children execute. This isn't a UI feature — it's how the model reasons about multi-step work.
The infrastructure underneath it runs Sharded Muon with a dual-grid HSDP (Hybrid Sharded Data Parallel) layout that overlaps parallel training dimensions. The practical effect: optimizer step times on Cursor's 1-trillion-parameter model came down to 0.2 seconds through asynchronous communication overlapping. The training infrastructure and the runtime architecture share the same parallel decomposition principle.
100
max parallel sub-agents
Agent Swarm decomposition
0.2s
optimizer step time
1T-param model, async overlap
25x
more synthetic training tasks
vs Composer 2
In practice, this means tasks that used to require you to manually fan out work — "implement this feature, then separately write tests, then separately update the docs" — can now be issued as a single instruction and the model will handle the decomposition. Whether it does so correctly depends on how well you've specified the task, but the capability is there in a way it wasn't in Composer 2.
How it was trained
The two training innovations are worth unpacking because they explain the behavioral improvements people are reporting.
Targeted RL with textual feedback
Standard RLHF applies a reward signal to the full trajectory. If the model calls a tool that doesn't exist, the penalty is diffused across the entire response. Targeted RL inserts a correction at the exact step: when the model attempts an unavailable tool, the training inserts a local hint — "Reminder: Available tools are…" — immediately after that decision point and continues training from there. The signal is surgical.
This is why Composer 2.5 is noticeably better at staying in bounds on complex codebases. It's not that the model learned to follow rules in general — it learned to catch itself at the specific moment it was about to break them.
Feature deletion synthetic data
To generate realistic training tasks, Cursor built a pipeline that takes real open-source repos with test suites, removes specific features from the production code, and trains the model to reimplement them using the tests as the only specification. The ground truth is testable — the model either makes the tests pass or it doesn't.
Running this at 25x the scale of Composer 2's training set produces a model that has seen an enormous variety of realistic coding tasks with machine-verifiable success criteria. That's a different quality of training data than human-labeled examples.
// BEFORE (in training corpus): working implementation
function parseConfig(raw: string): Config {
return JSON.parse(raw) as Config;
}
// AFTER deletion (what the model sees):
// (function body removed — tests still present)
function parseConfig(raw: string): Config {
// TODO: implement
}
// Model must reimplement to pass:
// ✓ test: parseConfig('{"key":"val"}') returns {key:"val"}
// ✓ test: parseConfig('invalid') throws SyntaxErrorBenchmarks and pricing
SWE-bench Multilingual is the most widely cited benchmark for agentic coding as of 2026 because it tests real GitHub issue resolution across multiple programming languages, not just Python. Composer 2.5 scores 79.8%. Claude Opus 4.7 scores 80.5%. The gap is 0.7 percentage points.
79.8%
SWE-bench Multilingual
Composer 2.5 — vs 80.5% for Claude Opus 4.7 (0.7 pp gap)
The pricing delta is larger. Opus 4.7 costs roughly $15/M input and $75/M output (Anthropic API pricing). Composer 2.5 is $0.50/M input and $2.50/M output via the Cursor API — the standard variant is 30x cheaper on input and output. The fast variant ($3.00/M input, $15.00/M output) is still substantially cheaper than Opus 4.7 at comparable latency.
$0.50
per million input tokens
standard variant
$2.50
per million output tokens
standard variant
$3.00 / $15.00
fast variant (in / out)
lower latency, still cheaper than Opus
What changes for your workflow
Three workflow changes are worth making today.
1. Give it compound tasks
The old pattern with Composer 2 was to break work into single-step prompts because the model struggled to track compound state. With 2.5, issuing a prompt like "implement the auth refresh logic, write unit tests, and update the README" is more likely to produce a coherent result because the Agent Swarm handles the decomposition. You'll still want to review each piece — parallel doesn't mean correct — but the scaffolding work is gone.
2. Use acceptance criteria in every prompt
Targeted RL trained the model to respond well to in-prompt constraints. Being explicit about what done looks like — "done means all existing tests still pass and the new feature has ≥2 test cases" — gives the model the same kind of signal its training was built around. Vague prompts got vaguer from Composer 2 to 2.5; precise prompts got more precise.
Task: Add rate limiting to the /api/auth/token endpoint.
Acceptance criteria:
- Existing test suite passes (run: pnpm test)
- New tests cover: (a) requests within limit succeed,
(b) requests over limit return 429 with Retry-After header
- No changes to other endpoints
- Implementation uses existing Redis client (src/lib/redis.ts)
Do not modify the database schema.3. Stack SKILL.md files for sub-agent specialization
When Composer 2.5 decomposes a task into parallel sub-agents, each sub-agent benefits from having a narrow, well-defined scope. Loading a code-review skill from the skills-hub registry before asking for a review pass, or a unit-test skill before a test generation pass, narrows the sub-agent's behavior in exactly the way the model's training expects.
# install skills that compose well with Composer 2.5 parallel tasks
npx @skills-hub-ai/cli install unit-test
npx @skills-hub-ai/cli install code-review
npx @skills-hub-ai/cli install tech-debt
# or install the full ship-it composition (all three + orchestrator)
npx @skills-hub-ai/cli install ship-itWhat's next
Cursor confirmed a collaboration with SpaceX AI on the next model, using 10x more total compute on Colossus 2's infrastructure. The timing is notable: Anthropic signed its own Colossus 1 deal the same week Composer 2.5 shipped, doubling Claude Code's rate limits in the process. Both of the major agentic coding platforms are now betting compute scale as the path to the next capability jump.
For users, the near-term implication is that Composer 2.5 is effectively a base to optimize against, not a ceiling. The parallel architecture and targeted RL training are production-proven now. Whatever ships next will run the same patterns at larger scale.
Composer 2.5 is better at sustained work on long-running tasks, follows complex instructions more reliably, and is more pleasant to collaborate with.
If you're already on Cursor Pro, Composer 2.5 is available now with a double-usage launch promotion through the end of May. The fast variant is worth testing on latency-sensitive workflows — it's priced below what most teams were paying for frontier models six months ago.
For a broader view of how Composer 2.5 fits into the current AI coding landscape, see our Windsurf 2.0 deep dive and the three-way Cursor vs Windsurf vs Claude Code comparison.
Written by
Skills-Hub Team
AI coding tool ecosystem coverage
Skills-Hub is the open registry for AI coding skills, 4,900+ SKILL.md files synced daily from Anthropic, Google, Microsoft, and 100+ official sources. Free + MIT.