By May 2026 there were seven coding tools serious enough to argue about: Claude Code, OpenAI Codex, Cursor, GitHub Copilot, Google Antigravity, Kiro, and Windsurf. What is new is that almost all of them now ship a background mode. Not "AI in your editor" but "AI that runs in the cloud, in its own machine, and opens a pull request while you go do literally anything else."
I have been running real work through the main contenders for a few months. Not toy tasks, actual production work in actual repos. This is the comparison I wish I had before I started, because the marketing pages make them all sound identical and they are not.
If you want the conceptual version of when to use this kind of tool at all, I wrote that up separately in the piece on background AI coding agents and when to delegate. This article is the other half: assuming you have decided to delegate, which tool do you actually pick.
What "Background Agent" Means Across These Tools
First, the shared model, because it is genuinely the same shape everywhere.
You describe a task. The tool provisions a fresh, isolated virtual machine. That VM clones your repository at the current HEAD of a branch, runs your setup commands so it has a working environment, then the agent executes the task, runs your checks, and opens a draft pull request with a summary. You review the PR whenever you get to it. You were not watching while it worked.
The differences are in the details that matter: where you trigger it from, how good the isolated environment is, how it handles setup, how it reports back, how much it costs, and how strong the underlying model is. Those details are the whole comparison.
One thing worth saying up front. None of these tools should be merging their own code. Every one of them opens a draft PR specifically so your existing quality gates stay in charge. The background agent is a contributor, not a committer. Keep branch protection and required review on, no matter which tool you pick.
OpenAI Codex Cloud Tasks
Codex leans hardest into the cloud-native model. The whole product is built around the idea that you describe a task and it runs somewhere else, returns a reviewable diff, and you decide what to do with it.
Where it shines. Independent, well-bounded work that returns a clean diff. Codex is genuinely good at the "here is a described change, go make it in a clean environment and show me the result" loop. The isolation is solid, the environment setup is straightforward, and the diffs come back in a form that is easy to review. For draining a backlog of tightly scoped tasks, it is one of the strongest options.
Where it struggles. Anything that needs your local state. Because it is so cloud-first, the things that depend on your actual machine, local services, browser state, uncommitted changes, are exactly the things it cannot help with. That is not a flaw, it is the design, but you feel it if you try to use it for the wrong tasks.
Pricing shape. Codex runs on a usage-based model rather than a flat seat. That cuts both ways. If you delegate sporadically, you pay for what you use and it is cheap. If you run a fleet of agents around the clock, usage-based billing can climb fast, so watch the meter.
The mental model that works for Codex: it is a queue you throw repo-scoped tasks into. Treat it like a build server that writes code.
Cursor Cloud Agents
Cursor came from the IDE side, and its background story reflects that. You can start work interactively in the editor and then push it to a cloud agent that runs in an isolated VM, or kick off cloud agents directly.
Where it shines. The handoff between foreground and background is the smoothest of the bunch. You can be working interactively, realize a chunk of it is delegatable, and send it to the cloud without leaving your flow. Each cloud task gets a fresh VM with its own filesystem, terminal, network, and package environment, clones your repo at HEAD, runs your setup, and works from there. For developers who already live in Cursor, the continuity is the selling point.
Where it struggles. It is the most expensive of the IDE-rooted options at the team tier, and the cloud agent experience, while good, is layered onto a product that was fundamentally designed around the editor. If you do not otherwise use Cursor, adopting it just for background agents is a lot of surface area for the feature.
Pricing shape. Business tier runs in the higher range for team seats, reportedly around the $4,800 per year mark for ten developers, in the same neighborhood as Windsurf. Pricing across all these tools moves constantly, so treat any number as a snapshot, but Cursor sits at the pricier end.
If you already work in Cursor, the cloud agents are an easy yes. If you do not, the value proposition is narrower.
GitHub Copilot Coding Agent
Copilot's background story is the most tightly woven into where the work already lives: GitHub itself. You assign an issue to the coding agent, or mention it, and it works in the background, opens a draft PR, and requests your review. There is also a delegate flow from the CLI and from inside editors.
Where it shines. If your team runs on GitHub issues and PRs, the integration is hard to beat. The agent reads the issue description and comments as context, opens a draft PR against your branch protection rules, and the whole thing lives inside the workflow you already use. Turning acceptance criteria in an issue directly into a draft PR, without anyone bypassing branch protection, is a genuinely clean loop. Copilot has actually split its agents into distinct flavors, local, background, cloud, and sub-agents, which is confusing at first but useful once you know which one you are invoking.
Where it struggles. The model ceiling has historically trailed the frontier, and for hard, ambiguous reasoning work it can show. It is excellent at well-specified issue-to-PR work and less impressive when the task needs deep judgment.
Pricing shape. Copilot is among the most affordable team options, reportedly around the $2,280 per year range for ten developers on the Business tier, which makes it an easy default for teams already paying for GitHub.
The fit here is organizational. If your team's source of truth is GitHub issues, Copilot's coding agent slots in with almost no new process.
Claude Code Async
Claude Code is a terminal-first agent, and its async story extends that. You can run it in the background and have it work through tasks while you do other things, and it keeps the deepest reasoning ceiling of the group thanks to the underlying Opus models.
Where it shines. Hard tasks. When the work needs real reasoning, multi-step planning, or wrangling a gnarly codebase, the model quality shows. The latest Claude Opus release pushed the reliability of long agentic sessions up noticeably, which is exactly what you want when an agent is running unattended for twenty minutes. It is also the most loved tool among developers in the 2026 surveys by a wide margin, which is not nothing.
Where it struggles. Cost, mainly. The team tier is dramatically more expensive than the others, reportedly an order of magnitude above Copilot for ten seats. For an individual or small team the per-use cost can be very reasonable, but at scale it is the priciest reasoning you can buy. The terminal-first nature also means the background experience is less GUI-polished than Cursor's or Copilot's GitHub-native flow.
Pricing shape. Individual usage is reasonable. The team tier is the most expensive in this group by a large margin, so it is a deliberate spend you make because you want the reasoning ceiling, not a default.
The Claude Code ecosystem also goes deep beyond the core agent. If you are standardizing how your team works with it, the plugin marketplace and skills turn your patterns into shareable installable pieces, which matters more once multiple people are delegating against the same conventions.
How I Actually Choose Between Them
Forget the feature matrices. Here is the decision I actually make, by situation.
Your team lives on GitHub issues and PRs. Copilot coding agent. The integration tax is near zero and the price is the lowest. You are not adopting a new tool, you are turning on a feature in one you already pay for.
You already work in Cursor all day. Cursor cloud agents. The foreground-to-background handoff is the smoothest you will get, and you are not adding new surface area. If you do not already use Cursor, this is a weaker reason to start.
You want a pure delegation queue for backlog work. Codex cloud tasks. The usage-based pricing fits sporadic delegation, and the cloud-first design is built precisely for "describe it, run it elsewhere, review the diff." Just watch the meter if you scale up.
The work is genuinely hard and you will pay for quality. Claude Code async. When the task needs the strongest reasoning available and getting it right the first time is worth real money, this is the ceiling. For individuals the cost is fine. For large teams it is a deliberate, expensive choice.
Most serious users do not pick one. They use the cheap, integrated option for the bulk of well-scoped work and reach for the expensive, high-reasoning option for the genuinely hard tasks. That is the same split I described in the single-agent versus multi-agent tradeoff: match the tool to the difficulty of the task, do not pay frontier prices for boilerplate.
The Pricing Reality Nobody Likes
A quick, honest aside on cost, because the numbers move every quarter and the framing matters more than any specific figure.
Two pricing models dominate. Flat per-seat (Copilot, Cursor, Windsurf, Kiro, Antigravity) and usage-based (Codex, and Claude Code at the API level). For ten developers per year, the reported flat-tier spread in 2026 ran roughly from the low thousands for Copilot up to multiples of that for Cursor and Windsurf, with Claude Code's team tier sitting far above the rest. Usage-based options are cheap when you delegate occasionally and can become the most expensive when you run agents constantly.
The trap is optimizing for the seat price and ignoring the usage pattern. A flat seat you barely use is wasted money. A usage-based tool you hammer around the clock can blow past a flat tier you would have been better off buying. Match the billing model to how you actually work, not to which sticker looks lowest.
And treat every number you see, including the ones in this article, as a snapshot. This category reprices constantly. Verify current pricing before you commit a team to anything.
The Mistakes That Make Any of Them Look Bad
The tool matters less than how you use it, and the same mistakes sink every one of them.
Vague tasks. "Improve the codebase" produces garbage on all four. Tight, well-scoped tasks with a clear definition of done produce mergeable PRs. The merge rate gap between good and bad prompts dwarfs the quality gap between tools.
No instructions file. A background agent starts in a clean VM with none of your context. An AGENTS.md or equivalent at the repo root, covering setup, test commands, and conventions, is the highest-leverage thing you can add. Skip it and every agent looks dumber than it is.
Rubber-stamping the diff. AI-generated code carries measurably higher bug density, and a convincing PR summary is not a substitute for reading the actual change. I run agent PRs through the same testing process for AI-generated code regardless of which tool produced them. The review discipline is tool-independent.
Ignoring the debt. Ship a lot of agent code fast and you accumulate complexity and duplication that no single PR review catches. This new shape of technical debt is real across all of these tools, because it is a property of fast machine-authored code, not of any one vendor.
Pick the wrong tool and you lose a bit of efficiency. Make these mistakes and the best tool in the world produces a pile of plausible-looking PRs you cannot trust. The leverage is in your process, not the logo.
The Verdict
If I had to hand someone a single default, it is the Copilot coding agent, purely on integration and price, assuming their team already lives on GitHub. It is the lowest-friction way to start delegating real work, and most teams already pay for it.
If you live in Cursor, use Cursor's cloud agents and do not overthink it. If you want a delegation queue and you delegate in bursts, Codex is the cleanest fit. And when the task is genuinely hard and correctness is worth real money, Claude Code's reasoning ceiling is the one I reach for, cost accepted.
But the meta-point is the one I keep landing on. These tools have converged on the same model, and within a release cycle they tend to leapfrog each other on benchmarks anyway. The durable advantage is not which agent you picked. It is whether you got good at scoping tasks, writing instructions, and reviewing output. Do that, and any of these four serves you well. Skip it, and none of them will.
The background agent category is a year old and already feels permanent. The right move is not to bet everything on one vendor. It is to build the delegation muscle that makes all of them useful, and stay light enough to switch when the next one ships something better. Because it will, probably before you finish reading the changelog.
Top comments (0)