When Cognition introduced Devin in March 2024 as "the first AI software engineer," the launch demo drew both enormous attention and pointed skepticism — several engineers picked apart the recording and argued the agent had quietly failed parts of the task it appeared to finish. Two years later the marketing has cooled and the product has been rebuilt. Devin in 2026 is a working async coding agent you assign tickets to, not a humanoid replacement for your team. We spent time pointing it at real tasks to figure out where that framing holds up.
The short version: Devin is good at well-scoped, repetitive work that you can describe precisely and verify automatically. It is unreliable on ambiguous, architecture-heavy changes — the exact work senior engineers are paid for. Whether it's worth the money depends almost entirely on which of those two buckets your backlog falls into.
What Devin actually is now
Devin runs as a cloud agent with its own sandboxed environment: a shell, a code editor, a browser, and access to your repository. You give it a task in a Slack thread or its web UI, and it works asynchronously — planning, writing code, running commands, hitting the browser to check its work, and opening a pull request when it thinks it's done. You watch the plan and the command log in real time and can interrupt to redirect it.
The redesign Cognition shipped as Devin 2.0 leaned into this async, parallel model. You can fan out several Devin sessions at once, each on a separate ticket, and check back on them the way you'd check on contractors. It integrates with GitHub, Slack, and Jira-style trackers, so the intended loop is: file a ticket, tag Devin, review the PR. There's also an interactive planning mode where you confirm the approach before it burns time executing.
This is a genuinely different shape from an in-editor assistant. Tools like Cursor or Copilot sit in your editor and accelerate the code you are writing. Devin is meant to take a unit of work off your plate entirely and report back. That distinction matters more than any benchmark, because it changes what "working" even means.
Where it earns its keep, and where it stalls
Devin is at its best on tasks that are tedious but mechanically clear. Bumping a dependency across a monorepo and fixing the resulting breakages. Adding test coverage to an under-tested module. Migrating a batch of files from one API to another. Wiring up a CRUD endpoint that mirrors five existing ones. In these cases the task is legible, the success criteria are checkable (the build passes, the tests are green), and the agent can iterate against fast feedback without needing your judgment.
The failure mode is just as consistent. On open-ended work — "redesign how we handle auth," "figure out why this is slow and fix it" — Devin tends to produce confident, plausible code that misses the actual point, or it churns through expensive iterations chasing a problem it doesn't understand. It does not push back the way a human engineer would when a ticket is underspecified; it picks an interpretation and runs. The more context lives in your head rather than in the repo, the worse it does.
Devin's autonomy is also its risk surface. An agent that can run shell commands and open PRs unattended will occasionally do something you didn't intend — delete a file, rewrite a config, or commit a credential it found. Run it against a sandbox or a fork first, require human PR review before anything merges, and never give it write access to production systems. Treat its output as an untrusted contribution, because that is exactly what it is.
The honest mental model: Devin is a fast, tireless junior engineer who never asks clarifying questions and never tells you when it's out of its depth. That's enormously useful for the right tasks and quietly dangerous for the wrong ones. The skill you have to develop is triage — knowing which tickets to hand it and which to keep.
Pricing, and whether the math works
Devin's original go-to-market was a $500/month team plan, which put it out of reach for individual developers and most small teams. The 2.0 relaunch replaced that with a lower entry point — a Core plan starting around $20 — built on consumption-based billing measured in ACUs (Agent Compute Units). You pay for the compute the agent burns, and complex or long-running tasks consume far more than simple ones.
That usage-based model is the part to scrutinize before committing. A clean, well-scoped task that Devin nails on the first pass is cheap. A task where it spirals — re-running tests, re-reading files, retrying a broken approach — can quietly rack up ACUs while producing nothing mergeable. Your effective cost per shipped PR depends heavily on how good you are at scoping tasks it can actually complete, which you won't know until you've spent some money learning.
Before you scale up, run a two-week trial on a fixed set of real tickets and track two numbers: ACUs spent and PRs you actually merged without major rework. Cost-per-merged-PR is the only metric that tells you whether Devin is cheaper than the engineering hours it's replacing. A high success rate on trivial tickets can hide a terrible rate on the work you actually care about.
Because pricing and plan structure have changed more than once, treat any specific dollar figure here as a starting point and confirm current rates on Cognition's site before you budget.
If what you actually want is to write code faster yourself rather than delegate whole tickets to an unsupervised agent, an in-editor tool is a different and often safer bet for the money.
Who should actually buy it
Devin makes sense for teams with a steady stream of well-defined, low-ambiguity work and the discipline to review every PR it produces — agencies doing repetitive migrations, teams paying down test-coverage debt, or anyone with a backlog of mechanical tickets nobody wants to do. For those uses it can genuinely clear work while you sleep.
It does not make sense as a senior-engineer replacement, a solution for vague problems, or a tool you can trust unsupervised. The 2024 launch oversold it on exactly those points, and the 2026 product is more honest precisely because Cognition stopped pretending otherwise. Buy it for what it is — a parallel async agent for legible work — and the value is real. Buy it expecting an autonomous engineer, and you'll spend ACUs learning the same lesson its early critics did.
Originally published at pickuma.com. Subscribe to the RSS or follow @pickuma.bsky.social for new reviews.
Top comments (0)