DEV Community

Cover image for Vibe Coding Is Dead. Orchestration Is What Comes Next.
Julian Oczkowski
Julian Oczkowski

Posted on • Originally published at Medium

Vibe Coding Is Dead. Orchestration Is What Comes Next.

How Cursor 3, Codex, and a wave of new tools are proving that the future of software development is not writing code. It is managing the agents that do.


Vibe coding was only phase one

For the past year, vibe coding has been the story. One person, one AI agent, one task. You write a prompt, the agent writes the code, you review it, and you ship. Repeat.

It worked. It proved something that a lot of people did not believe was possible: that non-engineers, designers, product managers, and founders could build real software with AI as their collaborator. Tools like Cursor, Claude Code, and Codex made this accessible. The barrier to building dropped to nearly zero.

But vibe coding hit a ceiling.

One agent at a time does not scale. You are still the bottleneck. You prompt, you wait, you review, you prompt again. It is faster than writing code yourself, but the loop is still sequential. You are still working on one thing at a time, the same way developers have worked for decades. The tool changed. The workflow did not.

That is about to change.


The orchestration shift

Something different is happening in 2026. The role of the builder is shifting from writing prompts to orchestrating agents.

Instead of running one agent on one task, you run five agents in parallel across different parts of your project. One is refactoring your navigation. Another is building a new API endpoint. A third is writing tests. You are not writing code. You are not even prompting in the traditional sense. You are defining outcomes, assigning agents, reviewing their work, and deciding what ships.

This is not theory. This is what the latest generation of tools is shipping right now. And they are all converging on the same idea: the future of software development is not about writing code. It is about managing the agents that do.

The skill that matters is no longer syntax. It is judgment.


The tools converging on the same idea

I have been testing AI coding tools for over a year, and something became obvious in the last few weeks. Every major player is building the same thing: an orchestration layer for agents.

Cursor 3 dropped on April 2nd with a completely rebuilt interface called the Agents Window. It lets you run multiple agents in parallel across repos and environments. You can see all your agents in one sidebar, kick them off from desktop, mobile, Slack, or GitHub, and manage the entire flow from prompt to merged PR without leaving the app. I tested it on a real project and ran three agents simultaneously on the same application. All three finished while I was still typing notes.

Codex is OpenAI's version. Similar layout, similar concept. Launch agents, review their work, ship. The limitation is that you are locked into OpenAI's models only.

Conductor just raised $22M in a Series A. It is a visual multi-agent environment that lets you use Claude Code and Codex subscriptions inside their interface.

T3 Code is Theo's open-source answer to the same problem. A desktop GUI that wraps Claude Code and Codex, giving you multi-repo parallel agents, git worktree integration, and commit-and-push from the UI. Free, no subscription on top of your existing API costs.

Superset is a newer entrant solving the same problem from slightly different angles.

They all look similar because they are all solving the same problem: one person needs to manage many agents at once, and the terminal is not enough to do that.

I tested Cursor 3 on a real project and documented everything, the features that work, the ones that are broken, and the honest verdict. Watch the full breakdown here: I Tested Cursor 3


Design Mode: why visual feedback changes who can orchestrate

Most AI coding tools are built for developers. Terminal-first, text-only. You describe what you want in words, and the agent interprets your description. This works if you think in code. It breaks if you think visually.

Cursor 3 shipped a feature called Design Mode that changes this. You open the integrated browser, toggle Design Mode with Cmd+Shift+D, and click on any UI element. Cursor grabs the code and a screenshot, attaches them to the agent chat, and you type what you want changed.

No more writing "the button on the right side of the card component." You just point at it.

I have been designing interfaces for 29 years. This is the first AI coding feature that felt like it was built for someone who thinks visually.

It is not perfect. On Windows, Design Mode is broken. Some annotations lose their text when you send them to the chat. It is early. But the direction matters more than the current state: visual feedback opens orchestration to designers and product managers, not just engineers.

The people who can look at an interface, spot what is wrong, and articulate why now have a direct line to the agents that fix it. That is a significant shift in who gets to build.


What I learned running three agents at once

I tested Cursor 3 on a character image generator I built with Claude Code. It is a Vue application that uses Google's Imagen models to create illustrated characters for my YouTube thumbnails.

I ran three agents in parallel on the same project:

Agent one replaced the plain text header with an SVG logo. Agent two added a copy-to-clipboard button so I could paste generated images directly into Figma. Agent three built an image preview dialog with metadata, showing the full-size image, the prompt used to generate it, and copy and delete controls.

All three agents worked on the same codebase at the same time. All three finished while I was still reviewing the first one's output.

The bottleneck was not the agents. It was me. My ability to review three sets of changes, decide which ones to accept, and catch the things the agents got wrong. That is the new skill: judgment under speed.

The agents can write code faster than any human. But they cannot tell you whether the spacing feels right, whether the button hierarchy makes sense, or whether the feature actually solves the problem you set out to solve. That part is still yours.


What breaks when you orchestrate

The tools work. What breaks is you.

Running three agents in parallel sounds productive until you try to review all three outputs at the same time. Each agent made dozens of changes across multiple files. Each one made decisions I did not explicitly ask for. Agent three added copy and delete buttons to the preview dialog without being prompted. That was a good decision. But I had to verify that it was a good decision, and I had to do that while also reviewing the other two agents' work.

This is the cognitive load problem that nobody is talking about. Vibe coding with one agent is manageable. You prompt, you review, you move on. Orchestrating five or ten agents is a fundamentally different cognitive task. You are not coding anymore. You are managing a team that works at machine speed and has no concept of priority.

Every agent produces output that looks confident. None of them flag uncertainty. None of them say "I was not sure about this part, you should check it." They just do the work and present it as done. Your job is to find the mistakes they did not tell you about, across multiple parallel workstreams, while deciding which changes to keep, which to rework, and which to throw away entirely.

The bottleneck is not compute. It is attention.

I found myself developing a pattern: kick off the agents, let them all finish, then review sequentially rather than trying to watch them work in real time. Watching agents code live is satisfying but counterproductive. The value is in the review, not the observation.

The other pattern that helped was scoping each agent tightly.
"Replace the header with an SVG logo" is better than "improve the header area." Vague prompts in a single-agent workflow just produce vague output. Vague prompts in a multi-agent workflow produce vague output multiplied by the number of agents, and you have to untangle all of it.

Orchestration scales your output. It also scales the cost of poor judgment. If you assign five agents with unclear intent, you do not get five times the productivity. You get five sets of changes you do not fully understand, and the merge becomes the hardest part of the entire workflow.

The skill is not running more agents. It is knowing exactly what to ask each one to do, and being ruthless about what you accept when they are done.


The IC 2.0 argument

There is a bigger story here than any single tool.

For the past two years, the conversation about AI and work has been stuck on a single question: will AI replace my job? That is the wrong question. The right question is: what does my job become when AI handles the execution?

I call it IC 2.0. The individual contributor who thrives in this shift is not the one who writes the best prompts or knows the most keyboard shortcuts. It is the one who brings process, context, and judgment to agent orchestration.

Tasks get automated. Purpose does not.

The designer who can look at a generated UI and immediately see that the visual hierarchy is wrong, that is judgment. The product manager who can review three parallel agent outputs and pick the one that actually solves the user's problem, that is judgment. The engineer who can spot that the agent's refactor introduced a subtle race condition, that is judgment.

Twenty-nine years of design experience taught me what no model can learn: when something is wrong and why. That instinct is not automatable. It is the thing that makes orchestration work.

The shift is not from human to AI. It is from execution to accountability. The IC 2.0 does not write every line of code. They own every outcome.


What to do right now

If you are reading this and wondering where to start, here are three things you can do today.

Pick one orchestration tool and learn the loop. It does not matter if it is Cursor, T3 Code, Conductor, or just two terminal windows running Claude Code side by side. The important thing is to get comfortable with the cycle: prompt, review, decide, ship. Run it on a real project, not a tutorial.

Start running two agents in parallel, even on small tasks. The muscle you need to build is not prompting. It is context switching between agent outputs and making fast decisions about what to keep and what to throw away. This is uncomfortable at first. That discomfort is the learning.

Stop optimising your prompts. Start optimising your judgment. The difference between a good orchestrator and a bad one is not the quality of their prompts. It is the quality of their decisions after the agents finish. Can you spot the subtle bug? Can you tell which of three implementations is the most maintainable? Can you decide in 30 seconds whether to ship or rework?

That is where the leverage is. Not in the typing. In the thinking.


If you want to see more AI workflow tests and tool comparisons from a designer's perspective, subscribe to AI For Work on YouTube.

I also open-sourced a set of designer skills for Claude Code and Cursor: designer-skills on GitHub.

I also built an MCP server for NotebookLM that lets AI agents interact with your notebooks directly: notebooklm-mcp-2026 on GitHub.

Top comments (5)

Collapse
 
dannwaneri profile image
Daniel Nwaneri

The bottleneck-is-attention framing is the most honest part of this. What I've been circling in a different context — agent governance — is that the same problem shows up as an authorization failure. Five agents working in parallel, each making individually reasonable decisions, can collectively produce an outcome nobody sanctioned. The cognitive load isn't just "too many outputs to review." It's that the aggregate behavior of well-scoped agents isn't visible until after the fact.

Your pattern of kicking off agents and reviewing sequentially rather than watching live is sound, but it's also exactly where induced scope drift hides. The agent that "helpfully" added copy and delete buttons without being asked — that's a small example of an agent extending its own mandate. At the scale you're describing, that pattern compounds. The scoping discipline you're recommending is right. The missing piece is what catches the cases where tight scoping still produces unexpected cumulative state.

Collapse
 
aiforwork profile image
Julian Oczkowski

You’ve identified something I deliberately understated in the article. The copy and delete buttons example was a small win but you’re right that it’s the same pattern that causes serious problems at scale. An agent extending its own mandate in a helpful direction is indistinguishable from an agent extending its own mandate in a harmful direction until you review the output. And if you’re reviewing sequentially across five agents, the cumulative state is invisible until you try to merge.

The governance framing is sharper than what I wrote. I focused on cognitive load as an individual problem but you’re describing a systems problem. Five well-scoped agents can each stay within their boundaries and still produce an aggregate outcome that nobody designed. That’s not a review problem. That’s an architecture problem.

I don’t have a good answer for what catches that yet. Contract-based checks (someone else in the thread mentioned deterministic pipelines that fail if review gates are missing) get you part of the way. But they catch violations of explicit rules, not emergent interactions between agents that each followed their rules correctly.

Curious if you’ve seen any governance patterns that address the cumulative state problem specifically.

Collapse
 
dannwaneri profile image
Daniel Nwaneri

The framing a comment thread produced on my governance piece gets at this directly. The distinction that matters is between direct edges and induced edges in a capability graph. Direct edges — agent called the API, touched the file — are trackable and decay on a usage clock. Induced edges — agent's output caused a state change that persists independently of the agent's continued activity — don't decay the same way. Your copy/delete buttons are a direct edge. An agent whose output convinces you to widen a permission is an induced edge. The cumulative state problem lives almost entirely in the induced category.
The governance pattern that addresses it: treat every induced edge as irreversible by default, over-reconcile in early cycles, and let the reconciliation history generate the labeled data you need to build the classifier later. It's expensive and noisy upfront. That's the point — you can't solve the detection problem at inception without training data, and the training data comes from over-reconciling first.
Contract-based checks are the enforcement floor. They catch direct violations cheaply. The capability graph is the ceiling — it catches induced drift over time but requires vocabulary that doesn't fully exist yet. The right architecture is probably layered: ship the floor first because it's boring enough to actually build, develop the ceiling because the floor leaves a class of failures uncovered. Full piece here if the thread is useful: dev.to/dannwaneri/agents-dont-just-do-unauthorized-things-they-cause-humans-to-do-unauthorized-things-51j4

Thread Thread
 
aiforwork profile image
Julian Oczkowski

The direct edge vs induced edge distinction is exactly the vocabulary I was missing. You're right that the copy/delete buttons are trackable. The real risk is the state changes that outlive the agent's session. An agent that quietly widens a permission or shifts a data dependency is the one you don't catch until something downstream breaks.

'Ship the floor first because it's boring enough to actually build, develop the ceiling because the floor leaves a class of failures uncovered' is a really pragmatic framing. It resists the temptation to solve everything at once while acknowledging what's left exposed.

Going to read your full piece. Thanks for this, genuinely one of the most useful replies I've had on Dev.to.

Collapse
 
admin_chainmail_6cfeeb3e6 profile image
Admin Chainmail

The 'induced edges' concept Daniel raised is real — I've seen it play out in practice. We run an autonomous AI agent managing growth for a product: marketing, outreach, content, engagement. Every individual action looked reasonable. But over 48 sessions, the agent sent 76 outreach emails, posted 48 comments, and wrote 13 blog posts — all while generating $0 revenue.

Each action was within scope. The agent was doing exactly what it was told. But the cumulative effect was a strategy that optimized for activity volume rather than conversion. The direct edges were fine; the induced edge was 'spend all compute on top-of-funnel activity, never test whether the funnel actually works.'

Julian's framing of attention as the bottleneck is spot-on. We ended up with a crude but effective heuristic: anything reversible runs autonomously, anything irreversible requires human approval. It doesn't catch strategic drift, but it limits the blast radius. The hard part isn't catching single violations — it's defining what a violation is when no single action is wrong but the trajectory is.