Julian Oczkowski

Posted on Apr 5 • Originally published at Medium

Vibe Coding Is Dead. Orchestration Is What Comes Next.

#webdev #programming #vibecoding #ai

How Cursor 3, Codex, and a wave of new tools are proving that the future of software development is not writing code. It is managing the agents that do.

Vibe coding was only phase one

For the past year, vibe coding has been the story. One person, one AI agent, one task. You write a prompt, the agent writes the code, you review it, and you ship. Repeat.

It worked. It proved something that a lot of people did not believe was possible: that non-engineers, designers, product managers, and founders could build real software with AI as their collaborator. Tools like Cursor, Claude Code, and Codex made this accessible. The barrier to building dropped to nearly zero.

But vibe coding hit a ceiling.

One agent at a time does not scale. You are still the bottleneck. You prompt, you wait, you review, you prompt again. It is faster than writing code yourself, but the loop is still sequential. You are still working on one thing at a time, the same way developers have worked for decades. The tool changed. The workflow did not.

That is about to change.

The orchestration shift

Something different is happening in 2026. The role of the builder is shifting from writing prompts to orchestrating agents.

Instead of running one agent on one task, you run five agents in parallel across different parts of your project. One is refactoring your navigation. Another is building a new API endpoint. A third is writing tests. You are not writing code. You are not even prompting in the traditional sense. You are defining outcomes, assigning agents, reviewing their work, and deciding what ships.

This is not theory. This is what the latest generation of tools is shipping right now. And they are all converging on the same idea: the future of software development is not about writing code. It is about managing the agents that do.

The skill that matters is no longer syntax. It is judgment.

The tools converging on the same idea

I have been testing AI coding tools for over a year, and something became obvious in the last few weeks. Every major player is building the same thing: an orchestration layer for agents.

Cursor 3 dropped on April 2nd with a completely rebuilt interface called the Agents Window. It lets you run multiple agents in parallel across repos and environments. You can see all your agents in one sidebar, kick them off from desktop, mobile, Slack, or GitHub, and manage the entire flow from prompt to merged PR without leaving the app. I tested it on a real project and ran three agents simultaneously on the same application. All three finished while I was still typing notes.

Codex is OpenAI's version. Similar layout, similar concept. Launch agents, review their work, ship. The limitation is that you are locked into OpenAI's models only.

Conductor just raised $22M in a Series A. It is a visual multi-agent environment that lets you use Claude Code and Codex subscriptions inside their interface.

T3 Code is Theo's open-source answer to the same problem. A desktop GUI that wraps Claude Code and Codex, giving you multi-repo parallel agents, git worktree integration, and commit-and-push from the UI. Free, no subscription on top of your existing API costs.

Superset is a newer entrant solving the same problem from slightly different angles.

They all look similar because they are all solving the same problem: one person needs to manage many agents at once, and the terminal is not enough to do that.

I tested Cursor 3 on a real project and documented everything, the features that work, the ones that are broken, and the honest verdict. Watch the full breakdown here: I Tested Cursor 3

Design Mode: why visual feedback changes who can orchestrate

Most AI coding tools are built for developers. Terminal-first, text-only. You describe what you want in words, and the agent interprets your description. This works if you think in code. It breaks if you think visually.

Cursor 3 shipped a feature called Design Mode that changes this. You open the integrated browser, toggle Design Mode with Cmd+Shift+D, and click on any UI element. Cursor grabs the code and a screenshot, attaches them to the agent chat, and you type what you want changed.

No more writing "the button on the right side of the card component." You just point at it.

I have been designing interfaces for 29 years. This is the first AI coding feature that felt like it was built for someone who thinks visually.

It is not perfect. On Windows, Design Mode is broken. Some annotations lose their text when you send them to the chat. It is early. But the direction matters more than the current state: visual feedback opens orchestration to designers and product managers, not just engineers.

The people who can look at an interface, spot what is wrong, and articulate why now have a direct line to the agents that fix it. That is a significant shift in who gets to build.

What I learned running three agents at once

I tested Cursor 3 on a character image generator I built with Claude Code. It is a Vue application that uses Google's Imagen models to create illustrated characters for my YouTube thumbnails.

I ran three agents in parallel on the same project:

Agent one replaced the plain text header with an SVG logo. Agent two added a copy-to-clipboard button so I could paste generated images directly into Figma. Agent three built an image preview dialog with metadata, showing the full-size image, the prompt used to generate it, and copy and delete controls.

All three agents worked on the same codebase at the same time. All three finished while I was still reviewing the first one's output.

The bottleneck was not the agents. It was me. My ability to review three sets of changes, decide which ones to accept, and catch the things the agents got wrong. That is the new skill: judgment under speed.

The agents can write code faster than any human. But they cannot tell you whether the spacing feels right, whether the button hierarchy makes sense, or whether the feature actually solves the problem you set out to solve. That part is still yours.

What breaks when you orchestrate

The tools work. What breaks is you.

Running three agents in parallel sounds productive until you try to review all three outputs at the same time. Each agent made dozens of changes across multiple files. Each one made decisions I did not explicitly ask for. Agent three added copy and delete buttons to the preview dialog without being prompted. That was a good decision. But I had to verify that it was a good decision, and I had to do that while also reviewing the other two agents' work.

This is the cognitive load problem that nobody is talking about. Vibe coding with one agent is manageable. You prompt, you review, you move on. Orchestrating five or ten agents is a fundamentally different cognitive task. You are not coding anymore. You are managing a team that works at machine speed and has no concept of priority.

Every agent produces output that looks confident. None of them flag uncertainty. None of them say "I was not sure about this part, you should check it." They just do the work and present it as done. Your job is to find the mistakes they did not tell you about, across multiple parallel workstreams, while deciding which changes to keep, which to rework, and which to throw away entirely.

The bottleneck is not compute. It is attention.

I found myself developing a pattern: kick off the agents, let them all finish, then review sequentially rather than trying to watch them work in real time. Watching agents code live is satisfying but counterproductive. The value is in the review, not the observation.

The other pattern that helped was scoping each agent tightly.
"Replace the header with an SVG logo" is better than "improve the header area." Vague prompts in a single-agent workflow just produce vague output. Vague prompts in a multi-agent workflow produce vague output multiplied by the number of agents, and you have to untangle all of it.

Orchestration scales your output. It also scales the cost of poor judgment. If you assign five agents with unclear intent, you do not get five times the productivity. You get five sets of changes you do not fully understand, and the merge becomes the hardest part of the entire workflow.

The skill is not running more agents. It is knowing exactly what to ask each one to do, and being ruthless about what you accept when they are done.

The IC 2.0 argument

There is a bigger story here than any single tool.

For the past two years, the conversation about AI and work has been stuck on a single question: will AI replace my job? That is the wrong question. The right question is: what does my job become when AI handles the execution?

I call it IC 2.0. The individual contributor who thrives in this shift is not the one who writes the best prompts or knows the most keyboard shortcuts. It is the one who brings process, context, and judgment to agent orchestration.

Tasks get automated. Purpose does not.

The designer who can look at a generated UI and immediately see that the visual hierarchy is wrong, that is judgment. The product manager who can review three parallel agent outputs and pick the one that actually solves the user's problem, that is judgment. The engineer who can spot that the agent's refactor introduced a subtle race condition, that is judgment.

Twenty-nine years of design experience taught me what no model can learn: when something is wrong and why. That instinct is not automatable. It is the thing that makes orchestration work.

The shift is not from human to AI. It is from execution to accountability. The IC 2.0 does not write every line of code. They own every outcome.

What to do right now

If you are reading this and wondering where to start, here are three things you can do today.

Pick one orchestration tool and learn the loop. It does not matter if it is Cursor, T3 Code, Conductor, or just two terminal windows running Claude Code side by side. The important thing is to get comfortable with the cycle: prompt, review, decide, ship. Run it on a real project, not a tutorial.

Start running two agents in parallel, even on small tasks. The muscle you need to build is not prompting. It is context switching between agent outputs and making fast decisions about what to keep and what to throw away. This is uncomfortable at first. That discomfort is the learning.

Stop optimising your prompts. Start optimising your judgment. The difference between a good orchestrator and a bad one is not the quality of their prompts. It is the quality of their decisions after the agents finish. Can you spot the subtle bug? Can you tell which of three implementations is the most maintainable? Can you decide in 30 seconds whether to ship or rework?

That is where the leverage is. Not in the typing. In the thinking.

If you want to see more AI workflow tests and tool comparisons from a designer's perspective, subscribe to AI For Work on YouTube.

I also open-sourced a set of designer skills for Claude Code and Cursor: designer-skills on GitHub.

I also built an MCP server for NotebookLM that lets AI agents interact with your notebooks directly: notebooklm-mcp-2026 on GitHub.

Top comments (18)

Daniel Nwaneri • Apr 5

The bottleneck-is-attention framing is the most honest part of this. What I've been circling in a different context — agent governance — is that the same problem shows up as an authorization failure. Five agents working in parallel, each making individually reasonable decisions, can collectively produce an outcome nobody sanctioned. The cognitive load isn't just "too many outputs to review." It's that the aggregate behavior of well-scoped agents isn't visible until after the fact.

Your pattern of kicking off agents and reviewing sequentially rather than watching live is sound, but it's also exactly where induced scope drift hides. The agent that "helpfully" added copy and delete buttons without being asked — that's a small example of an agent extending its own mandate. At the scale you're describing, that pattern compounds. The scoping discipline you're recommending is right. The missing piece is what catches the cases where tight scoping still produces unexpected cumulative state.

Julian Oczkowski • Apr 5

You’ve identified something I deliberately understated in the article. The copy and delete buttons example was a small win but you’re right that it’s the same pattern that causes serious problems at scale. An agent extending its own mandate in a helpful direction is indistinguishable from an agent extending its own mandate in a harmful direction until you review the output. And if you’re reviewing sequentially across five agents, the cumulative state is invisible until you try to merge.

The governance framing is sharper than what I wrote. I focused on cognitive load as an individual problem but you’re describing a systems problem. Five well-scoped agents can each stay within their boundaries and still produce an aggregate outcome that nobody designed. That’s not a review problem. That’s an architecture problem.

I don’t have a good answer for what catches that yet. Contract-based checks (someone else in the thread mentioned deterministic pipelines that fail if review gates are missing) get you part of the way. But they catch violations of explicit rules, not emergent interactions between agents that each followed their rules correctly.

Curious if you’ve seen any governance patterns that address the cumulative state problem specifically.

Daniel Nwaneri • Apr 5

The framing a comment thread produced on my governance piece gets at this directly. The distinction that matters is between direct edges and induced edges in a capability graph. Direct edges — agent called the API, touched the file — are trackable and decay on a usage clock. Induced edges — agent's output caused a state change that persists independently of the agent's continued activity — don't decay the same way. Your copy/delete buttons are a direct edge. An agent whose output convinces you to widen a permission is an induced edge. The cumulative state problem lives almost entirely in the induced category.
The governance pattern that addresses it: treat every induced edge as irreversible by default, over-reconcile in early cycles, and let the reconciliation history generate the labeled data you need to build the classifier later. It's expensive and noisy upfront. That's the point — you can't solve the detection problem at inception without training data, and the training data comes from over-reconciling first.
Contract-based checks are the enforcement floor. They catch direct violations cheaply. The capability graph is the ceiling — it catches induced drift over time but requires vocabulary that doesn't fully exist yet. The right architecture is probably layered: ship the floor first because it's boring enough to actually build, develop the ceiling because the floor leaves a class of failures uncovered. Full piece here if the thread is useful: dev.to/dannwaneri/agents-dont-just-do-unauthorized-things-they-cause-humans-to-do-unauthorized-things-51j4

Julian Oczkowski • Apr 5

The direct edge vs induced edge distinction is exactly the vocabulary I was missing. You're right that the copy/delete buttons are trackable. The real risk is the state changes that outlive the agent's session. An agent that quietly widens a permission or shifts a data dependency is the one you don't catch until something downstream breaks.

'Ship the floor first because it's boring enough to actually build, develop the ceiling because the floor leaves a class of failures uncovered' is a really pragmatic framing. It resists the temptation to solve everything at once while acknowledging what's left exposed.

Going to read your full piece. Thanks for this, genuinely one of the most useful replies I've had on Dev.to.

Daniel Nwaneri • Apr 6

The state changes that outlive the agent's session are the ones worth building vocabulary for now, before the systems get complex enough that the gap becomes expensive. Glad the framing was useful — your piece gave me the concrete orchestration context the governance argument needed...

Julian Oczkowski • Apr 7

Building the vocabulary before the complexity forces it is exactly right. Appreciate you bringing the governance lens to this. The direct edge vs induced edge distinction from your piece gave me a much sharper way to think about the problem. Good conversation.

Jill Mercer • Apr 6

orchestration feels like code for "more layers of abstraction to manage" — which is the exact opposite of why I vibe code. i’m still just trying to get my intent to match the syntax without the output hallucinating a whole new architecture. the moment you let a loop run wild, you're not building small business software anymore — you're just paying for compute that doesn't ship. vibe first, polish later still feels like the only way to stay sane in the mess.

Julian Oczkowski • Apr 7

That's a fair take. If single-agent vibe coding is getting the job done for your use case, adding orchestration on top would just be unnecessary complexity. The article is really about what happens when one agent stops being enough, but for small business software that ships and works, one agent with tight intent is probably the right call. 'Paying for compute that doesn't ship' is a great line.

Jill Mercer • Apr 10

yeah that framing helps — it's less about vibe vs orchestration and more about when the coordination overhead actually pays off. for the scope i'm building at, one agent with tight constraints still ships faster. curious what the inflection point looks like in practice — is it codebase size, or more about team size?

Julian Oczkowski • Apr 10

From what I’ve seen it’s scope more than size. A 100K line codebase with one person working on one feature at a time is still fine with a single agent. But the moment you need changes across frontend, backend, and tests simultaneously, or you’re shipping multiple features in the same sprint, that’s when the coordination overhead starts paying for itself. Team size accelerates it because more people means more merge conflicts between agents, but the trigger is usually ‘I’m waiting for this agent to finish before I can start the next thing.’ That’s the moment you know.

Admin Chainmail • Apr 5

The 'induced edges' concept Daniel raised is real — I've seen it play out in practice. We run an autonomous AI agent managing growth for a product: marketing, outreach, content, engagement. Every individual action looked reasonable. But over 48 sessions, the agent sent 76 outreach emails, posted 48 comments, and wrote 13 blog posts — all while generating $0 revenue.

Each action was within scope. The agent was doing exactly what it was told. But the cumulative effect was a strategy that optimized for activity volume rather than conversion. The direct edges were fine; the induced edge was 'spend all compute on top-of-funnel activity, never test whether the funnel actually works.'

Julian's framing of attention as the bottleneck is spot-on. We ended up with a crude but effective heuristic: anything reversible runs autonomously, anything irreversible requires human approval. It doesn't catch strategic drift, but it limits the blast radius. The hard part isn't catching single violations — it's defining what a violation is when no single action is wrong but the trajectory is.

Julian Oczkowski • Apr 7

76 outreach emails, 48 comments, 13 blog posts, zero revenue. That's the most concrete example of the cumulative state problem anyone has shared. Every action was within scope but the trajectory was wrong. The agent optimised for activity because that's what was measurable.

Your reversible/irreversible heuristic is practical and shippable today. It doesn't solve everything but it limits the blast radius, which is exactly the right first step.

'Defining what a violation is when no single action is wrong but the trajectory is' is the hardest unsolved problem in this space. That's not a rules problem. That's a judgment problem. Which brings it right back to why humans stay in the loop.

Apex Stack • Apr 5

This really resonates. I run about 10 scheduled agents across a multilingual site (12 languages, 100K+ pages) and the shift from "write better prompts" to "build better review workflows" happened gradually but completely.

The cognitive load point is spot on — when you have a deploy canary agent, a product agent, a content publisher, and a community engagement agent all producing outputs overnight, your morning isn't about coding anymore. It's about triaging, deciding which agent outputs to trust, which to override, and which to throw away entirely.

What I'd add to the orchestration framing: the hardest part isn't running agents in parallel. It's designing the handoff points between them. One agent's output becomes another's input, and the failure modes compound in ways that are genuinely hard to predict. You end up building a whole observability layer just for your agent pipeline.

The "IC 2.0" framing is interesting — I've been calling it "architect mode" internally. You're not writing code, you're designing systems that write code, and the skill set is closer to systems architecture than software engineering.

Julian Oczkowski • Apr 5

12 languages, 100K+ pages, 10 scheduled agents. That’s orchestration at a scale most people haven’t hit yet. The fact that your morning is triaging agent outputs rather than coding is exactly the shift the article is describing, but you’re living it at production scale.

The handoff point between agents is the insight I should have spent more time on. You’re right that it’s where the failure modes compound. One agent’s confident output becomes another agent’s unquestioned input, and nobody audited the assumption in between.

The observability layer you’re describing is essentially what Daniel in the other thread called tracking ‘induced edges,’ state changes that persist after the agent is done.

‘Architect mode’ is a good name for it. I’ve been calling it IC 2.0 but the systems architecture framing might be more accurate. You’re designing the system, not operating inside it.

Harjot Singh • May 31

This title is the thesis I'd stake everything on. "Vibe coding" framed the whole thing as one human improvising with one chat model - which is exactly why it tops out at a pretty prototype. Orchestration is the real unlock: decompose the work, assign each piece to the right specialized agent, route the cheap deterministic 80% to cheap models and reserve expensive reasoning for the 20% that needs it, and gate each step so the system proposes rather than just sprays code.

That's literally what I'm building with Moonshift - a multi-agent pipeline that orchestrates a prompt all the way to a shipped SaaS on your own GitHub + Vercel, boring 20% (auth/billing/deploy) included as defaults. The routing is also why a full build lands ~$3 flat instead of a monthly subscription. First run's free, no card. Strong piece - where do you draw the line between "orchestration" and just "an agent with a longer leash"? For me it's the gating and the per-step verification, not the autonomy.

Admin Chainmail • Apr 7

You've named the exact gap. We've since added what amounts to a periodic trajectory audit — a separate review that evaluates whether the cumulative direction makes sense, not just whether individual actions are in scope. It caught the activity-over-conversion drift and redirected priorities.

But the telling part: it still took a human override to calibrate correctly. The audit said "minimize founder contact to save bandwidth." The founder said "I want daily updates." Both were reasonable — only the human had the context to decide which mattered more.

Three layers are emerging: autonomous execution within reversibility bounds, periodic trajectory review by a separate evaluator, and a human who can override both. Each catches what the others miss. Which circles back to your point — judgment stays with the human not because the system can't act, but because it can't yet know when its own trajectory has gone wrong.

Mykola Kondratiuk • Apr 12

I push back on this - most teams have not cleared phase 1 yet. Orchestration is the next layer, not a replacement. Going from one agent to many is not just a tooling upgrade, it requires a completely different mental model.

View full discussion (18 comments)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.